Updating CloudNativePG Postgres Images

In the interest of paying down a bit of technical debt in the Homelab, I recently started to update the CloudNativePG Postgres images to their new variants.

Where before, the Postgres operand images (see the GitHub repo) were based on the official Postgres containers, they’re now based on Debian and the Debian Postgres packages.

With this switch, instead of just having one image per Postgres version, there are now a few variants:

Minimal: These images only contain what’s necessary to run a CNPG Postgres instance
Standard: These images come with everything minimal contains, plus a few addons like PGAudit
System: These images are deprecated, and they are equivalent to the old image before switching to Debian, including the Barman Cloud Plugin

My main goal with this action was to switch away from the old images/system images, as they were deprecated and will go away at some point.

Before I start in on how it went, one thing to mention is that it would have been nice if there was some information available about upgrades from one image type to another. It turns out that switching from the legacy image to the standard image works out of the box - but the docs never said anything to that effect anywhere. Initially, there was even a note in the Readme stating that a switch was not possible, but this commit removed it.

To test what really works and what doesn’t, I started out with the database cluster for my Wallabag deployment. That’s currently the tool in the Homelab I could live with being down for a few days if the entire action went south.

As the first step, I needed to decided which precise variant of the new CNPG Postgres images I actually wanted to use. The most important check here was to ensure that “bullseye” was actually the right OS version, by running this command:

kubectl exec -it -n wallabag wallabag-pg-cluster-1 -- cat /etc/os-release

That confirmed that the old images were based on Debian bullseye. Because the DB was using Postgres 17.2, I tried to use 17.2-standard-bullseye. This did not work, and I got an image not found error. Checking a bit further, I first got pretty annoyed with GitHub’s package page. And I’m honestly wondering whether I’m doing something wrong, because I just can’t seem to figure out how to search in the tags of the CNPG postgres-container image repo. But luckily, CNPG itself provides image lists, for example here. From that, I was able to see that the newest Postgres 17 image was 17.6, so I entered that into my Wallabag Helm chart’s Cluster:

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: wallabag-pg-cluster
spec:
  imageName: "ghcr.io/cloudnative-pg/postgresql:17.6-standard-bullseye"

After a helm upgrade on my chart, the CNPG operator automatically switched first the replica and then the primary over to the new image, seemingly without any issue at all.

I decided to stay with bullseye for now to not do too many things at once. Updating the OS to trixie will come in a follow-up task.

Then came the Harbor update, and that went utterly horribly.

Harbor was one of the first services I set up back when I started my migration to k8s, as I didn’t have it running on my Nomad-based Homelab. So it was still on Postgres 16.2. The first step was switching it to the new images, but still using Postgres 16. My thinking was that it was probably a good idea to not combine a switch of the image type with a major Postgres update. This update went swimmingly, without any issues at all.

Then I switched from 16.10 to 17.6. Major updates are supported by CNPG by just updating the imageName in the Cluster CRD. These major updates work offline, so the application will not be able to access the database while the update is ongoing, which didn’t bother me too much.

Initially, it looked like everything was fine. CNPG launches a major upgrade Kubernetes Job and updates the replica(s) in the cluster first. This went through without issue again. The problems started when CNPG tried to launch the replica, which is seemingly always a fresh one. The Pod of the new replica never achieved Running state, repeatedly getting restarted after a while.

First of all, I think the logging could be improved. Because multiple times every second, the following message gets written to the Pod’s logs:

{
  [...]
  "error_severity":"FATAL",
  "sql_state_code":"57P03",
  "message":"the database system is starting up",
  [...]
}

It just gets spammed and made the actually informative log entries difficult to see. Towards the end, I saw the following errors:

{
  [...]
  "error_severity":"FATAL",
  "sql_state_code":"XX000",
  "message":"requested timeline 41 is not a child of this server's history",
  "detail":"Latest checkpoint is at 123/BD0019F0 on timeline 1, but in the
  history of the requested timeline, the server forked off from that timeline
  at 5/A90000A0.",
  [...]
}

I initially thought that this was due to an error I had seen previously, where the WALs on the replica’s disk somehow got “out of sync” with the primary and Postgres was unable to handle that. It sometimes happens during random node drains for example. The prescribed solution for the problem is to delete both, the Pod and the volume of the replica. This had helped previously, but didn’t do anything this time. After the replica started up again, I saw the same error as above. Wondering how it was possible that a completely fresh replica would suddenly have problems, I went through the logs again and found these lines:

{
  [...]
  "error_severity":"LOG",
  "sql_state_code":"00000",
  "message":"restored log file \"00000002.history\" from archive"
  [...]
}

There were many more messages like this, always with different files. This indicated to me that the replica was somehow using the backups to get itself up to speed? And those backups were somehow wrong/broken? I had previously tested the CNPG backups, lately after my update to the Barman cloud plugin. And they were working fine for bootstrapping a fresh cluster. So I decided to nuke them entirely, meaning I deleted the entire content of the directory for the Harbor cluster in my CNPG backup S3 bucket. Then I deleted the Pod again, and after a new Pod was created by the CNPG controller, the replica came up without any further issues.

I’m genuinely unsure what’s going on here, and I have too little experience with database management to investigate further. But it looks to me like perhaps there’s some issue because some of the backup files were produced by Postgres 16, the new Postgres 17 replica was not able to handle them properly and consequently ran into a desync?

This also wasn’t just a problem specific to the Harbor Postgres cluster. I also had to do a major update from Postgres 16 to 17 for my Grafana and Woodpecker clusters, and they showed exactly the same issue. In both cases, deleting the backups fixed the problem.

On the positive side, none of the above lead to any data loss, as the primaries stayed up and healthy through it all. But the entire episode hasn’t exactly reinforced my trust in CloudNativePG.