Migrating my CNPG backups to the Barman Cloud Plugin

Wherein I migrate my CloudNativePG setup to the Barman Cloud Plugin.

During my migration from Nomad to Kubernetes, I started using CNPG for my database needs. For more details, have a look at this post. I configured their backup solution right away. It consists of a component which runs in the same Pod as the main Postgres and backs up both, the Write Ahead Log (WAL) and the full database, all while the instance is kept up and running. Those can then be copied to an S3 bucket for long term storage.

This solution has been part of the main container up to now, but it looks like the project is aiming for a more plugin-driven architecture, and their first step was to extract this backup functionality into the Barman Cloud Plugin. I learned about this through an entry in their 1.26 release notes back in May. In addition, they also re-organized their operand container images at the beginning of the year. There are now three image types:

minimal: Images based on Debian with only the minimum of packages to support CloudNativePG
standard: Minimal image, plus a few tools like PGAudit
system: Equivalent to the old images, but now based on the “standard” image and with Barman Cloud Backup still integrated

As the Readme mentions:

IMPORTANT: The system images are deprecated and will be removed once in-core support for Barman Cloud in CloudNativePG is phased out. While you can still use them as long as in-core Barman Cloud remains available, you should plan to migrate to either a minimal or standard image together with the Barman Cloud plugin—or adopt another supported backup solution.

So at some point soon, running CNPG with backups without also running the Barman Cloud Plugin will not be possible anymore.

What I’m currently missing (or have completely overlooked?) are some instructions for how to migrate from the system image to either standard or minimal. And I strongly remember that I read that you cannot just replace the system image with standard or minimal. But for the life of me, I can’t find where at the moment. 🤦

Preparations: cert-manager

For the migration, I followed the official docs, and their first step is installing the Barman Cloud Plugin, documented here. The install has one prerequisite, namely that it requires cert-manager.

I’ve not been using cert-manager in my Homelab up to now, because I generally don’t need internal certs and my external Let’s Encrypt cert is a wildcard cert, which requires DNS challenges. And my current DNS host does not support any kind of API to change DNS records, so I can’t use cert-manager here either.

But now I needed it. I used the official Helm chart, following the installation docs here.

My values.yaml file looks like this:

global:
  commonLables:
    homelab/part-of: cert-manager
crds:
  enabled: true
  keep: true
replicaCount: 1
enableCertificateOwnerRef: true
resources:
  requests:
    cpu: 200m
    memory: 256Mi
  limits:
    memory: 512Mi
prometheus:
  enabled: false
webhook:
  resources:
    requests:
      cpu: 200m
      memory: 100Mi
    limits:
      memory: 256Mi
  extraArgs:
    - "--logging-format=json"
cainjector:
  enabled: true
  extraArgs:
    - "--logging-format=json"
extraArgs:
  - "--logging-format=json"

The limits are likely a bit high, but I like to just run a new app for a bit with higher limits to gather a few weeks worth of metrics to determine tighter resource requests/limits.

The deployment went pretty smooth, and I did not set up any Issuers, as the Barman Plugin manifest brings its own self-signed Issuer along, and I do not intend to use cert-manager for anything else for now.

But later during the Barman Plugin deployment, I got these error messages:

  Error: 3 errors occurred:
        * Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/validate?timeout=30s": net/http: request canceled while waiting for conne
ction (Client.Timeout exceeded while awaiting headers)
        * Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/validate?timeout=30s": net/http: request canceled while waiting for conne
ction (Client.Timeout exceeded while awaiting headers)
        * Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/validate?timeout=30s": net/http: request canceled while waiting for conne
ction (Client.Timeout exceeded while awaiting headers)

This indicated that the kube-apiserver was not able to talk to the webhook cert-manager installs. I took a quick look into the Cilium firewall logs, as I was pretty sure that my network policies were wrong.

For that, I first figured out on which host the webhook was running with this command:

kubectl get -n cert-manager pods -o wide

Next, I needed to find the Cilium Pod responsible for that host:

kubectl get -n kube-system | grep cilium | grep <HOSTNAME>

Then I could launch the cilium monitor:

kubectl -n kube-system exec -ti cilium-smjsx -- cilium monitor --type drop

And this was the output:

xx drop (Policy denied) flow 0x0 to endpoint 2635, ifindex 6, file bpf_lxc.c:2127, , identity remote-node->2444: 10.8.0.108:38980 -> 10.8.15.207:10250 tcp SYN

For info, 10.8.15.207 was the webhook Pod. The thing is, I thought I had already set up the network policy for the cert-manager namespace to allow access from the kube-apiserver:

apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
  name: cert-manager-kube-system
spec:
  endpointSelector:
    matchLabels:
      app.kubernetes.io/component: webhook
  ingress:
    - fromEndpoints:
      - matchLabels:
          io.kubernetes.pod.namespace: kube-system
          component: kube-apiserver

But that’s where the drop message from the Cilium monitor comes into play, specifically this part:

identity remote-node->2444: 10.8.0.108:38980 -> 10.8.15.207:10250

First, the identity was not a Pod, but the generic remote-node identity. Checking the IP, I found that it was the IP of the Cilium host interface for one of my control plane nodes. Which makes sense, considering that the kube-apiserver runs on the Host network, not the cluster’s Pod network.

My next attempt to get the desired network policy setup was to use the kube-apiserver identity.

But that, similarly, did not work either. An explanation can be found in this issue. Namely, Cilium defines the kube-apiserver identity from the endpoints of the kubernetes service in the default namespace. Which, in my case, were the local network host IPs of my three control plane nodes, not the IPs of their Cilium host interfaces. and that’s why that approach also did not work.

What I finally landed on were node identities. The final network policy looks like this:

apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
  name: cert-manager-kube-system
spec:
  endpointSelector:
    matchLabels:
      app.kubernetes.io/component: webhook
  ingress:
    - fromNodes:
        - matchLabels:
            node-role.kubernetes.io/control-plane: ""

It’s a bit wider than I would like, as it allows all Pods which communicate via the host interface to access the webhook. But it’s the best I could come up with.

Deploying Barman Cloud Plugin

The deployment of the plugin itself is not too involved. The only annoying thing is that it is only provided as an all-in-one manifest. So I took the different manifests and transformed them into a proper Helm chart. Again, mostly copy and paste.

The only issue was: How would I do updates? So in addition to the actual, separated manifests, I also put the official all-in-one yaml file into my repo. So when it comes time to update, I only need to overwrite the old manifest and Git will tell me which parts I would need to update.

Perhaps not the most elegant solution, but I’m pretty sure it will work just fine.

Now onto the actual migration.

Migrating my CNPG clusters

The migration itself needed a few manual steps, but was very straightforward and didn’t have any problem at all. I followed the official docs from here.

For an example, let’s look at my Wallabag DB, which was the first one I migrated:

---
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: wallabag-pg-cluster
spec:
  instances: 2
  imageName: "ghcr.io/cloudnative-pg/postgresql:17.2"
  bootstrap:
    initdb:
      database: wallabag
      owner: wallabag
  resources:
    requests:
      memory: 200M
      cpu: 150m
  postgresql:
    parameters:
      [...]
  storage:
    size: 1.5G
    storageClass: rbd-fast
  backup:
    barmanObjectStore:
      endpointURL: http://my-ceph-rook-cluster:80
      destinationPath: "s3://backup-cnpg/"
      s3Credentials:
        accessKeyId:
          name: backups-s3-secret-wallabag
          key: AccessKey
        secretAccessKey:
          name: backups-s3-secret-wallabag
          key: SecretKey
    retentionPolicy: "30d"
---
apiVersion: postgresql.cnpg.io/v1
kind: ScheduledBackup
metadata:
  name: wallabag-pg-backup
spec:
  method: barmanObjectStore
  immediate: true
  schedule: "0 30 1 * * *"
  backupOwnerReference: self
  cluster:
    name: wallabag-pg-cluster

Wallabag is not something I use too much - I’m really bad at the “reading it later” part of “Read it later”. 🤦 So it was the ideal database to start with, as I could live with it being down for a little while, should anything go wrong.

As the docs state, the first step is to add the new ObjectStore object:

apiVersion: barmancloud.cnpg.io/v1
kind: ObjectStore
metadata:
  name: wallabag-pg-store
spec:
  retentionPolicy: "30d"
  configuration:
    endpointURL: http://my-ceph-rook-cluster:80
    destinationPath: "s3://backup-cnpg/"
    s3Credentials:
      accessKeyId:
        name: backups-s3-secret-wallabag
        key: AccessKey
      secretAccessKey:
        name: backups-s3-secret-wallabag
        key: SecretKey

This is a verbatim copy of the spec.backup.barmanObjectStore element of the original Cluster object. Plus the spec.backup.retentionPolicy. I then deployed the chart, to create the ObjectStore. This doesn’t do anything yet.

The next step is the reconfiguration of the Cluster. For this, I removed the entire spec.backup: section, and replaced it with a spec.plugins section, which looks like this:

  plugins:
    - name: barman-cloud.cloudnative-pg.io
      isWALArchiver: true
      parameters:
        barmanObjectName: wallabag-pg-store

Note that the plugins[0].parameters.barmanObjectName entry needs to be the name of the previously created ObjectStore. Then the Helm chart can be deployed again, and this is where the change happens. CNPG will now restart each of the Pods for the Wallabag cluster. Each Pod will gain a new plugin-barman-cloud container, which will run the backup steps from now on.

To verify that the backups were actually working after that, I checked the logs for the plugin-barman-cloud container with kubectl logs -n wallabag wallabag-pg-cluster-2 -c plugin-barman-cloud:

{"level":"info","ts":"2025-09-07T09:56:22.497125867Z","msg":"Archived WAL file","walName":"/var/lib/postgresql/data/pgdata/pg_wal/0000001A0000000200000019","startTime":"2025-09-07T09:56:14.33717675Z","endTime":"2025-09-07T09:56:22.496787018Z","elapsedWalTime":8.159610286,"logging_pod":"wallabag-pg-cluster-2"}
{"level":"info","ts":"2025-09-07T10:01:15.059185561Z","msg":"Executing barman-cloud-wal-archive","logging_pod":"wallabag-pg-cluster-2","walName":"/var/lib/postgresql/data/pgdata/pg_wal/0000001A000000020000001A","options":["--endpoint-url","http://my-ceph-rook-cluster.svc:80","--cloud-provider","aws-s3","s3://backup-cnpg/","wallabag-pg-cluster","/var/lib/postgresql/data/pgdata/pg_wal/0000001A000000020000001A"]}
{"level":"info","ts":"2025-09-07T10:01:23.348694147Z","msg":"Archived WAL file","walName":"/var/lib/postgresql/data/pgdata/pg_wal/0000001A000000020000001A","startTime":"2025-09-07T10:01:15.059148765Z","endTime":"2025-09-07T10:01:23.348614703Z","elapsedWalTime":8.289465957,"logging_pod":"wallabag-pg-cluster-2"}

Once that was done, the last step was the ScheduledBackup. There were two changes to be done: Replacing the method: barmanObjectStore with method: plugin and adding a pluginConfiguration section:

apiVersion: postgresql.cnpg.io/v1
kind: ScheduledBackup
metadata:
  name: wallabag-pg-backup
spec:
  method: plugin
  immediate: true
  schedule: "0 30 1 * * *"
  backupOwnerReference: self
  cluster:
    name: wallabag-pg-cluster
  pluginConfiguration:
    name: barman-cloud.cloudnative-pg.io

Then I just waited for a night to pass, to make sure that the base backups also happened. Those I checked by looking at the base/ paths for each cluster in my CNPG backup bucket, for example s3cmd -c ~/.s3-k8s ls -H "s3://backup-cnpg/mastodon-pg-cluster/base/20250908T013024/":

2025-09-08 01:34  1460   s3://backup-cnpg/mastodon-pg-cluster/base/20250908T013024/backup.info
2025-09-08 01:34     3G  s3://backup-cnpg/mastodon-pg-cluster/base/20250908T013024/data.tar

Yupp, files are there.

And that’s it already. Overall, it just took a rather lazy afternoon to do it all. Sure, I could have wished that it was a bit more automated, but eh. It was a one-time thing, and the manual changes were not complicated, so I could do it with at most half my brain engaged.

The one thing I’m hoping for is that they might add the Barman Cloud plugin as an optional component to the CNPG Helm chart, which would make the deployment for the overall solution a bit easier, while still allowing users to replace Barman with another backup solution.

Preparations: cert-manager#

Deploying Barman Cloud Plugin#

Migrating my CNPG clusters#

Preparations: cert-manager

Deploying Barman Cloud Plugin

Migrating my CNPG clusters