Wherein I migrate my CloudNativePG setup to the Barman Cloud Plugin.
During my migration from Nomad to Kubernetes, I started using CNPG for my database needs. For more details, have a look at this post. I configured their backup solution right away. It consists of a component which runs in the same Pod as the main Postgres and backs up both, the Write Ahead Log (WAL) and the full database, all while the instance is kept up and running. Those can then be copied to an S3 bucket for long term storage.
This solution has been part of the main container up to now, but it looks like the project is aiming for a more plugin-driven architecture, and their first step was to extract this backup functionality into the Barman Cloud Plugin. I learned about this through an entry in their 1.26 release notes back in May. In addition, they also re-organized their operand container images at the beginning of the year. There are now three image types:
- minimal: Images based on Debian with only the minimum of packages to support CloudNativePG
- standard: Minimal image, plus a few tools like PGAudit
- system: Equivalent to the old images, but now based on the “standard” image and with Barman Cloud Backup still integrated
As the Readme mentions:
IMPORTANT: The system images are deprecated and will be removed once in-core support for Barman Cloud in CloudNativePG is phased out. While you can still use them as long as in-core Barman Cloud remains available, you should plan to migrate to either a minimal or standard image together with the Barman Cloud plugin—or adopt another supported backup solution.
So at some point soon, running CNPG with backups without also running the Barman Cloud Plugin will not be possible anymore.
What I’m currently missing (or have completely overlooked?) are some instructions for how to migrate from the system image to either standard or minimal. And I strongly remember that I read that you cannot just replace the system image with standard or minimal. But for the life of me, I can’t find where at the moment. 🤦
Preparations: cert-manager
For the migration, I followed the official docs, and their first step is installing the Barman Cloud Plugin, documented here. The install has one prerequisite, namely that it requires cert-manager.
I’ve not been using cert-manager in my Homelab up to now, because I generally don’t need internal certs and my external Let’s Encrypt cert is a wildcard cert, which requires DNS challenges. And my current DNS host does not support any kind of API to change DNS records, so I can’t use cert-manager here either.
But now I needed it. I used the official Helm chart, following the installation docs here.
My values.yaml
file looks like this:
global:
commonLables:
homelab/part-of: cert-manager
crds:
enabled: true
keep: true
replicaCount: 1
enableCertificateOwnerRef: true
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
memory: 512Mi
prometheus:
enabled: false
webhook:
resources:
requests:
cpu: 200m
memory: 100Mi
limits:
memory: 256Mi
extraArgs:
- "--logging-format=json"
cainjector:
enabled: true
extraArgs:
- "--logging-format=json"
extraArgs:
- "--logging-format=json"
The limits are likely a bit high, but I like to just run a new app for a bit with higher limits to gather a few weeks worth of metrics to determine tighter resource requests/limits.
The deployment went pretty smooth, and I did not set up any Issuers, as the Barman Plugin manifest brings its own self-signed Issuer along, and I do not intend to use cert-manager for anything else for now.
But later during the Barman Plugin deployment, I got these error messages:
Error: 3 errors occurred:
* Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/validate?timeout=30s": net/http: request canceled while waiting for conne
ction (Client.Timeout exceeded while awaiting headers)
* Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/validate?timeout=30s": net/http: request canceled while waiting for conne
ction (Client.Timeout exceeded while awaiting headers)
* Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/validate?timeout=30s": net/http: request canceled while waiting for conne
ction (Client.Timeout exceeded while awaiting headers)
This indicated that the kube-apiserver was not able to talk to the webhook cert-manager installs. I took a quick look into the Cilium firewall logs, as I was pretty sure that my network policies were wrong.
For that, I first figured out on which host the webhook was running with this command:
kubectl get -n cert-manager pods -o wide
Next, I needed to find the Cilium Pod responsible for that host:
kubectl get -n kube-system | grep cilium | grep <HOSTNAME>
Then I could launch the cilium monitor:
kubectl -n kube-system exec -ti cilium-smjsx -- cilium monitor --type drop
And this was the output:
xx drop (Policy denied) flow 0x0 to endpoint 2635, ifindex 6, file bpf_lxc.c:2127, , identity remote-node->2444: 10.8.0.108:38980 -> 10.8.15.207:10250 tcp SYN
For info, 10.8.15.207
was the webhook Pod. The thing is, I thought I had
already set up the network policy for the cert-manager namespace to allow access
from the kube-apiserver:
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
name: cert-manager-kube-system
spec:
endpointSelector:
matchLabels:
app.kubernetes.io/component: webhook
ingress:
- fromEndpoints:
- matchLabels:
io.kubernetes.pod.namespace: kube-system
component: kube-apiserver
But that’s where the drop message from the Cilium monitor comes into play, specifically this part:
identity remote-node->2444: 10.8.0.108:38980 -> 10.8.15.207:10250
First, the identity was not a Pod, but the generic remote-node
identity. Checking
the IP, I found that it was the IP of the Cilium host interface for one of my
control plane nodes. Which makes sense, considering that the kube-apiserver runs
on the Host network, not the cluster’s Pod network.
My next attempt to get the desired network policy setup was to use the kube-apiserver identity.
But that, similarly, did not work either. An explanation can be found in
this issue. Namely, Cilium defines
the kube-apiserver identity from the endpoints of the kubernetes
service in the
default
namespace. Which, in my case, were the local network host IPs of my
three control plane nodes, not the IPs of their Cilium host interfaces. and that’s
why that approach also did not work.
What I finally landed on were node identities. The final network policy looks like this:
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
name: cert-manager-kube-system
spec:
endpointSelector:
matchLabels:
app.kubernetes.io/component: webhook
ingress:
- fromNodes:
- matchLabels:
node-role.kubernetes.io/control-plane: ""
It’s a bit wider than I would like, as it allows all Pods which communicate via the host interface to access the webhook. But it’s the best I could come up with.
Deploying Barman Cloud Plugin
The deployment of the plugin itself is not too involved. The only annoying thing is that it is only provided as an all-in-one manifest. So I took the different manifests and transformed them into a proper Helm chart. Again, mostly copy and paste.
The only issue was: How would I do updates? So in addition to the actual, separated manifests, I also put the official all-in-one yaml file into my repo. So when it comes time to update, I only need to overwrite the old manifest and Git will tell me which parts I would need to update.
Perhaps not the most elegant solution, but I’m pretty sure it will work just fine.
Now onto the actual migration.
Migrating my CNPG clusters
The migration itself needed a few manual steps, but was very straightforward and didn’t have any problem at all. I followed the official docs from here.
For an example, let’s look at my Wallabag DB, which was the first one I migrated:
---
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: wallabag-pg-cluster
spec:
instances: 2
imageName: "ghcr.io/cloudnative-pg/postgresql:17.2"
bootstrap:
initdb:
database: wallabag
owner: wallabag
resources:
requests:
memory: 200M
cpu: 150m
postgresql:
parameters:
[...]
storage:
size: 1.5G
storageClass: rbd-fast
backup:
barmanObjectStore:
endpointURL: http://my-ceph-rook-cluster:80
destinationPath: "s3://backup-cnpg/"
s3Credentials:
accessKeyId:
name: backups-s3-secret-wallabag
key: AccessKey
secretAccessKey:
name: backups-s3-secret-wallabag
key: SecretKey
retentionPolicy: "30d"
---
apiVersion: postgresql.cnpg.io/v1
kind: ScheduledBackup
metadata:
name: wallabag-pg-backup
spec:
method: barmanObjectStore
immediate: true
schedule: "0 30 1 * * *"
backupOwnerReference: self
cluster:
name: wallabag-pg-cluster
Wallabag is not something I use too much - I’m really bad at the “reading it later” part of “Read it later”. 🤦 So it was the ideal database to start with, as I could live with it being down for a little while, should anything go wrong.
As the docs state, the first step is to add the new ObjectStore object:
apiVersion: barmancloud.cnpg.io/v1
kind: ObjectStore
metadata:
name: wallabag-pg-store
spec:
retentionPolicy: "30d"
configuration:
endpointURL: http://my-ceph-rook-cluster:80
destinationPath: "s3://backup-cnpg/"
s3Credentials:
accessKeyId:
name: backups-s3-secret-wallabag
key: AccessKey
secretAccessKey:
name: backups-s3-secret-wallabag
key: SecretKey
This is a verbatim copy of the spec.backup.barmanObjectStore
element of the
original Cluster
object. Plus the spec.backup.retentionPolicy
. I then deployed
the chart, to create the ObjectStore. This doesn’t do anything yet.
The next step is the reconfiguration of the Cluster. For this, I removed the
entire spec.backup:
section, and replaced it with a spec.plugins
section,
which looks like this:
plugins:
- name: barman-cloud.cloudnative-pg.io
isWALArchiver: true
parameters:
barmanObjectName: wallabag-pg-store
Note that the plugins[0].parameters.barmanObjectName
entry needs to be the
name of the previously created ObjectStore. Then the Helm chart can be deployed
again, and this is where the change happens. CNPG will now restart each of the
Pods for the Wallabag cluster. Each Pod will gain a new plugin-barman-cloud
container, which will run the backup steps from now on.
To verify that the backups were actually working after that, I checked the logs
for the plugin-barman-cloud
container with kubectl logs -n wallabag wallabag-pg-cluster-2 -c plugin-barman-cloud
:
{"level":"info","ts":"2025-09-07T09:56:22.497125867Z","msg":"Archived WAL file","walName":"/var/lib/postgresql/data/pgdata/pg_wal/0000001A0000000200000019","startTime":"2025-09-07T09:56:14.33717675Z","endTime":"2025-09-07T09:56:22.496787018Z","elapsedWalTime":8.159610286,"logging_pod":"wallabag-pg-cluster-2"}
{"level":"info","ts":"2025-09-07T10:01:15.059185561Z","msg":"Executing barman-cloud-wal-archive","logging_pod":"wallabag-pg-cluster-2","walName":"/var/lib/postgresql/data/pgdata/pg_wal/0000001A000000020000001A","options":["--endpoint-url","http://my-ceph-rook-cluster.svc:80","--cloud-provider","aws-s3","s3://backup-cnpg/","wallabag-pg-cluster","/var/lib/postgresql/data/pgdata/pg_wal/0000001A000000020000001A"]}
{"level":"info","ts":"2025-09-07T10:01:23.348694147Z","msg":"Archived WAL file","walName":"/var/lib/postgresql/data/pgdata/pg_wal/0000001A000000020000001A","startTime":"2025-09-07T10:01:15.059148765Z","endTime":"2025-09-07T10:01:23.348614703Z","elapsedWalTime":8.289465957,"logging_pod":"wallabag-pg-cluster-2"}
Once that was done, the last step was the ScheduledBackup. There were two changes
to be done:
Replacing the method: barmanObjectStore
with method: plugin
and adding a
pluginConfiguration
section:
apiVersion: postgresql.cnpg.io/v1
kind: ScheduledBackup
metadata:
name: wallabag-pg-backup
spec:
method: plugin
immediate: true
schedule: "0 30 1 * * *"
backupOwnerReference: self
cluster:
name: wallabag-pg-cluster
pluginConfiguration:
name: barman-cloud.cloudnative-pg.io
Then I just waited for a night to pass, to make sure that the base backups also
happened. Those I checked by looking at the base/
paths for each cluster in
my CNPG backup bucket, for example s3cmd -c ~/.s3-k8s ls -H "s3://backup-cnpg/mastodon-pg-cluster/base/20250908T013024/"
:
2025-09-08 01:34 1460 s3://backup-cnpg/mastodon-pg-cluster/base/20250908T013024/backup.info
2025-09-08 01:34 3G s3://backup-cnpg/mastodon-pg-cluster/base/20250908T013024/data.tar
Yupp, files are there.
And that’s it already. Overall, it just took a rather lazy afternoon to do it all. Sure, I could have wished that it was a bit more automated, but eh. It was a one-time thing, and the manual changes were not complicated, so I could do it with at most half my brain engaged.
The one thing I’m hoping for is that they might add the Barman Cloud plugin as an optional component to the CNPG Helm chart, which would make the deployment for the overall solution a bit easier, while still allowing users to replace Barman with another backup solution.