Wherein I’m finally done with the backups in my Homelab’s k8s cluster.
This is part 15 of my k8s migration series.
Finally, I’m done. After months of writing Python code for my backup operator. Especially during the middle phase of the implementation, after the initial planning and design, it felt like a slog. Dozens of tasks, many functions to implement, and seemingly no end in sight. I’m rather elated to finally be able to write another post in the k8s migration series.
In this post, I will not write much about how the operator is implemented. Instead, I will give it the same treatment I gave the other apps I’ve deployed into the cluster up to now, as if I hadn’t written it myself.
If you’re interested in the implementation, take a look at the series of posts I wrote about it.
Suffice it to say for now that the operator reads a CRD defining some S3 buckets and PersistentVolumeClaims to be backed up, and launches Pods which use rclone and restic to do just that.
Infrastructure Preparation
Besides the backup operator itself, I also need some additional infrastructure. The backups themselves use restic with an S3 bucket as a target. I’m going with one bucket per app here. So before I can run the first backups, I need a couple of S3 users and buckets.
If you would like to read a bit more about my S3 setup, have a look at this post
The first two things needed are the backups
and service-backup-user
users.
The backups
user is the owner of all of the backup buckets, while
backups-services
is a reduced-permissions user for the actual backups:
apiVersion: ceph.rook.io/v1
kind: CephObjectStoreUser
metadata:
name: backups
namespace: rook-cluster
spec:
store: rgw-bulk
clusterNamespace: rook-cluster
displayName: "Common user for backup buckets"
---
apiVersion: ceph.rook.io/v1
kind: CephObjectStoreUser
metadata:
name: service-backup-user
namespace: rook-cluster
spec:
store: rgw-bulk
clusterNamespace: rook-cluster
displayName: "User for service backups"
With these two manifests, Rook Ceph will create two S3 users in my bulk storage, which is the part of my Ceph cluster backed by HDDs.
Due to the fact that I’m doing the bucket management itself through Ansible, I also need to push these secrets to my Vault instance, to make them available during Ansible runs. Although, now that I’m writing this, I’m wondering whether Ansible has a k8s Secrets lookup plugin? Something to look into later.
For pushing Secrets to Vault (and creating Secrets from Vault data), I’m using external-secrets. Specifically, PushSecret in this case:
apiVersion: external-secrets.io/v1alpha1
kind: PushSecret
metadata:
name: s3-backupsuser
namespace: rook-cluster
spec:
deletionPolicy: Delete
refreshInterval: 30m
secretStoreRefs:
- name: homelab-vault
kind: ClusterSecretStore
selector:
secret:
name: rook-ceph-object-user-rgw-bulk-backups
data:
- match:
secretKey: AccessKey
remoteRef:
remoteKey: secrets/backups
property: access
- match:
secretKey: SecretKey
remoteRef:
remoteKey: secrets/backups
property: secret
As always with secrets related stuff, this is a bit obfuscated.
What this manifest does is take the Secret automatically created by Rook Ceph at
rook-ceph-object-user-rgw-bulk-backups
and pushing the S3 access key
and secret key to the Vault KV store secrets
at the entry backups
.
Then I’m creating the S3 buckets themselves. I’m doing this with the Ansible amazon.aws.s3_bucket module. The Ansible play looks like this:
- hosts: candc
name: Play for creating the backup buckets
tags:
- backup
vars:
s3_access: "{{ lookup('hashi_vault', 'secret=secret/backups:access token='+vault_token+' url='+vault_url) }}"
s3_secret: "{{ lookup('hashi_vault', 'secret=secret/backups:secret token='+vault_token+' url='+vault_url) }}"
tasks:
- name: Create service backup buckets
tags:
- backup
- buckets
amazon.aws.s3_bucket:
name: backup-{{ item }}
access_key: "{{ s3_access }}"
secret_key: "{{ s3_secret }}"
ceph: true
endpoint_url: https://s3.example.com
state: present
policy: "{{ lookup('ansible.builtin.template','bucket-policies/backup-services.json.template') }}"
loop:
- audiobookshelf
This play first fetches the credentials pushed into Vault with the PushSecret above, using the Vault plugin. Be cautious when looking for info on Vault in Ansible, Ansible’s own secret storage is unfortunately also called vault. This uses my Vault token I have to generate on my C&C host before I can do pretty much anything, which in turn needs credentials not stored on said host.
My backup buckets always follow the backup-$APP
convention, and I’m iterating
over the apps I need backup buckets for via a loop.
Also interesting to mention is the policy set here, which is the S3 bucket
policy for the new bucket.
It’s created from this template:
{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"s3:GetObject",
"s3:DeleteObject",
"s3:PutObject",
"s3:ListBucket",
"s3:GetBucketLocation"
],
"Effect": "Allow",
"Resource": [
"arn:aws:s3:::backup-{{ item }}/*",
"arn:aws:s3:::backup-{{ item }}"
],
"Principal": {
"AWS": [
"arn:aws:iam:::user/service-backup-user"
]
}
},
{
"Action": [
"s3:GetObject",
"s3:ListBucket",
"s3:GetBucketLocation"
],
"Effect": "Allow",
"Resource": [
"arn:aws:s3:::backup-{{ item }}/*",
"arn:aws:s3:::backup-{{ item }}"
],
"Principal": {
"AWS": [
"arn:aws:iam:::user/external-backup-user"
]
}
}
]
}
Through the magic of jinja2 and some naming conventions, this policy template
will allow my service backup user to access all of the APIs needed by restic,
meaning read and write access. The second user, external-backup-user
, is the
user I use to run backups to an external HDD. This user is more restricted than
the service backup user, because it only needs read access and never writes to
the backup buckets.
Short aside: Why use Ansible for the bucket creation, instead of Rook’s ObjectBucketClaim? Simple answer: Because of policies. Until very recently, there was no way to configure a bucket policy via an ObjectBucketClaim, so I would have needed to reach for Ansible or something else anyway. That’s why I decided to go ahead and do the bucket creation with Ansible as well.
Just for completeness’ sake, I also created an ExternalSecret for my restic backup password:
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: "restic"
labels:
homelab/part-of: service-backups
spec:
secretStoreRef:
name: homelab-vault
kind: ClusterSecretStore
refreshInterval: "1h"
target:
creationPolicy: 'Owner'
data:
- secretKey: pw
remoteRef:
key: secret/restic
property: password
Incidentally, looking at the SecretStore name: I really need to stop prefixing everything with “homelab” or “hl”. 😅
Last but not least, I also need a sort of scratch volume, where backed up S3 buckets are copied to before being slurped up by restic. It’s a PVC looking like this:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: vol-service-backup-scratch
labels:
homelab/part-of: service-backups
spec:
storageClassName: homelab-fs
resources:
requests:
storage: 50Gi
accessModes:
- ReadWriteMany
It needs to be RWX because it’s shared among all backups for all apps, not one per app. So instead of my customary Ceph RBD volume, it’s a CephFS volume. This is one part of my backup setup I need to still improve. At some point, fully cloning an S3 bucket to a local disk and then feeding it into restic might no longer be feasible.
Anyway, that’s all the Yak shaving necessary, let’s look at the backup operator itself.
The operator deployment
Because this is an operator, the first thing to consider is what access it needs to the k8s API. For this, I defined one Role and one ClusterRole. The ClusterRole is necessary so the operator can access a number of resources in all namespaces, while the Role is for things where it only needs access in its own namespace.
Let’s begin with the ClusterRole:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: hlbo-cluster-role
rules:
# Needed for Kopf Framework
- apiGroups: [apiextensions.k8s.io]
resources: [customresourcedefinitions]
verbs: [list, watch]
- apiGroups: [""]
resources: [events]
verbs: [create]
- apiGroups: [""]
resources:
- namespaces
verbs:
- list
- watch
- apiGroups: [""]
resources:
- persistentvolumes
- persistentvolumeclaims
verbs:
- get
- apiGroups: ["storage.k8s.io"]
resources:
- volumeattachments
verbs:
- get
- list
- apiGroups: ["mei-home.net"]
resources:
- homelabbackupconfigs
- homelabservicebackups
verbs:
- get
- watch
- list
- patch
- update
- apiGroups: ["batch"]
resources:
- jobs
verbs:
- get
- watch
- list
- create
A number of things in here are requirements from the kopf framework I used to implement the operator. It needs to be able to watch for CRDs because it needs to handle them. The HomelabBackupConfigs and HomelabServiceBackups are the two CRDs I introduced. PersistentVolumeClaims, PersistentVolumes and VolumeAttachments are needed because that’s what the operator backs up. Both PersistentVolumes and VolumeAttachments are cluster level resources. And because PVCs generally live in the namespace of the app they’re used by, cluster-wide access is required for the operator. Finally, the cluster-wide access to Jobs is due to a quirk of Kopf. I really only need to access Jobs in the operator’s own namespace, to launch them and monitor them. But the issue is that I’m using Kopf’s event handler mechanism to watch for Job events, so I can react when a Job finishes or fails. And Kopf only knows universal configuration when it comes to which APIs it uses. Either the cluster level ones, or the namespaced ones. This can’t be configured per listener, only for the entire instance.
So in the end, even though I really only needed Job control over Jobs in the same namespace, I still need to grant cluster-wide access.
Next, the Role:
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: hlbo-role
namespace: backups
rules:
- apiGroups: [""]
resources:
- configmaps
verbs:
- get
- watch
- list
- patch
- update
- create
- delete
I’m creating ConfigMaps for the individual Jobs during the backup run, so I need access. But in this case, I implemented all of the necessary access myself with explicit API calls, without Kopf’s involvement. This allowed me to scope the access rights to a single namespace.
Then there’s the general backup configuration, which is set with the HomelabBackupConfig CRD. These are configuration options which don’t differ per app, and so can be set centrally, instead of having a block of similar config settings in every individual app’s backup config. For my deployment, it looks like this:
apiVersion: mei-home.net/v1alpha1
kind: HomelabBackupConfig
metadata:
name: backup-config
namespace: backups
labels:
homelab/part-of: hlbo
spec:
serviceBackup:
schedule: "30 1 * * *"
scratchVol: vol-service-backup-scratch
s3BackupConfig:
s3Host: s3.example.com:443
s3Credentials:
secretName: s3-backup-buckets-cred
accessKeyIDProperty: AccessKey
secretKeyProperty: SecretKey
s3ServiceConfig:
s3Host: s3.example.com:443
s3Credentials:
secretName: s3-backup-buckets-cred
accessKeyIDProperty: AccessKey
secretKeyProperty: SecretKey
resticPasswordSecret:
secretName: restic
secretKey: pw
resticRetentionPolicy:
daily: 7
weekly: 6
monthly: 6
yearly: 1
jobSpec:
jobNS: "backups"
image: harbor.example.com/homelab/hn-backup:5.0.0
command:
- "hn-backup"
- "kube-services"
This configures service backups to run every night at 01:30. It configures the credentials and S3 servers for both, the location of the app’s S3 buckets and the location of the backup buckets. These are currently the same, but if I ever run two types of S3, e.g. for some reason I decide to add a second Ceph cluster or a MinIO instance, I can have the service and backup buckets on different S3 servers.
Also of interest might be the retention policy. This keeps the backups for the last 7 days, the backups for the Sundays of the last 6 weeks, the backups of the last day of the month for the last 6 months and finally the backup from December 31st of the previous year.
Finally, there’s the definition of the container image and command to run during individual backups, just in case I ever decide to change my setup for the individual backups but want to keep the operator going.
And here, finally, the operator’s deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: hlbo
labels:
homelab/part-of: hlbo
spec:
replicas: 1
selector:
matchLabels:
homelab/app: hlbo
homelab/part-of: hlbo
strategy:
type: "Recreate"
template:
metadata:
labels:
homelab/app: hlbo
homelab/part-of: hlbo
spec:
serviceAccountName: hlbo-account
containers:
- name: hl-backup-operator
image: harbor.example.com/homelab/hl-backup-operator:1.1.0
args:
- "-A"
- "-v"
- "-d"
imagePullPolicy: Always
resources:
requests:
cpu: 50m
memory: 50Mi
The imagePullPolicy: Always
is mostly for the current, still somewhat “beta”
phase of use, so I can easily switch to using :dev
images. The args are all
for Kopf. The -A
says that Kubernetes’ cluster API should be used, while
-v
and -d
enable lots of debug output.
That’s it, operator deployed. Now onto configuring a backup.
Configuring backups for my Audiobookshelf instance
Audiobookshelf was the first user-facing workload I deployed in k8s after setting up all the monitoring and infrastructure. It stores everything on a single PersistentVolume, including progress and listened to episodes of all of my podcasts. As such, I only need to backup that single PVC, and I’m good to go.
Backups are configured via the HomelabServiceBackups CRD. For my Audiobookshelf, it looks like this:
apiVersion: mei-home.net/v1alpha1
kind: HomelabServiceBackup
metadata:
name: backup-audiobookshelf
labels:
{{- range $label, $value := .Values.commonLabels }}
{{ $label }}: {{ $value | quote }}
{{- end }}
spec:
backupBucketName: "backup-audiobookshelf"
backups:
- type: pvc
name: abs-data-volume
namespace: audiobookshelf
The only configuration needed is the name of the backup bucket and a list of the S3 buckets and PVCs to be backed up.
In this case, my Audiobookshelf deployment only has a single PVC:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: abs-data-volume
labels:
{{- range $label, $value := .Values.commonLabels }}
{{ $label }}: {{ $value | quote }}
{{- end }}
spec:
storageClassName: rbd-bulk
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100Gi
The operator will figure out where that volume is currently mounted and launch a backup Job on that host.
And that’s it! The backups are finally working, and by now several weeks worth of backups were successful. It was a pretty long detour, but I did have at least some fun writing a small-ish project that I’m actually using.
The next installment of this series will come pretty soon, because I’m already done migrating my Drone CI instance on Nomad to a Woodpecker CI instance on k8s. The only thing left to do is to write the blog post.