Nomad to k8s, Part 14: Deploying the Backups

Wherein I’m finally done with the backups in my Homelab’s k8s cluster.

This is part 15 of my k8s migration series.

Finally, I’m done. After months of writing Python code for my backup operator. Especially during the middle phase of the implementation, after the initial planning and design, it felt like a slog. Dozens of tasks, many functions to implement, and seemingly no end in sight. I’m rather elated to finally be able to write another post in the k8s migration series.

In this post, I will not write much about how the operator is implemented. Instead, I will give it the same treatment I gave the other apps I’ve deployed into the cluster up to now, as if I hadn’t written it myself.

If you’re interested in the implementation, take a look at the series of posts I wrote about it.

Suffice it to say for now that the operator reads a CRD defining some S3 buckets and PersistentVolumeClaims to be backed up, and launches Pods which use rclone and restic to do just that.

Infrastructure Preparation

Besides the backup operator itself, I also need some additional infrastructure. The backups themselves use restic with an S3 bucket as a target. I’m going with one bucket per app here. So before I can run the first backups, I need a couple of S3 users and buckets.

If you would like to read a bit more about my S3 setup, have a look at this post

The first two things needed are the backups and service-backup-user users. The backups user is the owner of all of the backup buckets, while backups-services is a reduced-permissions user for the actual backups:

apiVersion: ceph.rook.io/v1
kind: CephObjectStoreUser
metadata:
  name: backups
  namespace: rook-cluster
spec:
  store: rgw-bulk
  clusterNamespace: rook-cluster
  displayName: "Common user for backup buckets"
---
apiVersion: ceph.rook.io/v1
kind: CephObjectStoreUser
metadata:
  name: service-backup-user
  namespace: rook-cluster
spec:
  store: rgw-bulk
  clusterNamespace: rook-cluster
  displayName: "User for service backups"

With these two manifests, Rook Ceph will create two S3 users in my bulk storage, which is the part of my Ceph cluster backed by HDDs.

Due to the fact that I’m doing the bucket management itself through Ansible, I also need to push these secrets to my Vault instance, to make them available during Ansible runs. Although, now that I’m writing this, I’m wondering whether Ansible has a k8s Secrets lookup plugin? Something to look into later.

For pushing Secrets to Vault (and creating Secrets from Vault data), I’m using external-secrets. Specifically, PushSecret in this case:

apiVersion: external-secrets.io/v1alpha1
kind: PushSecret
metadata:
  name: s3-backupsuser
  namespace: rook-cluster
spec:
  deletionPolicy: Delete
  refreshInterval: 30m
  secretStoreRefs:
    - name: homelab-vault
      kind: ClusterSecretStore
  selector:
    secret:
      name:  rook-ceph-object-user-rgw-bulk-backups
  data:
    - match:
        secretKey: AccessKey
        remoteRef:
          remoteKey: secrets/backups
          property: access
    - match:
        secretKey: SecretKey
        remoteRef:
          remoteKey: secrets/backups
          property: secret

As always with secrets related stuff, this is a bit obfuscated. What this manifest does is take the Secret automatically created by Rook Ceph at rook-ceph-object-user-rgw-bulk-backups and pushing the S3 access key and secret key to the Vault KV store secrets at the entry backups.

Then I’m creating the S3 buckets themselves. I’m doing this with the Ansible amazon.aws.s3_bucket module. The Ansible play looks like this:

- hosts: candc
  name: Play for creating the backup buckets
  tags:
    - backup
  vars:
    s3_access: "{{ lookup('hashi_vault', 'secret=secret/backups:access token='+vault_token+' url='+vault_url) }}"
    s3_secret: "{{ lookup('hashi_vault', 'secret=secret/backups:secret token='+vault_token+' url='+vault_url) }}"
  tasks:
    - name: Create service backup buckets
      tags:
        - backup
        - buckets
      amazon.aws.s3_bucket:
        name: backup-{{ item }}
        access_key: "{{ s3_access }}"
        secret_key: "{{ s3_secret }}"
        ceph: true
        endpoint_url: https://s3.example.com
        state: present
        policy: "{{ lookup('ansible.builtin.template','bucket-policies/backup-services.json.template') }}"
      loop:
        - audiobookshelf

This play first fetches the credentials pushed into Vault with the PushSecret above, using the Vault plugin. Be cautious when looking for info on Vault in Ansible, Ansible’s own secret storage is unfortunately also called vault. This uses my Vault token I have to generate on my C&C host before I can do pretty much anything, which in turn needs credentials not stored on said host.

My backup buckets always follow the backup-$APP convention, and I’m iterating over the apps I need backup buckets for via a loop. Also interesting to mention is the policy set here, which is the S3 bucket policy for the new bucket. It’s created from this template:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": [
        "s3:GetObject",
        "s3:DeleteObject",
        "s3:PutObject",
        "s3:ListBucket",
        "s3:GetBucketLocation"
      ],
      "Effect": "Allow",
      "Resource": [
        "arn:aws:s3:::backup-{{ item }}/*",
        "arn:aws:s3:::backup-{{ item }}"
      ],
      "Principal": {
        "AWS": [
          "arn:aws:iam:::user/service-backup-user"
        ]
      }
    },
    {
      "Action": [
        "s3:GetObject",
        "s3:ListBucket",
        "s3:GetBucketLocation"
      ],
      "Effect": "Allow",
      "Resource": [
        "arn:aws:s3:::backup-{{ item }}/*",
        "arn:aws:s3:::backup-{{ item }}"
      ],
      "Principal": {
        "AWS": [
          "arn:aws:iam:::user/external-backup-user"
        ]
      }
    }
  ]
}

Through the magic of jinja2 and some naming conventions, this policy template will allow my service backup user to access all of the APIs needed by restic, meaning read and write access. The second user, external-backup-user, is the user I use to run backups to an external HDD. This user is more restricted than the service backup user, because it only needs read access and never writes to the backup buckets.

Short aside: Why use Ansible for the bucket creation, instead of Rook’s ObjectBucketClaim? Simple answer: Because of policies. Until very recently, there was no way to configure a bucket policy via an ObjectBucketClaim, so I would have needed to reach for Ansible or something else anyway. That’s why I decided to go ahead and do the bucket creation with Ansible as well.

Just for completeness’ sake, I also created an ExternalSecret for my restic backup password:

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: "restic"
  labels:
    homelab/part-of: service-backups
spec:
  secretStoreRef:
    name: homelab-vault
    kind: ClusterSecretStore
  refreshInterval: "1h"
  target:
    creationPolicy: 'Owner'
  data:
    - secretKey: pw
      remoteRef:
        key: secret/restic
        property: password

Incidentally, looking at the SecretStore name: I really need to stop prefixing everything with “homelab” or “hl”. 😅

Last but not least, I also need a sort of scratch volume, where backed up S3 buckets are copied to before being slurped up by restic. It’s a PVC looking like this:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: vol-service-backup-scratch
  labels:
    homelab/part-of: service-backups
spec:
  storageClassName: homelab-fs
  resources:
    requests:
      storage: 50Gi
  accessModes:
    - ReadWriteMany

It needs to be RWX because it’s shared among all backups for all apps, not one per app. So instead of my customary Ceph RBD volume, it’s a CephFS volume. This is one part of my backup setup I need to still improve. At some point, fully cloning an S3 bucket to a local disk and then feeding it into restic might no longer be feasible.

Anyway, that’s all the Yak shaving necessary, let’s look at the backup operator itself.

The operator deployment

Because this is an operator, the first thing to consider is what access it needs to the k8s API. For this, I defined one Role and one ClusterRole. The ClusterRole is necessary so the operator can access a number of resources in all namespaces, while the Role is for things where it only needs access in its own namespace.

Let’s begin with the ClusterRole:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: hlbo-cluster-role
rules:
  # Needed for Kopf Framework
  - apiGroups: [apiextensions.k8s.io]
    resources: [customresourcedefinitions]
    verbs: [list, watch]
  - apiGroups: [""]
    resources: [events]
    verbs: [create]
  - apiGroups: [""]
    resources:
      - namespaces
    verbs:
      - list
      - watch
  - apiGroups: [""]
    resources:
      - persistentvolumes
      - persistentvolumeclaims
    verbs:
      - get
  - apiGroups: ["storage.k8s.io"]
    resources:
      - volumeattachments
    verbs:
      - get
      - list
  - apiGroups: ["mei-home.net"]
    resources:
      - homelabbackupconfigs
      - homelabservicebackups
    verbs:
      - get
      - watch
      - list
      - patch
      - update
  - apiGroups: ["batch"]
    resources:
      - jobs
    verbs:
      - get
      - watch
      - list
      - create

A number of things in here are requirements from the kopf framework I used to implement the operator. It needs to be able to watch for CRDs because it needs to handle them. The HomelabBackupConfigs and HomelabServiceBackups are the two CRDs I introduced. PersistentVolumeClaims, PersistentVolumes and VolumeAttachments are needed because that’s what the operator backs up. Both PersistentVolumes and VolumeAttachments are cluster level resources. And because PVCs generally live in the namespace of the app they’re used by, cluster-wide access is required for the operator. Finally, the cluster-wide access to Jobs is due to a quirk of Kopf. I really only need to access Jobs in the operator’s own namespace, to launch them and monitor them. But the issue is that I’m using Kopf’s event handler mechanism to watch for Job events, so I can react when a Job finishes or fails. And Kopf only knows universal configuration when it comes to which APIs it uses. Either the cluster level ones, or the namespaced ones. This can’t be configured per listener, only for the entire instance.

So in the end, even though I really only needed Job control over Jobs in the same namespace, I still need to grant cluster-wide access.

Next, the Role:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: hlbo-role
  namespace: backups
rules:
  - apiGroups: [""]
    resources:
      - configmaps
    verbs:
      - get
      - watch
      - list
      - patch
      - update
      - create
      - delete

I’m creating ConfigMaps for the individual Jobs during the backup run, so I need access. But in this case, I implemented all of the necessary access myself with explicit API calls, without Kopf’s involvement. This allowed me to scope the access rights to a single namespace.

Then there’s the general backup configuration, which is set with the HomelabBackupConfig CRD. These are configuration options which don’t differ per app, and so can be set centrally, instead of having a block of similar config settings in every individual app’s backup config. For my deployment, it looks like this:

apiVersion: mei-home.net/v1alpha1
kind: HomelabBackupConfig
metadata:
  name: backup-config
  namespace: backups
  labels:
    homelab/part-of: hlbo
spec:
  serviceBackup:
    schedule: "30 1 * * *"
    scratchVol: vol-service-backup-scratch
    s3BackupConfig:
      s3Host: s3.example.com:443
      s3Credentials:
        secretName: s3-backup-buckets-cred
        accessKeyIDProperty: AccessKey
        secretKeyProperty: SecretKey
    s3ServiceConfig:
      s3Host: s3.example.com:443
      s3Credentials:
        secretName: s3-backup-buckets-cred
        accessKeyIDProperty: AccessKey
        secretKeyProperty: SecretKey
    resticPasswordSecret:
      secretName: restic
      secretKey: pw
    resticRetentionPolicy:
      daily: 7
      weekly: 6
      monthly: 6
      yearly: 1
    jobSpec:
      jobNS: "backups"
      image: harbor.example.com/homelab/hn-backup:5.0.0
      command:
        - "hn-backup"
        - "kube-services"

This configures service backups to run every night at 01:30. It configures the credentials and S3 servers for both, the location of the app’s S3 buckets and the location of the backup buckets. These are currently the same, but if I ever run two types of S3, e.g. for some reason I decide to add a second Ceph cluster or a MinIO instance, I can have the service and backup buckets on different S3 servers.

Also of interest might be the retention policy. This keeps the backups for the last 7 days, the backups for the Sundays of the last 6 weeks, the backups of the last day of the month for the last 6 months and finally the backup from December 31st of the previous year.

Finally, there’s the definition of the container image and command to run during individual backups, just in case I ever decide to change my setup for the individual backups but want to keep the operator going.

And here, finally, the operator’s deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: hlbo
  labels:
    homelab/part-of: hlbo
spec:
  replicas: 1
  selector:
    matchLabels:
      homelab/app: hlbo
      homelab/part-of: hlbo
  strategy:
    type: "Recreate"
  template:
    metadata:
      labels:
        homelab/app: hlbo
        homelab/part-of: hlbo
    spec:
      serviceAccountName: hlbo-account
      containers:
        - name: hl-backup-operator
          image: harbor.example.com/homelab/hl-backup-operator:1.1.0
          args:
            - "-A"
            - "-v"
            - "-d"
          imagePullPolicy: Always
          resources:
            requests:
              cpu: 50m
              memory: 50Mi

The imagePullPolicy: Always is mostly for the current, still somewhat “beta” phase of use, so I can easily switch to using :dev images. The args are all for Kopf. The -A says that Kubernetes’ cluster API should be used, while -v and -d enable lots of debug output.

That’s it, operator deployed. Now onto configuring a backup.

Configuring backups for my Audiobookshelf instance

Audiobookshelf was the first user-facing workload I deployed in k8s after setting up all the monitoring and infrastructure. It stores everything on a single PersistentVolume, including progress and listened to episodes of all of my podcasts. As such, I only need to backup that single PVC, and I’m good to go.

Backups are configured via the HomelabServiceBackups CRD. For my Audiobookshelf, it looks like this:

apiVersion: mei-home.net/v1alpha1
kind: HomelabServiceBackup
metadata:
  name: backup-audiobookshelf
  labels:
    {{- range $label, $value := .Values.commonLabels }}
    {{ $label }}: {{ $value | quote }}
    {{- end }}
spec:
  backupBucketName: "backup-audiobookshelf"
  backups:
    - type: pvc
      name: abs-data-volume
      namespace: audiobookshelf

The only configuration needed is the name of the backup bucket and a list of the S3 buckets and PVCs to be backed up.

In this case, my Audiobookshelf deployment only has a single PVC:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: abs-data-volume
  labels:
    {{- range $label, $value := .Values.commonLabels }}
    {{ $label }}: {{ $value | quote }}
    {{- end }}
spec:
  storageClassName: rbd-bulk
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi

The operator will figure out where that volume is currently mounted and launch a backup Job on that host.

And that’s it! The backups are finally working, and by now several weeks worth of backups were successful. It was a pretty long detour, but I did have at least some fun writing a small-ish project that I’m actually using.

The next installment of this series will come pretty soon, because I’m already done migrating my Drone CI instance on Nomad to a Woodpecker CI instance on k8s. The only thing left to do is to write the blog post.

Infrastructure Preparation#

The operator deployment#

Configuring backups for my Audiobookshelf instance#

Infrastructure Preparation

The operator deployment

Configuring backups for my Audiobookshelf instance