Homelab Backup Operator Part II: Basic Framework

In the last post of my Backup Operator series, I lamented the state of permissions in the kopf Kubernetes Operator framework. After some thinking, I decided to go ahead with kopf and just accept the permission/RBAC ugliness.

I’ve just finished implementing the first cluster state change in the operator, so I thought this is a good place to write a post about my approach and setup.

The journey up to now has been pretty interesting. I learned a bit about the Kubernetes API, and a lot about how cooperative multitasking with coroutines works in Python.

Why write an entire operator?

I’ve already written some things about my backup setup in the Kubernetes migration post which triggered this operator implementation.

Just to give a short refresher: I need to run daily backups on the persistent volumes and S3 buckets of the services running in my Homelab. I’m currently doing that by launching a run-to-completion job on every one of my Nomad hosts which backs up all the volumes which happen to be mounted on their host at the time. I can’t do that in k8s, because it seems to lack a run-to-completion, run-on-every-host type of workload. Jobs can do the run-to-completion part, and DaemonSets can do the run-on-every-host part, but there doesn’t seem to be a workload type which can do both in one. And that’s why I’ve decided to go with writing my own operator. There are two main benefits this approach will have, compared to my previous one. First, I will be able to explicitly schedule the second stage of my backup, backing up certain backups onto an external disk. Right now, I just schedule that phase an hour after the previous one. Second, I will be able to package the backup config for each individual service. In my current approach, I have the definition of which volumes and buckets to back up configured in the backup job’s config. With the Kubernetes operator, I will introduce a CRD that can be deployed together with each service, e.g. as part of the Helm chart.

Overview of the approach

As I’ve mentioned above, I will write the operator in Python and use the kopf framework to do it. This is simply because I’m currently familiar with three languages: C++, C and Python. And Python is the most comfortable of the three. Due to the RBAC problems I described in my last post, I briefly looked into other possibilities. But the Kubernetes ecosystem seems to mostly live in Golang, which I haven’t written anything in yet. And the main goal currently is to get ahead with the Homelab migration to k8s, not to learn yet another programming language. 🙂

There will be a total of three custom resources the operator will look for. The first one, HomelabBackupConfig, will be a one-per-cluster resource and looks like this:

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: homelabbackupconfigs.mei-home.net
spec:
  scope: Namespaced
  group: mei-home.net
  names:
    kind: HomelabBackupConfig
    plural: homelabbackupconfigs
    singular: homelabbackupconfig
  versions:
    - name: v1alpha1
      served: true
      storage: true
      schema:
        openAPIV3Schema:
          type: object
          description: "This object describes the general configuration of all backups created by the Homelab backup operator."
          properties:
            spec:
              type: object
              properties:
                serviceBackup:
                  type: object
                  description: "The configuration for all service level backups created by the operator instance."
                  properties:
                    schedule:
                      type: string
                      description: "The schedule on which all service level backups will be executed."
                    scratchVol:
                      type: string
                      description: "The name of the PVC for scratch space. Needs to be a RWX volume."
                    s3BackupConfig:
                      type: object
                      description: "Configuration for S3 access to the backup buckets."
                      properties:
                        s3Host:
                          type: string
                          description: "The S3 server hosting the backup buckets."
                        s3Credentials:
                          type: object
                          description: "The S3 credentials for the backup S3 user."
                          properties:
                            secretName:
                              type: string
                              description: "The name of the Secret containing the credentials."
                            accessKeyIDProperty:
                              type: string
                              description: "The name of the property in the secretName secret with the AWS_ACCESS_KEY_ID"
                            secretKeyProperty:
                              type: string
                              description: "The name of the property in the secretName secret with the AWS_SECRET_ACCESS_KEY"
                    s3ServiceConfig:
                      type: object
                      description: "Configuration for S3 access to the service buckets which should be backed up."
                      properties:
                        s3Host:
                          type: string
                          description: "The S3 server hosting the buckets which should be backed up."
                        s3Credentials:
                          type: object
                          description: "The S3 credentials for the service S3 user."
                          properties:
                            secretName:
                              type: string
                              description: "The name of the Secret containing the credentials."
                            accessKeyIDProperty:
                              type: string
                              description: "The name of the property in the secretName secret with the AWS_ACCESS_KEY_ID"
                            secretKeyProperty:
                              type: string
                              description: "The name of the property in the secretName secret with the AWS_SECRET_ACCESS_KEY"
                    resticPasswordSecret:
                      type: object
                      description: "The Secret with the Restic password for the backups."
                      properties:
                        secretName:
                          type: string
                          description: "The name of the Secret containing the password."
                        secretKey:
                          type: string
                          description: "The name of the property in the secretName Secret which contains the Restic password."
                    jobSpec:
                      type: object
                      description: "Configuration of the Job launched for each service backup."
                      properties:
                        image:
                          type: string
                          description: "The container image to be used for all service Jobs."
                        command:
                          type: array
                          description: "The command handed to Job.spec.template.containers.command"
                          items:
                            type: string
                        env:
                          type: array
                          description: "Additional entries for the containers.env list. These entries cann only be of the name,value variety. Other forms of env entries are not supported for now."
                          items:
                            type: object
                            properties:
                              name:
                                type: string
                                description: "The name of the env variable to add."
                              value:
                                type: string
                                description: "The value of the env variable to add."

This resource configures all of the common settings which will be shared by all of the individual service backups I will describe next.

My backups will be running with restic, backing up into S3 buckets on my Ceph Rook cluster for each service. As all service level backups will work like this, and back up to the same S3 service, it makes sense to centralize the configuration, instead of copying it into every service backup CRD. This configuration happens in the s3BackupConfig:

s3BackupConfig:
  type: object
  description: "Configuration for S3 access to the backup buckets."
  properties:
    s3Host:
      type: string
      description: "The S3 server hosting the backup buckets."
    s3Credentials:
      type: object
      description: "The S3 credentials for the backup S3 user."
      properties:
        secretName:
          type: string
          description: "The name of the Secret containing the credentials."
        accessKeyIDProperty:
          type: string
          description: "The name of the property in the secretName secret with the AWS_ACCESS_KEY_ID"
        secretKeyProperty:
          type: string
          description: "The name of the property in the secretName secret with the AWS_SECRET_ACCESS_KEY"

Pretty important to me is the flexibility when it comes to what the k8s Secrets have to look like. I’ve been annoyed with some of the Helm charts I’ve been using for prescribing exactly what the properties in the Secret need to be named, so I introduced a config option here to not only define the Secret’s name, but also the name of the property for the access and secret keys for the S3 credentials. The s3ServiceConfig has the same structure, but will be used for the credentials for accessing the S3 buckets of services, instead of the S3 backup buckets.

The resticPasswordSecret is the configuration of the restic password to unlock the restic encryption keys.

Finally, there’s the jobSpec. This will likely still change in the future, as I have not yet implemented that part. This spec will be used to create the Jobs which will run the actual backup. One will be created for each of the HomelabServiceBackup instances I will describe next. I will not go into detail on this part of the CRD today and instead keep it until I’ve actually implemented the Job creation.

Then there’s the HomelabServiceBackup CRD:

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: homelabservicebackups.mei-home.net
spec:
  scope: Namespaced
  group: mei-home.net
  names:
    kind: HomelabServiceBackup
    plural: homelabservicebackups
    singular: homelabservicebackup
  versions:
    - name: v1alpha1
      served: true
      storage: true
      schema:
        openAPIV3Schema:
          type: object
          description: "This object describes the configuration of the backups for a specific service."
          properties:
            spec:
              type: object
              properties:
                backupBucketName:
                  type: string
                  description: "The name of the S3 bucket to which the backup should be made."
                backups:
                  type: array
                  description: "The elements, like PVCs and S3 buckets to back up for this service."
                  items:
                    type: object
                    properties:
                      type:
                        type: string
                        description: "The Type of the element, either s3 or pvc."
                        enum:
                          - s3
                          - pvc
                      name:
                        type: string
                        description: "The name of the element, either the name of an S3 bucket or a PVC"
            status:
              type: object
              description: "Status of this service backup"
              properties:
                nextBackup:
                  type: string
                  description: "Date and time of the next backup run"
                lastBackup:
                  type: object
                  description: "Status of latest backup"
                  properties:
                    state:
                      type: integer
                      description: "State of the last backup. 1: Failed, 0: Successful"
                    timestamp:
                      type: string
                      description: "Date and time the last backup run was executed"

This CRD describes the backups to be done for an individual service. It contains two main parts, the status and the spec. In the spec, I’m configuring the S3 bucket to be used for the backup, and a list of things to back up. Right now, I’ve only got PersistentVolumeClaims and S3 buckets in mind. An instantiation might look like this:

apiVersion: mei-home.net/v1alpha1
kind: HomelabServiceBackup
metadata:
  name: test-service-backup
  namespace: backup-tests
  labels:
    homelab/part-of: hlbo
spec:
  backupBucketName: "non-existant-bucket"
  backups:
    - type: pvc
      name: non-existant-pvc
    - type: pvc
      name: another-non-existant-pvc
    - type: s3
      name: non-existant-S3-bucket

Kopf overview

Kopf has a relatively nice approach to listening for changes to resources it is supposed to be watching. It makes use of Kubernetes’ watch API. And then it combines some Kubernetes events to provide a nicer interface than could be provided when just using plain events.

The main method are event handlers for a small group of events. These handlers can be defined for each of four different event categories:

Creation of a new resource
Resume of the handler for an already existing resource after an operator restart
Deletion of a resource
Change of a resource

In addition, there are daemons, which are long running handlers. Instead of running to completion for every event, they stay active from the moment a resource is created to the moment it is deleted. They are automatically started up after operator restarts as well.

Finally, there is a generic event handler, which does get the full firehose of Kubernetes events, without the nice provisioning of diffs and the like you get for kopf’s event category handlers.

The handlers are Python functions with a decorator which describes the event group it should listen on and the CRD it should listen for. Those handlers can also be combined, so you can have the same Python function handling both, creation of a new resource and resume after the operator restarts.

Handlers generally come in two flavors, using threads or using coroutines. I spontaneously decided to go with the coroutine approach, because I had never before used Python’s asyncio feature, but I was familiar with coroutines in C and C++.

Handling the HomelabBackupConfig CRD

There isn’t too much to do with the generic handling for this CRD. There is only ever supposed to be one of those, and the only thing which needs to be done with it is to store it in memory in the operator and make it available to the handlers of the HomelabServiceBackup CRD, so they can use the configs to launch their job.

The implementation of the handlers themselves I kept pretty simple:

import asyncio

import kopf

import hl_backup_operator.homelab_backup_config as backupconf


@kopf.on.startup()
async def create_backup_config_cond(memo, **_):
    memo.backup_conf_cond = asyncio.Condition()


@kopf.on.create('homelabbackupconfigs')
@kopf.on.resume('homelabbackupconfigs')
@kopf.on.update('homelabbackupconfigs')
async def create_resume_update_handler(spec, meta, memo, **kwargs):
    await backupconf.handle_creation_and_change(meta["name"],
                                                memo.backup_conf_cond, spec)


@kopf.on.delete('homelabbackupconfigs')
async def delete_handler(meta, **kwargs):
    backupconf.handle_deletion(meta["name"])

This sets up a combined handler for creation, resumption and updates of the CRD. It also creates a Condition which I will later use in the HomelabServiceBackup handlers to notify them when the config changed.

The homelab_backup_config module looks like this:

import datetime
import logging

import croniter

__CONFIG = None


async def handle_creation_and_change(name, cond, spec):
    global __CONFIG
    __CONFIG = spec
    logging.info(f"Set backup config from {name} to: {spec}")
    async with cond:
        cond.notify_all()


def handle_deletion(name):
    global __CONFIG
    __CONFIG = None
    logging.warning(f"Config {name} deleted. No backups will be scheduled!")


def get_config():
    return __CONFIG


def get_next_service_time():
    if not __CONFIG:
        logging.error("Service schedule time requested, but no config present."
                      )
        return None
    if ("serviceBackup" not in __CONFIG
            or "schedule" not in __CONFIG["serviceBackup"]):
        logging.error("Config serviceBackup.schedule is missing.")
        return None

    now = datetime.datetime.now(datetime.timezone.utc)
    return croniter.croniter(__CONFIG["serviceBackup"]["schedule"], now
    ).get_next(datetime.datetime)


def get_service_backup_spec():
    if not __CONFIG or "serviceBackup" not in __CONFIG:
        logging.error("Config serviceBackup is missing.")
        return None
    else:
        return __CONFIG["serviceBackup"]

As I said, I kept it really simple. This implementation stores the spec as received from the handler in a module level variable __CONFIG and then has a couple functions to make it available to the rest of the operator. The only really interesting part is the get_next_service_time function. It looks at the spec.serviceBackup.schedule value, which is a string in cron format, for example like this:

spec:
  serviceBackup:
    schedule: "30 18 * * *"

I decided to keep all times in UTC internally, just to prevent confusing myself. Instead of writing my own cron parser, I used croniter. It doesn’t just provide a parser for the cron format, but also provides a helper to get the time and date of the next scheduled execution, which I make use of here.

Implementing the HomelabServiceBackup handling

The HomelabServiceBackup resource describes the backup for an individual service. In the operator, it will ultimately need to launch a Job to run the backup of the configured PersistentVolumeClaims and S3 buckets belonging to the service.

The first thing I implemented was the waiting for the scheduled execution time of the backup. For this, I initially thought to use kopf’s timers, but quickly realized that those only allow for a fix interval. But I needed an adaptable wait, depending on the schedule configured on the HomelabBackupConfig. For that reason, I reached for kopf’s Daemons. These are long running handlers. One is created for each instance of the watched resource.

The handler function itself is again simple, as I just call a separate function in a module:

import asyncio

import kopf

import hl_backup_operator.homelab_service_backup as servicebackup


@kopf.on.startup()
async def create_backup_config_cond(memo, **_):
    memo.backup_conf_cond = asyncio.Condition()


@kopf.daemon("homelabservicebackups", initial_delay=30)
async def service_backup_daemon(name, namespace, spec, memo, stopped, **_):
    await servicebackup.homelab_service_daemon(name, namespace, spec, memo,
                                               stopped)

The daemon will spend most of its time waiting, as it only needs to do something in two cases:

When the scheduled time for a backup has arrived
When the backup schedule changes

Let’s look at the second case first. This is the reason for the usage of the memo. The memo is a generic container handled by kopf and made available to all handlers. I’m creating a Condition during operator startup. Every daemon will wait on this condition, and the handler for HomelabBackupConfig updates will notify all waiters on that condition when the HomelabBackupConfig changes. This is necessary because the schedule is configured in the HomelabBackupConfig, so daemons might need to adjust their wait timer.

Here is what that waiting currently looks like:

class WakeupReason(Enum):
    TIMER = auto()
    SCHEDULE_UPDATE = auto()


async def cond_waiter(cond):
    async with cond:
        await cond.wait()


async def wait_for(waittime, update_condition):
    cond_task = asyncio.create_task(cond_waiter(update_condition),
                                    name="condwait")
    sleep_task = asyncio.create_task(asyncio.sleep(waittime), name="sleepwait")
    done, pending = await asyncio.wait([cond_task, sleep_task],
                                       return_when=asyncio.FIRST_COMPLETED)

    for p in pending:
        p.cancel()
    wake_reasons = []
    for d in done:
        if d.get_name() == "condwait":
            wake_reasons.append(WakeupReason.SCHEDULE_UPDATE)
        elif d.get_name() == "sleepwait":
            wake_reasons.append(WakeupReason.TIMER)

    return wake_reasons

As I’ve noted before, I’m using Python’s asyncio module, so instead of threads, I’m using coroutines. Luckily, the Python standard library already provides the means to wait for multiple tasks and even tell me which task is done waiting when the function returns. So here, I’m creating two tasks. One is waiting on the given waittime. This is the difference between the current time and the next scheduled backup, in seconds. The second one is waiting on the condition I mentioned previously. This condition will be notified by the handler for the HomelabBackupConfig when that resource changes. This is necessary because the daemon might need to adjust its wait time if the schedule for backups has changed.

Finally, I’m checking which task finished waiting, and return a list of enums to tell the caller why it woke up, to take different actions.

Then there’s the main loop of the daemon:

async def homelab_service_daemon(name, namespace, spec, memo, stopped):
    logging.info(f"Launching daemon for {namespace}/{name}.")
    while not stopped:
        logging.debug(f"In main loop of {namespace}/{name} with spec: {spec}")
        next_run = backupconfig.get_next_service_time()
        wait_time = next_run - datetime.datetime.now(datetime.timezone.utc)
        await wait_for(wait_time.total_seconds(), memo.backup_conf_cond)
    logging.info(f"Finished daemon for {namespace}/{name}.")

This doesn’t do much at the moment, as I haven’t implemented the backups themselves yet. It runs in an endless loop, checking the stopped variable, which will be set to True by kopf if the HomelabServiceBackup this daemon is handling is deleted or the operator is stopped. Kopf will also throw a CancelledError into the coroutine in those cases, so the daemon will also be stopped when it is currently waiting.

The waiting time is computed with the get_next_service_time function I discussed above.

Implementing status updates

The goal which triggered this blog post was me finally getting the scheduled triggering and updates of the HomelabServiceBackup’s status implemented, which was my first change of the cluster status via the operator.

My goal was to have each daemon update a field in its HomelabServiceBackup resource with the scheduled time of the next backup, which would ultimately look like this:

status:
  nextBackup: "2024-05-25T18:30:00+00:00"

The status.nextBackup field is what I was interested in setting. I first looked at the Kubernetes Python Client, but found that it did not support asyncio. But I quickly found kubernetes_asyncio. An interesting thing I learned while looking at these two libraries is that they were, for the most part, not hand-written. Instead, they use the openapi-generator to automatically generate the API code from the Kubernetes API definition. Which is pretty cool to see, to be honest. It leads to boatloads of repeated code, but the alternative of writing all that code by hand probably doesn’t bear thinking about.

Of course, one of the downsides of using the Python API client was that it would not have API support for the CRDs I’ve written for my own cluster. Instead, I needed to use the generic CustomObjectsAPI.

Initially, because I wanted to specifically update the status of my resources, I looked at the patch_namespaced_custom_object_status API. But running that API against a resource which did not have the status set yet just returns a 404. It took me a long while to realize that the 404 was not due to an error on my end, but simply because the resource needed to have a status already for the status API to work.

So instead, I reached for the patch_namespaced_custom_object API. That, too, had a lot of issues. I initially thought I was the first person to use the Python API package for accessing custom objects. All the examples I could find stated that this should work:

import asyncio
from kubernetes_asyncio import client, config
from kubernetes_asyncio.client.api_client import ApiClient
from pprint import pprint
import json

async def main():
    await config.load_kube_config()

    async with ApiClient() as api:
        mine = client.CustomObjectsApi(api)
        res = await mine.patch_namespaced_custom_object("mei-home.net", "v1alpha1",
                "backups", "homelabservicebackups", "test-service-backup",
                body={"status":{"lastBackup": {"state":1, "timestamp":"foobar"}}}
                )
        pprint(res)
asyncio.run(main())

But it did not. Instead, I kept getting errors like this back:

kubernetes_asyncio.client.exceptions.ApiException: (415)
Reason: Unsupported Media Type
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure",
"message":"the body of the request was in an unknown format - accepted media types
include: application/json-patch+json, application/merge-patch+json,
application/apply-patch+yaml",
"reason":"UnsupportedMediaType",
"code":415}

I finally found this bug. It seems to indicate that the issue is a wrong media type getting set in the content-type header. This lead me to the examples file, which shows that a specific content type could be forced, by adding _content_type='application/merge-patch+json' as a parameter to the patch_namespaced_custom_object call. With that addition, I was finally able to properly update the time for the next backup in the status, by adding these lines to the homelab_service_daemon function from before:

status_body = {
    "status": {
        "nextBackup": next_run.isoformat()
    }
}
await kubeapi.patch_mei_home_custom_object(
    namespace, kubeapi.HOMELABSERVICEBACKUP_PLURAL, name, status_body)

The patch_mei_home_custom_object function is just a thin wrapper around the patch_namespaced_custom_object function from above.

Some notes on testing

Writing UTs was not always simple here. First of all, I needed to employ a lot of mocks to remove any attempted k8s cluster access. I’m seriously considering buying some additional Pis and setting up a test cluster. 😁

My first generic issue was: How do I even properly unit test asyncio code? Luckily, that issue was easy to answer, at least in the abstract: I used pytest-asyncio. It allows me to add @pytest.mark.asyncio at the top of my test function, or entire test classes, and the pytest plugin will automatically setup the event loop infrastructure and execute the test functions with it.

Still, I had a particular challenge with testing the waiting code, specifically when it comes to testing whether the Condition properly fires. As a reminder, here is what the code looks like:

async def cond_waiter(cond):
    async with cond:
        await cond.wait()


async def wait_for(waittime, update_condition):
    cond_task = asyncio.create_task(cond_waiter(update_condition),
                                    name="condwait")
    sleep_task = asyncio.create_task(asyncio.sleep(waittime), name="sleepwait")
    done, pending = await asyncio.wait([cond_task, sleep_task],
                                       return_when=asyncio.FIRST_COMPLETED)

    for p in pending:
        p.cancel()
    wake_reasons = []
    for d in done:
        if d.get_name() == "condwait":
            wake_reasons.append(WakeupReason.SCHEDULE_UPDATE)
        elif d.get_name() == "sleepwait":
            wake_reasons.append(WakeupReason.TIMER)

    return wake_reasons

And here is my initial attempt at the test code:

import asyncio
from unittest.mock import AsyncMock, Mock

import hl_backup_operator.homelab_service_backup as sut


@pytest.mark.asyncio
class TestCondWait:

    async def test_cond_wait_works(self):
        cond = asyncio.Condition()
        test_task = asyncio.create_task(sut.wait_for(15, cond))
        async with cond:
            cond.notify_all()
        await test_task
        res = test_task.result()
        assert res == [sut.WakeupReason.SCHEDULE_UPDATE]

I’m trying to test whether the Condition works properly. My thinking is that the code path goes like this:

[testcode]: Creates an async task ready to run, executing the function under test.
[appcode]: Runs until it hits the asyncio.wait line
[appcode]: Now waits for either the timer to expire or the Condition to be triggered, hands back execution to the [testcode]
[testcode]: Executes the cond.notify_all function
[testcode]: Awaits the task, handing execution back to [appcode]
[appcode]: Gets notified in cond_waiter and runs to completion

But that was not what happened. Sprinkling in some print statements, I found that the test code continues running after the create_task call, straight through the notify_call call. The first time the wait_for gets to do anything is when the test code hits the await test_task line. And only then does it reach the await cond.wait line. But at this point, the test code already executed the notify_all, and the wait_for function does not return until the timer, of the sleepwait task, is hit, resulting in a failed UT.

The only way I found around this issue is to have the test code explicitly hand execution off. I did this by introducing a await asyncio.sleep(0.05) before the async with cond: line of the test function. Then the wait_for function gets to run until it hits the await cond.wait and gets properly notified and the test reliably succeeds.

This was, yet again, a case where the UT ends up being more complicated than the actual code.

One more issue I hit had to do with the merciless advance of time. Have another look at the homelab_service_daemon function:

async def homelab_service_daemon(name, namespace, spec, memo, stopped):
    logging.info(f"Launching daemon for {namespace}/{name}.")
    while not stopped:
        logging.debug(f"In main loop of {namespace}/{name} with spec: {spec}")
        next_run = backupconfig.get_next_service_time()
        wait_time = next_run - datetime.datetime.now(datetime.timezone.utc)
        status_body = {
            "status": {
                "nextBackup": next_run.isoformat()
            }
        }
        await kubeapi.patch_mei_home_custom_object(
            namespace, kubeapi.HOMELABSERVICEBACKUP_PLURAL, name, status_body)
        await wait_for(wait_time.total_seconds(), memo.backup_conf_cond)
    logging.info(f"Finished daemon for {namespace}/{name}.")

It has to compute the waiting time as the difference between the current time and the time of the next scheduled backup. But how to handle datetime.now in UTs? I initially tried to do this with a bit of fuzziness when comparing the arguments handed to the mocked wait_for with the expected wait time, but that seemed a bit too brittle.

Freezegun to the rescue. It provides a nice API to patch datetime.now (and several other related functions) so that it always returns a deterministic value. Using it in a UT to verify that homelab_service_daemon calls wait_for as expected could look like this:

@pytest.fixture()
def mock_wait_for(self, mocker):
    wait_for_mock = AsyncMock(spec=sut.wait_for)
    mocker.patch('hl_backup_operator.homelab_service_backup.wait_for',
                  side_effect=wait_for_mock)
    return wait_for_mock

async def test_daemon_waits_correctly(self, mocker, mock_wait_for):
    mock_memo = Mock()

    mock_stopped = Mock()
    mock_stopped_bool = Mock(side_effect=[False, True])
    mock_stopped.__bool__ = mock_stopped_bool

    time_now = datetime(year=2024, month=5, day=22, hour=19, minute=12,
                        second=10, tzinfo=timezone.utc)
    time_trigger = datetime(year=2024, month=5, day=22, hour=19, minute=12,
                            second=12, tzinfo=timezone.utc)

    mock_next_service_time = Mock(return_value=time_trigger)
    mocker.patch(
        'hl_backup_operator.homelab_backup_config.get_next_service_time',
        side_effect=mock_next_service_time)
    with freezegun.freeze_time(time_now):
        await sut.homelab_service_daemon("tests", "testns", {}, mock_memo,
                                          mock_stopped)

    mock_wait_for.assert_awaited_once_with(2, mock_memo.backup_conf_cond)

I’m mocking away both, the wait_for and get_next_service_time functions, and I’m also defining two fixed times, one “current” time, and one trigger time. In the with freezegun.freeze_time(time_now) context, datetime.now will now reliably always return time_now instead of the actual current time. And with that, I don’t need to rely on any fuzziness when testing time-related functionality.

Next steps

After I’m finally happy with the groundwork, I still need to implement a couple of features before starting with the implementation of the backup Jobs themselves. The first one is proper handling of the case where there is no HomelabBackupConfig configured. Currently, the homelab_service_daemon function would crash, because get_next_service_time would return None, due to not having any configured schedule. That is easily fixable by extending the waiting time to “forever”. With the Condition mechanism already in place, the daemons will be woken up once a HomelabBackupConfig appears and can then return to the right schedule.

The second feature currently missing is mostly for testing purposes. Right now, I’m only able to centrally set the schedule, which would be applicable for all service daemons. This is bound to become cumbersome once I want to start testing the Job creation and monitoring, so I will want the possibility to trigger a single service daemon’s backup immediately. I will likely introduce another parameter into the HomelabServiceBackup CRD which makes the daemon trigger a backup immediately.

Alright, that’s all I have to say for now. This is my first “programming” post on this blog, and I’m honestly not sure how it came out. Were you actually able to follow, or was it a confused mess? Was it actually interesting to read? I’d be glad for some feedback, e.g. via my Fediverse account.

Why write an entire operator?#

Overview of the approach#

Kopf overview#

Handling the HomelabBackupConfig CRD#

Implementing the HomelabServiceBackup handling#

Implementing status updates#

Some notes on testing#

Next steps#