Nomad to k8s, Part 25: Control Plane Migration

Wherein I migrate my control plane to the Raspberry Pi 4 nodes it is intended to run on.

This is part 26 of my k8s migration series.

This one did not remotely go as well as I thought. Initially, I wasn’t even sure that this was going to be worth a blog post. But my own impatience and the slowly aging Pi 4 conspired to ensure I’ve got something to write about.

But let’s start with where we are. This is very likely the penultimate post of this series. By the time I’m writing this, the migration is done:

task proj:homelab.k8s.migration stats
Category                   Data
Pending                    8
Waiting                    0
Recurring                  0
Completed                  808
Deleted                    48
Total                      864

Those last eight remaining tasks are just some cleanup. But tat the beginning of the weekend, I still had one major thing to do: Migrating my control plane from the three virtual machines it has been living on for over a year to the three Raspberry Pi 4 4GB with attached SATA SSDs which have been serving as my control plane before.

Control plane here means the k8s control plane, consisting of etcd, the kube-apiserver, the kube-controller-manager and the kube-scheduler. In addition, I had kube-vip running to provide a virtual IP for the k8s API. And the MONs of my Rook Ceph cluster were running on there as well. And finally my Vault instances are also assigned to those nodes.

While the kube control plane components probably don’t need any explanation, the other pieces do. Let’s start with the Ceph MONs. Why put them here, instead of on the Ceph nodes themselves? Mostly, habit, it was the setup I had previously. Originally born from the thought that I might be running my Ceph nodes on Pi 4 as well. And on those hosts, memory would have been at a premium. I ended up not going with that idea, but I still liked the thought of having control plane nodes which run the servers/controller components of all my major services. In the Nomad cluster setup, these nodes were running the Consul, Vault and Nomad servers as well as the Ceph MONs. I liked that setup and decided to keep it for the k8s setup. I couldn’t run the MONs on any worker nodes, because none of those have local storage. They all have their root disks on Ceph RBDs, which means they could only run the MONs for that same Ceph cluster until the first time they all went down at the same time. 😉

The reason for running Vault on the control plane nodes is one of convenience. I’ve got some automation for regular node updates. But my Vault instances need manual unsealing. This means that after the reboot as part of the regular update, I would need to manually unseal the instance on the host which was just updated. This is fine in the current setup - the controllers are the first nodes to be updated anyway, so I just need to pay attention right at the beginning of the node update playbook. And after those nodes have been restarted and their Vault instances have been unsealed, I can go and do something else.

So I needed to migrate the kube control plane and the MONs over to the Pis. I would need to do the following steps:

Setup Kubernetes on the three Pi 4
Join the three Pi 4 to the kube control plane
Add MONs on the three new nodes, for a total of 6 Ceph MONs
Add the new MONs to the MON lists and reboot everything
Remove the old control plane nodes

The most complicated step here was the MON migration. That’s due to the fact that the MONs are generally configured via their IPs, so I had to change some configuration. Specifically the configs outside the k8s cluster needed manual adaption, and the most important config here was the MON list used by my netbooting hosts to get their root disks. Just to make sure everything was okay, I needed to reboot all netbooting hosts in my Homelab.

In preparation for the move, I fixed the MON deployments for Ceph to the existing control plane nodes, to make sure that they were only migrated when I told them to migrate:

cephClusterSpec:
  placement:
    mon:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
            - matchExpressions:
                - key: "homelab/role"
                  operator: In
                  values:
                    - "controller"
                - key: "kubernetes.io/hostname"
                  operator: In
                  values:
                    - "oldcp1"
                    - "oldcp2"
                    - "oldcp3"
      tolerations:
        - effect: NoSchedule
          key: node-role.kubernetes.io/control-plane
          operator: Exists

Migrating the k8s control plane

This was reasonably easy to accomplish. I just needed to join the three Pi 4 into the cluster as control plane nodes.

But here I hit my first stumbling block. While most components - notionally including the Cilium Pod - came up, the Fluentbit Pod for log collection did not. Instead, on both hosts, it showed errors like these:

[2025/04/10 22:16:59] [ info] [filter:kubernetes:kubernetes.0] local POD info OK
[2025/04/10 22:16:59] [ info] [filter:kubernetes:kubernetes.0] testing connectivity with API server...
[2025/04/10 22:17:09] [error] [net] connection #60 timeout after 10 seconds to: kubernetes.default.svc:443
[2025/04/10 22:17:09] [error] [filter:kubernetes:kubernetes.0] kube api upstream connection error
[2025/04/10 22:17:09] [ warn] [filter:kubernetes:kubernetes.0] could not get meta for POD fluentbit-fluent-bit-ls6tt

After some fruitless research, I found this line in the logs of the Cilium Pods of the new control plane hosts:

Failed to initialize datapath, retrying later" module=agent.datapath.orchestrator error="failed to delete xfrm policies on node configuration changed: protocol not supported" retry-delay=10s

This brought me to the Cilium system requirements docs. And there it states pretty clearly that for the exact Ubuntu version I’m running, an additional package with kernel modules was needed. I didn’t have any issues with my Pi 4 worker nodes before, though. This was because those already had the linux-modules-extra-raspi package installed, as that’s needed for Ceph support, and all of my worker nodes use Ceph RBDs for their root disks. But the controller nodes never needed that, due to having local storage.

After installing the additional package, the new nodes worked properly. What I found a bit disappointing was that the Cilium Pods did not show any indication that anything was wrong, besides that single log line I showed above.

Another interesting sign that something was wrong was that I saw entries like these in my firewall logs:

HomelabInterface		2025-04-11T00:34:59	300.300.300.4:39696	310.310.17.198:4240	tcp	Block all local access
HomelabInterface		2025-04-11T00:34:54	300.300.300.5:42022	310.310.19.209:2020	tcp	Block all local access

Which is odd, because the 310.310.0.0/16 CIDR subnet is my Pod subnet, and those packets should really never show up at my firewall.

With that, my k8s control plane was up and running without further issue.

How not to migrate MONs

Do not follow the steps in this section. I will speculate a bit on what I did wrong, but I do not have another cluster to migrate to confirm what the right way would be.

This section is a cautionary tale, not a guide.

So let’s set the table. At the beginning of this, I had three MON daemons running on the three old control plane nodes. Everything was fine. I planned to start with replacing two old MONs with two new ones, leaving the one old MON available to the netbooting hosts with their old configuration.

So I started out with just replacing two of the old nodes with two new ones in the placement config for the MONs in the Rook Ceph cluster values.yaml file:

cephClusterSpec:
  placement:
    mon:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
            - matchExpressions:
                - key: "homelab/role"
                  operator: In
                  values:
                    - "controller"
                - key: "kubernetes.io/hostname"
                  operator: In
                  values:
                    - "oldcp1"
                    - "newcp1"
                    - "newcp2"

Deploying this did not work. This left the two MONs for newcp1 and newcp2 in pending state, because the one remaining MON was too few. I then tried to increase the number of MONs to five, with the three old nodes and the two new ones. That brought my heart to a standstill with this message showing up in the Rook operator’s logs:

2025-04-12 10:31:16.256775 I | ceph-spec: ceph-object-store-user-controller: CephCluster "k8s-rook" found but skipping reconcile since ceph health is &{Health:HEALTH_ERR Details:map[error:{Severity:Urgent Message:failed to get status. . [...]/src/mon/MonMap.h: In function 'void MonMap::add(const mon_i
nfo_t&)' thread 7f8ea6f2c640 time 2025-04-12T10:29:53.668780+0000
[...]/src/mon/MonMap.h: 221: FAILED ceph_assert(addr_mons.count(a) == 0)

Luckily for me, the operator checks the quorum before removing too many MONs, and so the cluster was not broken. I fixed this by going back to my original config, with three MONs placed on the three old control plane nodes. This still did not bring back the cluster, still showing the above error. I fixed this by editing the rook-ceph-mon-endpoints ConfigMap in the cluster namespace. It’s data key looks something like this:

data:
  data: m=300.300.300.1:6789,k=300.300.300.2:6789,l=300.300.300.3:6789,n=300.300.300.4:6789,o=300.300.300.5:6789
  mapping: '{"node":{"k":{"Name":"oldcp1","Hostname":"oldcp1","Address":"300.300.300.1"},"l":{"Name":"oldcp2","Hostname":"oldcp2","Address":"300.300.300.2"},"m":{"Name":"oldcp3","Hostname":"oldcp3","Address":"300.300.300.3"},"n":{"Name":"newcp1","Hostname":"newcp1","Address":"300.300.300.4"},"o":{"Name":"newcp2","Hostname":"newcp1","Address":"300.300.300.5"}}}'
  maxMonId: "12"
  outOfQuorum: ""

This still had the new MONs in there, which did not work. After manually removing the entries for the MONs n and o, which were the new ones and restarting the operator, everything came up fine again with the original three MONs on the old nodes.

So onto attempt Nr 2. Here I decided to go all in and immediately add all three new nodes, instead of just two. That was because I realized that I could replace all three MON addresses in the hardcoded netboot configs right away with the new MONs if I just went straight to six MONs, the three old and three new ones. This would save me one reboot for all netbooting cluster nodes.

So then I configured six MONs, and instead of replacing MONs in the placement config, I just added the three new ones, so it now looked like this:

cephClusterSpec:
  mon:
    count: 3
  placement:
    mon:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
            - matchExpressions:
                - key: "homelab/role"
                  operator: In
                  values:
                    - "controller"
                - key: "kubernetes.io/hostname"
                  operator: In
                  values:
                    - "oldcp1"
                    - "oldcp2"
                    - "oldcp3"
                    - "newcp1"
                    - "newcp2"
                    - "newcp3"

Applying this change worked without any issue whatsoever. The controller started three new MON deployments, and they all came up without any problem.

I then changed the hardcoded MON IPs everywhere to the IPs of the three new control plane nodes, and then rebooted the entire Homelab. Worked like a charm.

After that, the only thing remaining was to remove the old MONs. And here is where the horror really started. I can’t really put together what I did to create this situation, so take the following paragraphs with a grain of salt. I hope you can appreciate that I had other priorities than making good notes.

So I tried to get back to three MONs, but now running on the Pi 4 controller nodes by going back to three MONs in the config and removing the three oldcp nodes from the nodeSelector.

This seemed to lead the operator into an endless loop, because for some reason it tried to stop one of the MON deployments on the new control plane nodes, even though those were not supposed to be removed.

And here, I made my mistake. I got impatient, and manually deleted the k8s Deployments of the MONs I did no longer need. Or I thought I no longer needed. And when that did not really help, I edited the MON map ConfigMap again and manually deleted the old MONs there as well.

The price for my impatience immediately showed up in the operator logs:

ceph-object-store-user-controller: CephCluster \"k8s-rook\" found but skipping reconcile since ceph health is &{Health:HEALTH_ERR Details:map[error:{Severity:Urgent Message:failed to get status. . timed out: exit status 1}] [...]}

The operator’s attempts to even just get the cluster status timed out. I confirmed that by trying to run ceph -s, to no avail. There were still three MONs running. But no quorum anymore. I had just nuked my storage cluster.

Or so I thought. Looking at the logs of the still running MONs, I saw this line:

e15 handle_auth_request failed to assign global_id

For some reason I can’t explain, I thought this might have to do with the old MONs still being configured somewhere. Searching the web did not really deliver any results. But I ended up on this Ceph docs page. It showed me a way to get the MON map when the Ceph client doesn’t work anymore:

kubectl exec -it -n rook-cluster rook-ceph-mon-n-f498b8448-dw55m -- bash
ceph-conf --name mon.n --show-config-value admin_socket
/var/run/ceph/ceph-mon.n.asok

With that information, I could dump the MON status info:

ceph --admin-daemon /var/run/ceph/ceph-mon.k.asok mon_status

"mons": [
    {
        "name": "k",
        "addr": "300.300.300.1:6789/0",
        "public_addr": "300.300.300.1:6789/0",
    },
    {
        "name": "l",
        "addr": "300.300.300.2:6789/0",
        "public_addr": "300.300.2:6789/0",
    },
    {
        "name": "m",
        "addr": "300.300.300.3:6789/0",
        "public_addr": "300.300.300.3:6789/0",
    },
    {
        "name": "n",
        "addr": "300.300.300.4:6789/0",
        "public_addr": "300.300.300.4:6789/0",
    },
    {
        "name": "o",
        "addr": "300.300.300.5:6789/0",
        "public_addr": "300.300.300.5:6789/0",
    },
    {
        "name": "p",
        "addr": "300.300.300.6:6789/0",
        "public_addr": "300.300.300.6:6789/0",
    }
]

I’ve removed a lot of additional information here, but the important part was: Yes, the old MONs were still in the MON map. So how about updating the map so it only contains the new MONs? Worth a try!

To do that, I needed to extract the actual MON map, in the correct format. But that’s, for some reason, only possible when the MON is stopped. But because we’re talking about Pods here, and not baremetal deployments, I couldn’t just stop a daemon and then access it still. So I looked closer at the error message:

ceph-mon -i n --extract-monmap /tmp/monmap
2025-04-12T19:41:05.220+0000 ffffac4a5040 -1 rocksdb: IO error: While lock file: /var/lib/ceph/mon/ceph-n/store.db/LOCK: Resource temporarily unavailable
2025-04-12T19:41:05.220+0000 ffffac4a5040 -1 error opening mon data directory at '/var/lib/ceph/mon/ceph-n': (22) Invalid argument

So, I thought: How much worse could it possibly get? And manually removed /var/lib/ceph/mon/ceph-n/store.db/LOCK. The MON didn’t seem to care and continued running, but now I had the MON map.

As I noted above: This is a cautionary tale. Not a how-to.

After I had the MON map, I needed to remove the three old MONs. For that, I was able to use the monmaptool:

monmaptool --rm k /tmp/monmap
monmaptool --rm l /tmp/monmap
monmaptool --rm m /tmp/monmap

And then I just needed to inject the MON map again, which I could do while the MON was running:

ceph-mon -i n --inject-monmap /tmp/monmap

And then, after a restart of this utterly frankenstain’ed MON…it came back up. And it didn’t throw the auth error anymore. And then one of the other running MONs also came up again. And then the state check errors stopped in the operator logs. And my ceph -s worked again. Much rejoicing was had. So much rejoicing.

I deleted the deployment of the last MON, as it wasn’t willing to come up again. And then the operator redeployed it, and it came up fine again.

Then I retreated to my fainting couch and contemplated my own hubris, stupidity and especially impatience.

But it was done. Ceph being the battle-tested piece of software it is, there was zero issue afterwards. The OSDs were almost happy the entire time and didn’t even need a restart.

If we could bottle the elation and relieve I felt when the first MON started spewing its comfortably familiar log outputs again, we would have one hell of a drug on our hands. Bottled euphoria, pretty much.

Restoring it all from a blank Ceph cluster would have been a hell of a lot of work.

Stability problems

So I now had my control plane running on three Raspberry Pi 4 4GB. I had tried to make sure that those Pis would have enough resources by giving the VMs that ran the control plane before only four cores and 4GB of RAM, to keep them at least somewhat close to the Pis.

But that did not give me a realistic estimate on whether the Pis would be able to run the control plane. On the morning after I had finished the migration, I woke up to two of my Vault Pods requiring unsealing because they were restarted. After some searching, I thought I had found the culprit with these error messages:

2025-04-13 10:16:28.000 "This node is becoming a follower within the cluster"
2025-04-13 10:16:28.000 "lost leadership, restarting kube-vip"
2025-04-13 10:16:27.372 "1 leaderelection.go:285] failed to renew lease kube-system/plndr-cp-lock: timed out waiting for the condition
2025-04-13 10:16:27.371 "1 leaderelection.go:332] error retrieving resource lock kube-system/plndr-cp-lock: Get \"https://kubernetes:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/plndr-cp-lock?timeout=10s\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"

After some checking I realized that, for some reason, my Ansible playbook hadn’t deployed kube-vip to two of my three Pi control plane nodes.

But that, sadly, wasn’t the real issue. In the following days, I regularly came home to find one or more of my Vault Pods having been restarted during the night.

After some intense log reading, I think I identified the problem: Very regular timeouts in etcd. That’s Kubernetes’ distributed database, holding the cluster’s state. I very regularly get spurious leader elections where the three nodes can’t even agree on what term it is. That then ultimately leads to timed out requests from the kube-apiserver and restarts of kube-apiserver, kube-controller-manager and kube-scheduler. This isn’t really that bad - they seem to be perfectly able to come up again. But still.

It’s also not a permanent situation. Yesterday, for example, I didn’t have any spurious restarts on any host for over 31 hours. Here’s an example of kube-apiserver complaining:

2025-04-17 11:18:28.292 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:\"http: Handler timeout\"}: http: Handler timeout
2025-04-17 11:18:28.292 writers.go:122] apiserver was unable to write a JSON response: http: Handler timeout
2025-04-17 11:18:28.291 status.go:71] apiserver received an error that is not an metav1.Status: context.deadlineExceededError{}: context deadline exceeded

And then there’s etcd:

2025-04-17 11:18:30.903	slow fdatasync took=1.59687293s expected-duration=1s
2025-04-17 11:18:30.771	request stats start time=2025-04-17T09:18:28.771395Z time spent=2.000537375s remote=127.0.0.1:58000 response type=/etcdserverpb.KV/Range request count=0 request size=18 response count=0 response size=0 request content="key:\"/registry/health\""
2025-04-17 11:18:30.771	duration=2.000309284s start=2025-04-17T09:18:28.771470Z end=2025-04-17T09:18:30.771780Z steps="[\"trace[1219274112] 'agreement among raft nodes before linearized reading'  (duration: 2.00005686s)\"]" step_count=1
2025-04-17 11:18:30.771	apply request took too long took=2.000065119s expected-duration=100ms prefix="read-only range " request="key:\"/registry/health\" " response= error="context canceled"
2025-04-17 11:18:29.154	timed out sending read state timeout=1s
2025-04-17 11:18:28.750	request stats start time=2025-04-17T09:18:26.749079Z time spent=2.000959149s remote=127.0.0.1:57984 response type=/etcdserverpb.KV/Range request count=0 request size=18 response count=0 response size=0 request content="key:\"/registry/health\" "
2025-04-17 11:18:28.749	duration=2.000699837s start=2025-04-17T09:18:26.749169Z end=2025-04-17T09:18:28.749869Z steps="[\"trace[1722676343] 'agreement among raft nodes before linearized reading'  (duration: 2.00044495s)\"]" step_count=1
2025-04-17 11:18:28.749	apply request took too long took=2.000456339s expected-duration=100ms prefix="read-only range " request="key:\"/registry/health\" " response= error="context canceled"

One really weird thing to note: The issues seem to always come at xx:18? The hour at which they happen varies, but it’s always around 18 minutes past the hour.

Just to illustrate how bad it sometimes gets, here are two attempts at getting node agreement before linearized reads, taking 18 and 19 seconds:

2025-04-17 11:18:32.278 duration=19.273386011s start=2025-04-17T09:18:13.004696Z end=2025-04-17T09:18:32.278082Z steps="[\"trace[416778461] 'agreement among raft nodes before linearized reading'  (duration: 19.242077797s)\"]" step_count=1
2025-04-17 11:18:32.278 duration=18.583085717s start=2025-04-17T09:18:13.694856Z end=2025-04-17T09:18:32.277941Z steps="[\"trace[434574392] 'agreement among raft nodes before linearized reading'  (duration: 18.548892027s)\"]" step_count=1

On the positive side: The cluster itself didn’t really seem fazed by the restarts. It just trucked along. The only reason I had a problem with Vault was that I like having the unseal key only be stored in my password manager, and not automatically accessible.

But this is obviously not a permanent state. I have already found that running journalctl -ef on one of the controller nodes pretty reliably brings down at least one of the kube components. Updating a more complex Helm chart like kube-prometheus-stack also does the trick pretty reliably.

For once, I’m decidedly not looking forward to the service updates I’ve got scheduled for tomorrow morning. Let’s see how that goes.

But the remediation has already been put into action: I’ve ordered three Raspberry Pi 5 8GB plus 500 GB NVMe SSDs and NVMe hats for the Pis. I’m assuming that those will cope a hell of a lot better with the I/O load and tight latency tolerances of the Kubernetes control plane.

Final thoughts

On Saturday night, after I had taken down the server which provided me with some additional capacity during the migration, I felt most excellent. I was starting to consider which Homelab project I would tackle next. There are so many to choose from. I was rather disappointed when I was greeted by the downed Vault Pods on Sunday morning.

But not to whine too much - this gave me an excellent reason to get started on the Pi 5. Plus I also bought a 16 GB Pi 5 and a 1TB SSD. Those will serve me well in some future ideas I’ve got.

Plus, even though it’s currently a bit unstable: I’m done. 🥳 The migration is done. I’m now the proud owner of a Kubernetes cluster.

There is one more post to come in this series, with my final thoughts and stats on the migration. But first, I want to migrate the CP nodes to the Pi 5. But before that, I definitely want to upgrade the OS in the Homelab from Ubuntu 22.04 to 24.04, because that’s the first one with Pi 5 support, and I don’t want to have multiple Ubuntu versions in the Homelab.

Now please excuse me while I go sharpen my Yak shaver.

Migrating the k8s control plane#

How not to migrate MONs#

Stability problems#

Final thoughts#

Migrating the k8s control plane

How not to migrate MONs

Stability problems

Final thoughts