Nomad to k8s, Final: It's done

Wherein I try to draw a conclusion about my migration to k8s.

This is the final part of my k8s migration series.

After a total of 26 posts, this will be the last one in the migration series. On the evening of April 13th, after one year, three months and 26 days, I set the final task of my k8s migration plan to “Done”. I made the first commits for the migration on December 19th 2023, shortly after starting my Christmas vacation that year. It was the addition of the first VMs, for the control plane nodes. I already did some experimentation in November, but I don’t count that as time spend for the migration.

Overall, I had defined 864 tasks for the migration, most of them during the initial planning phase.

Apropos planning phase: How did that turn out? In my first migration post, I laid out in detail how I planned to proceed. And for the most part, I did follow that plan. The one thing I did not foresee was that k8s does not have a combined CronJob+DaemonSet kind of workload, meaning a run-to-completion workload that can be started on a schedule with an instance running on every machine. That was what I was doing with my backups in Nomad. But it wasn’t possible in Kubernetes. This lead me to the decision to put the migration on hold and implement my very own Kubernetes operator for orchestrating my backups. Besides that sidetracking, most things went according to the plan. Besides the very last step, migrating the controllers. In short, the Pi 4 with USB-attached SSDs were too slow to handle the control plane. This will be remedied with some Pi 5 with attached NVMe SSDs next week, but I didn’t see any reason to postpone this post.

Nomad VS k8s

Let’s take a closer look at Nomad vs k8s. Starting with the number of allocations I had in the Nomad cluster vs the number of Pods I have in the k8s cluster. The two are not exactly comparable, but at least approximately.

In the Nomad cluster, shortly before starting the migration in December, I had 57 allocations. While I currently have 193 running Pods in the k8s cluster. This is of course partially because I’m running more things in the k8s cluster than I ran in the Nomad cluster. For example, each host already has one more Pod than the Nomad hosts had allocations due to the Cilium Pod.

One big topic I’d like to call out is the comparative maturity of the ecosystems, meaning “how much ready-made stuff is available?”. For sure, the comparison is slightly unfair - Kubernetes runs on all of the large cloud providers, it is a full Linux foundation project, is used in many public and private clouds. Nomad, on the other hand, is only supported and mainly developed by a single company. As far as I’m aware, there is no public offering for running Nomad as a service for customers to run their own workloads on it. It’s used in private cloud deployments only, for the most part.

Take, as an example, Ceph. With Rook Ceph, there is a very good package for deploying a Ceph cluster into a Kubernetes cluster. There is no such way in Nomad, at least to my knowledge. You can still deploy Ceph baremetal and then use the official Ceph CSI driver in Nomad to control volumes, of course. But that’s not the same as a good piece of software allowing me to run the entire cluster inside Nomad.

Then there’s just the sheer generic support for tools of all stripes. For example stuff like external-secrets or external-dns. Sure, Nomad has direct, and very good support for Vault. But that’s not even close to the level of support for secret providers that external-secrets provides.

And finally, there’s Helm. Again as far as I know, Nomad doesn’t have anything similar that’s equally widely used. At the beginning, I was a bit hesitant to use them. Instead I wanted to write all of them myself. I relented pretty quickly, at least of the Helm charts which were provided by the projects themselves. So I’m fine with using e.g. Gitea’s chart, because it was at least supported by the Gitea project. But I wouldn’t use a Gitea chart from a third party, because the project itself will make its release announcements for the methods they officially support, not for third party Helm charts. So for each tool, I would have to read two release notes - the ones from the app itself, and the one for the Helm chart. Sure, I also need to do that for first party Charts, but at least there I can be reasonably sure that they got all the necessary adaptions correct. In Nomad, on the other hand, I wrote every single job and volume file myself. This definitely fostered a better understanding of both, the app I wanted to deploy and Nomad, but it does get a bit repetitive at some point.

Ceph Rook

I would like to concentrate a bit on Rook Ceph here. One thing I would like to highlight is that it worked really nicely for me, and I was able to reason pretty well about what the operator would do - for the most part. See the malheur with the controller migration for an example of how I completely screwed up and almost lost my storage cluster.

But what I’m still not certain about: Would I have been quite as comfortable with Rook Ceph if I hadn’t been running Ceph baremetal for a couple of years beforehand? I have been brooding about this question since I added the point to the notes for this blog post. But I got nowhere. I’d like to be the kind of person who can spew forth some nugget of wisdom, but I’m starting to get the feeling that I don’t really have that much to say about the migration…

One thing I did get surprised about was the sheer number of auxiliary Pods Rook puts up. In total, the operator namespace runs 41 Pods in my cluster, and the cluster namespace runs another 28. I actually ended up considerably reducing the resource requests for several Pod types, because after setting up Rook, I pretty much ran out of resources on my initial small cluster.

Resource utilization

So what does the comparison of the resource consumption look like? I haven’t been able to come up with something general that makes sense - there’s more things running now, due to stuff like external-secrets or external-dns for example, which Nomad simply did not have. Overall, I’m happy to report that I’ve now got more resources available for workloads, due to the simple reason that I’ve got the Ceph hosts as part of the cluster as well. And that allows me to use any free resources on them as well.

One thing we can look at is the control plane nodes, because those are basically doing the same thing in both clusters. Under Nomad, those nodes were running the control plane for the cluster, meaning one of each of these:

Nomad server
Consul server
Vault server
Ceph MON daemon

And it’s basically the same in the k8s cluster control plane:

kube-apiserver
kube-controller-manager
kube-scheduler
kube-vip
Ceph MON daemon
Vault Pod

Here are the CPU loads. As a reminder, the machine we’re talking about here is a Raspberry Pi 4 4GB, with a SATA SSD attached via USB. The load on an average day in 2023, before any k8s migrations, looked like this:

A screenshot of a Grafana time series plot. It shows the CPU usage in percent for the different CPU states on a Linux system. The system, over the entire day, shows about 88% idle. Further around 6 percent is system load, with the remaining 6% being user load. A couple of spikes for IOWAIT load down to about 40% utilization are visible. — CPU utilization by CPU state on one of my Pi 4 control plane nodes on an average day before the k8s migration.

As you can see, the load is somewhere around 88% idle, with only a few IOWAIT spikes down to only 60% idle. Next is the same host, but from yesterday, now running the k8s control plane:

Here the difference becomes clear - the k8s control plane needs about 10% more CPU in total. In addition, there’s now a clearly visible, constant 1% to 2% IOWAIT during the entire day.

I believe the majority of this difference is not due to the k8s control plane being inherently less efficient. Instead I think it’s entirely due to operators. In the Nomad cluster, the only requests made to the control plane were the ones kicked off by me entering some command, and the normal chatter between the cluster servers and the clients on the workers. But in the k8s cluster, I’ve got a number of operators running which all use the k8s API, and hence need to make apiserver requests and ultimately etcd requests. Just off the top of my head:

The Prometheus operator, probably running at least a watch on a number of resources
The Cilium operator and the Cilium per-node Pods, which definitely contribute to the load
The Rook operator, which needs to keep track of all the Ceph daemon deployments as well as PersistentVolumeClaims
Traefik, which has to keep tabs on Ingresses as well as its own resources
External DNS and external secrets
My own backup operator
CloudNativePG, again with a number of deployments and own CRDs it needs to keep an eye on

I believe that all of these taken together put quite some load on the apiserver, and hence on etcd. And that in turn might be too much for the USB-attached SSDs on my control plane nodes. In contrast, the Nomad/Consul servers did not get this many requests all the time.

Going incremental

The decision to do the migration slowly, with some extra capacity to run the two clusters side by side was an unquestionable positive. Sure, it cost a bit more due to the increased electricity consumption, but I think it was worth it.

Going incrementally mostly afforded me one thing: The ability to do things properly right from the start. It allowed me time to start with the Rook cluster, instead of first migrating to k8s and then migrating the baremetal Ceph cluster to Rook. It left me the time to write extensive notes and to write blog post on any interesting pieces of the migration.

In addition, the experimental phase I did before even starting the migration was also a good idea in hindsight. It allowed me to get some basic setup going, especially exploring Helmfile. I promise, I will be writing a post about that at some point as well. 🙂 One thing though: I wish I had dug a bit deeper into the backups. I did have the backup setup on the agenda, but for some reason I saw the CronJob and decided that that did everything I needed. I only realized that that didn’t do what I needed when I actually got to the implementation of the k8s backups. It would have been nicer to write the backup operator up front, instead of in the middle of the migration. Because running two workload clusters - Nomad/Consul and baremetal Ceph/Rook Ceph - was not actually that much fun.

Advantages gained

I’ve gained quite some advantages for my Homelab from the migration to k8s, appart from the original goal of moving away from HashiCorp’s tooling. The first thing I’d like to mention is how much I enjoy Kubernetes as “platform”. I’ve now got a lot more things running on a common platform - Kubernetes - than I had before. My individual hosts contain a lot less configuration. It’s basically just the kubelet now, where before I needed Nomad and Consul agents which needed to be manually configured, including generating tokens for each individual host.

In that same vain, I also like the fact that both Vault and Ceph are now running in Kubernetes instead of individually. Don’t get me wrong, it doesn’t reduce the maintenance for both that much, but I still got to remove quite some Ansible code.

Another big one was virtual IPs. With my Nomad cluster, I had an “Ingress” host which ran things like FluentD and Traefik which machines from outside the cluster needed to access. And that host was fixed, it had all the firewall configured and so on. When that host was down, access to my Homelab services was down. But back then, I didn’t see any other way. Although I could probably have done something with e.g. HAproxy or the like? But with my k8s cluster, I no longer have that problem. I’m using Cilium’s BGP LoadBalancer functionality to provide routes to my different services with a virtual IP. So for example my Traefik ingress can now be deployed wherever, and Cilium would update the routes when the host changed.

Another one in the “quite nice” category is that I finally got rid of Docker in my Homelab. The daemon was just annoying me from time to time. For example there was a memory leak in the FluentD logging driver for several months a couple of years ago. I’m now running cri-o as the CRI for Kubernetes, and it just feels a lot better. One of the big advantages is that I can configure pull-through caches not just for DockerHub, but any registry, without having to muck around with image locations in manifests or Helm charts.

And the final advantage is that I’ve now got more things which I can control with versioned code. This is especially visible in Ceph. Here, I can now create S3 buckets via the ObjectBucketClaim instead of doing it manually on the command line. The same goes for example for Ceph users or even CephFS volumes. And the Rook team is continually improving the Ceph API support too, for example with the addition of bucket policies for the ObjectBucketClaim.

Conclusion

I had fun. That’s really all there is to it, in the end, right? The best decision of the entire migration was to make it so I could do it incrementally. I never had any longer downtimes of any of the services in the Homelab I rely on. That in turn meant that I could do it at my own pace. If I didn’t feel like homelabbing on a weekend, I didn’t need to. The Homelab was always in a stable state I could leave it at. It was interesting to dive into this new (to me) technology and kick the tires, and I like what I ended up with.

The only two things which could have gone better was the backup situation for one, and the performance/stability problems with the control plane for another. It would have been more comfortable to have implemented the backup operator at the beginning, instead of interrupting the migration for a couple of months.

So what’s next? I will be starting another blog post right after this one where I detail some of the larger ideas I’ve got in mind. It would bloat this post a bit too much to detail them here.

But short-term, I will work on replacing my control plane nodes with Pi 5 with NVMe SSDs to hopefully fix the instability issues they’re currently suffering from. The last piece of hardware I was waiting for arrived today, and I will likely get to it next week, as there’s another long weekend in Germany. And then I will get stuck into all the small and medium sized tasks that I’ve been postponing for the past 1.5 years. For example migrating to Forgejo from Gitea, adding SSO support to some more services, cleaning up my log parsing and adding some more services.

Finally, I’ve greatly enjoyed accompanying the migration with this series of blog posts. One thing I’ve learned is that it is easier and more fun to write a post about something when doing it right after the thing is done, instead of putting the post on an ever-growing pile of posts to write at some point in the future.

Nomad VS k8s#

Ceph Rook#

Resource utilization#

Going incremental#

Advantages gained#

Conclusion#