What's next after the K8s Migration?

Wherein I go over my future plans for the Homelab, now that the k8s migration is finally done.

So it’s done. The k8s migration is finally complete, and I can now get started with some other projects. Or, well, I can once I’ve updated my control plane Pis to Pi 5 with NVMe SSDs.

But what to do then? As it turns out, I’ve got a very full backlog. I’m decidedly not in danger of boredom.

Without further ado, here is a meandering tour through my Homelab project list.

Baremetal improvements

At the moment, all of the hosts in my Homelab are running baremetal, I do not have any VMs. I’ve got both, x86 hosts and a lot of Pis. I’ve also got hosts with and without local storage. And their management, especially the creation of new hosts, is not great at the moment.

In short, I’m taking the current Ubuntu LTS and adapt the image for every single host. I’m using HashiCorp’s packer for the generation of the images. This generation varies wildly between x86 and Pi hosts. For the x86 hosts, Packer runs the full Ubuntu installer in a Qemu VM, just in an automated way. For the Pis, I’m taking the preinstalled Ubuntu Pi images. In both cases I then apply an Ansible playbook which installs basic necessities, especially my management user with its SSH key and some preconditions for Ansible usage. Importantly, this playbook also configures the host name. So I need to generate a fresh image for every new host, even though the only difference is the Linux hostname config.

What happens next depends on whether the host is netbooted or not. For netbooters, I put the new image onto a Ceph RBD to serve as the host’s root disk and extract the boot partition to the NFS share used to mount the boot content to the hosts themselves and my netboot control host, which runs dnsmasq to make the files available to the netbooters. For hosts with local storage, the approach is to stick in a USB stick, boot the host, mount an NFS share with the freshly generated image, and then dd that imagine onto the local disk.

All of the above is not really great, to be honest. There’s a number of manual steps in there. The first thing I’d like to somehow solve is the “one image per host”. That’s only there because I need to provide the hostname in the image so it gets the right one at first boot. But that shouldn’t be necessary. I should really only need two images, one for Pis and one for x86 machines. And then I should do all the necessary config via Cloud-init. This might even include some parts of the config I’m currently doing via the Ansible playbook. For example, the creation of my management/Ansible user should also be possible with cloud-init.

But even if that works, there’s still the actual install of the image on the host, be it netbooted or local storage. And for this, I’ve been eyeing two tools for quite a while: tinkerbell and Canonical’s MaaS (Metal as a Service). Both of those offer some kind of management for baremetal machines, making use of DHCP and netboot. With this, I’d like to move my hosts a bit more in the direction of cattle.

Both Tinkerbell and MaaS are capable of automatically installing baremetal machines. There are two issues with that when it comes to my setup, both related to my netbooters. First, the Raspberry Pis. Those have a “bespoke” netboot process, that doesn’t really follow any standards. But both MaaS and Tinkerbell rely on a certain pre-boot environment. And the Raspberry Pi can work with both - if you have an SD card to provide separate bootcode. In principle, you need to UEFI boot the Pi. And that, as said, works fine. But from everything I’ve read up to now, that approach always requires an SD card or some other local storage to work. It doesn’t work via the Pi’s netboot process. But then again - there’s a lot of open source stuff even way down in the Pi’s boot process. And I think it could be very interesting to get really deeply into the weeds here and implement an adapted Pi firmware myself. That would be a really large project, because I’ve got pretty much no idea about development that close to the metal. But it could be really interesting.

The second problem is with the root disk on my netbooting hosts. They’re Ceph RBDs, and I had to implement some special scripting in the initramfs to make those work as root disks. And I doubt that whatever OS install mechanisms Tinkerbell and MaaS support can actually handle Ceph RBDs as the install target. But that’s another thing to check. Both tools are open source, so perhaps I can hack something together.

With all of this, I might even make some proper contributions to open source again, if what I come up with is actually fit for wider consumption.

For all of this to work, I will also need some sort of separate host, because I don’t think running the tools responsible for host definitions and configuration on the k8s cluster that the managed hosts run is a great idea. For stuff like that which should not rely on any other services in the Homelab, I’ve already got a host, what I call my “cluster master”. It’s a Pi 4 and currently runs my PowerDNS server and dnsmasq for supporting the netbooting hosts. But that’s only a 4GB Pi 4, so I can’t run too much on there. But with MaaS or Tinkerbell, I’d not like to run it just in Docker containers or even baremetal. Instead, I’d like to set up a small management k8s cluster. I will use that to also test some of the lighter/smaller k8s distributions, like k3s. It will definitely be a single node cluster.

In addition to running Tinkerbell or MaaS on that cluster, I’d also like to get into Cluster API and ultimately see how I like Talos Linux. I initially bounced off of it due to not allowing SSH access to the host, but now that the number of things actually running baremetal is even smaller than before, I’m warming to the idea. And it’s supposedly supporting Pi 4 already, and they’re working on Pi 5 support, from what I understand. So that could be really nice to look at.

Furthermore, I’ve also been looking at GitOps for the cluster with Flux or Argo, and both would need some sort of management cluster as well.

So I’ve got a lot of things to work on for the baremetal/hardware side of the Homelab. One important thing I also need to do is to look into whether I want to continue with the Pi fleet. The advantage is that I’m getting a lot of physical hosts with a relatively small physical footprint and very low electricity consumption. But this also comes with downsides. One I’ve already mentioned above: Pis don’t always do things the standard way. They also don’t have much expandability. The standard advise today is to get some small form factor thin client from Dell, Lenovo or HP from the used market. They have a similar expandability problem, but at least they’d have a more standardized boot process, and I wouldn’t need to worry about whether a given OS supports them - they’re just UEFI machines. But they’re also 10x the size of a Pi 4. They also aren’t passively cooled, like all my Pis are. And my rack is sitting right next to my desk in my living room. If I go this route, I will have to look for models which support me putting in some nice Noctua instead of using whatever is already in there. Following this path, I’d probably put LXD (or rather, Incus) on the machines and run everything in VMs, again using Ceph RBDs as their root disks, so that the internal NVMe would also need to support the underlying system.

I’m honestly talking myself into going that route right now. What’s enticing me the most is honestly the return to something “standard”. That’s really tempting. Not having to think about whether my underlying machines can even support what I want to do with them would be nice.

But then again: The Pi 4 are still good. Sure, I have to replace the control plane Pi 4 with Pi 5, but the worker nodes are still trucking along. And I would guess that they will keep working for my needs for another couple of years, at minimum. Replacing them now, and just for the reason that I’d like to have something more standardized, would be a massive waste.

Networking

This is a really big one. At the moment, my entire network is 1 GbE. I’ve got a couple of hosts with 2.5 GbE cards, but all of my network infra is still only 1 GbE. I’d like to change that. There’s really nothing to be done with the Pis, they’re going to stay at 1 GbE. But, and this is the main thing, I’ve got my OPNsense router and my Ceph hosts as well as my desktop. The Ceph hosts would be nice to have with 2.5 GbE or even more, so they could supply several 1 GbE connected devices at their full speed.

So some new hardware will be in order. Preferably, I’d like to have a switch with mostly 1GbE ports. Though I’d also be happy with 1/2.5 GbE combo ports. But for some reason, 1 GbE/2.5 GbE ports don’t seem to be widespread? I still remember that 10/100/1000 ports were definitely a thing for a long time. I will then instead be looking for two switches, most likely. One with enough 1 GbE ports for my Pis and enough highspeed uplinks to connect to a second 2.5 GbE switch, perhaps even more. I’ve still got a lot of free PCIe ports in my Ceph machines for a faster network card. I’m currently eyeing MikroTik for all my future networking hardware needs, mostly to buy from a EU manufacturer.

Another network appliance I’d like to upgrade is my current router. It has more than enough performance, and the advantage of being quiet. But it’s also a mini PC, with the accompanying lack of expandability. It only has 1 GbE ports, so even if I upgrade the switches, the connection to the router would still be a significant bottleneck. For a replacement, I’d love something 1U I can mount into my rack. The main issue is of course the “1U” wish - because that, almost invariably, comes with fans. And small diameter fans at that. And as I’ve said above, the rack is sitting next to my desk, so a bit of quiet is appreciated. I’d like to look at the machines that OPNsense themselves offer, as the smaller once are looking pretty sweet. Or I could go and see whether anyone has made a 1U machine which is passively cooled. I mean, even 1U in a rack should provide enough volume to put in enough metal to cool something reasonable in a passive setup. But yeah, without that, I will likely give a bit of money to OPNsense for one of their HW offerings. Hm, just while doing my final read through the post, I’m thinking that I might not necessarily need to replace my router HW. It has six 1 GbE ports. And I’m only using two of them at the moment. Why not just look into combining them? Sure, it would take up more ports in my future switch, but that might be acceptable, if it means I can keep using the HW for longer.

Then there is another big elephant in the room - IPv6. I’m currently reading The TCP/IP Guide, and honestly, IPv6 sounds pretty interesting. And at this point at least, most hardware and software I’m using should be perfectly fine with it.

And finally, I’d like to fix two issues I’ve currently got with my networking setup. The first one has to do with using Cilium’s LoadBalancer support via BGP. It allows me to use LoadBalancer functionality in my cluster, and it does so via publishing routes to virtual IPs pointing at the hosts running the service. There’s just one issue with that: If anything in the Homelab subnet needs to access one of those LoadBalancer services, I end up with asymmetric routing. The packets coming from the requesting host go up to the router, because the LoadBalancer IPs are all in a separate subnet, so they need routing from the Homelab subnet. But when the Pods send answers, those are not routed back via the same path. Those packets are send via the k8s host’s interface, because that’s directly connected to the Homelab network. The main issue this introduces is for the stateful firewall I’ve got running on the router. Here, it’s problematic that the router only sees one piece of the initial TCP connection, but not the other side. By default, pf does not consider that a valid connection, so it will block packets trying to flow along it. I had to configure “sloppy state” for those firewall rules, which made it work, but it’s still not great, because the first few packets flowing along the path still get blocked.

The second issue is about my external DNS. It is currently hosted with my domain registrar, Strato. Which is fine, there’s only one issue I have with Strato: It doesn’t offer any sort of API for its DNS, besides some DynDNS support. So some things, like the DNS challenge to get a wildcard cert from Let’s Encrypt, need manual intervention. Whenever I need to get a new cert, I need to log into the Web UI to change the TXT records with the new challenge values. And I’d like to fully automate that. One option is ServFail. A DNS network with a bash based Web UI is right down my alley. But before I can do that, I will have to fix my mail delivery, because I currently depend on Strato’s mail package, which in turn depends on your DNS being hosted by them - or you entering the correct data into your own DNS server.

Mail

Speaking of mail, that is another big one I’d like to tackle at some point. Even though it’s currently pretty far down the list. I did buy Michael W Lucas' Run Your Own Mail Server a little while ago and plan to use it to set up my very own. Let’s see whether it’s really as simple as some people claim.

One important thing I need to do first though: Organizing a static IP.

Remote VPS as an entrypoint to the Homelab

At the moment, the entire Homelab actually runs at home. The DNS for this blog and other public things I host point to my Deutsche Telekom consumer VDSL connection. This has been working fine for all these years, but some things require a static IP. Especially the aforementioned self-hosted mail server. I’m reasonably sure any residential sender of mails will be blocked immediately. I’d then do the typical thing and create a WireGuard tunnel between that VPS and my Homelab. One other thing I plan to use that VPS for is to get an outside monitoring tool going, so I can actually get some indication of what’s going on when the Homelab completely crashes. Right now, my Gatus monitoring is running in the k8s cluster it’s monitoring. 😅

Ceph

Next up is Ceph. As I’ve described in one of my previous posts, one of my HDDs regularly displays an appallingly low IOPS value. I need to figure out whether it’s actually bad or whether there is something else wrong. But for that, I need to understand Ceph better. A lot better. There was also some weird behavior when I was moving around hosts after taking down the baremetal Ceph cluster, where the mlock scheduler was not using a disk’s full capacity while backfilling.

All of these I’d like to investigate. For this, I will likely have to actually read up on the algorithms behind Ceph, including the papers on CRUSH for example. And then I might even dig into the code, because while the Ceph docs themselves are pretty good, I’d like to really understand what’s happening behind the curtain.

Related to the IOPS issues, I’m also considering adding a SATA SSD to all of my Ceph hosts to put the WAL and RocksDB on it, at least for the HDD OSDs. That should improve overall performance for operations on my HDD pool, by releasing the pressure from having to handle the payload and metadata IO from a single HDD. The main issue with that is that one of my storage hosts is an Odroid H4, and that only has two SATA power and data connectors, both already in use. So that one would need to be replaced by something else.

Finally, one of the things which has been annoying me for a while is the fact that I’m currently hardcoding the IPs for my Ceph MON daemons in several places. Most importantly, in the Ceph configs for my netbooting hosts. That has the effect that I can’t easily move the MONs around. But it now looks like Rook added functionality to put the MONs behind Kubernetes services. This would allow me to move them without having to constantly update the configs and reboot hosts. I still couldn’t have them move around freely, because they’re using the local disk to store their data, but still, not having to worry about their IPs would be nice.

Monitoring

My beloved graphs. There are going to be more of them. But first, I need to deploy Thanos for my Prometheus instance. Because that’s currently got a 250GiB persistent volume. And I will need to increase the size of that volume again this week, as it’s currently at 94% full again. And no, I will not be contemplating reducing my retention period below five years, thank you very much. 😅 Thanos will allow the Prometheus TSDB to consume the entirety of my HDD pool, and I will finally be free from needing to regularly increase the size.

Once that’s accomplished, I want to get into gathering metrics from apps. Right now, I’m only gathering host metrics as well as Ceph and k8s metrics, but that’s pretty much it. But there’s a lot of apps running in the Homelab which also provide metrics, and I’d like to gather those too. And make pretty graphs of them. 🤓

One big one I’d like to tackle is my blog. Right now, I’ve got zero metrics there, besides the number of requests hitting it as part of my generic web server metrics gathering. But to be honest, I’d like to know more. Purely for the pretty graphs. I don’t want to track anyone or anything like that. Just some basic “How often did this article get clicked” graphs. So I might just go with some log analysis. But I’ve also been eyeing something like Plausible. It’s just because I really like a good dashboard. 🤓

And then there’s the big elephant in my monitoring room. At the moment, I’m mostly seeing any issues when I’m actively looking at my Homelab dashboard. Or, you know, when I suddenly hear fans ramping up or HDDs start rattling like mad. 😅 I’d like to change that with some proper alerting. Perhaps even including push notifications to waves arms somewhere. At least for the most important stuff like SMART issues on my disks.

And finally, I’ve been thinking about a public dashboard. Much reduced compared to what I’ve got internally, but perhaps just something like Pod CPU usage, overall memory usage and stuff like that? I’m wondering whether other Homelabbers would be interested in something along those lines.

The k8s cluster

Only two short points here. One, I’d like to get back into GitOps with Flux or Argo. I explored it a bit in the past, but the fact that I’d basically need another cluster, or at least a separate Git forge/CI system put me off. But with the plan to run a single-node management cluster in the future, it might be interesting to look at this again.

Second, I’d like to get something like renovate going for my k8s apps. Just so I can have a list of updates with links and everything when Homelab Service Friday rolls around.

Backups

And last, as they so often are, backups. Here, again, I’d like to improve my metrics. Restic can produce quite a lot of them, and I’d like to gather those. Again, mostly because I like pretty graphs. 🤓 I even started implementing something a while ago, but never finished it. It’s a nice combination of implementing something in Python and Homelabbing, because I’ll likely use the Prometheus Push Gateway.

The biggest issue with my backups at the moment is the complete lack of off-site backups. My backups currently consist of a battery of S3 buckets on my Ceph cluster, each of which holds the restic backup repository for one of my services. Then there’s a large external HDD onto which I rclone the most important ones daily. The biggest problem here is that said external HDD is sitting on the top shelf of the rack that also holds the other servers in the Homelab. So should anything physically happen to my Homelab, that second backup location is also going to be gone.

My idea is to pretty much take the content of the HDD and sync it to a Hetzner StorageBox. Or potentially make the backups a bit more independent and sync the important S3 buckets to Hetzner directly, so the external HDD and the off-site backups are a bit more independent.

What will I actually do next?

I hope you enjoyed this tour through my Homelab backlog. It was pretty nice to write a “stream of conscience” post like this, as compared to my normal tutorial/here-is-what-I-did-and-why posts.

The last remaining question: What will I actually do next in the Homelab? First step is going to be deploying the three Pi 5 with NVMe currently still strewn all over my table. Once that’s done, next will very likely be the Thanos deployment. I’m getting a bit tired of regularly increasing the Prometheus PVC’s size.

And then, the next big project will very likely be the baremetal deployment enhancement. I got myself a bit excited while writing about digging into the Pi bootloader and trying to get it to chainload an iPXE.

Baremetal improvements#

Networking#

Mail#

Remote VPS as an entrypoint to the Homelab#

Ceph#

Monitoring#

The k8s cluster#

Backups#

What will I actually do next?#