I’ve had a problem for quite a while now. As a reminder, I’m booting eight Raspberry Pi CM4 and one Udoo x86 II completely diskless, using boot partitions on NFS, PXE netboot and the Pi’s netboot feature with root disks being supplied by Ceph RBD volumes. If you’re interested in the details, I’ve got an entire series on the setup, as well as a separate article describing the Udoo boot setup.
This worked very nicely for quite a while and did exactly what I wanted. But it has one problem that’s been eluding me for a long time now: The hosts don’t always come up again after a poweroff or a reboot.
My initial theory was that that is a problem with Ceph’s blocklisting. For Ceph RBDs, you can enable a feature called exclusive locks. This feature means that there is only one large lock for an entire volume, which is by default never given up. This improves performance slightly, because Ceph knows that no other client will suddenly want to mount that particular volume.
To protect those volumes from concurrent access, that lock needs to be acquired. But, one of the standard problems with locking rears it’s head here as well: What happens when the lock holder crashes? For these cases, Ceph allows locks to be broken. This is a good mechanism to not perpetually block a volume.
But, what happens when the original lock holder did not completely crash, but instead just hung for a while and then continues processing as if nothing happened? It wouldn’t know that the lock has been broken, and with exclusive locks enabled, it wouldn’t check whether it still held the lock. And why would it, from its PoV nothing untoward happened.
So now you’d have two clients thinking they hold the lock. Bad things might happen. To prevent this, Ceph doesn’t just break the lock. It also blocklists the original lock holder for a while, so it can’t make any writes to the RBD volume.
This is a good approach I think. Unless you’re running a diskless setup like mine. As I noted, the hosts are completely diskless. They fetch an initramfs via netboot, launch into it, and inside the initramfs I mount the RBD root volume. I’m genuinely proud of this setup. Realizing that an initramfs was just an assortment of shell scripts was one of those great learning moments in my Homelab experience. But it has a downside, namely shutdowns. During shutdown, the root partition cannot be unmounted completely. As a consequence, I also can’t unmap the RBD volume. Shutdowns/reboots still work, though. But during the reboot, the attempt to map the RBD volume again will find that another client still holds the lock. Ceph will then break that lock, but in the course of doing so, it will also blocklist the previous client. Which has the same IP as the new client.
My working theory until recently was then: During boot, the host breaks the
Ceph lock, but at the same time, it puts itself onto Ceph’s blocklist, and then
fails to map the RBD volume, consequently failing to boot.
To fix the problem, I found that it works to issue a
ceph blocklist clear
command at the right time.
But I’ve always had a sliver of doubt about that theory, because at any given
reboot, e.g. during general system updates, only some netboot hosts, always a
minority, would fail. Most of them would reboot without issue, but all of them
would create blocklisting entries for themselves.
And I would then sit there and try to get the failed hosts to boot cleanly
by switching them off and on again while basically issuing
ceph blocklist clear
in an endless loop.
The nagging feeling in my head, for a while now, was that perhaps that was a bit of cargo culting. I was clearing the blocklist, and then the hosts booted again. But I always ignored the fact that sometimes, I had to go through the process multiple times before the machine finally came up again. So perhaps, it was just the repeated reboots which fixed whatever issue I have?
It doesn’t help that I don’t have a good way to look at the console output of my netbooting machines. Then there’s also the fact that I could never reproduce the problem with a fresh Pi not mounted in the rack but sitting on my desk with a screen attached.
I finally reached the level of annoyance which made me put away my Kubernetes project and get to the bottom of this issue. I set up a VM on my desktop to make debugging easier than it is with a Pi. I will write up that story soon, just take this from me for now: Keep your hands away from VirtualBox. 😠
With the VM setup, I got the same results as I got with a fresh Pi: Yes, the machine blocklisted a client during reboot, but still didn’t have a problem mapping the RBD volume and booting up properly. Just that now, I could watch the entire process. And there doesn’t seem to be anything wrong. As expected, I get the lock breaking, and then everything works as normal.
This has me pretty much convinced that something else is going wrong. So instead
of going with my original plan, which was just to add a
ceph blocklist clear
to the initramfs root mount script, I will first introduce some better logging.
This means implementing netconsole for all netbooting machines and then gathering those early boot logs in my FluentD instance.
My biggest fear is that once I’ve set that up, all my hosts will suddenly stop having the problem and I will never ever get those logs I need. 😅