Yesterday’s Homelab host update did not at all go as intended. I hit a kernel bug in the NFS code.

To describe the problem, I need to go into a bit of detail on my setup, so please bear with me.

I’ve got a fleet of 8 Raspberry Pi CM4 and a single Udoo x86 II forming the backbone of the compute in my Homelab. All of them do netbooting, with no per-host storage at all. To be able to do host updates, including kernels, the boot files used for netbooting are separated per host, and each host’s files are mounted to that host’s /boot/firmware dir via NFS. It looks something like this:

ls -l /mnt/netboot/
drwxr-xr-x 3 root    root   95210641 Feb 17 12:10 abcd
drwxr-xr-x 3 root    root   95206027 Feb 17 12:45 efgh
drwxr-xr-x 3 root    root   95209268 Feb 17 11:49 ijkl
drwxr-xr-x 3 root    root   95212903 Dec 21 23:06 mnop
drwxr-xr-x 3 root    root   95208373 Feb 17 12:10 qrst
drwxr-xr-x 3 root    root   95211504 Feb 17 11:49 uvwx
drwxr-xr-x 3 root    root   94928358 Nov 26 15:56 xyz
ls -l /mnt/netboot/abcd/
total 92267
-rwxr-xr-x 1 root   root        1024 Jan 16  2023 README
-rwxr-xr-x 1 root   root       53004 Feb 16 22:47 bcm2711-rpi-cm4.dtb
-rw-r--r-- 1 root   root        4624 Feb 16 22:47 boot.scr
-rw-r--r-- 1 root   root       52476 Feb 16 22:47 bootcode.bin
-rwxr-xr-x 1 root   root         285 Nov 26 15:57 cmdline.txt
-rwxr-xr-x 1 root   root        1220 Jan 21  2023 config.txt
-rw-r--r-- 1 root   root        7265 Feb 16 22:47 fixup.dat
-rw-r--r-- 1 root   root        5400 Feb 16 22:47 fixup4.dat
-rw-r--r-- 1 root   root        3170 Feb 16 22:47 fixup4cd.dat
-rw-r--r-- 1 root   root        8382 Feb 16 22:47 fixup4db.dat
-rw-r--r-- 1 root   root        8386 Feb 16 22:47 fixup4x.dat
-rw-r--r-- 1 root   root        3170 Feb 16 22:47 fixup_cd.dat
-rw-r--r-- 1 root   root       10229 Feb 16 22:47 fixup_db.dat
-rw-r--r-- 1 root   root       10227 Feb 16 22:47 fixup_x.dat
-rw-r--r-- 1 root   root    59735369 Feb 17 01:01 initrd.img
drwxr-xr-x 2 root   root      738870 Feb 16 22:48 overlays
-rw-r--r-- 1 root   root     2974880 Feb 16 22:47 start.elf
-rw-r--r-- 1 root   root     2250656 Feb 16 22:47 start4.elf
-rw-r--r-- 1 root   root      805084 Feb 16 22:47 start4cd.elf
-rw-r--r-- 1 root   root     3746856 Feb 16 22:47 start4db.elf
-rw-r--r-- 1 root   root     2998120 Feb 16 22:47 start4x.elf
-rw-r--r-- 1 root   root      805084 Feb 16 22:47 start_cd.elf
-rw-r--r-- 1 root   root     4818728 Feb 16 22:47 start_db.elf
-rw-r--r-- 1 root   root     3721800 Feb 16 22:47 start_x.elf
-rw-r--r-- 1 root   root      607200 Feb 16 22:47 uboot_rpi_4.bin
-rw-r--r-- 1 root   root      592696 Feb 16 22:47 uboot_rpi_arm64.bin
-rw-r--r-- 1 root   root    10348331 Feb 17 01:01 vmlinuz

This way, normal OS updates work seamlessly, as the update copies the new kernel and such to the NFS share from which dnsmasq supplies those files during netbooting. The netbooting is controlled from my cluster master host. That’s the one host in my setup which does not have any kind of HA setup. It’s also my one “can’t have dependencies” host. It’s where I run things which everything else depends on. This is things like my internal DNS server and my netboot setup. Let’s call this host spof, for no reason in particular. This host mounts the directory with all of the netboot directories for the different hosts. Then, dnsmasq runs a TFTP server for other host’s netbooting.

That should be enough for here, if you’re interested in an in-depth description of the setup, have a look at the series of posts I wrote about it.

Anyway, last night I ran my regular update of all my Homelab hosts. At first, I thought everything was well. But then, the first two freshly updated hosts failed to reboot. After some investigation, I saw that on the spof host, the netboot directory was no longer mounted. Trying to mount it manually, I was greeted with this error:

mount /mnt/netboot/
mount.nfs: Protocol not supported

Uh - huh? Going with -v wasn’t any more enlightening:

mount -v /mnt/netboot/
mount.nfs: timeout set for Sat Feb 17 12:16:44 2024
mount.nfs: trying text-based options 'timeo=900,vers=4.2,addr=,clientaddr='
mount.nfs: mount(2): Protocol not supported
mount.nfs: trying text-based options 'timeo=900,vers=4,minorversion=1,addr=,clientaddr='
mount.nfs: mount(2): Protocol not supported
mount.nfs: trying text-based options 'timeo=900,vers=4,addr=,clientaddr='
mount.nfs: mount(2): Protocol not supported
mount.nfs: trying text-based options 'timeo=900,addr='
mount.nfs: prog 100003, trying vers=3, prot=6
mount.nfs: trying prog 100003 vers 3 prot TCP port 2049
mount.nfs: portmap query retrying: RPC: Program/version mismatch
mount.nfs: prog 100003, trying vers=3, prot=17
mount.nfs: trying prog 100003 vers 3 prot UDP port 2049
mount.nfs: portmap query failed: RPC: Program/version mismatch
mount.nfs: Protocol not supported

So - either my Ceph NFS server or the client code suddenly forgot how NFS works. Interestingly, this problem was only visible for the NFS mount from my Ceph NFS server. Another NFS mount, which is supplied by the spof host itself for quick and easy file transfers in the Homelab, worked without issue. I then also found that the netboot NFS mount also still worked on the hosts which were not yet updated.

Some googling later, I hit this bug against NFS Ganesha, which is the NFS implementation used by Ceph. That then pointed me to this fix in the kernel for an issue with NFS in relatively recent kernels. It looks like Ubuntu mistakenly backported the broken NFS commit, but not the fix for it to their 5.15 kernels in Ubuntu 22.04. For me, I saw the issue in the following kernels:

  • linux-image-5.15.0-1046-raspi on my Raspberry Pis
  • linux-image-5.15.0-94-generic on my x86_64 hosts

This bug report on Longhorn also serves as an indication that it was a bad backport. It was opened back in October, when the original bad commit entered the kernel, and then got another flurry of responses when the LTS Ubuntu kernel got updated.

In the end, I did not have any other choice but to skip the kernel update on my machines. I fixed the kernel version by running these commands:

ansible "x86hosts_group" -a "apt-mark hold linux-image-generic linux-headers-generic linux-generic"
ansible "raspihosts_group" -a "apt-mark hold linux-image-raspi linux-modules-extra-raspi"

After that, I could safely run apt upgrade on my hosts. For the two hosts that I had already updated I needed to go a bit further. The first issue was that I needed to get the spof host to be able to mount the netboot NFS volume again.

And I went about it in the most stupid way possible. Remember, that host is called spof for no particular reason. It’s central to my Homelabs functioning. And I decided to use it as a testbed. So, smart guy that I am, I thought: Okay, the kernel that’s booted is in /boot/firmware/vmlinuz. So let’s just copy the old kernel from /boot there, reboot, and done!

Yeah, no. The flash-kernel tool is there for a reason. It does something when copying the kernel/initrd from /boot to /boot/firmware on Pi hosts. So now I had an unbootable spof host. I had to remove it from the rack mount, including its SSD so I could connect a screen and see what the actual problem was. The error message was that it couldn’t find the disk, for reasons I don’t understand. I ended up just copying the kernel and initrd from a different Pi’s /boot/firmware to spof’s /boot/firmware and got my host (and DNS…) back.

The flash-kernel tool then had to be invoked on all the already updated Pis to switch them back to the older kernel and initrd:

flash-kernel --force 5.15.0-1045-raspi

Let’s hope that this fills my quota of kernel bugs in the Homelab for the next 5 years or so. 😅