Yesterday’s Homelab host update did not at all go as intended. I hit a kernel bug in the NFS code.
To describe the problem, I need to go into a bit of detail on my setup, so please bear with me.
I’ve got a fleet of 8 Raspberry Pi CM4 and a single Udoo x86 II forming the
backbone of the compute in my Homelab. All of them do netbooting, with no
per-host storage at all. To be able to do host updates, including kernels,
the boot files used for netbooting are separated per host, and each host’s files
are mounted to that host’s /boot/firmware
dir via NFS.
It looks something like this:
ls -l /mnt/netboot/
drwxr-xr-x 3 root root 95210641 Feb 17 12:10 abcd
drwxr-xr-x 3 root root 95206027 Feb 17 12:45 efgh
drwxr-xr-x 3 root root 95209268 Feb 17 11:49 ijkl
drwxr-xr-x 3 root root 95212903 Dec 21 23:06 mnop
drwxr-xr-x 3 root root 95208373 Feb 17 12:10 qrst
drwxr-xr-x 3 root root 95211504 Feb 17 11:49 uvwx
drwxr-xr-x 3 root root 94928358 Nov 26 15:56 xyz
ls -l /mnt/netboot/abcd/
total 92267
-rwxr-xr-x 1 root root 1024 Jan 16 2023 README
-rwxr-xr-x 1 root root 53004 Feb 16 22:47 bcm2711-rpi-cm4.dtb
-rw-r--r-- 1 root root 4624 Feb 16 22:47 boot.scr
-rw-r--r-- 1 root root 52476 Feb 16 22:47 bootcode.bin
-rwxr-xr-x 1 root root 285 Nov 26 15:57 cmdline.txt
-rwxr-xr-x 1 root root 1220 Jan 21 2023 config.txt
-rw-r--r-- 1 root root 7265 Feb 16 22:47 fixup.dat
-rw-r--r-- 1 root root 5400 Feb 16 22:47 fixup4.dat
-rw-r--r-- 1 root root 3170 Feb 16 22:47 fixup4cd.dat
-rw-r--r-- 1 root root 8382 Feb 16 22:47 fixup4db.dat
-rw-r--r-- 1 root root 8386 Feb 16 22:47 fixup4x.dat
-rw-r--r-- 1 root root 3170 Feb 16 22:47 fixup_cd.dat
-rw-r--r-- 1 root root 10229 Feb 16 22:47 fixup_db.dat
-rw-r--r-- 1 root root 10227 Feb 16 22:47 fixup_x.dat
-rw-r--r-- 1 root root 59735369 Feb 17 01:01 initrd.img
drwxr-xr-x 2 root root 738870 Feb 16 22:48 overlays
-rw-r--r-- 1 root root 2974880 Feb 16 22:47 start.elf
-rw-r--r-- 1 root root 2250656 Feb 16 22:47 start4.elf
-rw-r--r-- 1 root root 805084 Feb 16 22:47 start4cd.elf
-rw-r--r-- 1 root root 3746856 Feb 16 22:47 start4db.elf
-rw-r--r-- 1 root root 2998120 Feb 16 22:47 start4x.elf
-rw-r--r-- 1 root root 805084 Feb 16 22:47 start_cd.elf
-rw-r--r-- 1 root root 4818728 Feb 16 22:47 start_db.elf
-rw-r--r-- 1 root root 3721800 Feb 16 22:47 start_x.elf
-rw-r--r-- 1 root root 607200 Feb 16 22:47 uboot_rpi_4.bin
-rw-r--r-- 1 root root 592696 Feb 16 22:47 uboot_rpi_arm64.bin
-rw-r--r-- 1 root root 10348331 Feb 17 01:01 vmlinuz
This way, normal OS
updates work seamlessly, as the update copies the new kernel and such to the
NFS share from which dnsmasq supplies those files during netbooting.
The netbooting is controlled from my cluster master host. That’s the one host
in my setup which does not have any kind of HA setup. It’s also my one
“can’t have dependencies” host. It’s where I run things which everything else
depends on. This is things like my internal DNS server and my netboot setup.
Let’s call this host spof
, for no reason in particular. This host mounts
the directory with all of the netboot directories for the different hosts. Then,
dnsmasq runs a TFTP server for
other host’s netbooting.
That should be enough for here, if you’re interested in an in-depth description of the setup, have a look at the series of posts I wrote about it.
Anyway, last night I ran my regular update of all my Homelab hosts. At first,
I thought everything was well. But then, the first two freshly updated hosts
failed to reboot. After some investigation, I saw that on the spof
host, the
netboot directory was no longer mounted. Trying to mount it manually, I was
greeted with this error:
mount /mnt/netboot/
mount.nfs: Protocol not supported
Uh - huh? Going with -v
wasn’t any more enlightening:
mount -v /mnt/netboot/
mount.nfs: timeout set for Sat Feb 17 12:16:44 2024
mount.nfs: trying text-based options 'timeo=900,vers=4.2,addr=10.86.5.132,clientaddr=10.86.1.200'
mount.nfs: mount(2): Protocol not supported
mount.nfs: trying text-based options 'timeo=900,vers=4,minorversion=1,addr=10.86.5.132,clientaddr=10.86.1.200'
mount.nfs: mount(2): Protocol not supported
mount.nfs: trying text-based options 'timeo=900,vers=4,addr=10.86.5.132,clientaddr=10.86.1.200'
mount.nfs: mount(2): Protocol not supported
mount.nfs: trying text-based options 'timeo=900,addr=10.86.5.132'
mount.nfs: prog 100003, trying vers=3, prot=6
mount.nfs: trying 10.86.5.132 prog 100003 vers 3 prot TCP port 2049
mount.nfs: portmap query retrying: RPC: Program/version mismatch
mount.nfs: prog 100003, trying vers=3, prot=17
mount.nfs: trying 10.86.5.132 prog 100003 vers 3 prot UDP port 2049
mount.nfs: portmap query failed: RPC: Program/version mismatch
mount.nfs: Protocol not supported
So - either my Ceph NFS server or the client code suddenly forgot how NFS works.
Interestingly, this problem was only visible for the NFS mount from my Ceph NFS
server. Another NFS mount, which is supplied by the spof
host itself for quick
and easy file transfers in the Homelab, worked without issue. I then also found
that the netboot NFS mount also still worked on the hosts which were not yet
updated.
Some googling later, I hit this bug
against NFS Ganesha, which is the NFS implementation used by Ceph. That then
pointed me to this fix in the kernel
for an issue with NFS in relatively recent kernels. It looks like Ubuntu mistakenly
backported the broken NFS commit, but not the fix for it to their 5.15
kernels
in Ubuntu 22.04. For me, I saw the issue in the following kernels:
linux-image-5.15.0-1046-raspi
on my Raspberry Pislinux-image-5.15.0-94-generic
on my x86_64 hosts
This bug report on Longhorn also serves as an indication that it was a bad backport. It was opened back in October, when the original bad commit entered the kernel, and then got another flurry of responses when the LTS Ubuntu kernel got updated.
In the end, I did not have any other choice but to skip the kernel update on my machines. I fixed the kernel version by running these commands:
ansible "x86hosts_group" -a "apt-mark hold linux-image-generic linux-headers-generic linux-generic"
ansible "raspihosts_group" -a "apt-mark hold linux-image-raspi linux-modules-extra-raspi"
After that, I could safely run apt upgrade
on my hosts. For the two hosts
that I had already updated I needed to go a bit further. The first issue was that
I needed to get the spof
host to be able to mount the netboot NFS volume again.
And I went about it in the most stupid way possible. Remember, that host is called
spof
for no particular reason. It’s central to my Homelabs functioning. And
I decided to use it as a testbed. So, smart guy that I am, I thought: Okay, the
kernel that’s booted is in /boot/firmware/vmlinuz
. So let’s just copy the old
kernel from /boot
there, reboot, and done!
Yeah, no. The flash-kernel
tool is there for a reason. It does something
when copying the kernel/initrd from /boot
to /boot/firmware
on Pi hosts.
So now I had an unbootable spof
host. I had to remove it from the rack mount,
including its SSD so I could connect a screen and see what the actual problem
was. The error message was that it couldn’t find the disk, for reasons I don’t
understand. I ended up just copying the kernel and initrd from a different Pi’s
/boot/firmware
to spof
’s /boot/firmware
and got my host (and DNS…) back.
The flash-kernel
tool then had to be invoked on all the already updated Pis
to switch them back to the older kernel and initrd:
flash-kernel --force 5.15.0-1045-raspi
Let’s hope that this fills my quota of kernel bugs in the Homelab for the next 5 years or so. 😅