This is the third part of my Pi netboot series. You can find an overview and links to the other parts in the Pi netboot series overview article.

This, to me, is the most interesting article of the entire netboot Pi series. When I started with setting up netbooting, I had no idea how it would work. I had a vague idea that there was a kernel command line parameter, but no idea where it was interpreted. Now, I know that the early boot and the initramfs are not voodoo magic, but just that most mundane of Linux tech: Shell scripts. šŸ˜… That was another magic Linux moment for me: Huh, it’s really that simple?

Special thanks for pretty much this entire article go to the GitHub user trickkiste, who had an unmerged PR showing how to add support for netbooting from a Ceph RBD volume in this PR. The code I will show later is based on this PR, adapted a bit for booting Ubuntu on a Pi 4.

The goal of this article is to show you how to get Ubuntu (any Linux, really) to boot from something, anything, else than a local disc. It will be applicable to both, a completely diskless machine doing PXE netboot, as well as machines like a Pi which uses its SD card for the boot partition, but stores the root partition somewhere else. The post will also not be Pi specific, because at this point, we are out of the arcane Pi boot process and everything is just vanilla Linux.

Please note: All command examples and code in this post will concentrate on initramfs, not initrd. While all the concepts are the same, the commands required to handle them are different. In addition, as I’m using Ubuntu for my Pis, this will serve as the example system, but other Linux distributions probably have very similar looking scripts.

What does the initramfs do?

In the words of the kernel docs on the topic:

All 2.6 Linux kernels contain a gzipped “cpio” format archive, which is extracted into rootfs when the kernel boots up. After extracting, the kernel checks to see if rootfs contains a file “init”, and if so it executes it as PID 1. If found, this init process is responsible for bringing the system the rest of the way up, including locating and mounting the real root device (if any).

So in principle, the main use of the initramfs is to provide an environment for finding and mounting the root partition with a bit more tooling than the kernel itself has. It also looks like it was introduced as an early userspace to implement some special cases (e.g. NFS mounting or loading additional kernel modules) without having to implement all of them directly into the kernel, keeping both the code and the resulting image smaller.

In most implementations, the initramfs contains BusyBox tooling and a simple shell. On Ubuntu, that’s ash.

And this is exactly what we need: An environment in which we can mount a Ceph RBD volume as the root disk.

Basic initramfs-tools scripting

So what does the basic scripting look like? How does it work? I will explain what’s going on under the hood using the Ubuntu variant of the initramfs-tools package. This package is used on a number of distributions to make changing initramfs scripting a bit easier. My explanations will apply to both Debian and Ubuntu, and should also be useful for other distributions.

The scripting used in the initramfs can be found in /usr/share/initramfs-tools/.

As described above, the initramfs scripting is loaded by the kernel as the init process with PID 1. This is done by executing the file init at the root of the initramfs. This is a simple shell script, so everybody can read it. This was a really nice discovery for me, because it means I can go through it and understand what happens, and also adapt it easily.

Let’s start at the end. Being called as the init process by the kernel means that at some point, after the root disk is found and mounted, the initramfs init script needs to invoke the actual init program of the distribution. These days, that’s going to be systemd in most cases. And sure enough, that’s what we see at the end of the script:

exec run-init ${drop_caps} "${rootmnt}" "${init}" "$@" <"${rootmnt}/dev/console" >"${rootmnt}/dev/console" 2>&1

The run-init program which it execs into is a small kernel helper which runs the actual init program. The source can be found here.

The source code comment describes what it does:

/*
* 1. Delete all files in the initramfs;
* 2. Remounts /real-root onto the root filesystem;
* 3. Drops comma-separated list of capabilities;
* 4. Chroots;
* 5. Opens /dev/console;
* 6. Spawns the specified init program (with arguments.)
*
* With the -p option, it skips step 1 in order to allow the initramfs to
* be persisted into the running system.
*
* With the -n option, it skips steps 1, 2 and 6 and can be used to check
* whether the given root and init are likely to work.
*/

The init variable that is handed into run-init is just /sbin/init on a Ubuntu system, which in turn is a symlink pointing to /lib/systemd/systemd.

The first relevant thing the init script does is to parse the command line parameters. These are taken from the kernel command line, which can always be found in /proc/cmdline. The code partially looks like this, with a few uninteresting parameters filtered out:

for x in $(cat /proc/cmdline); do
	case $x in
	root=*)
		ROOT=${x#root=}
		if [ -z "${BOOT}" ] && [ "$ROOT" = "/dev/nfs" ]; then
			BOOT=nfs
		fi
		;;
	nfsroot=*)
		# shellcheck disable=SC2034
		NFSROOT="${x#nfsroot=}"
		;;
	boot=*)
		BOOT=${x#boot=}
		;;
	debug)
		debug=y
		quiet=n
		if [ -n "${netconsole}" ]; then
			log_output=/dev/kmsg
		else
			log_output=/run/initramfs/initramfs.debug
		fi
		set -x
		;;
	netconsole=*)
		netconsole=${x#netconsole=}
		[ "$debug" = "y" ] && log_output=/dev/kmsg
		;;
	esac
done

The root and nfsroot options provide the root disk/device to mount. The boot option and the BOOT environment variable will become important later, because the content of that variable determines which script initramfs uses to mount the root device.

The debug and netconsole options are interesting for debugging, especially in a netboot scenario where you don’t necessarily have the option of attaching a screen to your host.

The next interesting part of the script loads any additional kernel modules before the mount process for the root disk starts:

[ "$quiet" != "y" ] && log_begin_msg "Loading essential drivers"
[ -n "${netconsole}" ] && modprobe netconsole netconsole="${netconsole}"
load_modules
[ "$quiet" != "y" ] && log_end_msg

The load_modules function looks into the /conf/modules file and runs modprobe on each line which is not a comment. Which modules are in the /conf/modules file is defined in /etc/initramfs-tools/modules when the initramfs is created.

Next comes some sourcing of root device specific scripting. Both the local disk and nfs scripts are always sourced. The interesting part of this sourcing is this line, though:

. /scripts/${BOOT}

As I wrote above, the content of the boot command line option, stored in the BOOT variable, determines which script is used for booting. So when you implement your own mount type for the root disk, you can just place a script called something like my-boot-method into /etc/initramfs-tools/scripts/, and it will end up in the /scripts directory of the initramfs. Then, you add boot=my-boot-method to your kernel command line, and initramfs will source that script here.

Those scripts have a very specific content, demonstrated by the lines which follow the sourcing and which do the actual root mounting:

mount_top
mount_premount
mountroot

These three lines call the functions which are expected to mount the actual root device. These functions are overwritten in each of the scripts, so that whatever script is named in BOOT will provide the implementation for those three functions.

NFS as an example of how it works

Because NFS is implemented as a root disk option by default, I will use it as an example. And while I don’t understand it, I still recognize that not everybody wants to run a Ceph cluster. šŸ˜‰

The NFS root disk scripting can be found at /usr/share/initramfs-tools/scripts/nfs.

The important part of the functionality are the following two functions:

nfs_mount_root()
{
        nfs_top

        # For DHCP
        modprobe af_packet

        wait_for_udev 10

        # Default delay is around 180s
        delay=${ROOTDELAY:-180}

        # loop until nfsmount succeeds
        nfs_mount_root_impl
        ret=$?
        nfs_retry_count=0
        while [ ${nfs_retry_count} -lt "${delay}" ] \
                && [ $ret -ne 0 ] ; do
                [ "$quiet" != "y" ] && log_begin_msg "Retrying nfs mount"
                sleep 1
                nfs_mount_root_impl
                ret=$?
                nfs_retry_count=$(( nfs_retry_count + 1 ))
                [ "$quiet" != "y" ] && log_end_msg
        done
}
nfs_mount_root_impl()
{
        configure_networking

        # get nfs root from dhcp
        if [ "${NFSROOT}" = "auto" ]; then
                # check if server ip is part of dhcp root-path
                if [ "${ROOTPATH#*:}" = "${ROOTPATH}" ]; then
                        NFSROOT=${ROOTSERVER}:${ROOTPATH}
                else
                        NFSROOT=${ROOTPATH}
                fi

        # nfsroot=[<server-ip>:]<root-dir>[,<nfs-options>]
        elif [ -n "${NFSROOT}" ]; then
                # nfs options are an optional arg
                if [ "${NFSROOT#*,}" != "${NFSROOT}" ]; then
                        NFSOPTS="-o ${NFSROOT#*,}"
                fi
                NFSROOT=${NFSROOT%%,*}
                if [ "${NFSROOT#*:}" = "$NFSROOT" ]; then
                        NFSROOT=${ROOTSERVER}:${NFSROOT}
                fi
        fi

        if [ -z "${NFSOPTS}" ]; then
                NFSOPTS="-o retrans=10"
        fi

        nfs_premount

        if [ "${readonly?}" = y ]; then
                roflag="-o ro"
        else
                roflag="-o rw"
        fi

        # shellcheck disable=SC2086
        nfsmount -o nolock ${roflag} ${NFSOPTS} "${NFSROOT}" "${rootmnt?}"
}

The nfs_mount_root function is called by the mountroot function mentioned in the main init script. It is mainly responsible for implementing a retry mechanism.

The actual mount happens in the nfs_mount_root_impl function. First, it reads the nfsroot kernel parameter. This has the format nfsroot=[<server-ip>:]<root-dir>[,<nfs-options>]. As an example, the option would be nfsroot=10.0.0.15:/serverroots/server1 if the NFS server was running on 10.0.0.15 and the root for the current server was serverroots/server1 on that NFS server.

The nfsmount utility called at the end of the function to execute the mount is another small helper program similar to run-init in that it is only part of the initramfs and linked against klibc. It’s source code can be found here.

And that’s it already. By setting the boot=nfs and nfsroot=... options on your kernel command line you can boot with a root disk located on NFS. This is functionality you don’t need to explicitly implement, it is already part of the kernel and the default initramfs-tools. This also already works fine with Raspi OS.

Booting from a Ceph RBD volume

Now finally to the reason we’re doing all of this: Booting not from NFS or a local disk, but from RBD. There are a number of details which need to be configured to actually get an initramfs which can boot into an RBD volume. Details on that will come in the next article of the series, in which I will present a HashiCorp Packer image and Ansible playbook to generate a Raspberry Pi image which netboots and uses an RBD volume as the root disk.

Here, I will concentrate only on the necessary initramfs scripting to get it working.

As noted above, new scripts/boot methods can just be dropped into /etc/initramfs-tools/scripts. All of the following code should be put into a file called rbd in that directory.

Most of the code for booting from a RBD volume comes from this pull request and has been lightly adapted by me.

As said before, the basic idea is that the init script sources the file in /scripts/$BOOT, overwriting three functions which it then calls. The most important of these is mountroot. In the RBD implementation, this function calls rbd_mount_root, which looks like this:

rbd_mount_root()
{

	export RBDROOT=

	# Parse command line options for rbdroot option
	for x in $(cat /proc/cmdline); do
		case $x in
		rbdroot=*)
			RBDROOT="${x#rbdroot=}"
			;;
		esac
	done

	rbd_top

	modprobe rbd
	# For DHCP
	modprobe af_packet

	wait_for_udev 10

	# Default delay is around 180s
	delay=${ROOTDELAY:-120}

	# loop until rbd mount succeeds
	rbd_map_root_impl
	rbd_map_retry_count=0
	while [ ${rbd_map_retry_count} -lt ${delay} ] \
		&& [[ -z "$dev" ]] ; do
		[ "$quiet" != "y" ] && log_begin_msg "Retrying rbd map"
		/bin/sleep 1
		rbd_map_root_impl
		rbd_map_retry_count=$(( ${rbd_map_retry_count} + 1 ))
		[ "$quiet" != "y" ] && log_end_msg
	done

  if [ -z "$dev" ] ; then
		echo "ERROR: RBD could not be mapped"
		return 1
	fi

	# loop until rbd mount succeeds
	rbd_mount_root_impl
	rbd_mount_retry_count=0
	while [ ${rbd_mount_retry_count} -lt ${delay} ] \
		&& ! chroot "${rootmnt}" test -x "${init}" ; do
		[ "$quiet" != "y" ] && log_begin_msg "Retrying rbd mount"
		/bin/sleep 1
		rbd_mount_root_impl
		rbd_mount_retry_count=$(( ${rbd_mount_retry_count} + 1 ))
		[ "$quiet" != "y" ] && log_end_msg
	done
}

To begin with, the rbdroot option is read from the kernel command line and put into a variable. This variable will later be interpreted to provide the necessary credentials and configs to get the right RBD volume from the Ceph cluster. An example value would be rbdroot=10.86.5.105,10.86.5.102,10.86.5.104:AUTHX-USERNAME:AUTHX-PASSWORD:mypool:myvolume::_netdev,noatime.

Next follows the loading of some necessary kernel modules which need to be available on the initramfs.

Then follows the actual mounting. For Ceph RBD volumes, mounting happens in two steps. The first one is to map the volume to the local host, which puts a device file under /dev. Then follows the actual mounting just like any other disk, depending only on the filesystem of the volume.

The first step, the mapping, is done in the rbd_map_root_impl function, which looks as follows:

rbd_map_root_impl()
{
	configure_networking

	# get rbd root from dhcp
	if [ "x${RBDROOT}" = "xauto" ]; then
		RBDROOT=${ROOTPATH}
	fi

	local mons user key pool image snap partition

	# rbdroot=<mons>:<user>:<key>:<pool>:<image>[@<snapshot>]:[<partition>]:[<mountopts>]
	if [ -n "${RBDROOT}" ]; then
		local i=1
		local OLD_IFS=${IFS}
		IFS=":"
		for arg in ${RBDROOT} ; do
			case ${i} in
				1)
					mons=$(echo ${arg} | tr ";" ":")
					;;
				2)
					user=${arg}
					;;
				3)
					key=${arg}
					;;
				4)
					pool=${arg}
					;;
				5)
					# image contains an @, i.e. a snapshot
					if [ ${arg#*@*} != ${arg} ] ; then
						image=${arg%%@*}
						snap=${arg##*@}
					else
						image=${arg}
						snap=""
					fi
					;;
				6)
					partition=${arg}
					;;
				7)
                    mountopts=${arg}
                      ;;
            esac
			i=$((${i} + 1))
		done
		IFS=${OLD_IFS}
	fi

	# the kernel will reject writes to add if add_single_major exists
	local rbd_bus
	if [ -e /sys/bus/rbd/add_single_major ]; then
		rbd_bus=/sys/bus/rbd/add_single_major
	elif [ -e /sys/bus/rbd/add ]; then
		rbd_bus=/sys/bus/rbd/add
	else
		echo "ERROR: /sys/bus/rbd/add does not exist"
		return 1
	fi

	# tell the kernel rbd client to map the block device
	echo "${mons} name=${user},secret=${key} ${pool} ${image} ${snap}" > ${rbd_bus}
	# figure out where the block device appeared
	dev=$(ls /dev/rbd* | grep '/dev/rbd[0-9]*$' | tail -n 1)
	# add partition if set
	if [ ${partition} ]; then
		dev=${dev}p${partition}
	fi
}

In the first part, the rbdroot kernel command line is parsed. Then follows the mapping of the volume to the local host. Here, instead of adding the CLI tools of Ceph with all of their dependencies to the initramfs, direct writing to the RBD kernel module’s /sys/bus/rbd file is done. As the final mapping step, the new device is stored in the dev variable.

The last step of the process is the mount itself, which happens in the rbd_mount_root_impl function:

rbd_mount_root_impl()
{
	if [ ${readonly} = y ]; then
		roflag="-r"
	else
		roflag="-w"
	fi

	if [[ ! -z "$mountopts" ]] ; then
		mountopts="-o $mountopts"
	fi

	# get the root filesystem type if not set
	if [ -z "${ROOTFSTYPE}" ]; then
		FSTYPE=$(get_fstype "${dev}")
	else
		FSTYPE=${ROOTFSTYPE}
	fi

	rbd_premount

	# mount the fs
	modprobe ${FSTYPE}
	echo "EXECUTING: \"mount -t ${FSTYPE} ${roflag},${mountopts} $dev ${rootmnt}\""
	mount -t ${FSTYPE} ${roflag} ${mountopts} $dev ${rootmnt}

}

This function just determines the filesystem type and then executes a normal mount call.

And with that, we are done. The RBD volume should now be mounted and ready to be used as the root device for our host.

Debugging

So how to debug your initramfs? Let’s start with where to find it. That should be pretty simple: It’s going to be in your /boot directory, right alongside the kernel image.

Manipulating an initramfs

While constructing an initramfs, especially for netbooting outside the default NFS option, it might be useful to be able to look at an initramfs’ content.

A good overview of how to unpack and repack an initramfs can be found in this blog article.

First, figure out what compression the initramfs uses by running:

file /path/to/initramfs

The following command will unpack an image:

gunzip -c /path/to/initramfs | cpio -i

This works for a gzip compressed image. If you have, for example, a zstd compressed image, just replace gunzip -c with zstdcat.

To repackage the image after making your changes, run the commands in reverse:

find . | cpio -H newc -o | gzip -9 > /path/for/new/image

Getting logs from a remote boot

One important question when debugging: How do I see the console output of the initramfs scripting if I don’t have a monitor connected to the machine?

This can be accomplished with the netboot kernel option and netcat.

The kernel docs for the netconsole feature can be found here.

An example netconsole= kernel command line parameter would look like this:

netconsole=4444@10.0.0.1/eth1,9353@10.0.0.2/12:34:56:78:9a:bc

The parameters are as follows:

  • 4444: The source port to use for sending (port on the host producing the logs)
  • 10.0.0.1: IP of the host producing the logs
  • /eth1: Name of the NIC to use (be aware that this might differ from the name in the booted system, as systemd/udev change the name for predictable NIC naming)
  • 9353@10.0.0.2: Target port and IP on the machine which is listening for the logs
  • /12:34:56:78:9a:bc: MAC address of the listening machine (find this with ip link)

On the receiver side, for the machine where you want to receive the logs, you can use netcat: nc -u -l <port>, where <port> with the above example netconsole line would be 9353.

Closing

I hope that with the above article, I was able to generate the same “Huh, it’s that simple?” reaction and delight that rummaging through the initramfs created for me.

The next and last article in this series will give a short overview on how to create a HashiCorp Packer image and Ansible playbook for a netbooting Raspberry Pi.