This is the third part of my Pi netboot series. You can find an overview and links to the other parts in the Pi netboot series overview article.
This, to me, is the most interesting article of the entire netboot Pi series. When I started with setting up netbooting, I had no idea how it would work. I had a vague idea that there was a kernel command line parameter, but no idea where it was interpreted. Now, I know that the early boot and the initramfs are not voodoo magic, but just that most mundane of Linux tech: Shell scripts. 😅 That was another magic Linux moment for me: Huh, it’s really that simple?
Special thanks for pretty much this entire article go to the GitHub user trickkiste, who had an unmerged PR showing how to add support for netbooting from a Ceph RBD volume in this PR. The code I will show later is based on this PR, adapted a bit for booting Ubuntu on a Pi 4.
The goal of this article is to show you how to get Ubuntu (any Linux, really) to boot from something, anything, else than a local disc. It will be applicable to both, a completely diskless machine doing PXE netboot, as well as machines like a Pi which uses its SD card for the boot partition, but stores the root partition somewhere else. The post will also not be Pi specific, because at this point, we are out of the arcane Pi boot process and everything is just vanilla Linux.
Please note: All command examples and code in this post will concentrate on initramfs, not initrd. While all the concepts are the same, the commands required to handle them are different. In addition, as I’m using Ubuntu for my Pis, this will serve as the example system, but other Linux distributions probably have very similar looking scripts.
What does the initramfs do?
In the words of the kernel docs on the topic:
All 2.6 Linux kernels contain a gzipped “cpio” format archive, which is extracted into rootfs when the kernel boots up. After extracting, the kernel checks to see if rootfs contains a file “init”, and if so it executes it as PID 1. If found, this init process is responsible for bringing the system the rest of the way up, including locating and mounting the real root device (if any).
So in principle, the main use of the initramfs is to provide an environment for finding and mounting the root partition with a bit more tooling than the kernel itself has. It also looks like it was introduced as an early userspace to implement some special cases (e.g. NFS mounting or loading additional kernel modules) without having to implement all of them directly into the kernel, keeping both the code and the resulting image smaller.
In most implementations, the initramfs contains BusyBox
tooling and a simple shell. On Ubuntu, that’s ash
.
And this is exactly what we need: An environment in which we can mount a Ceph RBD volume as the root disk.
Basic initramfs-tools scripting
So what does the basic scripting look like? How does it work? I will explain
what’s going on under the hood using the Ubuntu variant of the initramfs-tools
package. This package is used on a number of distributions to make changing
initramfs scripting a bit easier. My explanations will apply to both Debian
and Ubuntu, and should also be useful for other distributions.
The scripting used in the initramfs can be found in /usr/share/initramfs-tools/
.
As described above, the initramfs scripting is loaded by the kernel as the
init
process with PID 1. This is done by executing the file init
at the
root of the initramfs. This is a simple shell script, so everybody can read
it. This was a really nice discovery for me, because it means I can go
through it and understand what happens, and also adapt it easily.
Let’s start at the end. Being called as the init
process by the kernel
means that at some point, after the root disk is found and mounted, the initramfs
init
script needs to invoke the actual init
program of the distribution.
These days, that’s going to be systemd
in most cases.
And sure enough, that’s what we see at the end of the script:
exec run-init ${drop_caps} "${rootmnt}" "${init}" "$@" <"${rootmnt}/dev/console" >"${rootmnt}/dev/console" 2>&1
The run-init
program which it execs into is a small kernel helper which
runs the actual init program.
The source can be found here.
The source code comment describes what it does:
/*
* 1. Delete all files in the initramfs;
* 2. Remounts /real-root onto the root filesystem;
* 3. Drops comma-separated list of capabilities;
* 4. Chroots;
* 5. Opens /dev/console;
* 6. Spawns the specified init program (with arguments.)
*
* With the -p option, it skips step 1 in order to allow the initramfs to
* be persisted into the running system.
*
* With the -n option, it skips steps 1, 2 and 6 and can be used to check
* whether the given root and init are likely to work.
*/
The init
variable that is handed into run-init
is just /sbin/init
on a
Ubuntu system, which in turn is a symlink pointing to /lib/systemd/systemd
.
The first relevant thing the init
script does is to parse the command line
parameters. These are taken from the kernel command line, which can always
be found in /proc/cmdline
. The code partially looks like this, with a few
uninteresting parameters filtered out:
for x in $(cat /proc/cmdline); do
case $x in
root=*)
ROOT=${x#root=}
if [ -z "${BOOT}" ] && [ "$ROOT" = "/dev/nfs" ]; then
BOOT=nfs
fi
;;
nfsroot=*)
# shellcheck disable=SC2034
NFSROOT="${x#nfsroot=}"
;;
boot=*)
BOOT=${x#boot=}
;;
debug)
debug=y
quiet=n
if [ -n "${netconsole}" ]; then
log_output=/dev/kmsg
else
log_output=/run/initramfs/initramfs.debug
fi
set -x
;;
netconsole=*)
netconsole=${x#netconsole=}
[ "$debug" = "y" ] && log_output=/dev/kmsg
;;
esac
done
The root
and nfsroot
options provide the root disk/device to mount.
The boot
option and the BOOT
environment variable will become important
later, because the content of that variable determines which script initramfs
uses to mount the root device.
The debug
and netconsole
options are interesting for debugging, especially
in a netboot scenario where you don’t necessarily have the option of attaching
a screen to your host.
The next interesting part of the script loads any additional kernel modules before the mount process for the root disk starts:
[ "$quiet" != "y" ] && log_begin_msg "Loading essential drivers"
[ -n "${netconsole}" ] && modprobe netconsole netconsole="${netconsole}"
load_modules
[ "$quiet" != "y" ] && log_end_msg
The load_modules
function looks into the /conf/modules
file and runs modprobe
on each line which is not a comment. Which modules are in the /conf/modules
file is defined in /etc/initramfs-tools/modules
when the initramfs is created.
Next comes some sourcing of root device specific scripting. Both the local
disk
and nfs
scripts are always sourced. The interesting part of this sourcing is
this line, though:
. /scripts/${BOOT}
As I wrote above, the content of the boot
command line option, stored in the
BOOT
variable, determines which script is used for booting. So when you
implement your own mount type for the root disk, you can just place a script
called something like my-boot-method
into /etc/initramfs-tools/scripts/
,
and it will end up in the /scripts
directory of the initramfs. Then, you
add boot=my-boot-method
to your kernel command line, and initramfs will source
that script here.
Those scripts have a very specific content, demonstrated by the lines which follow the sourcing and which do the actual root mounting:
mount_top
mount_premount
mountroot
These three lines call the functions which are expected to mount the actual root
device. These functions are overwritten in each of the scripts, so that whatever
script is named in BOOT
will provide the implementation for those three
functions.
NFS as an example of how it works
Because NFS is implemented as a root disk option by default, I will use it as an example. And while I don’t understand it, I still recognize that not everybody wants to run a Ceph cluster. 😉
The NFS root disk scripting can be found at /usr/share/initramfs-tools/scripts/nfs
.
The important part of the functionality are the following two functions:
nfs_mount_root()
{
nfs_top
# For DHCP
modprobe af_packet
wait_for_udev 10
# Default delay is around 180s
delay=${ROOTDELAY:-180}
# loop until nfsmount succeeds
nfs_mount_root_impl
ret=$?
nfs_retry_count=0
while [ ${nfs_retry_count} -lt "${delay}" ] \
&& [ $ret -ne 0 ] ; do
[ "$quiet" != "y" ] && log_begin_msg "Retrying nfs mount"
sleep 1
nfs_mount_root_impl
ret=$?
nfs_retry_count=$(( nfs_retry_count + 1 ))
[ "$quiet" != "y" ] && log_end_msg
done
}
nfs_mount_root_impl()
{
configure_networking
# get nfs root from dhcp
if [ "${NFSROOT}" = "auto" ]; then
# check if server ip is part of dhcp root-path
if [ "${ROOTPATH#*:}" = "${ROOTPATH}" ]; then
NFSROOT=${ROOTSERVER}:${ROOTPATH}
else
NFSROOT=${ROOTPATH}
fi
# nfsroot=[<server-ip>:]<root-dir>[,<nfs-options>]
elif [ -n "${NFSROOT}" ]; then
# nfs options are an optional arg
if [ "${NFSROOT#*,}" != "${NFSROOT}" ]; then
NFSOPTS="-o ${NFSROOT#*,}"
fi
NFSROOT=${NFSROOT%%,*}
if [ "${NFSROOT#*:}" = "$NFSROOT" ]; then
NFSROOT=${ROOTSERVER}:${NFSROOT}
fi
fi
if [ -z "${NFSOPTS}" ]; then
NFSOPTS="-o retrans=10"
fi
nfs_premount
if [ "${readonly?}" = y ]; then
roflag="-o ro"
else
roflag="-o rw"
fi
# shellcheck disable=SC2086
nfsmount -o nolock ${roflag} ${NFSOPTS} "${NFSROOT}" "${rootmnt?}"
}
The nfs_mount_root
function is called by the mountroot
function mentioned
in the main init
script.
It is mainly responsible for implementing a retry mechanism.
The actual mount happens in the nfs_mount_root_impl
function. First, it reads
the nfsroot
kernel parameter. This has the format nfsroot=[<server-ip>:]<root-dir>[,<nfs-options>]
.
As an example, the option would be nfsroot=10.0.0.15:/serverroots/server1
if
the NFS server was running on 10.0.0.15
and the root for the current server
was serverroots/server1
on that NFS server.
The nfsmount
utility called at the end of the function to execute the mount
is another small helper program similar to run-init
in that it is only part
of the initramfs and linked against klibc. It’s source code can be found
here.
And that’s it already. By setting the boot=nfs
and nfsroot=...
options
on your kernel command line you can boot with a root disk located on NFS. This
is functionality you don’t need to explicitly implement, it is already part of
the kernel and the default initramfs-tools
. This also already works fine
with Raspi OS.
Booting from a Ceph RBD volume
Now finally to the reason we’re doing all of this: Booting not from NFS or a local disk, but from RBD. There are a number of details which need to be configured to actually get an initramfs which can boot into an RBD volume. Details on that will come in the next article of the series, in which I will present a HashiCorp Packer image and Ansible playbook to generate a Raspberry Pi image which netboots and uses an RBD volume as the root disk.
Here, I will concentrate only on the necessary initramfs scripting to get it working.
As noted above, new scripts/boot methods can just be dropped into /etc/initramfs-tools/scripts
.
All of the following code should be put into a file called rbd
in that directory.
Most of the code for booting from a RBD volume comes from this pull request and has been lightly adapted by me.
As said before, the basic idea is that the init
script sources the file in
/scripts/$BOOT
, overwriting three functions which it then calls. The most
important of these is mountroot
. In the RBD implementation, this function calls
rbd_mount_root
, which looks like this:
rbd_mount_root()
{
export RBDROOT=
# Parse command line options for rbdroot option
for x in $(cat /proc/cmdline); do
case $x in
rbdroot=*)
RBDROOT="${x#rbdroot=}"
;;
esac
done
rbd_top
modprobe rbd
# For DHCP
modprobe af_packet
wait_for_udev 10
# Default delay is around 180s
delay=${ROOTDELAY:-120}
# loop until rbd mount succeeds
rbd_map_root_impl
rbd_map_retry_count=0
while [ ${rbd_map_retry_count} -lt ${delay} ] \
&& [[ -z "$dev" ]] ; do
[ "$quiet" != "y" ] && log_begin_msg "Retrying rbd map"
/bin/sleep 1
rbd_map_root_impl
rbd_map_retry_count=$(( ${rbd_map_retry_count} + 1 ))
[ "$quiet" != "y" ] && log_end_msg
done
if [ -z "$dev" ] ; then
echo "ERROR: RBD could not be mapped"
return 1
fi
# loop until rbd mount succeeds
rbd_mount_root_impl
rbd_mount_retry_count=0
while [ ${rbd_mount_retry_count} -lt ${delay} ] \
&& ! chroot "${rootmnt}" test -x "${init}" ; do
[ "$quiet" != "y" ] && log_begin_msg "Retrying rbd mount"
/bin/sleep 1
rbd_mount_root_impl
rbd_mount_retry_count=$(( ${rbd_mount_retry_count} + 1 ))
[ "$quiet" != "y" ] && log_end_msg
done
}
To begin with, the rbdroot
option is read from the kernel command line
and put into a variable. This variable will later be interpreted to provide the
necessary credentials and configs to get the right RBD volume from the Ceph
cluster.
An example value would be rbdroot=10.86.5.105,10.86.5.102,10.86.5.104:AUTHX-USERNAME:AUTHX-PASSWORD:mypool:myvolume::_netdev,noatime
.
Next follows the loading of some necessary kernel modules which need to be available on the initramfs.
Then follows the actual mounting. For Ceph RBD volumes, mounting happens in two
steps. The first one is to map the volume to the local host, which puts a device
file under /dev
. Then follows the actual mounting just like any other disk,
depending only on the filesystem of the volume.
The first step, the mapping, is done in the rbd_map_root_impl
function, which
looks as follows:
rbd_map_root_impl()
{
configure_networking
# get rbd root from dhcp
if [ "x${RBDROOT}" = "xauto" ]; then
RBDROOT=${ROOTPATH}
fi
local mons user key pool image snap partition
# rbdroot=<mons>:<user>:<key>:<pool>:<image>[@<snapshot>]:[<partition>]:[<mountopts>]
if [ -n "${RBDROOT}" ]; then
local i=1
local OLD_IFS=${IFS}
IFS=":"
for arg in ${RBDROOT} ; do
case ${i} in
1)
mons=$(echo ${arg} | tr ";" ":")
;;
2)
user=${arg}
;;
3)
key=${arg}
;;
4)
pool=${arg}
;;
5)
# image contains an @, i.e. a snapshot
if [ ${arg#*@*} != ${arg} ] ; then
image=${arg%%@*}
snap=${arg##*@}
else
image=${arg}
snap=""
fi
;;
6)
partition=${arg}
;;
7)
mountopts=${arg}
;;
esac
i=$((${i} + 1))
done
IFS=${OLD_IFS}
fi
# the kernel will reject writes to add if add_single_major exists
local rbd_bus
if [ -e /sys/bus/rbd/add_single_major ]; then
rbd_bus=/sys/bus/rbd/add_single_major
elif [ -e /sys/bus/rbd/add ]; then
rbd_bus=/sys/bus/rbd/add
else
echo "ERROR: /sys/bus/rbd/add does not exist"
return 1
fi
# tell the kernel rbd client to map the block device
echo "${mons} name=${user},secret=${key} ${pool} ${image} ${snap}" > ${rbd_bus}
# figure out where the block device appeared
dev=$(ls /dev/rbd* | grep '/dev/rbd[0-9]*$' | tail -n 1)
# add partition if set
if [ ${partition} ]; then
dev=${dev}p${partition}
fi
}
In the first part, the rbdroot
kernel command line is parsed. Then follows the
mapping of the volume to the local host. Here, instead of adding the CLI tools of
Ceph with all of their dependencies to the initramfs, direct writing to the
RBD kernel module’s /sys/bus/rbd
file is done.
As the final mapping step, the new device is stored in the dev
variable.
The last step of the process is the mount itself, which happens in the
rbd_mount_root_impl
function:
rbd_mount_root_impl()
{
if [ ${readonly} = y ]; then
roflag="-r"
else
roflag="-w"
fi
if [[ ! -z "$mountopts" ]] ; then
mountopts="-o $mountopts"
fi
# get the root filesystem type if not set
if [ -z "${ROOTFSTYPE}" ]; then
FSTYPE=$(get_fstype "${dev}")
else
FSTYPE=${ROOTFSTYPE}
fi
rbd_premount
# mount the fs
modprobe ${FSTYPE}
echo "EXECUTING: \"mount -t ${FSTYPE} ${roflag},${mountopts} $dev ${rootmnt}\""
mount -t ${FSTYPE} ${roflag} ${mountopts} $dev ${rootmnt}
}
This function just determines the filesystem type and then executes a normal mount call.
And with that, we are done. The RBD volume should now be mounted and ready to be used as the root device for our host.
Debugging
So how to debug your initramfs? Let’s start with where to find it. That should
be pretty simple: It’s going to be in your /boot
directory, right alongside
the kernel image.
Manipulating an initramfs
While constructing an initramfs, especially for netbooting outside the default NFS option, it might be useful to be able to look at an initramfs’ content.
A good overview of how to unpack and repack an initramfs can be found in this blog article.
First, figure out what compression the initramfs uses by running:
file /path/to/initramfs
The following command will unpack an image:
gunzip -c /path/to/initramfs | cpio -i
This works for a gzip compressed image. If you have, for example, a zstd
compressed image, just replace gunzip -c
with zstdcat
.
To repackage the image after making your changes, run the commands in reverse:
find . | cpio -H newc -o | gzip -9 > /path/for/new/image
Getting logs from a remote boot
One important question when debugging: How do I see the console output of the initramfs scripting if I don’t have a monitor connected to the machine?
This can be accomplished with the netboot
kernel option and netcat
.
The kernel docs for the netconsole feature can be found here.
An example netconsole=
kernel command line parameter would look like this:
netconsole=4444@10.0.0.1/eth1,9353@10.0.0.2/12:34:56:78:9a:bc
The parameters are as follows:
4444
: The source port to use for sending (port on the host producing the logs)10.0.0.1
: IP of the host producing the logs/eth1
: Name of the NIC to use (be aware that this might differ from the name in the booted system, as systemd/udev change the name for predictable NIC naming)9353@10.0.0.2
: Target port and IP on the machine which is listening for the logs/12:34:56:78:9a:bc
: MAC address of the listening machine (find this withip link
)
On the receiver side, for the machine where you want to receive the logs,
you can use netcat: nc -u -l <port>
, where <port>
with the above example
netconsole
line would be 9353
.
Closing
I hope that with the above article, I was able to generate the same “Huh, it’s that simple?” reaction and delight that rummaging through the initramfs created for me.
The next and last article in this series will give a short overview on how to create a HashiCorp Packer image and Ansible playbook for a netbooting Raspberry Pi.