In this post, I will describe how I deployed Tinkerbell into my k3s cluster and provisioned the first Ubuntu VM with it.
This is part 3 of my Tinkerbell series.
Deploying Tinkerbell
The first step is to deploy Tinkerbell into the k3s cluster I set up in the previous post. For this, I used the official Helm chart, which can be found here.
My values.yaml
file looks like this:
publicIP: "203.0.113.200"
trustedProxies:
- "10.42.0.0/24"
artifactsFileServer: "http://203.0.113.200:7173"
deployment:
envs:
tinkController:
enableLeaderElection: false
smee:
dhcpMode: "proxy"
globals:
enableRufioController: false
enableSecondstar: false
logLevel: 3
init:
enabled: true
service:
lbClass: ""
optional:
hookos:
service:
lbClass: ""
kernelVersion: "both"
persistence:
existingClaim: "hookos-volume"
kubevip:
enabled: false
The first setting, publicIP
, is the public IP under which Tinkerbell’s services
will be available to other machines. It will be used in DHCP responses for the
next server, download URLs for iPXE scripting and so forth. It will also be set
as the loadBalancerIP
in the Service manifest created by the chart. In my
case, this is a VIP controlled by a kube-vip deployment I will go into more
detail on later. The trustedProxies
entry is just the CIDR for Pods in
my k3s cluster. The artifactsFileServer
is the address for the HookOS artifacts,
in this case the kernel and initrd. The Tinkerbell chart sets up a small Nginx
deployment for this and automatically downloads the newest HookOS artifacts to
it. This is configured under optional.hookos
. I’m also disabling a few things
because I don’t intend to use them. One of those is leader elections for
Tinkerbell - as I will only have one deployment, those seem unnecessary. I disable
Rufio and SecondStar as well. Rufio is a component to talk to baseboard
management controllers usually found on enterprise equipment. As I don’t have
any such gear, it’s unnecessary. Finally, SecondStar is a serial over SSH service
I also don’t need.
The dhcpMode
of Smee, the DHCP and general netboot component of Tinkerbell,
is more interesting. DHCP servers, especially those providing netboot options,
sometimes need to coexist. Where one DHCP server does the general IP management,
handing out dynamic and static IPs as well as stuff like NTP and DNS servers.
And then there’s a second DHCP server which only sends out DHCP information
necessary for PXE boot. Most normal DHCP servers can do that as well, I’m
currently using Dnsmasq to boot
my diskless machines for example, while normal IP address management is done
by the ISC DHCP server running on my OPNsense router.
Smee supports similar modes. It can either do all of the DHCP in one, handing
out IPs and netboot information, or only hand out netboot info, or even don’t
do anything with DHCP at all, but only serve iPXE binaries and scripts. The
different running modes are described in more detail here.
I’m using the proxy mode because I’ve already got a DHCP server handling
address management, although I might change that for the actual production
deployment. This is because I have to set the machine’s static IP in the
Hardware manifest anyway, as I will explain later. And I just like the fact
that static IPs would then finally be under version control. Right now, they’re
just configured in the OPNsense UI.
The logLevel
option is more important than it seems. Without it, Tinkerbell
will keep a number of low priority errors/warnings to itself. These are the
kind of “error” which might appear during normal operation, like DHCP packets
arriving for hosts which Tinkerbell doesn’t know about. But for me, it made
debugging my setup a bit more difficult. I will talk about that in the next
section.
I’m also disabling the kube-vip service that the chart can deploy, and instead deploy a separate one to have more control over the deployment.
Configuring Tinkerbell
The goal of my first tests was to get a feel for how Tinkerbell ticks. So I didn’t start out with trying to install an OS, but just wanted to see how the netboot and the Tinkerbell manifests work.
Before launching the VM, I created a couple of manifests for Tinkerbell. The core of Tinkerbell is the Workflow. It connects a Template containing actions to be executed with a Hardware representing a host. Here is my initial configuration:
apiVersion: tinkerbell.org/v1alpha1
kind: Hardware
metadata:
name: test-vm
spec:
disks:
- device: /dev/sda
interfaces:
- dhcp:
arch: x86_64
hostname: test-vm
mac: 10:66:6a:07:8d:0d
name_servers:
- 203.0.113.250
uefi: true
netboot:
allowPXE: true
allowWorkflow: true
---
apiVersion: tinkerbell.org/v1alpha1
kind: Template
metadata:
name: test-template
spec:
data: |
name: test-template
version: "0.1"
global_timeout: 600
tasks:
- name: "os installation"
worker: "{{`{{.machine_mac}}`}}"
volumes:
- /dev:/dev
- /dev/console:/dev/console
actions:
- name: "echome"
image: ghcr.io/jacobweinstock/waitdaemon:latest
timeout: 600
pid: host
command:
- echo "Hello, this is {{ .machine_mac }}"
- echo "Ending script here"
environment:
IMAGE: alpine
WAIT_SECONDS: 60
volumes:
- /var/run/docker.sock:/var/run/docker.sock
---
apiVersion: "tinkerbell.org/v1alpha1"
kind: Workflow
metadata:
name: test-workflow
spec:
templateRef: test-template
hardwareRef: test-vm
hardwareMap:
machine_mac: 10:66:6a:07:8d:0d
Let’s start with the Hardware manifest. It defines both, characteristics of the machine as well as configuration for said machine. This controls both the DHCP as well as the netboot options, also configuring whether the machine gets to PXE boot and whether it gets to run workflows. The Hardware is documented in more detail here. The Hardware manifest has a lot more options, but for my tests, only these ones were relevant.
Next is the Template. This specifies the actions to be executed. In this particular
example, I’m only running a few simple echo
command, as I was mostly interested
in how the netboot works. These Templates are not supposed to be machine-specific,
but instead are intended to be used by multiple workflows.
And finally, there’s the Workflow itself. It specifies a Hardware, meaning a
host, and a Template to apply to that host.
The hardwareMap
is a map of values to be made available in Templates, see my
use of the machine_mac
in the Template to set the worker
ID. One downside
of Tinkerbell at the moment is that only the spec.disks
value is available
from the Hardware, but none of the others. Hence why I also had to add the
machine_mac
in the Workflow’s hardwareMap
, instead of taking the value from
the spec.interfaces[].dhcp
value.
To summarize what this configuration is supposed to achieve: When Tinkerbell
receives a DHCP request from a machine with the MAC address 10:66:6a:07:8d:0d
,
it will send it some netboot information, namely itself as the next server
option and an iPXE binary. That binary will fetch an iPXE script when executed
by the netbooting host, again from Tinkerbell. That script will then download
the kernel and initrd for the HookOS from Tinkerbell’s Nginx deployment. When
those are booted up, they will launch the Tink worker in Docker and request
a workflow from Tinkerbell. It will get the echome
action delivered and execute
that. Right now, that only runs a couple of echo commands.
But that did not work out as expected, at least initially.
DHCP problems
For my testing, I needed another VM. And it couldn’t have a normal image,
because I wanted to ultimately install a fresh OS on it. Luckily, Incus supports
the --empty
parameter to create a VM and root disk, but without setting up an
image. I launched my test VM like this:
incus init test-vm --empty --vm -c limits.cpu=4 -c limits.memory=4GiB --profile base --profile disk-vms -d network,hwaddr="10:66:6a:07:8d:0d"
This command launches a VM with a 20 GB root disk which is empty. The VM also gets 4 GiB of RAM and 4 CPU cores. Then I’m also hardcoding the MAC address of the NIC. This was a later addition because I deleted the VM multiple times during testing, and it getting a new MAC each time it was created got annoying because I had to change the static DHCP lease and Tinkerbell config each time.
Then I launched the VM and saw - nothing. It tried to PXE boot, but did not get any netboot info, so I got dropped into a UEFI shell. I looked over my configuration, but couldn’t find anything. So I ran a quick test, to see whether hitting port 67 made it into the Tinkerbell Pod:
echo "foo" | nc -u 203.0.113.200 67
And indeed, the packet seemed to reach Tinkerbell, as I saw this in the logs:
{"time":"2025-06-01T20:48:36.172709819Z","level":"0","caller":"smee/internal/dhcp/server/dhcp.go:62","msg":"error parsing DHCPv4 request","service":"smee","err":"buffer too short at position 4: have 0 bytes, want 4 bytes"}
I wasn’t sending a DHCP message, so it was understandable that Tinkerbell didn’t know what to do with it. So in principle, the ServiceLB of k3s was working. But the DHCP packets did not. Next, I ran tcpdump on the VM running Tinkerbell to see whether the DHCP packets even made it to the machine itself:
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on enp5s0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
23:02:42.984176 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 10:66:6a:07:8d:0d (oui Unknown), length 253
E...V...@.#..........D.C...4.....3.......................fj...................................................................................................................................................................................
..........................c.Sc5..9...7.....
23:02:42.984524 IP _gateway.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 300
E..H.......B
V.......C.D.4.......3..........
V...........fj.............................................................................................................................................................................................................c.Sc5..6.
V..3....T........
V....
V.............................
23:02:46.363155 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 10:66:6a:07:8d:0d (oui Unknown), length 265
E..%V...@.#..........D.C...l.....3.......................fj...................................................................................................................................................................................
..........................c.Sc5..6.
V..2.
V..9...7.....
23:02:46.363507 IP _gateway.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 300
E..H.......B
V.......C.D.4.......3..........
V...........fj.............................................................................................................................................................................................................c.Sc5..6.
V..3....P........
V....
V.............................
4 packets captured
4 packets received by filter
0 packets dropped by kernel
So yes, the packet at least arrived at the machine and on the right interface. Running tcpdump in the network namespace of the Tinkerbell Pod showed no packet arriving, though. So I dug a bit deeper into k3s’ ServiceLB and what it actually does, and found this output in the logs:
kmaster logs -n kube-system svclb-tinkerbell-01c2218a-p69fs -c lb-udp-67
+ trap exit TERM INT
+ BIN_DIR=/usr/sbin
+ check_iptables_mode
+ set +e
+ lsmod
+ grep -qF nf_tables
+ '[' 0 '=' 0 ]
+ mode=nft
+ set -e
+ info 'nft mode detected'
+ set_nft
+ ln -sf xtables-nft-multi /usr/sbin/iptables
[INFO] nft mode detected
+ ln -sf xtables-nft-multi /usr/sbin/iptables-save
+ ln -sf xtables-nft-multi /usr/sbin/iptables-restore
+ ln -sf xtables-nft-multi /usr/sbin/ip6tables
+ start_proxy
+ echo 0.0.0.0/0
+ grep -Eq :
+ iptables -t filter -I FORWARD -s 0.0.0.0/0 -p UDP --dport 32562 -j ACCEPT
+ echo 203.0.113.200
+ grep -Eq :
+ cat /proc/sys/net/ipv4/ip_forward
+ '[' 1 '==' 1 ]
+ iptables -t filter -A FORWARD -d 203.0.113.200/32 -p UDP --dport 32562 -j DROP
+ iptables -t nat -I PREROUTING -p UDP --dport 67 -j DNAT --to 203.0.113.200:32562
+ iptables -t nat -I POSTROUTING -d 203.0.113.200/32 -p UDP -j MASQUERADE
+ '[' '!' -e /pause ]
+ mkfifo /pause
What I thought I could read out of that setup was that only packets which are
directed to the exact IP of the host, 203.0.113.200
, would be forwarded to
the Tinkerbell Pod. But the initial DHCP discovery packets are of course send
to the broadcast address, as can be seen in the tcpdump from above. And so I
thought that these packets would simply get dropped, because they were not
addressed to the unicast address of the host. But I’m no longer 100% sure about
that. Because in later testing, with kube-vip as the LoadBalancer instead of
ServiceLB, I got a similar result - no reaction by Tinkerbell in the logs. But:
I then figured out that I had the log level too low.
But at this point, I still thought that ServiceLB was the problem. So I decided to disable it and instead deploy kube-vip. I’ve already got experience with it, as I’m using it as the VIP provider for the k8s API in my main cluster.
I deployed kube-vip with this Deployment:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: kube-vip
spec:
selector:
matchLabels:
name: kube-vip
template:
metadata:
labels:
name: kube-vip
spec:
hostNetwork: true
serviceAccountName: kube-vip
containers:
- name: kube-vip
image: ghcr.io/kube-vip/kube-vip:v0.9.1
imagePullPolicy: IfNotPresent
args:
- manager
env:
- name: svc_enable
value: "true"
- name: vip_arp
value: "true"
- name: vip_leaderelection
value: "false"
- name: svc_election
value: "false"
securityContext:
capabilities:
add:
- NET_ADMIN
- NET_RAW
- SYS_TIME
With this config, kube-vip will watch for LoadBalancer services and announce
their IP via ARP. I’ve disabled all leader elections, as this k3s cluster
will only ever have a single node.
Kube-vip does not have any IPAM functionality, it either relies on annotations
on the Service or the loadBalancerIP
setting. The Tinkerbell chart already
sets the loadBalancerIP
to the publicIP
value from the values.yaml
file,
so I just relied on that.
But that did not seem to fix my problem. There still wasn’t any reaction from Tinkerbell to the DHCP requests. Which was when I finally realized that I had never increased Tinkerbell’s log level. 🤦 And that was when I finally got some results:
{
"time":"2025-06-07T22:04:38.322503545Z",
"level":"-1",
"caller":"smee/internal/dhcp/handler/proxy/proxy.go:211",
"msg":"Ignoring packet",
"service":"smee",
"mac":"10:66:6a:07:8d:0d",
"xid":"0xfd39e0af",
"interface":"macvlan0",
"error":"failed to convert hardware to DHCP data: no IP data"
}
I didn’t have time to dig deeper into that error at the time, but did create this issue, requesting that the above error message be increased in log level, so it appears with the standard logging setting. But it turned out that I had actually run into a bug. My Hardware manifest was okay, but Tinkerbell erroneously required some IP configuration. This has now been fixed.
First successful boot
And with that fix, I finally got my first successful netboot: Screenshot of my first successful HookOS network boot.
So that was pretty nice to see. But there was something even better going on in
the background. First of all, the two echo
commands I had configured to be run
as tasks upon boot did run. But the cool thing was how I was able to verify that.
It turns out that Tinkerbell launches a syslog server and configures the in-memory
HookOS in such a way that it would forward the logs to Tinkerbell. And Tinkerbell
then spits them out in its own logs. This is a really nice and convenient feature
for seeing what’s happening on the remote machine.
Side Quest: Generating an Ubuntu image
The obvious next step was to install an entire OS instead of just outputting some text. But for that, I first needed a new image. My current image pipeline produces individual images for each host, which is clumsy and should be unnecessary. Something like cloud-init should be able to do all of the initial setup I need to prepare for Ansible management. I did not want to just use Ubuntu’s cloud images, and instead create my own.
Initially, I looked at ubuntu-image. That’s the tool that’s used by Canonical to produce the official Ubuntu images. But it went a bit too deep for me, and I wasn’t able to really grok how it worked. In addition, while the current image was for an x86 VM with a local disk, I would also need images for Raspberry Pis without any local storage. And those would definitely need some adaptions, as they need a special initramfs. It didn’t look like that would be easily possible with ubuntu-image, so I would have to use Packer/Ansible for those. In the end, I would have different tools for different images, which I didn’t really like.
So I decided to stay with my Packer approach. One problem with my current approach was that it reboots the image after installation and runs Ansible on it. And when using cloud-init, that would count as the first boot, so the first boot after actually installing the image would not run cloud-init again. But it should. So I looked for a way to disable provisioning, and found it in this issue.
My HashiCorp Packer file looks like this:
locals {
ubuntu-major = "24.04"
ubuntu-minor = "2"
ubuntu-arch = "amd64"
out_dir = "ubuntu-base"
}
local "img-name" {
expression = "ubuntu-base-${local.ubuntu-major}.${local.ubuntu-minor}-${local.ubuntu-arch}"
}
local "s3-access" {
expression = vault("secret/s3-creds", "access")
sensitive = true
}
local "s3-secret" {
expression = vault("secret/s3-creds", "secret")
sensitive = true
}
source "qemu" "ubuntu-base" {
iso_url = "https://releases.ubuntu.com/${local.ubuntu-major}/ubuntu-${local.ubuntu-major}.${local.ubuntu-minor}-live-server-${local.ubuntu-arch}.iso"
iso_checksum = "sha256:d6dab0c3a657988501b4bd76f1297c053df710e06e0c3aece60dead24f270b4d"
output_directory = "ubuntu-base"
shutdown_command = ""
shutdown_timeout = "1h"
disk_size = "8G"
cpus = 6
memory = "4096"
format = "raw"
accelerator = "kvm"
firmware = "/usr/share/edk2-ovmf/OVMF_CODE.fd"
net_device = "virtio-net"
disk_interface = "virtio"
communicator = "none"
vm_name = "${local.img-name}"
http_content = {
"/user-data" = file("${path.root}/files/ubuntu-base-autoinstall")
"/meta-data" = ""
}
boot_command = ["<wait>e<wait5>", "<down><wait><down><wait><down><wait2><end><wait5>", "<bs><bs><bs><bs><wait>autoinstall ds=nocloud-net\\;s=http://{{ .HTTPIP }}:{{ .HTTPPort }}/ ---<wait><f10>"]
}
build {
name = "ubuntu-base-${local.ubuntu-major}.${local.ubuntu-minor}-${local.ubuntu-arch}"
sources = ["source.qemu.ubuntu-base"]
post-processor "shell-local" {
script = "${path.root}/scripts/s3-upload.sh"
environment_vars = [
"OUT_DIR=${abspath(local.out_dir)}",
"OUT_NAME=${local.img-name}",
"RCLONE_CONFIG_CEPHS3_PROVIDER=Ceph",
"RCLONE_CONFIG_CEPHS3_TYPE=s3",
"RCLONE_CONFIG_CEPHS3_ACCESS_KEY_ID=${local.s3-access}",
"RCLONE_CONFIG_CEPHS3_SECRET_ACCESS_KEY=${local.s3-secret}",
"RCLONE_CONFIG_CEPHS3_ENDPOINT=https://s3.example.com"
]
}
}
This Packer file starts out with downloading the current Ubuntu 24.04.2 Server LTS
install image. It then uses Packer’s Qemu plugin
to launch a VM on the machine where the Packer build is executed.
The way the automation works is always pretty funny to me. See the boot_commnd
parameter above. Packer just takes control of the keyboard and types in what
you’d type in to run an Ubuntu autoinstall. The small HTTP server used to
supply the user-data
is automatically started by Packer and made available to
the VM. This file uses Ubuntu’s autoinstall
to automate the installation:
#cloud-config
autoinstall:
version: 1
identity:
hostname: "ubuntu-base"
password: "$6$exDY1mhS4KUYCE/2$zmn9ToZwTKLhCw.b4/b.ZRTIZM30JZ4QrOQ2aOXJ8yk96xpcCof0kxKwuX1kqLG/ygbJ1f8wxED22bTL4F46P0"
username: ubuntu
locale: en_US.UTF-8
source:
id: ubuntu-server-minimal
storage:
layout:
name: direct
ssh:
install-server: true
late-commands:
- echo 'ubuntu ALL=(ALL) NOPASSWD:ALL' > /target/etc/sudoers.d/sysuser
shutdown: poweroff
Not that much configuration is necessary here. I create the ubuntu
user here
just as an escape hatch, so that when something goes wrong with later provisioning
steps, I still have a way to get into the machine. It’s removed in the first
steps of my Homelab Ansible playbook.
As I’ve noted above, I don’t need any additional customization here, the plan was to create a really generic and small image I could then customize once it was installed on a machine.
The last interesting part is the post-processor in the Packer file. Here, I
wrote a little script that uploads the finished image to my S3 storage, so
Tinkerbell has a place to install it from. This is what the s3-upload.sh
script looks like:
#!/bin/sh
if ! command -v rclone > /dev/null; then
echo "Command rclone not found, aborting."
exit 1
fi
image="${OUT_DIR}/${OUT_NAME}"
if [ ! -f "${image}" ]; then
echo "Could not find image '${image}', aborting."
exit 1
fi
echo "Copying ${image}..."
env
rclone copy "${image}" cephs3:public/images/ || exit 1
exit 0
It uses rclone to upload the image file to S3. One advantage
of starting out with a generic image is that it doesn’t contain any secrets or
credentials, so there’s no problem with putting it on a (internally) public S3
bucket.
The credentials for the S3 upload are taken from Vault via Packer’s integration
in the s3-access
and s3-secret
variables at the beginning of the Packer file.
Provisioning the VM via Tinkerbell
And now finally, I was ready to fully provision a VM with Tinkerbell. This requires an update of the Tinkerbell Template, which now looks like this:
apiVersion: tinkerbell.org/v1alpha1
kind: Template
metadata:
name: test-template
spec:
data: |
name: test-template
version: "0.1"
global_timeout: 600
tasks:
- name: "os installation"
worker: "{{`{{.machine_mac}}`}}"
volumes:
- /dev:/dev
- /dev/console:/dev/console
actions:
- name: "install ubuntu"
image: quay.io/tinkerbell/actions/image2disk:latest
timeout: 900
environment:
IMG_URL: {{ .Values.images.ubuntuBaseAmd64 }}
DEST_DISK: /dev/sda
COMPRESSED: false
And that just worked, right out of the box. The Tinkerbell image2disk
action
downloaded the image from S3 and automatically put it onto the VM’s local disk.
And just like that, I had a fully deployed VM, provisioned via Tinkerbell. 🎉
But not so fast. Of course, the first thing missing here was a proper cloud-init
config to set up my standard Ansible user so I could run my standard playbook.
Cloud-init can download configurations for the initial
boot from a cloud provider, codified in the user-data
and vendor-data
.
It runs in several phases during boot, first, before the network is available,
from local config files. And then, afterwards, from user-data
provided e.g.
by the cloud provider via a HTTP server. The user-data
and vendor-data
can
also be provided from local files entirely. There’s a wide range of configurations
that can be done via cloud-init. From creating local users and installing packages
to configuring mounts and networking.
To supply this cloud-init data, Tinkerbell has the Tootles component. It implements AWS’ EC2 metadata service API, which is also supported by cloud-init. The metadata reported by Tootles for any given instance is supplied via the Hardware object:
apiVersion: tinkerbell.org/v1alpha1
kind: Hardware
metadata:
name: test-vm
spec:
metadata:
instance:
id: 10:66:6a:5a:91:8c
ips:
- address: 203.0.113.20
allow_pxe: true
hostname: test-vm
operating_system:
distro: "ubuntu"
version: "24.04"
disks:
- device: /dev/sda
interfaces:
- dhcp:
arch: x86_64
hostname: test-vm
mac: 10:66:6a:5a:91:8c
ip:
address: 203.0.113.20
netmask: 255.255.255.0
name_servers:
- 10.86.25.254
uefi: true
netboot:
allowPXE: true
allowWorkflow: true
userData: |
#cloud-config
packages:
- openssh-server
- python3
- sudo
ssh_pwauth: false
disable_root: true
allow_public_ssh_keys: false
timezone: "Europe/Berlin"
users:
- name: ansible-user
shell: /bin/bash
ssh_authorized_keys:
- from="192.0.2.100" ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIOaxn8l16GNyBEgYzWO0BAko9fw8kkIq9tbels3hXdUt user@foo
sudo: ALL=(ALL:ALL) ALL
runcmd:
- systemctl enable ssh.service
- systemctl start ssh.service
power_state:
delay: 2
timeout: 2
mode: reboot
The first change necessary here is to add the spec.interfaces[].dhcp.ip
section.
This is one of the suboptimal pieces of Tinkerbell. I’m not actually having
Tinkerbell do the IPAM part of DHCP, that’s still left to my OPNsense router.
But I still needed to specify the VM’s IP here, because the EC2 API, and thus
Tootles, determines which metadata to return by the IP the request is coming
from. So if you just do a request for /2009-04-04/meta-data
from any host,
you won’t get a response. The request needs to come from an IP which has a
Hardware object. Another downside is that the spec.metadata
section needs to
be defined manually - it’s not automatically created from the rest of the Hardware
object.
Then we come to the actually interesting part, the spec.userData
. This is the
cloud-init config returned to the machine upon request. As I’ve noted above,
the main goal here is to configure the new machine so I can run my main Ansible
playbook on it. I’m making sure that my Ansible user exists, has my SSH key
and is in the sudoers file. In addition, I’m making sure that SSH is started
and then finally reboot the machine. The #cloud-init
comment is load-bearing
by the way, without it cloud-init won’t accept the configuration.
So far so good, but this configuration still did not work. The central issue was
that the machine did not have proper networking config. The ip addr
command
was showing the Ethernet interface being down. This confused me, because the
cloud-init config clearly states that, when there’s no explicit network config
given, a default using DHCP for all interfaces will be applied.
So I went searching. And that wasn’t easy, because it turns out that Ubuntu’s
server-minimal install is so minimal that it even eschews vi or nano. I had to
look at files with cat
. But I was finally able to find what I was looking for.
In /etc/netplan/50-cloud-init.yaml
, I found this:
network:
version: 2
ethernets:
ens3:
dhcp4: true
That file was created by the installer during the Packer install run. But of course, the NIC had a different name in that environment than it has on the final VM. To remedy this, I added another task to the Tinkerbell Template, removing the cloud-init config created by the installer so that the defaults apply:
- name: "remove installer network config"
image: quay.io/tinkerbell/actions/writefile:latest
timeout: 90
environment:
DEST_DISK: {{ `{{ formatPartition ( index .Hardware.Disks 0 ) 2 }}` }}
FS_TYPE: ext4
DEST_PATH: /etc/cloud/cloud.cfg.d/90-installer-network.cfg
UID: 0
GID: 0
MODE: 0600
DIRMODE: 0700
CONTENTS: |
# Removed during provisioning
This task is executed after the image is dd’d onto the disk, mounts the root partition and overrides the file content with a comment.
But even after that, I was still not getting my cloud-init user-config applied.
After some more searching, I found the file /run/cloud/init/cloud-init-generator.log
with the following content:
ds-identify rc=1
cloud-init is enabled but no datasource found, disabling
I could have avoided this problem by following Tinkerbell’s cloud-init docs. There, the example contains two more tasks:
- name: "add cloud-init config"
image: quay.io/tinkerbell/actions/writefile:latest
timeout: 90
environment:
DEST_DISK: {{ `{{ formatPartition ( index .Hardware.Disks 0 ) 2 }}` }}
DEST_PATH: /etc/cloud/cloud.cfg.d/10_tinkerbell.cfg
DIRMODE: "0700"
FS_TYPE: ext4
GID: "0"
MODE: "0600"
UID: "0"
CONTENTS: |
datasource:
Ec2:
metadata_urls: ["http://203.0.113.200:7172"]
strict_id: false
manage_etc_hosts: localhost
warnings:
dsid_missing_source: off
- name: "add cloud-init ds-identity"
image: quay.io/tinkerbell/actions/writefile:latest
timeout: 90
environment:
DEST_DISK: {{ `{{ formatPartition ( index .Hardware.Disks 0 ) 2 }}` }}
FS_TYPE: ext4
DEST_PATH: /etc/cloud/ds-identify.cfg
UID: 0
GID: 0
MODE: 0600
DIRMODE: 0700
CONTENTS: |
datasource: Ec2
The first task adds some basic cloud-init configuration. Most importantly, the
URL for the metadata service. For most cloud providers, this is a hardcoded IP
over their entire cloud, but here it will be Tinkerbell’s public IP as configured
in the Helm chart’s values.yaml
. Another important setting is to hardcode
the data source as Ec2
, because cloud-init’s default search mechanism checks
the aforementioned default IP, where it won’t find any metadata service in my
Homelab.
With all of this configuration done, I was able to delete the VM one last time, reset the Workflow object of Tinkerbell, and recreate the VM. After a couple of minutes, I was greeted with a fully functional VM, ready for Ansible, with no further manual intervention from my side.
Final thoughts
I really like what I’ve seen from Tinkerbell so far. I also like how well cloud-init works. Even if I don’t end up deploying Tinkerbell, I will likely change my new host setup to use a generic image and then do the customization with cloud-init.
The next steps will be the more complicated ones. There are two basic things I will need to figure out. First, how to boot Raspberry Pi 4 and 5 into iPXE so I can use Tinkerbell for provisioning them. From some initial research, it looks like that should be possible. The bigger issue might be diskless hosts. Sure, I can set up iPXE and provisioning - but the problem is then how to tell them to boot into their own system, instead of Tinkerbell’s provisioning, once they’ve been properly set up.
Let’s see how those next experiments turn out.