In this post, I will describe how I deployed Tinkerbell into my k3s cluster and provisioned the first Ubuntu VM with it.

This is part 3 of my Tinkerbell series.

Deploying Tinkerbell

The first step is to deploy Tinkerbell into the k3s cluster I set up in the previous post. For this, I used the official Helm chart, which can be found here.

My values.yaml file looks like this:

publicIP: "203.0.113.200"
trustedProxies:
  - "10.42.0.0/24"
artifactsFileServer: "http://203.0.113.200:7173"
deployment:
  envs:
    tinkController:
      enableLeaderElection: false
    smee:
      dhcpMode: "proxy"
    globals:
      enableRufioController: false
      enableSecondstar: false
      logLevel: 3
  init:
    enabled: true
service:
  lbClass: ""
optional:
  hookos:
    service:
      lbClass: ""
    kernelVersion: "both"
    persistence:
      existingClaim: "hookos-volume"
  kubevip:
    enabled: false

The first setting, publicIP, is the public IP under which Tinkerbell’s services will be available to other machines. It will be used in DHCP responses for the next server, download URLs for iPXE scripting and so forth. It will also be set as the loadBalancerIP in the Service manifest created by the chart. In my case, this is a VIP controlled by a kube-vip deployment I will go into more detail on later. The trustedProxies entry is just the CIDR for Pods in my k3s cluster. The artifactsFileServer is the address for the HookOS artifacts, in this case the kernel and initrd. The Tinkerbell chart sets up a small Nginx deployment for this and automatically downloads the newest HookOS artifacts to it. This is configured under optional.hookos. I’m also disabling a few things because I don’t intend to use them. One of those is leader elections for Tinkerbell - as I will only have one deployment, those seem unnecessary. I disable Rufio and SecondStar as well. Rufio is a component to talk to baseboard management controllers usually found on enterprise equipment. As I don’t have any such gear, it’s unnecessary. Finally, SecondStar is a serial over SSH service I also don’t need.

The dhcpMode of Smee, the DHCP and general netboot component of Tinkerbell, is more interesting. DHCP servers, especially those providing netboot options, sometimes need to coexist. Where one DHCP server does the general IP management, handing out dynamic and static IPs as well as stuff like NTP and DNS servers. And then there’s a second DHCP server which only sends out DHCP information necessary for PXE boot. Most normal DHCP servers can do that as well, I’m currently using Dnsmasq to boot my diskless machines for example, while normal IP address management is done by the ISC DHCP server running on my OPNsense router. Smee supports similar modes. It can either do all of the DHCP in one, handing out IPs and netboot information, or only hand out netboot info, or even don’t do anything with DHCP at all, but only serve iPXE binaries and scripts. The different running modes are described in more detail here. I’m using the proxy mode because I’ve already got a DHCP server handling address management, although I might change that for the actual production deployment. This is because I have to set the machine’s static IP in the Hardware manifest anyway, as I will explain later. And I just like the fact that static IPs would then finally be under version control. Right now, they’re just configured in the OPNsense UI.

The logLevel option is more important than it seems. Without it, Tinkerbell will keep a number of low priority errors/warnings to itself. These are the kind of “error” which might appear during normal operation, like DHCP packets arriving for hosts which Tinkerbell doesn’t know about. But for me, it made debugging my setup a bit more difficult. I will talk about that in the next section.

I’m also disabling the kube-vip service that the chart can deploy, and instead deploy a separate one to have more control over the deployment.

Configuring Tinkerbell

The goal of my first tests was to get a feel for how Tinkerbell ticks. So I didn’t start out with trying to install an OS, but just wanted to see how the netboot and the Tinkerbell manifests work.

Before launching the VM, I created a couple of manifests for Tinkerbell. The core of Tinkerbell is the Workflow. It connects a Template containing actions to be executed with a Hardware representing a host. Here is my initial configuration:

apiVersion: tinkerbell.org/v1alpha1
kind: Hardware
metadata:
  name: test-vm
spec:
  disks:
  - device: /dev/sda
  interfaces:
  - dhcp:
      arch: x86_64
      hostname: test-vm
      mac: 10:66:6a:07:8d:0d
      name_servers:
      - 203.0.113.250
      uefi: true
    netboot:
      allowPXE: true
      allowWorkflow: true
---
apiVersion: tinkerbell.org/v1alpha1
kind: Template
metadata:
  name: test-template
spec:
  data: |
    name: test-template
    version: "0.1"
    global_timeout: 600
    tasks:
      - name: "os installation"
        worker: "{{`{{.machine_mac}}`}}"
        volumes:
          - /dev:/dev
          - /dev/console:/dev/console
        actions:
          - name: "echome"
            image: ghcr.io/jacobweinstock/waitdaemon:latest
            timeout: 600
            pid: host
            command:
              - echo "Hello, this is {{ .machine_mac }}"
              - echo "Ending script here"
            environment:
              IMAGE: alpine
              WAIT_SECONDS: 60
            volumes:
              - /var/run/docker.sock:/var/run/docker.sock    
---
apiVersion: "tinkerbell.org/v1alpha1"
kind: Workflow
metadata:
  name: test-workflow
spec:
  templateRef: test-template
  hardwareRef: test-vm
  hardwareMap:
    machine_mac: 10:66:6a:07:8d:0d

Let’s start with the Hardware manifest. It defines both, characteristics of the machine as well as configuration for said machine. This controls both the DHCP as well as the netboot options, also configuring whether the machine gets to PXE boot and whether it gets to run workflows. The Hardware is documented in more detail here. The Hardware manifest has a lot more options, but for my tests, only these ones were relevant.

Next is the Template. This specifies the actions to be executed. In this particular example, I’m only running a few simple echo command, as I was mostly interested in how the netboot works. These Templates are not supposed to be machine-specific, but instead are intended to be used by multiple workflows.

And finally, there’s the Workflow itself. It specifies a Hardware, meaning a host, and a Template to apply to that host. The hardwareMap is a map of values to be made available in Templates, see my use of the machine_mac in the Template to set the worker ID. One downside of Tinkerbell at the moment is that only the spec.disks value is available from the Hardware, but none of the others. Hence why I also had to add the machine_mac in the Workflow’s hardwareMap, instead of taking the value from the spec.interfaces[].dhcp value.

To summarize what this configuration is supposed to achieve: When Tinkerbell receives a DHCP request from a machine with the MAC address 10:66:6a:07:8d:0d, it will send it some netboot information, namely itself as the next server option and an iPXE binary. That binary will fetch an iPXE script when executed by the netbooting host, again from Tinkerbell. That script will then download the kernel and initrd for the HookOS from Tinkerbell’s Nginx deployment. When those are booted up, they will launch the Tink worker in Docker and request a workflow from Tinkerbell. It will get the echome action delivered and execute that. Right now, that only runs a couple of echo commands.

But that did not work out as expected, at least initially.

DHCP problems

For my testing, I needed another VM. And it couldn’t have a normal image, because I wanted to ultimately install a fresh OS on it. Luckily, Incus supports the --empty parameter to create a VM and root disk, but without setting up an image. I launched my test VM like this:

incus init test-vm --empty --vm -c limits.cpu=4 -c limits.memory=4GiB --profile base --profile disk-vms -d network,hwaddr="10:66:6a:07:8d:0d"

This command launches a VM with a 20 GB root disk which is empty. The VM also gets 4 GiB of RAM and 4 CPU cores. Then I’m also hardcoding the MAC address of the NIC. This was a later addition because I deleted the VM multiple times during testing, and it getting a new MAC each time it was created got annoying because I had to change the static DHCP lease and Tinkerbell config each time.

Then I launched the VM and saw - nothing. It tried to PXE boot, but did not get any netboot info, so I got dropped into a UEFI shell. I looked over my configuration, but couldn’t find anything. So I ran a quick test, to see whether hitting port 67 made it into the Tinkerbell Pod:

echo "foo" | nc -u 203.0.113.200 67

And indeed, the packet seemed to reach Tinkerbell, as I saw this in the logs:

{"time":"2025-06-01T20:48:36.172709819Z","level":"0","caller":"smee/internal/dhcp/server/dhcp.go:62","msg":"error parsing DHCPv4 request","service":"smee","err":"buffer too short at position 4: have 0 bytes, want 4 bytes"}

I wasn’t sending a DHCP message, so it was understandable that Tinkerbell didn’t know what to do with it. So in principle, the ServiceLB of k3s was working. But the DHCP packets did not. Next, I ran tcpdump on the VM running Tinkerbell to see whether the DHCP packets even made it to the machine itself:

tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on enp5s0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
23:02:42.984176 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 10:66:6a:07:8d:0d (oui Unknown), length 253
E...V...@.#..........D.C...4.....3.......................fj...................................................................................................................................................................................
..........................c.Sc5..9...7.....
23:02:42.984524 IP _gateway.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 300
E..H.......B
V.......C.D.4.......3..........
V...........fj.............................................................................................................................................................................................................c.Sc5..6.
V..3....T........
V....
V.............................
23:02:46.363155 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 10:66:6a:07:8d:0d (oui Unknown), length 265
E..%V...@.#..........D.C...l.....3.......................fj...................................................................................................................................................................................
..........................c.Sc5..6.
V..2.
V..9...7.....
23:02:46.363507 IP _gateway.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 300
E..H.......B
V.......C.D.4.......3..........
V...........fj.............................................................................................................................................................................................................c.Sc5..6.
V..3....P........
V....
V.............................
4 packets captured
4 packets received by filter
0 packets dropped by kernel

So yes, the packet at least arrived at the machine and on the right interface. Running tcpdump in the network namespace of the Tinkerbell Pod showed no packet arriving, though. So I dug a bit deeper into k3s’ ServiceLB and what it actually does, and found this output in the logs:

kmaster logs -n kube-system svclb-tinkerbell-01c2218a-p69fs -c lb-udp-67
+ trap exit TERM INT
+ BIN_DIR=/usr/sbin
+ check_iptables_mode
+ set +e
+ lsmod
+ grep -qF nf_tables
+ '[' 0 '=' 0 ]
+ mode=nft
+ set -e
+ info 'nft mode detected'
+ set_nft
+ ln -sf xtables-nft-multi /usr/sbin/iptables
[INFO]  nft mode detected
+ ln -sf xtables-nft-multi /usr/sbin/iptables-save
+ ln -sf xtables-nft-multi /usr/sbin/iptables-restore
+ ln -sf xtables-nft-multi /usr/sbin/ip6tables
+ start_proxy
+ echo 0.0.0.0/0
+ grep -Eq :
+ iptables -t filter -I FORWARD -s 0.0.0.0/0 -p UDP --dport 32562 -j ACCEPT
+ echo 203.0.113.200
+ grep -Eq :
+ cat /proc/sys/net/ipv4/ip_forward
+ '[' 1 '==' 1 ]
+ iptables -t filter -A FORWARD -d 203.0.113.200/32 -p UDP --dport 32562 -j DROP
+ iptables -t nat -I PREROUTING -p UDP --dport 67 -j DNAT --to 203.0.113.200:32562
+ iptables -t nat -I POSTROUTING -d 203.0.113.200/32 -p UDP -j MASQUERADE
+ '[' '!' -e /pause ]
+ mkfifo /pause

What I thought I could read out of that setup was that only packets which are directed to the exact IP of the host, 203.0.113.200, would be forwarded to the Tinkerbell Pod. But the initial DHCP discovery packets are of course send to the broadcast address, as can be seen in the tcpdump from above. And so I thought that these packets would simply get dropped, because they were not addressed to the unicast address of the host. But I’m no longer 100% sure about that. Because in later testing, with kube-vip as the LoadBalancer instead of ServiceLB, I got a similar result - no reaction by Tinkerbell in the logs. But: I then figured out that I had the log level too low.

But at this point, I still thought that ServiceLB was the problem. So I decided to disable it and instead deploy kube-vip. I’ve already got experience with it, as I’m using it as the VIP provider for the k8s API in my main cluster.

I deployed kube-vip with this Deployment:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: kube-vip
spec:
  selector:
    matchLabels:
      name: kube-vip
  template:
    metadata:
      labels:
        name: kube-vip
    spec:
      hostNetwork: true
      serviceAccountName: kube-vip
      containers:
        - name: kube-vip
          image: ghcr.io/kube-vip/kube-vip:v0.9.1
          imagePullPolicy: IfNotPresent
          args:
            - manager
          env:
            - name: svc_enable
              value: "true"
            - name: vip_arp
              value: "true"
            - name: vip_leaderelection
              value: "false"
            - name: svc_election
              value: "false"
          securityContext:
            capabilities:
              add:
              - NET_ADMIN
              - NET_RAW
              - SYS_TIME

With this config, kube-vip will watch for LoadBalancer services and announce their IP via ARP. I’ve disabled all leader elections, as this k3s cluster will only ever have a single node. Kube-vip does not have any IPAM functionality, it either relies on annotations on the Service or the loadBalancerIP setting. The Tinkerbell chart already sets the loadBalancerIP to the publicIP value from the values.yaml file, so I just relied on that.

But that did not seem to fix my problem. There still wasn’t any reaction from Tinkerbell to the DHCP requests. Which was when I finally realized that I had never increased Tinkerbell’s log level. 🤦 And that was when I finally got some results:

{
"time":"2025-06-07T22:04:38.322503545Z",
"level":"-1",
"caller":"smee/internal/dhcp/handler/proxy/proxy.go:211",
"msg":"Ignoring packet",
"service":"smee",
"mac":"10:66:6a:07:8d:0d",
"xid":"0xfd39e0af",
"interface":"macvlan0",
"error":"failed to convert hardware to DHCP data: no IP data"
}

I didn’t have time to dig deeper into that error at the time, but did create this issue, requesting that the above error message be increased in log level, so it appears with the standard logging setting. But it turned out that I had actually run into a bug. My Hardware manifest was okay, but Tinkerbell erroneously required some IP configuration. This has now been fixed.

First successful boot

And with that fix, I finally got my first successful netboot:

A screenshot of a Linux terminal. It shows the command prompt after a fresh boot. The initial text welcomes the user to HookOS, Tinkerbell's boot in-memory OS. The output also indicates that the OS is based on LinuxKit and the 5.10 kernel. Furthermore, it informs the user that the 'docker' command can be used to access tink's worker container.

Screenshot of my first successful HookOS network boot.

So that was pretty nice to see. But there was something even better going on in the background. First of all, the two echo commands I had configured to be run as tasks upon boot did run. But the cool thing was how I was able to verify that. It turns out that Tinkerbell launches a syslog server and configures the in-memory HookOS in such a way that it would forward the logs to Tinkerbell. And Tinkerbell then spits them out in its own logs. This is a really nice and convenient feature for seeing what’s happening on the remote machine.

Side Quest: Generating an Ubuntu image

The obvious next step was to install an entire OS instead of just outputting some text. But for that, I first needed a new image. My current image pipeline produces individual images for each host, which is clumsy and should be unnecessary. Something like cloud-init should be able to do all of the initial setup I need to prepare for Ansible management. I did not want to just use Ubuntu’s cloud images, and instead create my own.

Initially, I looked at ubuntu-image. That’s the tool that’s used by Canonical to produce the official Ubuntu images. But it went a bit too deep for me, and I wasn’t able to really grok how it worked. In addition, while the current image was for an x86 VM with a local disk, I would also need images for Raspberry Pis without any local storage. And those would definitely need some adaptions, as they need a special initramfs. It didn’t look like that would be easily possible with ubuntu-image, so I would have to use Packer/Ansible for those. In the end, I would have different tools for different images, which I didn’t really like.

So I decided to stay with my Packer approach. One problem with my current approach was that it reboots the image after installation and runs Ansible on it. And when using cloud-init, that would count as the first boot, so the first boot after actually installing the image would not run cloud-init again. But it should. So I looked for a way to disable provisioning, and found it in this issue.

My HashiCorp Packer file looks like this:

locals {
  ubuntu-major = "24.04"
  ubuntu-minor = "2"
  ubuntu-arch = "amd64"
  out_dir = "ubuntu-base"
}

local "img-name" {
  expression = "ubuntu-base-${local.ubuntu-major}.${local.ubuntu-minor}-${local.ubuntu-arch}"
}

local "s3-access" {
  expression = vault("secret/s3-creds", "access")
  sensitive = true
}
local "s3-secret" {
  expression = vault("secret/s3-creds", "secret")
  sensitive = true
}

source "qemu" "ubuntu-base" {
  iso_url           = "https://releases.ubuntu.com/${local.ubuntu-major}/ubuntu-${local.ubuntu-major}.${local.ubuntu-minor}-live-server-${local.ubuntu-arch}.iso"
  iso_checksum      = "sha256:d6dab0c3a657988501b4bd76f1297c053df710e06e0c3aece60dead24f270b4d"
  output_directory  = "ubuntu-base"
  shutdown_command  = ""
  shutdown_timeout  = "1h"
  disk_size         = "8G"
  cpus              = 6
  memory            = "4096"
  format            = "raw"
  accelerator       = "kvm"
  firmware          = "/usr/share/edk2-ovmf/OVMF_CODE.fd"
  net_device        = "virtio-net"
  disk_interface    = "virtio"
  communicator      = "none"
  vm_name = "${local.img-name}"
  http_content      = {
    "/user-data" = file("${path.root}/files/ubuntu-base-autoinstall")
    "/meta-data" = ""
  }
  boot_command = ["<wait>e<wait5>", "<down><wait><down><wait><down><wait2><end><wait5>", "<bs><bs><bs><bs><wait>autoinstall ds=nocloud-net\\;s=http://{{ .HTTPIP }}:{{ .HTTPPort }}/ ---<wait><f10>"]
}

build {
  name = "ubuntu-base-${local.ubuntu-major}.${local.ubuntu-minor}-${local.ubuntu-arch}"
  sources = ["source.qemu.ubuntu-base"]

  post-processor "shell-local" {
    script = "${path.root}/scripts/s3-upload.sh"
    environment_vars = [
      "OUT_DIR=${abspath(local.out_dir)}",
      "OUT_NAME=${local.img-name}",
      "RCLONE_CONFIG_CEPHS3_PROVIDER=Ceph",
      "RCLONE_CONFIG_CEPHS3_TYPE=s3",
      "RCLONE_CONFIG_CEPHS3_ACCESS_KEY_ID=${local.s3-access}",
      "RCLONE_CONFIG_CEPHS3_SECRET_ACCESS_KEY=${local.s3-secret}",
      "RCLONE_CONFIG_CEPHS3_ENDPOINT=https://s3.example.com"
    ]
  }
}

This Packer file starts out with downloading the current Ubuntu 24.04.2 Server LTS install image. It then uses Packer’s Qemu plugin to launch a VM on the machine where the Packer build is executed. The way the automation works is always pretty funny to me. See the boot_commnd parameter above. Packer just takes control of the keyboard and types in what you’d type in to run an Ubuntu autoinstall. The small HTTP server used to supply the user-data is automatically started by Packer and made available to the VM. This file uses Ubuntu’s autoinstall to automate the installation:

#cloud-config
autoinstall:
  version: 1
  identity:
    hostname: "ubuntu-base"
    password: "$6$exDY1mhS4KUYCE/2$zmn9ToZwTKLhCw.b4/b.ZRTIZM30JZ4QrOQ2aOXJ8yk96xpcCof0kxKwuX1kqLG/ygbJ1f8wxED22bTL4F46P0"
    username: ubuntu
  locale: en_US.UTF-8
  source:
    id: ubuntu-server-minimal
  storage:
    layout:
      name: direct
  ssh:
    install-server: true
  late-commands:
    - echo 'ubuntu ALL=(ALL) NOPASSWD:ALL' > /target/etc/sudoers.d/sysuser
  shutdown: poweroff

Not that much configuration is necessary here. I create the ubuntu user here just as an escape hatch, so that when something goes wrong with later provisioning steps, I still have a way to get into the machine. It’s removed in the first steps of my Homelab Ansible playbook.

As I’ve noted above, I don’t need any additional customization here, the plan was to create a really generic and small image I could then customize once it was installed on a machine.

The last interesting part is the post-processor in the Packer file. Here, I wrote a little script that uploads the finished image to my S3 storage, so Tinkerbell has a place to install it from. This is what the s3-upload.sh script looks like:

#!/bin/sh

if ! command -v rclone > /dev/null; then
  echo "Command rclone not found, aborting."
  exit 1
fi

image="${OUT_DIR}/${OUT_NAME}"

if [ ! -f "${image}" ]; then
  echo "Could not find image '${image}', aborting."
  exit 1
fi
echo "Copying ${image}..."

env
rclone copy "${image}" cephs3:public/images/ || exit 1

exit 0

It uses rclone to upload the image file to S3. One advantage of starting out with a generic image is that it doesn’t contain any secrets or credentials, so there’s no problem with putting it on a (internally) public S3 bucket. The credentials for the S3 upload are taken from Vault via Packer’s integration in the s3-access and s3-secret variables at the beginning of the Packer file.

Provisioning the VM via Tinkerbell

And now finally, I was ready to fully provision a VM with Tinkerbell. This requires an update of the Tinkerbell Template, which now looks like this:

apiVersion: tinkerbell.org/v1alpha1
kind: Template
metadata:
  name: test-template
spec:
  data: |
    name: test-template
    version: "0.1"
    global_timeout: 600
    tasks:
      - name: "os installation"
        worker: "{{`{{.machine_mac}}`}}"
        volumes:
          - /dev:/dev
          - /dev/console:/dev/console
        actions:
          - name: "install ubuntu"
            image: quay.io/tinkerbell/actions/image2disk:latest
            timeout: 900
            environment:
                IMG_URL: {{ .Values.images.ubuntuBaseAmd64 }}
                DEST_DISK: /dev/sda
                COMPRESSED: false    

And that just worked, right out of the box. The Tinkerbell image2disk action downloaded the image from S3 and automatically put it onto the VM’s local disk.

And just like that, I had a fully deployed VM, provisioned via Tinkerbell. 🎉

But not so fast. Of course, the first thing missing here was a proper cloud-init config to set up my standard Ansible user so I could run my standard playbook. Cloud-init can download configurations for the initial boot from a cloud provider, codified in the user-data and vendor-data. It runs in several phases during boot, first, before the network is available, from local config files. And then, afterwards, from user-data provided e.g. by the cloud provider via a HTTP server. The user-data and vendor-data can also be provided from local files entirely. There’s a wide range of configurations that can be done via cloud-init. From creating local users and installing packages to configuring mounts and networking.

To supply this cloud-init data, Tinkerbell has the Tootles component. It implements AWS’ EC2 metadata service API, which is also supported by cloud-init. The metadata reported by Tootles for any given instance is supplied via the Hardware object:

apiVersion: tinkerbell.org/v1alpha1
kind: Hardware
metadata:
  name: test-vm
spec:
  metadata:
    instance:
      id: 10:66:6a:5a:91:8c
      ips:
        - address: 203.0.113.20
      allow_pxe: true
      hostname: test-vm
      operating_system:
        distro: "ubuntu"
        version: "24.04"
  disks:
  - device: /dev/sda
  interfaces:
  - dhcp:
      arch: x86_64
      hostname: test-vm
      mac: 10:66:6a:5a:91:8c
      ip:
        address: 203.0.113.20
        netmask: 255.255.255.0
      name_servers:
      - 10.86.25.254
      uefi: true
    netboot:
      allowPXE: true
      allowWorkflow: true
  userData: |
    #cloud-config
    packages:
      - openssh-server
      - python3
      - sudo
    ssh_pwauth: false
    disable_root: true
    allow_public_ssh_keys: false
    timezone: "Europe/Berlin"
    users:
      - name: ansible-user
        shell: /bin/bash
        ssh_authorized_keys:
          - from="192.0.2.100" ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIOaxn8l16GNyBEgYzWO0BAko9fw8kkIq9tbels3hXdUt user@foo
        sudo: ALL=(ALL:ALL) ALL
    runcmd:
      - systemctl enable ssh.service
      - systemctl start ssh.service
    power_state:
      delay: 2
      timeout: 2
      mode: reboot    

The first change necessary here is to add the spec.interfaces[].dhcp.ip section. This is one of the suboptimal pieces of Tinkerbell. I’m not actually having Tinkerbell do the IPAM part of DHCP, that’s still left to my OPNsense router. But I still needed to specify the VM’s IP here, because the EC2 API, and thus Tootles, determines which metadata to return by the IP the request is coming from. So if you just do a request for /2009-04-04/meta-data from any host, you won’t get a response. The request needs to come from an IP which has a Hardware object. Another downside is that the spec.metadata section needs to be defined manually - it’s not automatically created from the rest of the Hardware object.

Then we come to the actually interesting part, the spec.userData. This is the cloud-init config returned to the machine upon request. As I’ve noted above, the main goal here is to configure the new machine so I can run my main Ansible playbook on it. I’m making sure that my Ansible user exists, has my SSH key and is in the sudoers file. In addition, I’m making sure that SSH is started and then finally reboot the machine. The #cloud-init comment is load-bearing by the way, without it cloud-init won’t accept the configuration.

So far so good, but this configuration still did not work. The central issue was that the machine did not have proper networking config. The ip addr command was showing the Ethernet interface being down. This confused me, because the cloud-init config clearly states that, when there’s no explicit network config given, a default using DHCP for all interfaces will be applied.

So I went searching. And that wasn’t easy, because it turns out that Ubuntu’s server-minimal install is so minimal that it even eschews vi or nano. I had to look at files with cat. But I was finally able to find what I was looking for. In /etc/netplan/50-cloud-init.yaml, I found this:

network:
  version: 2
  ethernets:
    ens3:
      dhcp4: true

That file was created by the installer during the Packer install run. But of course, the NIC had a different name in that environment than it has on the final VM. To remedy this, I added another task to the Tinkerbell Template, removing the cloud-init config created by the installer so that the defaults apply:

- name: "remove installer network config"
  image: quay.io/tinkerbell/actions/writefile:latest
  timeout: 90
  environment:
    DEST_DISK: {{ `{{ formatPartition ( index .Hardware.Disks 0 ) 2 }}` }}
    FS_TYPE: ext4
    DEST_PATH: /etc/cloud/cloud.cfg.d/90-installer-network.cfg
    UID: 0
    GID: 0
    MODE: 0600
    DIRMODE: 0700
    CONTENTS: |
      # Removed during provisioning      

This task is executed after the image is dd’d onto the disk, mounts the root partition and overrides the file content with a comment.

But even after that, I was still not getting my cloud-init user-config applied. After some more searching, I found the file /run/cloud/init/cloud-init-generator.log with the following content:

ds-identify rc=1
cloud-init is enabled but no datasource found, disabling

I could have avoided this problem by following Tinkerbell’s cloud-init docs. There, the example contains two more tasks:

- name: "add cloud-init config"
  image: quay.io/tinkerbell/actions/writefile:latest
  timeout: 90
  environment:
    DEST_DISK: {{ `{{ formatPartition ( index .Hardware.Disks 0 ) 2 }}` }}
    DEST_PATH: /etc/cloud/cloud.cfg.d/10_tinkerbell.cfg
    DIRMODE: "0700"
    FS_TYPE: ext4
    GID: "0"
    MODE: "0600"
    UID: "0"
    CONTENTS: |
      datasource:
        Ec2:
          metadata_urls: ["http://203.0.113.200:7172"]
          strict_id: false
      manage_etc_hosts: localhost
      warnings:
        dsid_missing_source: off      
- name: "add cloud-init ds-identity"
  image: quay.io/tinkerbell/actions/writefile:latest
  timeout: 90
  environment:
    DEST_DISK: {{ `{{ formatPartition ( index .Hardware.Disks 0 ) 2 }}` }}
    FS_TYPE: ext4
    DEST_PATH: /etc/cloud/ds-identify.cfg
    UID: 0
    GID: 0
    MODE: 0600
    DIRMODE: 0700
    CONTENTS: |
      datasource: Ec2      

The first task adds some basic cloud-init configuration. Most importantly, the URL for the metadata service. For most cloud providers, this is a hardcoded IP over their entire cloud, but here it will be Tinkerbell’s public IP as configured in the Helm chart’s values.yaml. Another important setting is to hardcode the data source as Ec2, because cloud-init’s default search mechanism checks the aforementioned default IP, where it won’t find any metadata service in my Homelab.

With all of this configuration done, I was able to delete the VM one last time, reset the Workflow object of Tinkerbell, and recreate the VM. After a couple of minutes, I was greeted with a fully functional VM, ready for Ansible, with no further manual intervention from my side.

Final thoughts

I really like what I’ve seen from Tinkerbell so far. I also like how well cloud-init works. Even if I don’t end up deploying Tinkerbell, I will likely change my new host setup to use a generic image and then do the customization with cloud-init.

The next steps will be the more complicated ones. There are two basic things I will need to figure out. First, how to boot Raspberry Pi 4 and 5 into iPXE so I can use Tinkerbell for provisioning them. From some initial research, it looks like that should be possible. The bigger issue might be diskless hosts. Sure, I can set up iPXE and provisioning - but the problem is then how to tell them to boot into their own system, instead of Tinkerbell’s provisioning, once they’ve been properly set up.

Let’s see how those next experiments turn out.