After setting up my lab environment in the previous article, I’ve now also set up the Kubernetes cluster itself, with kubeadm as the setup tool and Cilium as the CNI plugin for networking.

Here, I will describe why I chose the tools I did, and how I initialized the cluster, as well as how to remove the cluster when necessary.

Tools choice

Before setting up a cluster, several choices need to be made. The first one in the case of Kubernetes is which distribution to use.

The first one, and the one I chose, is “vanilla” k8s. This is the default distribution, with full support for all the related standards and functionality.

Another well-liked one is k3s, which bills itself as a lightweight distribution. Its most distinguishing feature seems to be the fact that its control plane comes along as a single binary, instead of an entire set, as in the case of vanilla k8s. Also in contrast to k8s, it uses a simple SQLite database as a storage backend for cluster data, instead of a full etcd cluster. It also falls into the “opinionated” part of the tech spectrum. Instead of telling you to make a choice on things like CNI and CRI, it already comes with some options out of the box. Flannel is pre-chosen as a CNI plugin, while e.g. Traefik is already set up as an Ingress Controller.

If you want to go even further from vanilla, there’s also things like Talos Linux. It’s an entire Linux distro made with only one goal: Running a Kubernetes cluster. It doesn’t even allow you to SSH into it.

For now, I will stay with vanilla k8s, which I will install with kubeadm. Simply because I like making the “vanilla” experience my first contact with some tech. I also prefer being forced to make my own decisions on tools, so that I am forced to inform myself about the alternatives. Once I’ve completed my current experimentation, I will likely at least take another look at Talos OS. Its premise sounds quite interesting, especially with the declarative config files, but the “no SSH” is honestly somewhat weird for me.

The next choice to be made is the CRI, the container runtime. The only thing I knew going into this is that I did not want to go with Docker. Too many bad experiences with memory leaks and other shenanigans with their daemon. After some research, my choice fell on CRI-O. To be honest, mostly because it bills itself as a container engine focused on use with Kubernetes.

Next is the CNI plugin. This is the piece of the Kubernetes stack which controls networking, most importantly inter-Pod networking. With this, I had my biggest problem to choose. The websites of all of them are chock-full of buzzwords. eBPF! It’s better than sliced bread! 😒 In the end, my decision was between Cilium and Calico. The one thing I was really interested in and I definitely wanted was Network Policies. Those allow defining rules for inter-Pod connectivity, allowing me to define which pods can talk with each other. I like having this for the sake of security, so that e.g. only the apps which actually need a DB can talk to the Postgres pod. In my current HashiCorp Nomad based cluster, I’ve got something similar using Consul’s service mesh. One more thing I find pretty nice is that both Calico and Cilium support encryption. This was another reason for why I started using Consul: It provides me with encrypted network traffic, without me having to setup TLS certs for each individual service. In the end, even after reading through most of the docs for both Calico and Cilium, I didn’t know which one to choose. So I did the obvious thing:

A picture of a 20 sided dice with the number 16 on the upper face.

When in doubt, just ask Principal Lead Architect Dice for their opinion.

And that’s how I came to use Cilium as the CNI plugin in my cluster. 😅

Without further ado, let’s conjure ourselves a Kubernetes cluster. 🤓

Preparing the machines

Before we can actually call kubeadm init, we need to install the tools on the machines and do some additional config. For most of the setup, I followed the official Kubernetes kubeadm docs.

Before installing the tools, a couple of config options need to be set on the machines, defined here.

I configured the options using Ansible:

- name: load overlay kernel module
  tags:
    - kubernetes
    - kernel
  community.general.modprobe:
    name: overlay
    persistent: present
    state: present
- name: load br_netfilter kernel module
  tags:
    - kubernetes
    - kernel
  community.general.modprobe:
    name: br_netfilter
    persistent: present
    state: present
- name: enable ipv4 netfilter on bridge interfaces
  tags:
    - kubernetes
    - kernel
  ansible.posix.sysctl:
    name: net.bridge.bridge-nf-call-iptables
    value: 1
    state: present
- name: enable ipv6 netfilter on bridge interfaces
  tags:
    - kubernetes
    - kernel
  ansible.posix.sysctl:
    name: net.bridge.bridge-nf-call-ip6tables
    value: 1
    state: present
- name: enable ipv4 forwarding
  tags:
    - kubernetes
    - kernel
  ansible.posix.sysctl:
    name: net.ipv4.ip_forward
    value: 1
    state: present

This takes care of the pre-config. But if you’re using the Ubuntu cloud image, for example with LXD VMs, you also need to switch to a different kernel to later be able to make use of Cilium as the CNI plugin.

Fun with Ubuntu’s cloud images

When I started up the cluster and went to install Cilium, its containers went into a crash loop, due to the Ubuntu cloud image kernel, which is specialized for use with e.g. KVM, not having all Kernel modules available. For me, the kernel installed was linux-image-kvm. The Cilium docs have a page detailing the required kernel config. I initially thought: Those will be fulfilled by any decently current kernel. But I was wrong, the -kvm variant of Ubuntu’s kernel seems to be lacking some of the configs.

To fix this, I then needed to switch to the -generic kernel. Naively, I again thought: How difficult could it possibly be? And I just ran apt remove linux-image-5.15.0-1039-kvm. That did not have the hoped for effect. Instead, it tried to remove that image and then install linux-image-unsigned-5.15.0-1039-kvm. Which would not have been too useful. Finally, I followed this tutorial, but decided to install the -generic kernel, instead of the -virtual one.

Installing and setting up CRI-O

As noted above, cri-o is my container interface of choice. To install it, several additional APT repos need to be added. I will only show the Ansible setup for one of them here. First, we need the public keys for the repos, which I normally just download and then store in my Homelab repo. Before storing the keys, you should pipe them through gpg --dearmor.

Setting up the keys then simply means copying them into the right directory, where apt can find them:

- name: add libcontainers cri-o repo key
  tags:
    - kubernetes
    - crio
  copy:
    src: libcontainers-crio-keyring.gpg
    dest: /usr/share/keyrings/libcontainers-crio-keyring.gpg
    owner: root
    group: root
    mode: 0644

Once that’s done, we can set up the actual repo:

- name: add libcontainers cri-o ubuntu repo
  tags:
    - kubernetes
    - crio
  apt_repository:
    repo: >
      deb [signed-by=/usr/share/keyrings/libcontainers-crio-keyring.gpg]
      https://download.opensuse.org/repositories/devel:/kubic:/libcontainers:/stable:/cri-o:/1.27/xUbuntu_22.04/ /      
    state: present
    filename: libcontainers-crio
  register: libcontainers_ubuntu_repo
  when: ansible_facts['distribution'] == 'Ubuntu'

As you can see, I’m only adding the specific Ubuntu repo if the distribution actually is Ubuntu. Please note that I’ve only shown one additional repo, but there’s another one which needs to be added in a similar way.

Finally, we can install cri-o:

- name: create cri-o config dir
  tags:
    - kubernetes
    - crio
  ansible.builtin.file:
    path: /etc/crio
    state: directory
    owner: root
    group: root
    mode: '755'
- name: install cri-o config file
  tags:
    - kubernetes
    - crio
  ansible.builtin.copy:
    dest: /etc/crio/crio.conf
    src: crio.conf
    owner: root
    group: root
    mode: '644'
- name: install cri-o
  tags:
    - kubernetes
    - crio
  ansible.builtin.apt:
    name:
      - cri-o
      - cri-o-runc
    state: present
    install_recommends: false
- name: autostart cri-o
  tags:
    - crio
    - kubernetes
  ansible.builtin.systemd_service:
    name: crio
    enabled: true
    state: started

One weird thing which tripped me up in the beginning was that cri-o needs runc, but it doesn’t come with a dependency on cri-o-runc.

The config file I’m using for cri-o is pretty simple, as the defaults were mostly fine for me.

[crio.runtime]
cgroup_manager = "systemd"

So at least for now, I just make sure that systemd is set as the cgroup manager, as the CRI’s and the kubelet’s cgroup manager need to be the same.

Installing the Kubernetes tools

Finally, we need to install a couple of Kubernetes tools. First, similar to above, we need to add the Kubernetes APT repo at pkgs.k8s.io. Then we can install the three necessary Kubernetes tools:

- name: install kubernetes tools
  tags:
    - kubernetes
  ansible.builtin.apt:
    name:
      - kubelet
      - kubeadm
      - kubectl
    state: present
    install_recommends: false

In addition to installing the tools, they should also be pinned to their respective versions, as updating the Kubernetes tools cannot be done during a random system package update:

- name: pin kubelet version
  tags:
    - kubernetes
  dpkg_selections:
    name: kubelet
    selection: hold
- name: pin kubeadm version
  tags:
    - kubernetes
  dpkg_selections:
    name: kubeadm
    selection: hold
- name: pin kubectl version
  tags:
    - kubernetes
  dpkg_selections:
    name: kubectl
    selection: hold
- name: autostart kubelet
  tags:
    - kubelet
    - kubernetes
  ansible.builtin.systemd_service:
    name: kubelet
    enabled: true

At the end, I’m also making sure that the kubelet is auto-started.

Setting up kube-vip as a load balancer for the control plane

If you want a HA control plane with Kubernetes, you need a load balancer to balance the Kubernetes API endpoint to the three control plane instances.

Luckily, you don’t need to migrate your Homelab into a cloud to make this work. Through some helpful comments on the Fediverse, I was pointed towards the kube-vip app. It provides a virtual IP for the control plane, notably the Kubernetes API server. In my setup, I ran it as a static pod, as I liked the idea of tying it to the kubelet, instead of running it standalone.

To do so, I put the following static pod config file into /etc/kubernetes/manifests/kube-vip.yaml:

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  name: kube-vip
  namespace: kube-system
spec:
  containers:
  - args:
    - manager
    env:
    - name: vip_arp
      value: "true"
    - name: bgp_enable
      value: "false"
    - name: port
      value: "6443"
    - name: vip_cidr
      value: "32"
    - name: cp_enable
      value: "true"
    - name: svc_enable
      value: "false"
    - name: cp_namespace
      value: kube-system
    - name: vip_ddns
      value: "false"
    - name: vip_leaderelection
      value: "true"
    - name: vip_leasename
      value: plndr-cp-lock
    - name: vip_leaseduration
      value: "5"
    - name: vip_renewdeadline
      value: "3"
    - name: vip_retryperiod
      value: "1"
    - name: address
      value: 10.12.0.100
    - name: prometheus_server
      value: :2112
    image: ghcr.io/kube-vip/kube-vip:v0.6.2
    imagePullPolicy: Always
    name: kube-vip
    resources: {}
    securityContext:
      capabilities:
        add:
        - NET_ADMIN
        - NET_RAW
    volumeMounts:
    - mountPath: /etc/kubernetes/admin.conf
      name: kubeconfig
  hostAliases:
  - hostnames:
    - kubernetes
    ip: 127.0.0.1
  hostNetwork: true
  volumes:
  - hostPath:
      path: /etc/kubernetes/admin.conf
    name: kubeconfig
status: {}

Most of these options should be relatively clear. More information can be found in the docs. Important to note are the vip_arp and bgp_enable options. These configure how the address is made known. Because I’m definitely not a networking wizard, I went with the simpler ARP based approach.

Also worth noting is that I disabled the svc_enable option, which can be switched on to allow kube-vip to act as a LoadBalancer for Kubernetes services of that type. To reduce initial complexity, I will be working with ClusterIP and NodePort services for now and look at LoadBalancer type services later again, including things like MetalLB.

The final and most important config is address. It determines which virtual IP address kube-vip will advertise. In my case, I also added a DNS name for that IP into my authoritative DNS server for easier access.

Kube-vip should be a static pod, so it can run (more or less) outside Kubernetes. In my setup, this is necessary because I will point kubeadm towards the virtual IP during the setup of the actual cluster, so kube-vip needs to work before the cluster is actually up and running.

Initializing the cluster

All preparations finally complete, it’s time to get ourselves a Kubernetes cluster. As I’ve noted above, I’m using vanilla k8s with kubeadm. There are two ways to initialize the cluster and add additional nodes. First, using the command line. Second, using a kubeadm init config file. I will be going with the config file approach, to be able to put the initialization under version control.

Generally speaking, command line flags and config files cannot be mixed at the moment.

The documentation for the init config file can be found here.

There is no default location for the config file, so I just put them alongside all of the other Kubernetes configs under /etc/kubernetes.

And here is my init config:

apiVersion: kubeadm.k8s.io/v1beta3
kind: InitConfiguration
skipPhases:
  - "addon/kube-proxy"
nodeRegistration:
  kubeletExtraArgs:
{% if 'kube_ceph' in group_names %}
    node-labels: "homelab.role=ceph"
{% elif 'kube_controllers' in group_names %}
    node-labels: "homelab.role=controller"
{% elif 'kube_workers' in group_names %}
    node-labels: "homelab.role=worker"
{% endif %}
---
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
networking:
  podSubnet: "10.20.0.0/16"
  serviceSubnet: "10.21.0.0/16"
controlPlaneEndpoint: "api.k8s.example.com:6443"
apiServer:
  timeoutForControlPlane: 4m0s
  extraArgs:
    authorization-mode: "Node,RBAC"
controllerManager:
  extraArgs:
    allocate-node-cidrs: "true"
clusterName: "exp-cluster"
---
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
cgroupDriver: "systemd"

First, you can see the defaults used without any flags or config files by running the kubeadm config print init-defaults.

I’m actually not diverging very much from the defaults here. As you can see with the ifs in the kubeletExtraArgs, I’m using Ansible’s templating engine here to assign roles to nodes in Kubernetes node labels. Furthermore, I’m also disabling the kube-proxy initialization phase. This is due to the fact that I will be using Cilium as my Container Networking plugin, and it can provide the proxy functionality, mostly concerned with Kubernetes service handling, already. So I don’t want kubeadm to install kube-proxy on the nodes.

For the cluster itself, I’m also setting the service and pod CIDRs.

Important note: I’m using example values here, not my actual configs. If you see any weird inconsistencies between IP addresses or DNS names, please yell at me on Mastodon. 😉

The allocate-node-cidrs option for the controllerManager is recommended by Cilium.

Last but not least, the Kubernetes docs recommend to set the cgroupDriver explicitly, which I do in the KubeletConfiguration.

After this file has been defined, we can run the command to init the cluster on the first control plane node:

kubeadm init --upload-certs --config /etc/kubernetes/kube-init-config.yaml

Noteworthy here is the --upload-certs flag. Without it, the initial certs generated for the cluster will not be stored inside the cluster. As a consequence, they won’t be usable for subsequent additions of more control plane nodes and would require the execution of another command to generate a new set and upload those to be able to add additional control plane nodes. By default, those certs only have a TTL of 24 hours. So if you plan to add more control planes past that point, you can skip that flag for now.

After this first node has been initialized, the next step is to copy the admin cert to your workstation for use with kubectl. You can find it under /etc/kubernetes/admin.conf. To use this file, copy it to ~/.kube/config. And now you should be able to run your first command against your newly inaugurated Kubernetes cluster, e.g. kubectl get all -n kube-system.

Security note: This file contains the private key which gives you full access to the new Kubernetes cluster. Secure it appropriately. I will probably do another post once I’ve figured out what to do with it.

You will see a number of pods, mostly the Kubernetes control plane elements, namely etcd, kube-apiserver, kube-controller-manager and kube-scheduler. In addition, there should be a kube-vip instance. If the kubectl get command fails, first check whether kube-vip starts up correctly. The kubectl config file we copied from the initial cluster node to the workstation contains the address entered under the controlPlaneEndpoint in the kubeadm init config above. In my setup, that’s a DNS entry which points to the virtual IP managed by kube-vip.

You will also see that the coredns pods are currently still in PENDING state. That’s because CoreDNS, the default Kubernetes internal DNS server, only starts up when a Container Networking Interface plugin has been installed. In our case that’s Cilium.

Installing Cilium

As I’ve detailed the necessary preparations above, the only thing left to install Cilium is to run the install command:

cilium install --set ipam.mode=cluster-pool --set ipam.operator.clusterPoolIPv4PodCIDRList=10.20.0.0/16 --set kubeProxyReplacement=true --version 1.14.1 --set encryption.enabled=true --set encryption.type=wireguard

This command should be run on your workstation. Cilium will automatically use the kubectl config file in ~/.kube/config to contact the cluster and install itself.

The clusterPoolIPv4PodCIDRList is important here. Because while we already set the Pod address CIDR in the kubeadm init config file above, Cilium does not seem to have access to that and instead uses its internal default. In addition, I’m telling Cilium here that it should act as a replacement for kube-proxy. Finally, I’m enabling pod-to-pod encryption with WireGuard. This way, I don’t have to care about encrypting traffic between pods myself, e.g. by configuring all my services to use TLS.

If the install command fails and the Cilium pods do not come up, check to make sure that the preconditions I noted above are all fulfilled. You should now see a single cilium-operator and a cilium pod in Running state when you execute kubectl get pods -n kube-system. Furthermore, the CoreDNS pod should now also be in the Running state.

You can check whether everything went alright by executing cilium status on your workstation.

Joining remaining nodes to the cluster

For joining additional nodes, I went with a similar approach as for the cluster init, using a join configuration file.

apiVersion: kubeadm.k8s.io/v1beta3
kind: JoinConfiguration
nodeRegistration:
  kubeletExtraArgs:
{% if 'kube_ceph' in group_names %}
    node-labels: "homelab.role=ceph"
{% elif 'kube_controllers' in group_names %}
    node-labels: "homelab.role=controller"
{% elif 'kube_workers' in group_names %}
    node-labels: "homelab.role=worker"
{% endif %}
discovery:
  bootstrapToken:
    token: Token here
    apiServerEndpoint: api.k8s.example.com:6443
    caCertHashes:
      - "Cert Hash here"
{% if 'kube_controllers' in group_names %}
controlPlane:
  certificateKey: "Cert key here"
{% endif %}

This file is a bit more unwieldy than the init config, because it also needs to contain some secrets. This wouldn’t be a problem if those secrets were permanent, I could just store them in Vault. But they are pretty short lived, so storing them and templating them into the file during Ansible deployments doesn’t really work. So I just input them into the file without committing the result to git.

When you run the kubeadm init command, the output will look something like this, provided you supply the --upload-certs flag:

You can now join any number of control-plane node by running the following command on each as a root:
    kubeadm join 192.168.0.200:6443 --token 9vr73a.a8uxyaju799qwdjv --discovery-token-ca-cert-hash sha256:7c2e69131a36ae2a042a339b33381c6d0d43887e2de83720eff5359e26aec866 --control-plane --certificate-key f8902e114ef118304e561c3ecd4d0b543adc226b7a07f675f56564185ffe0c07

The important part you don’t get without the --upload-certs flag is the --certificate-key. This is required for new control plane nodes. The values in this message fit into the JoinConfiguration as follows:

  • discovery.bootstrapToken.token: --token value
  • discovery.bootstrapToken.caCertHashes: --discovery-token-ca-cert-hash value
  • controlPlane: --certificate-key value

A fully rendered version of the JoinConfiguration file above would look like this, using the values from the kubeadm init example output:

apiVersion: kubeadm.k8s.io/v1beta3
kind: JoinConfiguration
nodeRegistration:
  kubeletExtraArgs:
    node-labels: "homelab.role=controller"
discovery:
  bootstrapToken:
    token: "9vr73a.a8uxyaju799qwdjv"
    apiServerEndpoint: 192.168.0.200:6443
    caCertHashes:
      - "sha256:7c2e69131a36ae2a042a339b33381c6d0d43887e2de83720eff5359e26aec866"
controlPlane:
  certificateKey: "f8902e114ef118304e561c3ecd4d0b543adc226b7a07f675f56564185ffe0c07"

If some time has passed since running the kubeadm init command, the bootstrap token and cert will have expired. You can recreate them by running the following command on a control plane node which has already been initialized:

kubeadm init phase upload-certs --upload-certs

The output created will be similar to this:

[upload-certs] Storing the certificates in Secret "kubeadm-certs" in the "kube-system" Namespace
[upload-certs] Using certificate key:
supersecretcertkey

The last line is the new value for certificateKey. The next step is generating a fresh bootstrap token, as that is invalidated after 24 hours:

kubeadm token create --certificate-key supersecretcertkey --print-join-command

This will create a fresh join command you can use to join additional control nodes to the cluster or enter into your JoinConfiguration file.

Finally, additional worker nodes can be joined in a similar manner. Simply remove the following lines from the JoinConfiguration:

controlPlane:
  certificateKey: "Cert key here"

Deleting a cluster

As I had to kill the cluster multiple times but did not want to completely reinstall the nodes every time, I also researched how to remove a cluster. The steps are as follows:

  • First remove the CNI plugin, with cilium uninstall in my case
  • Starting with the worker nodes, execute the following commands on each node:
    1. kubeadm reset
    2. rm -fr /etc/cni
    3. Reboot the machine (this is for undoing the networking changes of the CNI plugin)

It is important to note the order here, always start with the worker nodes before removing the control plane nodes.

Final thoughts

First of all: Yay, Kubernetes Cluster! 🥳

This was a pretty vexing process. The research phase, before I set up a cluster for the first time, was considerably longer than for my current Nomad/Consul/Vault cluster. And I feel that that’s mostly due to the differences in the documentation. HashiCorp’s docs, especially their tutorials, for all three tools, are top notch.

Sure, if you follow the instructions in the docs for Kubernetes and Cilium, you will relatively reliably end up with a working cluster. But it just feels like there are a lot more moving parts. And some decisions you need to make up front, like choosing a CNI plugin and a container engine.

Don’t misunderstand me, having that choice is great. As I mentioned above, I’m a fan of apps that don’t have opinions on everything, so I can make choices for myself. But in HashiCorp’s Nomad, I can also do that. I even have greater choice, because I can decide per workload which container engine I want to use, and which networking plugin I want to use.

On the bright side, at least for now I have not seen anything I would consider a show stopper for my migration to Kubernetes. As this article was a bit longer in the making, I’ve just finished setting up Traefik as ingress a couple of days ago, and I’m working on setting up a Ceph Rook cluster now. Let’s see how this continues. 🙂

Last but not least a comment mostly to myself: Write setup articles more closely to the actual setup happening. I’m writing a lot of this over a month after I issued the (currently 😇) final kubeadm init. I’d made some notes on important things, but I had not thought of copying the outputs from the kubeadm init or kubeadm join commands to show how they’re supposed to look like. I also did not think of making a couple of notes on the initial output of some kubectl get commands during the setup phase to show what to expect, which I think might have been nice.

The next article in the series will be about day 1 operations, writing about how I plan to handle Kubernetes manifests for actual workloads in my setup.