Nomad to k8s, Part 7: Ansible Plays for Host Updates

Wherein I add the Kubernetes nodes to my host update Ansible playbook.

This is part eight of my k8s migration series.

With the number of hosts I’ve now got in my Homelab, I definitely need a better way to update them than manually SSH’ing into each. So a while ago, I created an Ansible playbook to update all hosts in my Homelab. These updates are also one of the reasons I keep so many physical hosts, even if they’re individually relatively small: I want an environment where I can take down any given host for updates without anything at all breaking, and especially without having to take the entire lab down before a regular host update.

My node updates need to execute the following sequence:

Drain all Pods from the node
Run apt update
Run apt upgrade
Reboot the machine
Uncordon the machine
Run apt autoremove

I’ve got a couple of different classes of nodes in my Homelab, but I will concentrate only on those related to k8s in this post:

Control plane nodes. These run the kubeadm control plane Pods and Ceph MONs.
Ceph nodes. These run the Ceph OSDs providing storage to the Homelab and some other Ceph services.
Worker nodes. Those run my Kubernetes workloads.

All three have some alterations on the above sequence of steps. All three classes of node have their separate play, and the plays all run in sequence, not parallel to each other, to ensure stability of the overall cluster. I’m reasonably sure that with some fancy footwork, I could probably run them in parallel as well. But the main goal of this setup is that I enter a single command, and then I can do something completely different, without having to babysit the update. If it takes an hour longer but I can just go and read something while it’s running, that’s an okay trade-off for me. Those of you following me on Mastodon can probably tell when my update Fridays are just by the volume of posts I make on those evenings. 😅

The first difference in the playbooks for each class of node is in the parallelism inside the playbooks. For this, I’m using Ansible’s Linear Strategy. Both the control plane and Ceph nodes run with serial: 1, to make sure there are always enough nodes up to keep the Homelab chugging along. The worker nodes on the other hand are allowed to run with serial: 2, updating two hosts in parallel, as I should have enough slack in the cluster to keep at least most things running even with two fewer nodes.

For draining the nodes, I initially used the k8s_drain_module. But I had a problem with that one, namely getting too many requests errors:

fatal: [node1 -> cnc]: FAILED! => {"changed": false, "msg": "Failed to delete pod rook-cluster/rook-ceph-osd-1-7977658495-nt6ps due to: Too Many Requests"}

I didn’t always get the error. Sometimes it just worked. And I’m not 100% sure what the trigger here is. After spending quite a while googling, I’m still not sure where those errors are even coming from. Whether it’s the kube-apiserver which returns them, or whether it for example has something to do with Pod disruption budgets. I then switched to executing kubectl via the command module. This worked without issue. The task for draining a node looks like this:

- name: drain node
  tags:
    - kubernetes
  delegate_to: cnc
  become_user: my_user
  command:
    argv:
      - /home/my_user/.local/bin/kubectl
      - drain
      - --delete-emptydir-data=true
      - --force=true
      - --ignore-daemonsets=true
      - "{{ ansible_hostname }}"

You need to supply the absolute path to the kubectl binary, as this runs a command directly, not inside a shell, so no PATH extensions and the like. I’m also delegating this task to my Command & Control host. This is the only machine with Kubernetes certs. The --delete-emptydir-data=true is needed because Cilium uses emptyDir for some temporary storage, and without it, the drain fails. The same is true for --force, which is necessary to allow a drain on nodes with Pods from e.g. Deployments or StatefulSets to go through. Finally, --ignore-daemonsets is necessary to allow draining of DaemonSet pods, which in my case is for example the Fluentbit log shipper.

The rest of the play for my worker nodes looks like this:

- hosts: kube_workers
  name: Update kubernetes worker nodes
  tags:
    - k8s-workers
  serial: 2
  strategy: linear
  pre_tasks:
  tasks:
    - name: drain node
      tags:
        - kubernetes
      delegate_to: cnc
      become_user: my_user
      command:
        argv:
          - /home/my_user/.local/bin/kubectl
          - drain
          - --delete-emptydir-data=true
          - --force=true
          - --ignore-daemonsets=true
          - "{{ ansible_hostname }}"
    - name: run apt upgrade
      tags:
        - apt
      apt:
        install_recommends: no
        update_cache: yes
        upgrade: yes
    - name: reboot machine
      tags:
        - reboot
      reboot:
    - name: wait for the machine to accept ansible commands again
      tags:
        - reboot
      wait_for_connection:
    - name: clear OSD blocklist
      become_user: my_user
      delegate_to: cnc
      tags:
        - ceph
      command: /home/my_user/.krew/bin/kubectl-rook_ceph --operator-namespace rook-ceph -n rook-cluster ceph osd blocklist clear
    - name: uncordon node
      tags:
        - kubernetes
      delegate_to: cnc
      become_user: my_user
      kubernetes.core.k8s_drain:
        name: "{{ ansible_hostname }}"
        state: uncordon
    - name: run autoremove
      tags:
        - apt
      apt:
        autoremove: true
    - name: pause for one minute
      tags:
        - kubernetes
      pause:
        minutes: 1

The clear OSD blocklist task clears Ceph’s client blocklist. Most of my worker nodes don’t have any storage of their own, and instead use netboot and a Ceph RBD volume for their root FS. And sometimes, Ceph puts clients on a blocklist, as I’ve explained in more detail here. This task clears the blocklist. I’m also giving all my plays a pause at the end, to afford the cluster some time to settle again before the next batch of workers is taken down.

For the Ceph nodes, I need to get a little bit more involved. I start out with pre tasks and post tasks, which set and later unset the Ceph noout flag. This flag tells Ceph to not be bothered when an OSD goes down. With this flag unset, which should be the default, Ceph would start re-balancing data between the still available OSDs once an OSD has been out of the cluster for some time. This is useful for error cases, but the noout flag can be used to tell Ceph that it’s going to be okay.

- name: set osd noout
  delegate_to: cnc
  become_user: my_user
  command: /home/my_user/.krew/bin/kubectl-rook_ceph --operator-namespace rook-ceph -n rook-cluster ceph osd set noout

As you can see here, I’m again delegating execution of the command to my C&C host, as no other host has the necessary k8s certs. This command also needs the absolute path to the binary. This time, that’s not kubectl itself, but instead the binary of the rook-ceph plugin. Normally I would call it with kubectl rook-ceph ..., but that does not work with the command module, so I give it the absolute path.

The next extra, compared to the worker node play, is that I’m actively waiting for the Ceph OSDs to come back. This is important to make sure that I don’t start the updates of the next OSD node before the previous one is back up and running, because otherwise, bad things would happen. For one thing, I’ve got workloads already using the Ceph storage. For another, most of my worker nodes will use storage from Ceph for their root disks.

- name: wait for OSDs to start
  delegate_to: cnc
  become_user: my_user
  tags:
    - ceph
  command: /home/my_user/.krew/bin/kubectl-rook_ceph --operator-namespace rook-ceph -n rook-cluster ceph osd status "{{ ansible_hostname }}"
  register: ceph_end
  until: '(ceph_end.stdout | regex_findall(".*,up.*", multiline=True) | list | length) == (ceph_end.stdout_lines | length - 1)'
  retries: 12
  delay: 10

This task waits for a maximum of 120 seconds for the node’s OSDs to come up. The output of ceph osd status looks like this:

ID  HOST     USED  AVAIL  WR OPS  WR DATA  RD OPS  RD DATA  STATE      
 0  nakith   191G  7260G      0        0       0        0   exists,up  
 1  nakith   809M  1862G      4     11.1k      5       26   exists,up  
 2  neper    391M   931G      0        0       1        0   exists,up  
 3  neper    191G  3534G      0        0       0     14.2k  exists,up

That’s then parsed with a regex, and the number of lines with up in the state is compared to the number of lines overall. Just for completeness’ sake, here is the full Ceph play:

- hosts: kube_ceph
  name: Update kubernetes Ceph nodes
  tags:
    - k8s-ceph
  serial: 1
  strategy: linear
  pre_tasks:
    - name: set osd noout
      delegate_to: cnc
      become_user: my_user
      command: /home/my_user/.krew/bin/kubectl-rook_ceph --operator-namespace rook-ceph -n rook-cluster ceph osd set noout
  tasks:
    - name: drain node
      tags:
        - kubernetes
        - ceph
      delegate_to: cnc
      become_user: my_user
      command:
        argv:
          - /home/my_user/.local/bin/kubectl
          - drain
          - --delete-emptydir-data=true
          - --force=true
          - --ignore-daemonsets=true
          - "{{ ansible_hostname }}"
    - name: run apt upgrade
      tags:
        - apt
      apt:
        install_recommends: no
        update_cache: yes
        upgrade: yes
    - name: reboot machine
      tags:
        - reboot
      reboot:
    - name: wait for the machine to accept ansible commands again
      tags:
        - reboot
      wait_for_connection:
    - name: uncordon node
      tags:
        - kubernetes
      delegate_to: cnc
      become_user: my_user
      kubernetes.core.k8s_drain:
        name: "{{ ansible_hostname }}"
        state: uncordon
    - name: wait for OSDs to start
      delegate_to: cnc
      become_user: my_user
      tags:
        - ceph
      command: /home/my_user/.krew/bin/kubectl-rook_ceph --operator-namespace rook-ceph -n rook-cluster ceph osd status "{{ ansible_hostname }}"
      register: ceph_end
      until: '(ceph_end.stdout | regex_findall(".*,up.*", multiline=True) | list | length) == (ceph_end.stdout_lines | length - 1)'
      retries: 12
      delay: 10
    - name: run autoremove
      tags:
        - apt
      apt:
        autoremove: true
    - name: pause for two minutes
      tags:
        - ceph
      pause:
        minutes: 2
  post_tasks:
    - name: unset osd noout
      delegate_to: cnc
      become_user: my_user
      command: /home/my_user/.krew/bin/kubectl-rook_ceph --operator-namespace rook-ceph -n rook-cluster ceph osd unset noout

And finally, the control plane nodes. The main addition here is that I’m using Ansible’s wait_for module to wait until the CP components are up again. Or, to be more precise, to wait until their ports are open, as I’m not really doing a readiness check. Here is the play:

- hosts: kube_controllers
  name: Update k8s controller hosts
  tags:
    - k8s-controller
  serial: 1
  strategy: linear
  tasks:
    - name: drain node
      tags:
        - kubernetes
      delegate_to: cnc
      become_user: my_user
      command:
        argv:
          - /home/my_user/.local/bin/kubectl
          - drain
          - --delete-emptydir-data=true
          - --force=true
          - --ignore-daemonsets=true
          - "{{ ansible_hostname }}"
    - name: run apt upgrade
      tags:
        - apt
      apt:
        install_recommends: no
        update_cache: yes
        upgrade: yes
    - name: reboot machine
      tags:
        - reboot
      reboot:
    - name: wait for the machine to accept ansible commands again
      tags:
        - reboot
      wait_for_connection:
    - name: uncordon node
      tags:
        - kubernetes
      delegate_to: cnc
      become_user: my_user
      kubernetes.core.k8s_drain:
        name: "{{ ansible_hostname }}"
        state: uncordon
    - name: wait for kubelet to start
      tags:
        - kubernetes
      wait_for:
        host: "{{ ansible_default_ipv4.address }}"
        port: 10250
        sleep: 10
        state: started
    - name: wait for kube-apiserver to start
      tags:
        - kubernetes
      wait_for:
        host: "{{ ansible_default_ipv4.address }}"
        port: 6443
        sleep: 10
        state: started
    - name: wait for kube-vip to start
      tags:
        - kubernetes
      wait_for:
        host: "{{ ansible_default_ipv4.address }}"
        port: 2112
        sleep: 10
        state: started
    - name: wait for etcd to start
      tags:
        - kubernetes
      wait_for:
        host: "{{ ansible_default_ipv4.address }}"
        port: 2379
        sleep: 10
        state: started
    - name: wait for ceph mon to start
      tags:
        - ceph
      wait_for:
        host: "{{ ansible_default_ipv4.address }}"
        port: 6789
        sleep: 10
        state: started
    - name: run autoremove
      tags:
        - apt
      apt:
        autoremove: true
    - name: pause for one minute after controller update
      tags:
        - kubernetes
      pause:
        minutes: 1

The additional waits for the ports to accepts connections on all of the CP components is a bit of insurance, to make sure the node is fully up again. This could certainly be improved by checking the Pod status via kubectl instead, but this approach has served me well for about a year now in my Nomad cluster, so it should be fine here as well.

And with that, I’ve finally got my Kubernetes nodes in the regular updates as well. It was really high time, I set the nodes up back on the 20th of December and haven’t updated them since then. 😬