Wherein I add the Kubernetes nodes to my host update Ansible playbook.
This is part eight of my k8s migration series.
With the number of hosts I’ve now got in my Homelab, I definitely need a better way to update them than manually SSH’ing into each. So a while ago, I created an Ansible playbook to update all hosts in my Homelab. These updates are also one of the reasons I keep so many physical hosts, even if they’re individually relatively small: I want an environment where I can take down any given host for updates without anything at all breaking, and especially without having to take the entire lab down before a regular host update.
My node updates need to execute the following sequence:
- Drain all Pods from the node
- Run
apt update
- Run
apt upgrade
- Reboot the machine
- Uncordon the machine
- Run
apt autoremove
I’ve got a couple of different classes of nodes in my Homelab, but I will concentrate only on those related to k8s in this post:
- Control plane nodes. These run the kubeadm control plane Pods and Ceph MONs.
- Ceph nodes. These run the Ceph OSDs providing storage to the Homelab and some other Ceph services.
- Worker nodes. Those run my Kubernetes workloads.
All three have some alterations on the above sequence of steps. All three classes of node have their separate play, and the plays all run in sequence, not parallel to each other, to ensure stability of the overall cluster. I’m reasonably sure that with some fancy footwork, I could probably run them in parallel as well. But the main goal of this setup is that I enter a single command, and then I can do something completely different, without having to babysit the update. If it takes an hour longer but I can just go and read something while it’s running, that’s an okay trade-off for me. Those of you following me on Mastodon can probably tell when my update Fridays are just by the volume of posts I make on those evenings. 😅
The first difference in the playbooks for each class of node is in the parallelism
inside the playbooks. For this, I’m using Ansible’s Linear Strategy.
Both the control plane and Ceph nodes run with serial: 1
, to make sure there
are always enough nodes up to keep the Homelab chugging along.
The worker nodes on the other hand are allowed to run with serial: 2
, updating
two hosts in parallel, as I should have enough slack in the cluster to keep at
least most things running even with two fewer nodes.
For draining the nodes, I initially used the k8s_drain_module.
But I had a problem with that one, namely getting too many requests
errors:
fatal: [node1 -> cnc]: FAILED! => {"changed": false, "msg": "Failed to delete pod rook-cluster/rook-ceph-osd-1-7977658495-nt6ps due to: Too Many Requests"}
I didn’t always get the error. Sometimes it just worked. And I’m not 100% sure
what the trigger here is. After spending quite a while googling, I’m still not
sure where those errors are even coming from. Whether it’s the kube-apiserver
which returns them, or whether it for example has something to do with Pod
disruption budgets. I then switched to executing kubectl
via the command module.
This worked without issue. The task for draining a node looks like this:
- name: drain node
tags:
- kubernetes
delegate_to: cnc
become_user: my_user
command:
argv:
- /home/my_user/.local/bin/kubectl
- drain
- --delete-emptydir-data=true
- --force=true
- --ignore-daemonsets=true
- "{{ ansible_hostname }}"
You need to supply the absolute path to the kubectl
binary, as this runs
a command directly, not inside a shell, so no PATH extensions and the like.
I’m also delegating this task to my Command & Control host. This is the only
machine with Kubernetes certs. The --delete-emptydir-data=true
is needed
because Cilium uses emptyDir
for some temporary storage, and without it, the drain fails.
The same is true for --force
, which is necessary to allow a drain on nodes
with Pods from e.g. Deployments or StatefulSets to go through. Finally, --ignore-daemonsets
is necessary to allow draining of DaemonSet pods, which in my case is for example
the Fluentbit log shipper.
The rest of the play for my worker nodes looks like this:
- hosts: kube_workers
name: Update kubernetes worker nodes
tags:
- k8s-workers
serial: 2
strategy: linear
pre_tasks:
tasks:
- name: drain node
tags:
- kubernetes
delegate_to: cnc
become_user: my_user
command:
argv:
- /home/my_user/.local/bin/kubectl
- drain
- --delete-emptydir-data=true
- --force=true
- --ignore-daemonsets=true
- "{{ ansible_hostname }}"
- name: run apt upgrade
tags:
- apt
apt:
install_recommends: no
update_cache: yes
upgrade: yes
- name: reboot machine
tags:
- reboot
reboot:
- name: wait for the machine to accept ansible commands again
tags:
- reboot
wait_for_connection:
- name: clear OSD blocklist
become_user: my_user
delegate_to: cnc
tags:
- ceph
command: /home/my_user/.krew/bin/kubectl-rook_ceph --operator-namespace rook-ceph -n rook-cluster ceph osd blocklist clear
- name: uncordon node
tags:
- kubernetes
delegate_to: cnc
become_user: my_user
kubernetes.core.k8s_drain:
name: "{{ ansible_hostname }}"
state: uncordon
- name: run autoremove
tags:
- apt
apt:
autoremove: true
- name: pause for one minute
tags:
- kubernetes
pause:
minutes: 1
The clear OSD blocklist
task clears Ceph’s client blocklist. Most of my worker
nodes don’t have any storage of their own, and instead use netboot and a Ceph
RBD volume for their root FS. And sometimes, Ceph puts clients on a blocklist,
as I’ve explained in more detail here.
This task clears the blocklist. I’m also giving all my plays a pause at the end,
to afford the cluster some time to settle again before the next batch of workers
is taken down.
For the Ceph nodes, I need to get a little bit more involved. I start out with
pre tasks and post tasks, which set and later unset the Ceph noout
flag. This flag tells Ceph
to not be bothered when an OSD goes down. With this flag unset, which should be
the default, Ceph would start re-balancing data between the still available OSDs
once an OSD has been out of the cluster for some time. This is useful for error
cases, but the noout
flag can be used to tell Ceph that it’s going to be okay.
- name: set osd noout
delegate_to: cnc
become_user: my_user
command: /home/my_user/.krew/bin/kubectl-rook_ceph --operator-namespace rook-ceph -n rook-cluster ceph osd set noout
As you can see here, I’m again delegating execution of the command to my C&C host,
as no other host has the necessary k8s certs. This command also needs the absolute
path to the binary. This time, that’s not kubectl
itself, but instead the binary
of the rook-ceph plugin. Normally
I would call it with kubectl rook-ceph ...
, but that does not work with the
command
module, so I give it the absolute path.
The next extra, compared to the worker node play, is that I’m actively waiting for the Ceph OSDs to come back. This is important to make sure that I don’t start the updates of the next OSD node before the previous one is back up and running, because otherwise, bad things would happen. For one thing, I’ve got workloads already using the Ceph storage. For another, most of my worker nodes will use storage from Ceph for their root disks.
- name: wait for OSDs to start
delegate_to: cnc
become_user: my_user
tags:
- ceph
command: /home/my_user/.krew/bin/kubectl-rook_ceph --operator-namespace rook-ceph -n rook-cluster ceph osd status "{{ ansible_hostname }}"
register: ceph_end
until: '(ceph_end.stdout | regex_findall(".*,up.*", multiline=True) | list | length) == (ceph_end.stdout_lines | length - 1)'
retries: 12
delay: 10
This task waits for a maximum of 120 seconds for the node’s OSDs to come up.
The output of ceph osd status
looks like this:
ID HOST USED AVAIL WR OPS WR DATA RD OPS RD DATA STATE
0 nakith 191G 7260G 0 0 0 0 exists,up
1 nakith 809M 1862G 4 11.1k 5 26 exists,up
2 neper 391M 931G 0 0 1 0 exists,up
3 neper 191G 3534G 0 0 0 14.2k exists,up
That’s then parsed with a regex, and the number of lines with up
in the
state is compared to the number of lines overall.
Just for completeness’ sake, here is the full Ceph play:
- hosts: kube_ceph
name: Update kubernetes Ceph nodes
tags:
- k8s-ceph
serial: 1
strategy: linear
pre_tasks:
- name: set osd noout
delegate_to: cnc
become_user: my_user
command: /home/my_user/.krew/bin/kubectl-rook_ceph --operator-namespace rook-ceph -n rook-cluster ceph osd set noout
tasks:
- name: drain node
tags:
- kubernetes
- ceph
delegate_to: cnc
become_user: my_user
command:
argv:
- /home/my_user/.local/bin/kubectl
- drain
- --delete-emptydir-data=true
- --force=true
- --ignore-daemonsets=true
- "{{ ansible_hostname }}"
- name: run apt upgrade
tags:
- apt
apt:
install_recommends: no
update_cache: yes
upgrade: yes
- name: reboot machine
tags:
- reboot
reboot:
- name: wait for the machine to accept ansible commands again
tags:
- reboot
wait_for_connection:
- name: uncordon node
tags:
- kubernetes
delegate_to: cnc
become_user: my_user
kubernetes.core.k8s_drain:
name: "{{ ansible_hostname }}"
state: uncordon
- name: wait for OSDs to start
delegate_to: cnc
become_user: my_user
tags:
- ceph
command: /home/my_user/.krew/bin/kubectl-rook_ceph --operator-namespace rook-ceph -n rook-cluster ceph osd status "{{ ansible_hostname }}"
register: ceph_end
until: '(ceph_end.stdout | regex_findall(".*,up.*", multiline=True) | list | length) == (ceph_end.stdout_lines | length - 1)'
retries: 12
delay: 10
- name: run autoremove
tags:
- apt
apt:
autoremove: true
- name: pause for two minutes
tags:
- ceph
pause:
minutes: 2
post_tasks:
- name: unset osd noout
delegate_to: cnc
become_user: my_user
command: /home/my_user/.krew/bin/kubectl-rook_ceph --operator-namespace rook-ceph -n rook-cluster ceph osd unset noout
And finally, the control plane nodes. The main addition here is that I’m using Ansible’s wait_for module to wait until the CP components are up again. Or, to be more precise, to wait until their ports are open, as I’m not really doing a readiness check. Here is the play:
- hosts: kube_controllers
name: Update k8s controller hosts
tags:
- k8s-controller
serial: 1
strategy: linear
tasks:
- name: drain node
tags:
- kubernetes
delegate_to: cnc
become_user: my_user
command:
argv:
- /home/my_user/.local/bin/kubectl
- drain
- --delete-emptydir-data=true
- --force=true
- --ignore-daemonsets=true
- "{{ ansible_hostname }}"
- name: run apt upgrade
tags:
- apt
apt:
install_recommends: no
update_cache: yes
upgrade: yes
- name: reboot machine
tags:
- reboot
reboot:
- name: wait for the machine to accept ansible commands again
tags:
- reboot
wait_for_connection:
- name: uncordon node
tags:
- kubernetes
delegate_to: cnc
become_user: my_user
kubernetes.core.k8s_drain:
name: "{{ ansible_hostname }}"
state: uncordon
- name: wait for kubelet to start
tags:
- kubernetes
wait_for:
host: "{{ ansible_default_ipv4.address }}"
port: 10250
sleep: 10
state: started
- name: wait for kube-apiserver to start
tags:
- kubernetes
wait_for:
host: "{{ ansible_default_ipv4.address }}"
port: 6443
sleep: 10
state: started
- name: wait for kube-vip to start
tags:
- kubernetes
wait_for:
host: "{{ ansible_default_ipv4.address }}"
port: 2112
sleep: 10
state: started
- name: wait for etcd to start
tags:
- kubernetes
wait_for:
host: "{{ ansible_default_ipv4.address }}"
port: 2379
sleep: 10
state: started
- name: wait for ceph mon to start
tags:
- ceph
wait_for:
host: "{{ ansible_default_ipv4.address }}"
port: 6789
sleep: 10
state: started
- name: run autoremove
tags:
- apt
apt:
autoremove: true
- name: pause for one minute after controller update
tags:
- kubernetes
pause:
minutes: 1
The additional waits for the ports to accepts connections on all of the CP
components is a bit of insurance, to make sure the node is fully up again. This
could certainly be improved by checking the Pod status via kubectl
instead,
but this approach has served me well for about a year now in my Nomad cluster,
so it should be fine here as well.
And with that, I’ve finally got my Kubernetes nodes in the regular updates as well. It was really high time, I set the nodes up back on the 20th of December and haven’t updated them since then. 😬