Wherein I talk about updating my kubeadm Kubernetes cluster from 1.30 to 1.33 using Ansible.
I’ve been a bit lax on my Kubernetes cluster updates, and I was still running Kubernetes v1.30. I’m also currently on a trip to fix a number of the smaller tasks in my Homelab, paying down a bit of technical debt before tackling the next big projects.
I already did one update, from my initial Kubernetes 1.29 to 1.30 in the past, using an Ansible playbook I wrote to codify the kubeadm upgrade procedure. But I never wrote a proper post about it, which I’m now rectifying.
There were no really big problems - my cluster stayed up the entire time. But there were issues in all three of the updates which might be of interest to at least someone.
The kubeadm cluster update procedure and version skew
The update of a kubeadm cluster is relatively straightforward, but it does require some manual kubeadm actions directly on each node. The documentation can be found here.
Please note: Those instructions are versioned, and may change in the future compared to what I’m describing here. Please make sure you’re reading the instructions pertinent to the version you’re currently running.
The first thing to do is to read the release notes. These are very nicely prepared by the Kubernetes team here, sorted by major version. And I approve of them wholeheartedly. I’ve been known to rant a bit about release engineering and release notes, but there’s nothing to complain about when it comes to Kubernetes. Besides perhaps their length, but that’s to be expected in a project of Kubernetes’ size.
I did not find anything relevant or interesting to me directly in any of the releases, so I won’t go into detail about the changes.
One thing to note, which will bite me later, is the version skew policy. It describes the allowed skew between versions, most importantly between the kubelet and the kube-apiserver said kubelet is talking to. Namely, the versions between the two can skew at most by a single minor version, and the kubelet must not be newer than the kube-apiserver. Meaning the kube-apiserver always needs to be updated first. More on this later, when I stumble over this policy.
Here is a short step-by-step of the kubeadm update process, always starting with the control plane nodes:
- Update kubeadm to the new Kubernetes version
- On the very first CP node, run
kubeadm upgrade apply v1.31.11
, for example - Then, update kubeadm on the other CP nodes and run
kubeadm upgrade node
- Only after point 3) is completed on all nodes, update the kubelet as well
The steps 2-4 are repeated for all non-CP nodes as well. The order of steps
3 and 4 is important. kubeadm upgrade
needs to be run on all CP nodes before
any kubelet is updated. Or at least, that’s true on a High Availability cluster,
where the kube-apiservers are sitting behind a virtual IP. That’s because of
the version skew policy I mentioned above: The kubelet must never be newer than
the kube-apiserver it is talking to. Which makes some sense: The Kubernetes API
is the public API, with stability guarantees, backwards compatibility and such.
So it will likely be able to serve older kubelets just fine, as it will still
support the older APIs that kubelet depends on. But in the other direction, the
newer kubelet may access APIs which older kube-apiservers simply don’t serve
yet.
My cluster update Ansible playbook
As I tend to do, I created an Ansible playbook during the first update, so that I could do something else while the update runs fully automated. That did not work for any of the updates this time around, but I will go into more detail later.
Let’s start with the fact that I’m using Ubuntu Linux as my OS on all of my Homelab hosts, and I’m getting the Kubernetes components from the official apt repos provided by the Kubernetes project. I’m also using cri-o as my container runtime. Until recently, that was also hosted in the k8s.io repos, but has since moved to the openSUSE repos.
Before starting the first tasks, here is my group_vars/all.yml
file:
crio_version_prev: v1.30
kube_version_prev: v1.30
kube_version: v1.31
kube_version_full: 1.31.11
crio_version: v1.31
I’ve stored the versions here, instead of the defaults/main.yml
of the role
because I also use the versions in a few other places, mainly my deployment
roles for configuring new cluster nodes.
But enough prelude, here are the first few tasks from the tasks/main.yml
file:
- name: update kubernetes repo key
copy:
src: kubernetes-keyring.gpg
dest: /usr/share/keyrings/kubernetes.gpg
owner: root
group: root
mode: 0644
- name: remove old kubernetes deb repo
apt_repository:
repo: >
deb [signed-by=/usr/share/keyrings/kubernetes.gpg]
https://pkgs.k8s.io/core:/stable:/{{ kube_version_prev }}/deb/ /
state: absent
filename: kubernetes
when: ansible_facts['distribution'] == 'Ubuntu'
- name: add kubernetes ubuntu repo
apt_repository:
repo: >
deb [signed-by=/usr/share/keyrings/kubernetes.gpg]
https://pkgs.k8s.io/core:/stable:/{{ kube_version }}/deb/ /
state: present
filename: kubernetes
when: ansible_facts['distribution'] == 'Ubuntu'
- name: update apt after kubernetes repos changed
apt:
update_cache: yes
These deploy the apt key of the K8s.io
repo for the main Kubernetes components,
remove the repo of the previous version and add the repo of the new version.
Finally, an apt cache update is executed to fetch the packages from the new repo
before running any install tasks.
One thing to note here is that I’m manually fetching the Kubernetes repo key and storing it in the repo via this command:
curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.31/deb/Release.key | gpg --dearmor -o roles/kube-common/files/kubernetes-keyring.gpg
The next step is updating the kubeadm version:
- name: unpin kubeadm version
dpkg_selections:
name: kubeadm
selection: install
when: update_kubeadm
- name: update kubeadm
ansible.builtin.apt:
name:
- 'kubeadm={{ kube_version_full }}*'
state: present
install_recommends: false
when: update_kubeadm
- name: pin kubeadm version
dpkg_selections:
name: kubeadm
selection: hold
when: update_kubeadm
The update_kubeadm
variable is necessary because I’m running this role twice for control plane nodes.
Once updating only kubeadm on all CP nodes, and then again to run the kubelet
update. But that second run won’t need to run the kubeadm update again, hence
why the update_kubeadm
variable exists.
Next is the kubeadm upgrade
invocation, the main part of the cluster update:
- name: run kubeadm update
command:
cmd: "kubeadm upgrade node"
when: not kube_first_node and update_kubeadm
- name: run kubeadm update
command:
cmd: "kubeadm upgrade apply -y v{{ kube_version_full }}"
when: kube_first_node and update_kubeadm
There are two variants of this task, depending on whether kube_first_node
is
set or not. This is necessary because only the first CP node updated needs to
run upgrade apply -y v<NEW_VERSION>
. All other CP nodes and all non-CP nodes
just run upgrade node
. Again, this setup using variables is mostly because
in principle, the update steps are the same for all nodes in the cluster. So
it made more sense to have one role where I could switch some tasks on/off, rather
than having multiple roles which each repeat a lot of their respective tasks.
The kubeadm update includes updating the control plane components: kube-apiserver,
kube-controller-manager and kube-scheduler as well as etcd. All of these are
static Pods, who’s definition is controlled by kubeadm.
The next step is updating the kubelet and kubectl on the nodes, which is headed by draining the node:
- name: drain node
tags:
- kubernetes
- ceph
delegate_to: candc
become_user: myuser
command:
argv:
- kubectl
- drain
- --delete-emptydir-data=true
- --force=true
- --ignore-daemonsets=true
- "{{ ansible_hostname }}"
when: update_non_kubeadm
Here is the second variable I’m using to restrict which tasks of the role are
executed for a particular host, the update_non_kubeadm
variable. It indicates
that all tasks not related to the kubeadm update are to be executed.
This command is not issued on the node itself, but rather on my command and
control host, which also runs the Ansible playbook.
Then comes the update of cri-o:
- name: remove previous kube cri-o repo
apt_repository:
repo: >
deb [signed-by=/usr/share/keyrings/libcontainers-crio-keyring.gpg]
https://download.opensuse.org/repositories/isv:/cri-o:/stable:/{{ crio_version_prev }}/deb/ /
state: absent
filename: libcontainers-crio
when: ansible_facts['distribution'] == 'Ubuntu' and update_non_kubeadm
- name: add libcontainers cri-o repo key
copy:
src: libcontainers-crio-keyring.gpg
dest: /usr/share/keyrings/libcontainers-crio-keyring.gpg
owner: root
group: root
mode: 0644
when: update_non_kubeadm
- name: add kube cri-o repo
apt_repository:
repo: >
deb [signed-by=/usr/share/keyrings/libcontainers-crio-keyring.gpg]
https://download.opensuse.org/repositories/isv:/cri-o:/stable:/{{ crio_version }}/deb/ /
state: present
filename: libcontainers-crio
when: ansible_facts['distribution'] == 'Ubuntu' and update_non_kubeadm
- name: update apt after cri-o repos changed
apt:
update_cache: yes
when: update_non_kubeadm
- name: update cri-o
ansible.builtin.apt:
name:
- cri-o
- cri-tools
state: latest
install_recommends: false
when: update_non_kubeadm
- name: autostart cri-o
ansible.builtin.systemd_service:
name: crio
enabled: true
state: started
when: update_non_kubeadm
This is similar to the initial Kubernetes repo setup. Please note that from version 1.30 to 1.32, cri-o lived in the k8s.io repos, but was then moved to openSUSE repos.
Once cri-o is updated, the last part of the role is updating kubectl and kubelet:
- name: unpin kubelet version
dpkg_selections:
name: kubelet
selection: install
when: update_non_kubeadm
- name: update kubelet
ansible.builtin.apt:
name:
- 'kubelet={{ kube_version_full }}*'
state: present
install_recommends: false
when: update_non_kubeadm
- name: pin kubelet version
dpkg_selections:
name: kubelet
selection: hold
when: update_non_kubeadm
- name: unpin kubectl version
dpkg_selections:
name: kubectl
selection: install
when: update_non_kubeadm
- name: update kubectl
ansible.builtin.apt:
name:
- 'kubectl={{ kube_version_full }}*'
state: present
install_recommends: false
when: update_non_kubeadm
- name: pin kubectl version
dpkg_selections:
name: kubectl
selection: hold
when: update_non_kubeadm
- name: restart kubelet
systemd_service:
name: kubelet
daemon_reload: true
state: restarted
when: update_non_kubeadm
And finally, the node is uncordoned:
- name: uncordon node
delegate_to: candc
become_user: myuser
kubernetes.core.k8s_drain:
name: "{{ ansible_hostname }}"
state: uncordon
when: update_non_kubeadm
This command is again delegated to my command and control host, which means the command is not executed on the remote host by my Ansible user, but rather for every host, the kubectl command is executed on a central host which has the necessary permissions and keys to actually run kubectl against the cluster.
The role I’ve described above is then used in a playbook running it against
the different groups of hosts in my Homelab. First is one of the control plane
hosts, running the required first kubeadm upgrade apply -y <NEW_KUBE_VERSION>
command, which only needs to be run on the first control plane node:
- hosts: firstcp
name: Update first kubernetes controller kubeadm
tags:
- k8s-update-kubeadm-first
serial: 1
strategy: linear
tasks:
- name: include cluster upgrade role
include_role:
name: kube-cluster-upgrade
vars:
kube_first_node: true
update_kubeadm: true
update_non_kubeadm: false
- name: pause for two minutes
tags:
- kubernetes
pause:
minutes: 2
Notably, this run gets the kube_first_node
variable set, but doesn’t run the
non-kubeadm updates, meaning the kubelet update, yet.
Next come the remaining control plane nodes:
- hosts: kube_controllers:!firstcp
name: Update other kubernetes controllers kubeadm
tags:
- k8s-update-kubeadm
serial: 1
strategy: linear
tasks:
- name: include cluster upgrade role
include_role:
name: kube-cluster-upgrade
vars:
update_kubeadm: true
update_non_kubeadm: false
- name: pause for two minutes
tags:
- kubernetes
pause:
minutes: 2
These nodes don’t have the kube_first_node
set, so they execute the kubeadm upgrade node
update command. Here, too, update_non_kubeadm
is false, meaning the kubelets
are not updated yet. This is necessary because without this, there’s a danger
that a kubelet that has already been updated would talk to a kube-apiserver which
hasn’t yet been updated, potentially leading to errors.
After the kubeadm update follows the kubelet update for the controller nodes:
- hosts: kube_controllers
name: Update kubernetes controllers
tags:
- k8s-update-controllers
serial: 1
strategy: linear
tasks:
- name: include cluster upgrade role
include_role:
name: kube-cluster-upgrade
vars:
update_kubeadm: false
update_non_kubeadm: true
- name: wait for vault to be running
tags:
- kubernetes
delegate_to: candc
become_user: myuser
kubernetes.core.k8s_info:
kind: Pod
namespace: vault
label_selectors:
- app.kubernetes.io/name=vault
- app.kubernetes.io/instance=vault
field_selectors:
- "spec.nodeName={{ ansible_hostname }}"
wait: true
wait_condition:
status: "True"
type: "Ready"
wait_sleep: 10
wait_timeout: 300
register: vault_pod_list
- name: unseal vault prompt
tags:
- vault
pause:
echo: true
prompt: "Please unseal vault: k exec -it -n vault {{ vault_pod_list.resources[0].metadata.name }} -- vault operator unseal"
- name: pause for two minutes
tags:
- kubernetes
pause:
minutes: 2
This runs the role with update_kubeadm: false
but update_non_kubeadm: true
,
leading to the kubeadm update being skipped as it was already run in the previous
play, and instead the kubelet is being updated. This is safe to do now, because
all kube-apiservers have been updated to the new version at this point.
I’m running a two minute pause task at the end of each play, to give the cluster
a bit of time to start all Pods again.
This kubelet update step also contains some handling of my Vault containers, which
are running on the control plane nodes. They need to be manually unsealed
when they’re restarted.
Next up are the Ceph nodes, which I do not throw together with the rest of the worker nodes as they need to be run one at a time, to prevent storage downtime.
- hosts: kube_ceph
name: Update kubernetes Ceph nodes
tags:
- k8s-update-ceph
serial: 1
strategy: linear
pre_tasks:
- name: set osd noout
delegate_to: candc
become_user: myuser
command: /home/myuser/.krew/bin/kubectl-rook_ceph --operator-namespace rook-ceph -n rook-cluster ceph osd set noout
tasks:
- name: include cluster upgrade role
include_role:
name: kube-cluster-upgrade
vars:
update_kubeadm: true
update_non_kubeadm: true
- name: wait for OSDs to start
delegate_to: candc
become_user: myuser
tags:
- ceph
command: /home/myuser/.krew/bin/kubectl-rook_ceph --operator-namespace rook-ceph -n rook-cluster ceph osd status "{{ ansible_hostname }}" --format json
register: ceph_end
until: "(ceph_end.stdout | trim | from_json | community.general.json_query('OSDs[*].state') | select('contains', 'up') | length) == (ceph_end.stdout | trim | from_json | community.general.json_query('OSDs[*]') | length)"
retries: 12
delay: 10
- name: pause for two minutes
tags:
- ceph
pause:
minutes: 2
post_tasks:
- name: unset osd noout
delegate_to: candc
become_user: myuser
command: /home/myuser/.krew/bin/kubectl-rook_ceph --operator-namespace rook-ceph -n rook-cluster ceph osd unset noout
I’m also setting the noout
flag for Ceph. This ensures that Ceph doesn’t start
automatic rebalancing when the OSDs on the upgraded host temporarily go down.
In addition, I’m waiting for the OSDs on each host to be up again before continuing
to the next host, to prevent storage issues.
Last but not least are my worker nodes:
- hosts: kube_workers
name: Update kubernetes worker nodes
tags:
- k8s-update-workers
serial: 2
strategy: linear
pre_tasks:
tasks:
- name: include cluster upgrade role
include_role:
name: kube-cluster-upgrade
vars:
update_kubeadm: true
update_non_kubeadm: true
- name: pause for one minute
tags:
- kubernetes
pause:
minutes: 1
Nothing special about these. In contrast to all the other plays, I’m running two hosts in parallel through it, because I do currently have enough slack in the cluster to be able to tolerate the loss of two workers.
So now let me tell you how that beautiful theory I laid out up to now actually worked in practice. 😁
A tale of three updates
I upgraded from Kubernetes 1.30 all the way to 1.33. None of the three went through without at least one issue.
Updating from 1.30 to 1.31
This one was the most complicated when it came to fixing the issue. I started it with the previous iteration of my update playbook, which still fully updated each control plane node in turn. So it first ran the kubeadm update on one node and then immediately followed that up with updating the kubelet on that same node. Right on the first node, I was greeted with these errors for a number of the Pods:
NAMESPACE NAME READY STATUS RESTARTS AGE
fluentbit fluentbit-fluent-bit-km8r7 0/1 CreateContainerConfigError 0 38m
kube-system cilium-98hzq 0/1 Init:CreateContainerConfigError 0 14m
kube-system cilium-envoy-tklh7 0/1 CreateContainerConfigError 0 40m
kube-system etcd-firstcp 1/1 Running 2 (35m ago) 35m
kube-system kube-apiserver-firstcp 1/1 Running 2 (35m ago) 35m
kube-system kube-controller-manager-firstcp 1/1 Running 0 35m
kube-system kube-scheduler-firstcp 1/1 Running 0 35m
kube-system kube-vip-firstcp 1/1 Running 0 35m
rook-ceph rook-ceph.cephfs.csi.ceph.com-nodeplugin-bnmsd 0/3 CreateContainerConfigError 0 38m
rook-ceph rook-ceph.rbd.csi.ceph.com-nodeplugin-hq82g 0/3 CreateContainerConfigError 0 38m
Note the error in the STATUS
of all of the non-kube Pods. I had never heard
of a CreateContainerConfigError
before, so I went to google and found
this issue. It identified
the problem pretty clearly and the kubernetes maintainers helpfully pointed
to the version-skew-policy.
After reading said policy multiple times, I finally realized what my error was and
updated my Ansible playbook to first update all kubeadm versions on all CP nodes
and only then start updating the kubelet. I got the error fixed by just running
the kubeadm update on the other two control plane nodes as well.
After that, the rest of the update went through without a hitch.
Updating from 1.31 to 1.32
In this one I stumbled over the fact that I hadn’t fully understood the release notes for 1.32, or rather their implications. Specifically, this point in the 1.32 release notes:
kubeadm: kubeadm upgrade node now supports addon and post-upgrade phases. Users can use kubeadm upgrade node phase addon to execute the addon upgrade, or use kubeadm upgrade node –skip-phases addon to skip the addon upgrade. If you were previously skipping an addon subphase on kubeadm init you should now skip the same addon when calling kubeadm upgrade apply and kubeadm upgrade node. Currently, the post-upgrade phase is no-op, and it is mainly used to handle some release-specific post-upgrade tasks.
So basically, addons, like kube-proxy for example, had been ignored during updates
up to this point. Which is why my updates worked up to now. But in 1.32,
the kubeadm upgrade
command gained the ability to also update addons. And
seemingly also deploy them if they’re not present, because I suddenly found
kube-proxy Pods on my nodes after the upgrade.
I did not use kube-proxy, because I was using Cilium’s kube-proxy replacement.
I had disabled kube-proxy in my InitConfiguration
like this:
skipPhases:
- "addon/kube-proxy"
But, the InitConfiguration isn’t read during updates, and it seems that kubeadm
doesn’t transfer this setting into the kubeadm-config
ConfigMap during cluster
creation. So kubeadm upgrade
didn’t have any idea that it should be skipping
the addon, and happily deployed it on my nodes.
Luckily for me, it didn’t seem to interfere with anything, and my cluster didn’t just collapse in on itself. I removed them all with the handy instructions from the Cilium docs:
kubectl -n kube-system delete ds kube-proxy
kubectl -n kube-system delete cm kube-proxy
To prevent any further issues, I edited the kubeadm-config file:
kubectl get -n kube-system configmaps kubeadm-config -o yaml
And added an entry proxy.disabled: true
to it. With this, the problem did not
occur again during the subsequent 1.33 update.
Updating from 1.32 to 1.33
The last one. I was hoping it would go through without an issue, to at least have one successful update during which I could move away from the computer and read a bit, but no such luck.
During the update of the cri-o repository for 1.33, I got this error:
Failed to update apt cache: E:Failed to fetch https://pkgs.k8s.io/addons:/cri-o:/stable:/v1.33/deb/InRelease 403 Forbidden [IP: 3.167.227.100 443]
This was because cri-o’s repos moved from k8s.io to openSUSE, see for example this issue. The adaption was pretty simple, I just needed to change the address in my playbook.
After that fix, the update ran through without any further issues and I was finally done. Cost me almost a day of work, but alas, most of the issues were of my own making.
Increased memory requests?
And finally for something amusing. When I looked at my Homelab dashboard on the morning after the upgrade, I found that the memory requests for my worker nodes were suddenly in the red, with almost 83% of available capacity used:
Resource usage the morning after the update. This shows the sum of resource requests on Pods divided by the overall resources of the group of nodes.
Thinking that the update must have changed something in how the memory utilization was computed, or perhaps there was some Deployment which increased memory requests after the update, I looked through my metrics, but wasn’t able to find anything.
After some additional checking, I finally found the issue in how I was computing the values for the metric:
(
sum(
kube_pod_container_resource_requests{resource="memory"}
and
on(pod) (kube_pod_status_phase{phase="Running"} == 1)
unless
on(node) (kube_node_spec_taint{}))
)
/
sum(
(
kube_node_status_capacity{resource="memory"}
unless
on(node) kube_node_spec_taint{}
)
)
So I’m using the kube_pod_container_resource_requests
for the memory
resource,
but only for Pods on nodes where there is no taint. Then I divide that by the
memory capacity of all nodes which don’t have a taint. I use this because the taint
was readily available in the Prometheus data, and my worker nodes are the only
ones which don’t have a taint applied to them, so it made sense to use them.
What I did not consider: There are a few non-catastrophic taints which Kubernetes applies, in my case the disk pressure taint. This simply happened because the disks were getting a bit full on a few worker nodes due to the many node drains and subsequent reschedules of Pods. So there were a lot more unused images laying around locally than was normally the case.
I was quite amused with myself when I realized that I had just spend half an hour staring at completely the wrong plots. 😁
And that’s it. Here’s to hoping that the next Kubernetes update is not interesting enough to blog about.