Wherein I talk about updating my kubeadm Kubernetes cluster from 1.30 to 1.33 using Ansible.

I’ve been a bit lax on my Kubernetes cluster updates, and I was still running Kubernetes v1.30. I’m also currently on a trip to fix a number of the smaller tasks in my Homelab, paying down a bit of technical debt before tackling the next big projects.

I already did one update, from my initial Kubernetes 1.29 to 1.30 in the past, using an Ansible playbook I wrote to codify the kubeadm upgrade procedure. But I never wrote a proper post about it, which I’m now rectifying.

There were no really big problems - my cluster stayed up the entire time. But there were issues in all three of the updates which might be of interest to at least someone.

The kubeadm cluster update procedure and version skew

The update of a kubeadm cluster is relatively straightforward, but it does require some manual kubeadm actions directly on each node. The documentation can be found here.

Please note: Those instructions are versioned, and may change in the future compared to what I’m describing here. Please make sure you’re reading the instructions pertinent to the version you’re currently running.

The first thing to do is to read the release notes. These are very nicely prepared by the Kubernetes team here, sorted by major version. And I approve of them wholeheartedly. I’ve been known to rant a bit about release engineering and release notes, but there’s nothing to complain about when it comes to Kubernetes. Besides perhaps their length, but that’s to be expected in a project of Kubernetes’ size.

I did not find anything relevant or interesting to me directly in any of the releases, so I won’t go into detail about the changes.

One thing to note, which will bite me later, is the version skew policy. It describes the allowed skew between versions, most importantly between the kubelet and the kube-apiserver said kubelet is talking to. Namely, the versions between the two can skew at most by a single minor version, and the kubelet must not be newer than the kube-apiserver. Meaning the kube-apiserver always needs to be updated first. More on this later, when I stumble over this policy.

Here is a short step-by-step of the kubeadm update process, always starting with the control plane nodes:

  1. Update kubeadm to the new Kubernetes version
  2. On the very first CP node, run kubeadm upgrade apply v1.31.11, for example
  3. Then, update kubeadm on the other CP nodes and run kubeadm upgrade node
  4. Only after point 3) is completed on all nodes, update the kubelet as well

The steps 2-4 are repeated for all non-CP nodes as well. The order of steps 3 and 4 is important. kubeadm upgrade needs to be run on all CP nodes before any kubelet is updated. Or at least, that’s true on a High Availability cluster, where the kube-apiservers are sitting behind a virtual IP. That’s because of the version skew policy I mentioned above: The kubelet must never be newer than the kube-apiserver it is talking to. Which makes some sense: The Kubernetes API is the public API, with stability guarantees, backwards compatibility and such. So it will likely be able to serve older kubelets just fine, as it will still support the older APIs that kubelet depends on. But in the other direction, the newer kubelet may access APIs which older kube-apiservers simply don’t serve yet.

My cluster update Ansible playbook

As I tend to do, I created an Ansible playbook during the first update, so that I could do something else while the update runs fully automated. That did not work for any of the updates this time around, but I will go into more detail later.

Let’s start with the fact that I’m using Ubuntu Linux as my OS on all of my Homelab hosts, and I’m getting the Kubernetes components from the official apt repos provided by the Kubernetes project. I’m also using cri-o as my container runtime. Until recently, that was also hosted in the k8s.io repos, but has since moved to the openSUSE repos.

Before starting the first tasks, here is my group_vars/all.yml file:

crio_version_prev: v1.30
kube_version_prev: v1.30
kube_version: v1.31
kube_version_full: 1.31.11
crio_version: v1.31

I’ve stored the versions here, instead of the defaults/main.yml of the role because I also use the versions in a few other places, mainly my deployment roles for configuring new cluster nodes.

But enough prelude, here are the first few tasks from the tasks/main.yml file:

- name: update kubernetes repo key
  copy:
    src: kubernetes-keyring.gpg
    dest: /usr/share/keyrings/kubernetes.gpg
    owner: root
    group: root
    mode: 0644
- name: remove old kubernetes deb repo
  apt_repository:
    repo: >
      deb [signed-by=/usr/share/keyrings/kubernetes.gpg]
      https://pkgs.k8s.io/core:/stable:/{{ kube_version_prev }}/deb/ /      
    state: absent
    filename: kubernetes
  when: ansible_facts['distribution'] == 'Ubuntu'
- name: add kubernetes ubuntu repo
  apt_repository:
    repo: >
      deb [signed-by=/usr/share/keyrings/kubernetes.gpg]
      https://pkgs.k8s.io/core:/stable:/{{ kube_version }}/deb/ /      
    state: present
    filename: kubernetes
  when: ansible_facts['distribution'] == 'Ubuntu'
- name: update apt after kubernetes repos changed
  apt:
    update_cache: yes

These deploy the apt key of the K8s.io repo for the main Kubernetes components, remove the repo of the previous version and add the repo of the new version. Finally, an apt cache update is executed to fetch the packages from the new repo before running any install tasks.

One thing to note here is that I’m manually fetching the Kubernetes repo key and storing it in the repo via this command:

curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.31/deb/Release.key | gpg --dearmor -o roles/kube-common/files/kubernetes-keyring.gpg

The next step is updating the kubeadm version:

- name: unpin kubeadm version
  dpkg_selections:
    name: kubeadm
    selection: install
  when: update_kubeadm
- name: update kubeadm
  ansible.builtin.apt:
    name:
      - 'kubeadm={{ kube_version_full }}*'
    state: present
    install_recommends: false
  when: update_kubeadm
- name: pin kubeadm version
  dpkg_selections:
    name: kubeadm
    selection: hold
  when: update_kubeadm

The update_kubeadm variable is necessary because I’m running this role twice for control plane nodes. Once updating only kubeadm on all CP nodes, and then again to run the kubelet update. But that second run won’t need to run the kubeadm update again, hence why the update_kubeadm variable exists.

Next is the kubeadm upgrade invocation, the main part of the cluster update:

- name: run kubeadm update
  command:
    cmd: "kubeadm upgrade node"
  when: not kube_first_node and update_kubeadm
- name: run kubeadm update
  command:
    cmd: "kubeadm upgrade apply -y v{{ kube_version_full }}"
  when: kube_first_node and update_kubeadm

There are two variants of this task, depending on whether kube_first_node is set or not. This is necessary because only the first CP node updated needs to run upgrade apply -y v<NEW_VERSION>. All other CP nodes and all non-CP nodes just run upgrade node. Again, this setup using variables is mostly because in principle, the update steps are the same for all nodes in the cluster. So it made more sense to have one role where I could switch some tasks on/off, rather than having multiple roles which each repeat a lot of their respective tasks. The kubeadm update includes updating the control plane components: kube-apiserver, kube-controller-manager and kube-scheduler as well as etcd. All of these are static Pods, who’s definition is controlled by kubeadm.

The next step is updating the kubelet and kubectl on the nodes, which is headed by draining the node:

- name: drain node
  tags:
    - kubernetes
    - ceph
  delegate_to: candc
  become_user: myuser
  command:
    argv:
      - kubectl
      - drain
      - --delete-emptydir-data=true
      - --force=true
      - --ignore-daemonsets=true
      - "{{ ansible_hostname }}"
  when: update_non_kubeadm

Here is the second variable I’m using to restrict which tasks of the role are executed for a particular host, the update_non_kubeadm variable. It indicates that all tasks not related to the kubeadm update are to be executed. This command is not issued on the node itself, but rather on my command and control host, which also runs the Ansible playbook.

Then comes the update of cri-o:

- name: remove previous kube cri-o repo
  apt_repository:
    repo: >
      deb [signed-by=/usr/share/keyrings/libcontainers-crio-keyring.gpg]
      https://download.opensuse.org/repositories/isv:/cri-o:/stable:/{{ crio_version_prev }}/deb/ /      
    state: absent
    filename: libcontainers-crio
  when: ansible_facts['distribution'] == 'Ubuntu' and update_non_kubeadm
- name: add libcontainers cri-o repo key
  copy:
    src: libcontainers-crio-keyring.gpg
    dest: /usr/share/keyrings/libcontainers-crio-keyring.gpg
    owner: root
    group: root
    mode: 0644
  when: update_non_kubeadm
- name: add kube cri-o repo
  apt_repository:
    repo: >
      deb [signed-by=/usr/share/keyrings/libcontainers-crio-keyring.gpg]
      https://download.opensuse.org/repositories/isv:/cri-o:/stable:/{{ crio_version }}/deb/ /      
    state: present
    filename: libcontainers-crio
  when: ansible_facts['distribution'] == 'Ubuntu' and update_non_kubeadm
- name: update apt after cri-o repos changed
  apt:
    update_cache: yes
  when: update_non_kubeadm
- name: update cri-o
  ansible.builtin.apt:
    name:
      - cri-o
      - cri-tools
    state: latest
    install_recommends: false
  when: update_non_kubeadm
- name: autostart cri-o
  ansible.builtin.systemd_service:
    name: crio
    enabled: true
    state: started
  when: update_non_kubeadm

This is similar to the initial Kubernetes repo setup. Please note that from version 1.30 to 1.32, cri-o lived in the k8s.io repos, but was then moved to openSUSE repos.

Once cri-o is updated, the last part of the role is updating kubectl and kubelet:

- name: unpin kubelet version
  dpkg_selections:
    name: kubelet
    selection: install
  when: update_non_kubeadm
- name: update kubelet
  ansible.builtin.apt:
    name:
      - 'kubelet={{ kube_version_full }}*'
    state: present
    install_recommends: false
  when: update_non_kubeadm
- name: pin kubelet version
  dpkg_selections:
    name: kubelet
    selection: hold
  when: update_non_kubeadm
- name: unpin kubectl version
  dpkg_selections:
    name: kubectl
    selection: install
  when: update_non_kubeadm
- name: update kubectl
  ansible.builtin.apt:
    name:
      - 'kubectl={{ kube_version_full }}*'
    state: present
    install_recommends: false
  when: update_non_kubeadm
- name: pin kubectl version
  dpkg_selections:
    name: kubectl
    selection: hold
  when: update_non_kubeadm
- name: restart kubelet
  systemd_service:
    name: kubelet
    daemon_reload: true
    state: restarted
  when: update_non_kubeadm

And finally, the node is uncordoned:

- name: uncordon node
  delegate_to: candc
  become_user: myuser
  kubernetes.core.k8s_drain:
    name: "{{ ansible_hostname }}"
    state: uncordon
  when: update_non_kubeadm

This command is again delegated to my command and control host, which means the command is not executed on the remote host by my Ansible user, but rather for every host, the kubectl command is executed on a central host which has the necessary permissions and keys to actually run kubectl against the cluster.

The role I’ve described above is then used in a playbook running it against the different groups of hosts in my Homelab. First is one of the control plane hosts, running the required first kubeadm upgrade apply -y <NEW_KUBE_VERSION> command, which only needs to be run on the first control plane node:

- hosts: firstcp
  name: Update first kubernetes controller kubeadm
  tags:
    - k8s-update-kubeadm-first
  serial: 1
  strategy: linear
  tasks:
    - name: include cluster upgrade role
      include_role:
        name: kube-cluster-upgrade
      vars:
        kube_first_node: true
        update_kubeadm: true
        update_non_kubeadm: false
    - name: pause for two minutes
      tags:
        - kubernetes
      pause:
        minutes: 2

Notably, this run gets the kube_first_node variable set, but doesn’t run the non-kubeadm updates, meaning the kubelet update, yet. Next come the remaining control plane nodes:

- hosts: kube_controllers:!firstcp
  name: Update other kubernetes controllers kubeadm
  tags:
    - k8s-update-kubeadm
  serial: 1
  strategy: linear
  tasks:
    - name: include cluster upgrade role
      include_role:
        name: kube-cluster-upgrade
      vars:
        update_kubeadm: true
        update_non_kubeadm: false
    - name: pause for two minutes
      tags:
        - kubernetes
      pause:
        minutes: 2

These nodes don’t have the kube_first_node set, so they execute the kubeadm upgrade node update command. Here, too, update_non_kubeadm is false, meaning the kubelets are not updated yet. This is necessary because without this, there’s a danger that a kubelet that has already been updated would talk to a kube-apiserver which hasn’t yet been updated, potentially leading to errors.

After the kubeadm update follows the kubelet update for the controller nodes:

- hosts: kube_controllers
  name: Update kubernetes controllers
  tags:
    - k8s-update-controllers
  serial: 1
  strategy: linear
  tasks:
    - name: include cluster upgrade role
      include_role:
        name: kube-cluster-upgrade
      vars:
        update_kubeadm: false
        update_non_kubeadm: true
    - name: wait for vault to be running
      tags:
        - kubernetes
      delegate_to: candc
      become_user: myuser
      kubernetes.core.k8s_info:
        kind: Pod
        namespace: vault
        label_selectors:
          - app.kubernetes.io/name=vault
          - app.kubernetes.io/instance=vault
        field_selectors:
          - "spec.nodeName={{ ansible_hostname }}"
        wait: true
        wait_condition:
          status: "True"
          type: "Ready"
        wait_sleep: 10
        wait_timeout: 300
      register: vault_pod_list
    - name: unseal vault prompt
      tags:
        - vault
      pause:
        echo: true
        prompt: "Please unseal vault: k exec -it -n vault {{ vault_pod_list.resources[0].metadata.name }} -- vault operator unseal"
    - name: pause for two minutes
      tags:
        - kubernetes
      pause:
        minutes: 2

This runs the role with update_kubeadm: false but update_non_kubeadm: true, leading to the kubeadm update being skipped as it was already run in the previous play, and instead the kubelet is being updated. This is safe to do now, because all kube-apiservers have been updated to the new version at this point. I’m running a two minute pause task at the end of each play, to give the cluster a bit of time to start all Pods again. This kubelet update step also contains some handling of my Vault containers, which are running on the control plane nodes. They need to be manually unsealed when they’re restarted.

Next up are the Ceph nodes, which I do not throw together with the rest of the worker nodes as they need to be run one at a time, to prevent storage downtime.

- hosts: kube_ceph
  name: Update kubernetes Ceph nodes
  tags:
    - k8s-update-ceph
  serial: 1
  strategy: linear
  pre_tasks:
    - name: set osd noout
      delegate_to: candc
      become_user: myuser
      command: /home/myuser/.krew/bin/kubectl-rook_ceph --operator-namespace rook-ceph -n rook-cluster ceph osd set noout
  tasks:
    - name: include cluster upgrade role
      include_role:
        name: kube-cluster-upgrade
      vars:
        update_kubeadm: true
        update_non_kubeadm: true
    - name: wait for OSDs to start
      delegate_to: candc
      become_user: myuser
      tags:
        - ceph
      command: /home/myuser/.krew/bin/kubectl-rook_ceph --operator-namespace rook-ceph -n rook-cluster ceph osd status "{{ ansible_hostname }}" --format json
      register: ceph_end
      until: "(ceph_end.stdout | trim | from_json | community.general.json_query('OSDs[*].state') | select('contains', 'up') | length) == (ceph_end.stdout | trim | from_json | community.general.json_query('OSDs[*]') | length)"
      retries: 12
      delay: 10
    - name: pause for two minutes
      tags:
        - ceph
      pause:
        minutes: 2
  post_tasks:
    - name: unset osd noout
      delegate_to: candc
      become_user: myuser
      command: /home/myuser/.krew/bin/kubectl-rook_ceph --operator-namespace rook-ceph -n rook-cluster ceph osd unset noout

I’m also setting the noout flag for Ceph. This ensures that Ceph doesn’t start automatic rebalancing when the OSDs on the upgraded host temporarily go down. In addition, I’m waiting for the OSDs on each host to be up again before continuing to the next host, to prevent storage issues.

Last but not least are my worker nodes:

- hosts: kube_workers
  name: Update kubernetes worker nodes
  tags:
    - k8s-update-workers
  serial: 2
  strategy: linear
  pre_tasks:
  tasks:
    - name: include cluster upgrade role
      include_role:
        name: kube-cluster-upgrade
      vars:
        update_kubeadm: true
        update_non_kubeadm: true
    - name: pause for one minute
      tags:
        - kubernetes
      pause:
        minutes: 1

Nothing special about these. In contrast to all the other plays, I’m running two hosts in parallel through it, because I do currently have enough slack in the cluster to be able to tolerate the loss of two workers.

So now let me tell you how that beautiful theory I laid out up to now actually worked in practice. 😁

A tale of three updates

I upgraded from Kubernetes 1.30 all the way to 1.33. None of the three went through without at least one issue.

Updating from 1.30 to 1.31

This one was the most complicated when it came to fixing the issue. I started it with the previous iteration of my update playbook, which still fully updated each control plane node in turn. So it first ran the kubeadm update on one node and then immediately followed that up with updating the kubelet on that same node. Right on the first node, I was greeted with these errors for a number of the Pods:

NAMESPACE     NAME                                             READY   STATUS                            RESTARTS      AGE
fluentbit     fluentbit-fluent-bit-km8r7                       0/1     CreateContainerConfigError        0             38m
kube-system   cilium-98hzq                                     0/1     Init:CreateContainerConfigError   0             14m
kube-system   cilium-envoy-tklh7                               0/1     CreateContainerConfigError        0             40m
kube-system   etcd-firstcp                                     1/1     Running                           2 (35m ago)   35m
kube-system   kube-apiserver-firstcp                           1/1     Running                           2 (35m ago)   35m
kube-system   kube-controller-manager-firstcp                  1/1     Running                           0             35m
kube-system   kube-scheduler-firstcp                           1/1     Running                           0             35m
kube-system   kube-vip-firstcp                                 1/1     Running                           0             35m
rook-ceph     rook-ceph.cephfs.csi.ceph.com-nodeplugin-bnmsd   0/3     CreateContainerConfigError        0             38m
rook-ceph     rook-ceph.rbd.csi.ceph.com-nodeplugin-hq82g      0/3     CreateContainerConfigError        0             38m

Note the error in the STATUS of all of the non-kube Pods. I had never heard of a CreateContainerConfigError before, so I went to google and found this issue. It identified the problem pretty clearly and the kubernetes maintainers helpfully pointed to the version-skew-policy. After reading said policy multiple times, I finally realized what my error was and updated my Ansible playbook to first update all kubeadm versions on all CP nodes and only then start updating the kubelet. I got the error fixed by just running the kubeadm update on the other two control plane nodes as well.

After that, the rest of the update went through without a hitch.

Updating from 1.31 to 1.32

In this one I stumbled over the fact that I hadn’t fully understood the release notes for 1.32, or rather their implications. Specifically, this point in the 1.32 release notes:

kubeadm: kubeadm upgrade node now supports addon and post-upgrade phases. Users can use kubeadm upgrade node phase addon to execute the addon upgrade, or use kubeadm upgrade node –skip-phases addon to skip the addon upgrade. If you were previously skipping an addon subphase on kubeadm init you should now skip the same addon when calling kubeadm upgrade apply and kubeadm upgrade node. Currently, the post-upgrade phase is no-op, and it is mainly used to handle some release-specific post-upgrade tasks.

So basically, addons, like kube-proxy for example, had been ignored during updates up to this point. Which is why my updates worked up to now. But in 1.32, the kubeadm upgrade command gained the ability to also update addons. And seemingly also deploy them if they’re not present, because I suddenly found kube-proxy Pods on my nodes after the upgrade.

I did not use kube-proxy, because I was using Cilium’s kube-proxy replacement. I had disabled kube-proxy in my InitConfiguration like this:

skipPhases:
  - "addon/kube-proxy"

But, the InitConfiguration isn’t read during updates, and it seems that kubeadm doesn’t transfer this setting into the kubeadm-config ConfigMap during cluster creation. So kubeadm upgrade didn’t have any idea that it should be skipping the addon, and happily deployed it on my nodes.

Luckily for me, it didn’t seem to interfere with anything, and my cluster didn’t just collapse in on itself. I removed them all with the handy instructions from the Cilium docs:

kubectl -n kube-system delete ds kube-proxy
kubectl -n kube-system delete cm kube-proxy

To prevent any further issues, I edited the kubeadm-config file:

kubectl get -n kube-system configmaps kubeadm-config -o yaml

And added an entry proxy.disabled: true to it. With this, the problem did not occur again during the subsequent 1.33 update.

Updating from 1.32 to 1.33

The last one. I was hoping it would go through without an issue, to at least have one successful update during which I could move away from the computer and read a bit, but no such luck.

During the update of the cri-o repository for 1.33, I got this error:

Failed to update apt cache: E:Failed to fetch https://pkgs.k8s.io/addons:/cri-o:/stable:/v1.33/deb/InRelease  403  Forbidden [IP: 3.167.227.100 443]

This was because cri-o’s repos moved from k8s.io to openSUSE, see for example this issue. The adaption was pretty simple, I just needed to change the address in my playbook.

After that fix, the update ran through without any further issues and I was finally done. Cost me almost a day of work, but alas, most of the issues were of my own making.

Increased memory requests?

And finally for something amusing. When I looked at my Homelab dashboard on the morning after the upgrade, I found that the memory requests for my worker nodes were suddenly in the red, with almost 83% of available capacity used:

A screenshot of several Grafana gauge visualizations. They show the utilization of memory and CPU resource usage in my k8s cluster, as measured by looking at the total resource requests from all Pods in the cluster. There are three gauges, one for each of my node groups, 'Control Plane', 'Ceph' and 'Workers'. Interesting here are the values for the 'Workers' group, which show 72.5% for the CPU resource consumption and 82.8% for the memory resource consumption.

Resource usage the morning after the update. This shows the sum of resource requests on Pods divided by the overall resources of the group of nodes.

Normally, the memory utilization is more around 60%.

Thinking that the update must have changed something in how the memory utilization was computed, or perhaps there was some Deployment which increased memory requests after the update, I looked through my metrics, but wasn’t able to find anything.

After some additional checking, I finally found the issue in how I was computing the values for the metric:

(
  sum(
      kube_pod_container_resource_requests{resource="memory"}
    and
      on(pod) (kube_pod_status_phase{phase="Running"} == 1)
    unless
    on(node) (kube_node_spec_taint{}))
    )
  /
  sum(
    (
      kube_node_status_capacity{resource="memory"}
      unless
      on(node) kube_node_spec_taint{}
    )
  )

So I’m using the kube_pod_container_resource_requests for the memory resource, but only for Pods on nodes where there is no taint. Then I divide that by the memory capacity of all nodes which don’t have a taint. I use this because the taint was readily available in the Prometheus data, and my worker nodes are the only ones which don’t have a taint applied to them, so it made sense to use them.

What I did not consider: There are a few non-catastrophic taints which Kubernetes applies, in my case the disk pressure taint. This simply happened because the disks were getting a bit full on a few worker nodes due to the many node drains and subsequent reschedules of Pods. So there were a lot more unused images laying around locally than was normally the case.

I was quite amused with myself when I realized that I had just spend half an hour staring at completely the wrong plots. 😁

And that’s it. Here’s to hoping that the next Kubernetes update is not interesting enough to blog about.