This is the third part of my k8s migration series.
This time, I will be talking about using Cilium as the load balancer for my Kubernetes cluster with L2 announcements.
A couple of days ago, I was working on setting up my Traefik ingress for the
cluster. While doing so, I yet again had to do a couple of things that just felt
weird and hacky. The most prominent of those was using
hostPort a lot when
setting up the pod.
In addition, I would also pin the Traefik pod to a specific host and provide a DNS entry for that host, all hardcoded.
All of this has a couple of downsides. First, if that ingress host running
Traefik is down, so is my entire cluster, at least as seen from the outside.
hostPort and a fixed host also has a problem with the
RollingUpdate strategy. Because the ports and the host are fixed, Kubernetes
cannot start a fresh pod before the old pod has been killed.
More generally speaking, there’s also the fact that most examples and tutorials,
as well as most Helm chart defaults assume that
LoadBalancer type services
And with what?
Initially, I looked at two potential load balancer implementations. These were kube-vip and MetalLB. I was initially leaning towards kube-vip, if for no other reason than that I had kube-vip already running on my control plane nodes, providing the VIP for the k8s API endpoint.
But while researching, I found out that newer versions of Cilium also had load balancer functionality. Reading through it, it sounded like it had all the features I wanted. Its biggest advantage is the simple fact that it doesn’t need me to install any additional components into the Kubernetes cluster. It’s just a couple of configuration changes in Cilium, plus two more maninfests.
Interlude: Migrating the Cilium install to Helm
Before I started, I decided to change my Cilium install approach. Up to now, I had Cilium installed via the Cilium CLI, as described in their Quick Start Guide.
There is one pretty big downside in this approach in my mind: It’s manual invocations of a tool, with a specific set of parameters. It’s also not simple to put under version control properly. Sure, I could always create a bash script which contains the entire invocation with the right parameter, but that’s just not too nice.
So instead of having to document somewhere with which command line parameters I needed to invoke the Cilium CLI, I switched it all over to Helm and Helmfile, so now it’s treated like everything else in the cluster.
The migration was pretty painless, because in the background, the Cilium CLI already just calls Helm.
So for the migration, I first needed to get the translation of the command line parameters into the Helm values for my running install. That can be done with Helm like this:
helm get values cilium -n kube-system -o yaml
I then put those values into a
values.yaml file for use with Helmfile.
The Helmfile config looks like this:
- name: cilium
- name: cilium
cilium.yaml values file looks like this:
With this config, there’s no redeployment necessary, it is equivalent to what the Cilium CLI does.
Cilium L2 announcements setup
Cilium (and load balancers in general, it seems) have two modes for announcing IPs of services. The more complex one is the BGP mode. In this mode, Cilium would announce routes to the exposed services. This needs an environment where BGP is configured. I decided to skip this approach, as my network knowledge in general isn’t that great. I’ve only got a relatively hazy idea what the BGP protocol even does.
So I settled on the simpler approach, L2 Announcements. In this approach, all Cilium nodes in the cluster take part in a leader election for each of the services which should be exposed and receive a virtual IP. The node which wins the election then answers any ARP requests asking for the MAC address of the node with the service virtual IP. The node then regularly renews a lease in Kubernetes to signal to all other nodes in the cluster that it’s still there. If a lease isn’t renewed in a certain time frame, another node takes over the ARP announcements.
One consequence of this approach is the fact that this is not true load balancing. All traffic for a given service will always arrive at one specific node. From the documentation, this is different when using the BGP approach, as that approach does provide true load balancing. But what the L2 announcements approach does provide is fail over, and this is all that I really care about for my setup, at least for now.
The first step in enabling L2 announcements is to enable the Helm option:
Once that was done, I had the problem that nothing seemed to happen at all.
It turns out that the Helm options are written into a
ConfigMap in the Cilium
Helm chart, which is then read by the Cilium pods. And the pods are not
restarted automatically. So to get the option to take any effect, I had to
run the following two commands after deploying the updated Helm chart:
kubectl rollout restart -n kube-system deployment cilium-operator
kubectl rollout restart daemonset -n kube-system cilium
Then the option was active. You can see the active options in the log output
cilium pods if you ever want to check what the
pods are actually running with.
If anybody out there has any idea what I might have done wrong, needing those
rollout restart calls, please ping me on Mastodon.
But still, nothing happens just from enabling the option. There are two manifests which need to be deployed.
Load balancer IP pools
CiliumLoadBalancerIPPool manifest needs to be deployed. This manifest
controls the pools of IPs which are handed out to
LoadBalancer type services.
In my setup, the manifest looks something like this:
- cidr: "10.86.5.80/28"
It defines a relatively small IP range, as I don’t expect to expose too many services. Most of what I will expose will run through the ingress service. Documentation on the pools and additional options can be found here.
L2 announcement policies
The second piece of config is the configuration for which services should get
an IP and which nodes should do the L2 announcements. This is done via a
CiliumL2AnnouncementPolicy manifest, which is documented here.
For me, the config looks like this:
This restricts the announcements to only happen from my worker nodes, not from the control plane or Ceph nodes.
In addition, I’m adding a
serviceSelector here, so that only certain services
get an IP and are announced. This is necessary due to this bug.
The bug leads to all services being considered for L2 announcements, regardless
of whether they are of type
LoadBalancer or not. This doesn’t make much
sense at all, and also costs performance, which I will go into in a later section.
With all of that config done, let’s have a look at an example. I used the following deployment:
- name: echo-server
- name: http-port
This is just a simple echo server which returns a bit of information on the HTTP request it received. Then this is the service for exposing that pod:
- name: http-port
As noted above, only services with the
are handled by the Cilium L2 announcements. In addition, I’m supplying the
service with an external-dns hostname to get an automated DNS entry.
In short, any requests which reach the service IP on port
80 are forwarded
8080 in the pod, which is where the
echo-server is listening.
One very important thing to note: Use curl for testing! Ping won’t work,
as the service IP does not answer to
When starting to debug, first check whether the service got an IP assigned:
kubectl get -n testsetup service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
testsetup-service LoadBalancer 10.7.174.128 10.86.5.93 80:32206/TCP 14h
The important part here is the
Next, check whether there is a Kubernetes lease created by anyone , signaling
that the node is announcing the service:
kubectl get -n kube-system leases.coordination.k8s.io
NAME HOLDER AGE
cilium-l2announce-testsetup-testsetup-service sehith 13h
You can also use
arping to check whether there’s anyone announcing the IP:
58 bytes from 00:16:3e:17:a4:31 (10.86.5.93): index=0 time=253.747 usec
Important to note:
arping will only work from within the same subnet, as ARP
is a layer 2 protocol. Ask me how much time I spend trying to figure out why
I didn’t get an answer on an
arping from a separate subnet. 😉
One last point I’ve got to bring up is the efficiency of Cilium’s L2 load balancer approach.
As noted, this bug made
Cilium announce every service in my cluster initially,
This produced quite a high load increase on one of my control plane nodes:
The CPU load on this 4 core control node was increased by about 2% during the time where Cilium had to announce the 5 services I had defined in my cluster. This is most likely all API server/etcd load, as Cilium uses Kubernetes' leases functionality. For every L2 announcement, all nodes continuously check whether the current lease holder is still holding the lease, so that another node can take over if the one which previously did the announcement for the service failed for some reason.
This 2% load increase was from only five services with three nodes in the cluster. My cluster will very likely end up with 9 worker nodes in the end, and possibly more than 5 services. I really don’t like where that might lead.
I will have to keep my eye on this while I migrate more hosts and services over from Nomad. If it gets too bad, I will have to return to this topic and try out MetalLB, or potentially go ahead and have a look at BGP after all.