Wherein I talk about migrating from Cilium’s L2 announcements for
services to BGP.
BGP instead of L2 announcements?
In the last post, I described my setup to make
LoadBalancer type services
functional in my k8s Homelab with Cilium’s L2 Announcements
feature. While working on the next part of my Homelab, introducing Ingress with
Traefik, I ran into the issue that the source IP is not necessarily preserved
during in-cluster routing.
By default, packets which arrive on a node which doesn’t have a pod of the target service are forwarded to a node which has such a pod. During that forwarding, source NAT is applied to the packet, overwriting the source IP with the IP of the node where it originally arrived. This is also described in the Kubernetes docs.
This is true for both,
LoadBalancer services. I see this as a
problem specifically for Ingress proxies, as it prevents stuff like IP allow lists
and any other IP dependent functionality in the proxy. All packets would look
like they’re coming from a cluster node. With Cilium’s L2 announcements, they
would all have the source IP of the node which is currently announcing the
This can be fixed with a config option on Kubernetes services, namely
externalTrafficPolicy: Local. This has the effect that packets are not
forwarded to another node if the one they arrive on doesn’t have a pod of the
target service. The default mode is
Cluster, where packets are forwarded to
other nodes, but with the downside of sNAT.
Now, at some point, while reading into L2 announcements and the
option, I read that the
Local setting doesn’t work properly with the ARP based
But now, I can’t find that anywhere anymore. 😔
This was my main trigger, but there are a couple of additional downsides of the L2 announcements feature. First, it produces a lot of load on the kube-apiserver. I went into a bit of detail in my previous post.
Then there’s the fact that with the L2 announcements feature, there’s also no
real load balancing. Due to how ARP works, there can only ever be one node which
announces the service IP, and so only that node will ever receive traffic for
Combined with what I previously wrote, this also means that if you want to have
a service with preserved source IPs and multiple pods, you’re out of luck. With
externalTrafficPolicy: Local, packets will never be forwarded to another node’s
pod, regardless of how many there are. The current announcer will have to carry
all of the load, and any other pods on other nodes will only ever be idle.
To be entirely honest, that’s not going to be too much of a problem in my Homelab. I’m currently running exactly no jobs with more than one replica. But hey, who knows? At some point, my writing might really take off and I might need three instances serving my blog. 😉
So instead of the ARP based L2 announcements, it’s now going to be Cilium’s beta BGP control plane feature.
I really don’t know enough about the protocol, so I’m not going to annoy you with my 1 day old half-knowledge here.
Suffice it to say that with BGP, routers can exchange routes, mostly telling their peers which networks they can reach.
In the Kubernetes
LoadBalancer application, Cilium will announce routes to
LoadBalancer service IPs through a group of cluster nodes.
A route announcement could look like this:
10.86.55.1/32 via 10.86.5.206
That would tell the peer that it can reach the service IP
the Kubernetes host
10.86.5.206. Here, the
10.86.5.206 host is hanging off
of a switch directly connected to my router, so the router already knows how
to reach it. With the above announcement, it now also knows to forward packets
10.86.5.206, where Cilium will then forward it to
a pod of the target service.
One of the advantages over the Layer 2 ARP protocol used by L2 announcements is that a completely different, non-routable subnet can be used for the service IPs.
There are two parts to the setup, one is configuring the router and the other is configuring Cilium.
One thing to decide on before continuing is the Autonomous System Number.
This number is an identifier for autonomous networks. Similar to IPs, there is
a range of ASNs for private usage which will never be handed out to the public
Internet. It is the range
64512–65534. For more infos, have a look at the ASN
table in the Wikipedia.
While you can use different ASNs for the router and Cilium, it is not necessary,
and I will continue with the same ASN,
64555, for both.
The first step to using BGP is setting it up on the router. I’m using OPNsense here and will describe the setup. If you’re using a different router, you can adapt the instructions.
To setup BGP in the router, you need a piece of software which listens on port
179 by default, receiving route announcements from peers and sending route
announcements to them.
OPNsense uses a plugin which installs FRRouting,
which can also be used standalone if you are for example running a Linux host
as a router.
Once you’ve enabled BGP, you will need to add all the k8s nodes you would like to participate in BGP as peers to the router. At least in OPNsense, this means simply adding the node’s routable IP and the Cilium ASN as the node’s ASN.
One very important point that cost me quite some time: Don’t forget to make sure
that the Kubernetes cluster nodes participating in BGP can actually reach port
179/TCP on your router. I spend quite a while trying to figure out why my
router and Cilium won’t peer. 😑
For OPNsense, the first step is to go to
and install the
os-frr plugin, which is OPNsense’s way to install FRROuting.
Once that’s done, a new top level menu entry called
Routing will appear.
Note: This is not the
Then, enable the general routing functionality, which starts the necessary daemons:
Save after you’ve checked
Next, go to
BGP and also check
BGP AS Number, enter the ASN
you chose from the private range.
As I don’t need OPNsense redistributing any routes, I’ve left the
Nothing selected. I’ve left the
Network field empty for the
The next step is adding the neighbors. For each of the Kubernetes hosts which
should announce routes, click on the
+ in the bottom right corner of the
Neighbors tab and enter the following information:
- A description so you know which host it is. I’m just using the hostname
Peer-IP, add the IP of the Kubernetes host
Remote AS, enter the ASN you chose from the private range
Update-Source Interface, set the interface from which the Kubernetes host is reachable
I left all the checkboxes unchecked, and did not set anything in the
Here I’ve got a question to my readers: Isn’t there a better way than adding every single Kubernetes worker host as a peer here? It just feels like unnecessary manual work, but I didn’t find any other info on it.
With all of that done, the router config is complete.
As noted above, don’t forget to open port
179/TCP on your firewall!
I encountered an error later, when I really started using the Cilium LB. I’ve described it in this post.
In short, if you have a situation like this:
- LoadBalancer service setup as described in this post
- Host in the same subnet as your Kubernetes nodes trying to use LoadBalancer service
- LoadBalancer IPs assigned with different subnet than those hosts
You will end up with asymmetric routing. Your packets from the host accessing the service will go through OPNsense, as the packets need to be routed. But the return path of the packets will be direct, as the k8s nodes and the host using the service are in the same subnet.
You will then need to do the following:
- Switch the “State Type” for all rules allowing access from the subnet to the LoadBalancer IPs to “sloppy state”, as OPNsense will only ever see one side of a connection attempt and consequently block the connection
- Create an OUTGOING firewall rule which allows the k8s subnet to access the LoadBalancer IP as well as an INCOMING rule. I’m not sure why this works right now, but it seems to be necessary, at least in my setup.
The documentation for the Cilium BGP feature can be found here.
The first step of the setup is enabling the BGP functionality. As I’m using
Helm to deploy Cilium, I’m adding this option to my
Similar to the L2 announcement, the BGP functionality needs something which
hands out IP addresses to the
LoadBalancer services. This can be done with
Cilium’s Load Balancer IPAM.
As I’ve noted above, because BGP in contrast to L2 ARP announces routes, it is
easier to chose a CIDR which does not overlap with the subnet the Kubernetes nodes
are located in. In my case, the
CiliumLoadBalancerIPPool looks like this:
- cidr: "10.86.55.0/24"
I’ve chosen only a single
/24, as I don’t expect to ever reach 254 LoadBalancer
services. Most of my services will run through my Traefik Ingress instead of being
The second part of the Cilium config is the BGP peering policy. It sets up the details of how to peer, what to announce and with whom the peering should happen.
For me, it looks like this:
- localASN: 64555
- peerAddress: '10.86.5.254/32'
A couple of things to note: There can be multiple neighbors that Cilium peers
with. In my case though, I’ve only got the one OPNsense router, which is
10.86.5.254 from the Kubernetes nodes. I’m using the same ASN
as I used for the router’s BGP setup,
64555. I didn’t see any reason for why
I should have different ASNs.
nodeSelector ensures that only my worker nodes announce routes.
Important to note is also the
serviceSelector. A missing
notably not an error. It just means that Cilium won’t announce any routes
If you’d like to, you can also have Cilium announce routes to the actual pods,
With my current k8s Homelab, I have configured my three worker nodes as neighbors in OPNsense. I’ve also got the following service running for my Ingress:
- name: secureweb
This is a simplified version of the service the Traefik Helm chart automatically
creates for me.
Important here are the
type: LoadBalancer and the
It currently has the following IP:
kubectl get -n traefik-ingress service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
traefik-ingress LoadBalancer 10.7.122.207 10.86.55.5 443:31512/TCP 32h
And here is the culmination of this entire article:
So there we are. There’s only one Traefik pod running for the moment. And it’s
running on the node with the IP
10.86.5.206. As I said in the beginning,
externalTrafficPolicy: Local, only the nodes which host pods of a given
service announce routes to themselves. This prevents intra-cluster routing and
preserves the source IP.
I also had a trial with
externalTrafficPolicy: Cluster, and in that case all
three of my current cluster nodes announce the service IP to OPNsense.
Finally, another request to my readers: Do you have a favorite book about networking? I was initially completely lost (and as you see from my explanation of BGP, still mostly am) reading about BGP and even ARP when I was working on the L2 announcement. It’s the one big glaring hole in my Homelab knowledge. Took me ages to get started on VLANs as well, for example.
So if you’ve got a favorite book about current important networking tech and protocols, drop me a note on the Fediverse.