I updated my Nomad and Consul clusters recently. Especially Consul was a
large jump, getting from 1.13.5 to 1.15.1. After about three days, I suddenly
started getting 500: Internal server error
from a lot of my services.
In this article, I will be going into the debugging process and explain a little
bit about what Consul Connect is and how it works.
Why Consul Connect?
So Consul Connect, or Consul Service Mesh, creates a sort-of overlay network to connect services running on multiple machines, in a secure manner.
How it works is like this: Every service (in my case Nomad jobs, but it also works standalone) gets a sidecar running Envoy proxy. Each service gets its own network namespace, which is the only network the service container is connected to.
There are multiple Envoy proxy sidecars. One of them is always present in a Mesh enabled service: The Envoy proxy allowing connections from the outside. But this proxy does not just allow any connection. Instead, it is protected with mTLS certificates. Those certificates are handed out by the Consul cluster. Without a fitting mTLS certificate, an incoming connection will be rejected.
Whether a TLS cert is actually valid, and a connection is allowed to a service’s proxy, is defined by Consul intentions. Those allow or deny connections between services, based on service names.
So let’s say we have a service FOO, which wants to connect to service BAR. In this case, in Nomad, an upstream would be defined. This upstream will mean that Consul will configure the local Envoy proxy for the service with the IP and port of the Envoy proxy fronting the BAR service. In addition, the local port that BAR should be reachable on needs to be defined.
Then, the Envoy proxy will open a socket on the given port in the local service network namespace and wait for connections. Once a connection is made, the data is encrypted with mTLS certificates and forwarded to BAR’s Envoy. There, it is decrypted and then forwarded to the local port on which BAR is listening in its local network namespace.
This shows the three advantages of using Consul Connect service mesh:
- Connections are automatically encrypted - without having to configure TLS certs for every service.
- Consul’s service discovery is used, so no hardcoded IPs or ports are needed.
- You don’t just have your service’s ports dangling openly in your LAN. Only connections from services which are allowed as per the configured Consul intentions are allowed through, as the connection is not made to the service directly, but to its Envoy proxy.
And the complexity of setting this up is not too high, at least after the initial setup. For example, my service config in the Nomad job file for Nextcloud looks like this:
network {
mode = "bridge"
port "health" {
host_network = "local"
to = 80
}
}
service {
name = "nextcloud"
port = 80
connect {
sidecar_service {
proxy {
upstreams {
destination_name = "redis"
local_bind_port = 6379
}
upstreams {
destination_name = "postgres"
local_bind_port = 5577
}
}
}
}
tags = [
"traefik.enable=true",
"traefik.consulcatalog.connect=true",
"traefik.http.routers.nextcloud.entrypoints=foobar",
"traefik.http.routers.nextcloud.rule=Host(`my.very.own.cloud`)",
]
check {
type = "http"
interval = "30s"
path = "/"
timeout = "2s"
port = "health"
}
}
Nextcloud has two upstream dependencies, Redis for caching and Postgres for the
database. They are bound to their default ports, but only in the job group’s
own namespace. So Nextcloud will just be configured to connect to Redis under
localhost:6379
and to Postgres under localhost:5577
. The Envoy proxy then
takes care of forwarding any connections to the Envoy proxies of these two
services.
Another very nice thing is that Traefik already supports Consul Connect natively.
So it doesn’t even need an Envoy proxy. It automatically discovers all
services in the mesh, checks whether they have the traefik.enable=true
tag
and then uses the other traefik.*
tags to automatically set up routes.
But it’s not all Milk and Honey. As you can see, some acrobatics are needed for health checks to work. This is due to the fact that with a config like this, the only possible connection to the outside is via the Envoy proxy - and that only allows connections with the right mTLS client cert. But the local Consul agent needs to be able to connect to the service somehow, to make sure it works. I do this by forwarding the service’s port to a localhost port and running the health check on that, which works reasonably well.
In addition, Envoy is mostly intended for Layer 7 proxying. As you can see in the example above, it does work with TCP as well - but it fails with TLS. So if I were to enable TLS connections in Postgres and Redis, it won’t work anymore.
There is a somewhat similar problem with Traefik. It currently has a bug where it does not use the right certificates to talk to backend services via the mesh, if the incoming connection is pure TLS, instead of HTTPS.
The bug
Now finally onto what the main part of this post was supposed to be about…
As mentioned above, the communication, both for encryption and for authorization, is done via mTLS in the Consul Connect mesh. Those mTLS certs have a TTL of 72h by default. After that, Consul is supposed to automatically renew them.
But it doesn’t, in Consul v1.15.1.
At first, I had no idea what’s going on - I was suddenly getting HTTP error code 500 almost everywhere. The health checks were almost all green. No errors in the logs anywhere. I later realized that this was partially due to the fact that my health checks bypass the mesh and run on a local port, as shown above.
When it happened a second time, I realized that it was exactly 72 hours after I had last restarted all of my services. So I opened this GitHub issue, and the debugging started.
First, I used Consul’s own troubleshooting functionality, with the aptly named troubleshoot command.
This command, at least in the setup I have with separate network namespaces in
Nomad jobs, needed to be run in the service container’s namespace. Linux has a
handy tool to execute host binaries in a namespace’s context. This is the
nsenter
command.
To use it, first list your docker containers with docker ps
and note the
Container ID of the container you want to enter. Then, you need the PID of
said container, which you can get with this command:
docker inspect -f '{{.State.Pid}}' <CONTAINER_ID>
Then, you can run the actual troubleshooting command:
nsenter -t PID -n consul troubleshoot proxy -envoy-admin-endpoint=<ENVOY_ADMIN_ADDR> -upstream-ip=127.0.0.1
The ENVOY_ADMIN_ADDR
is the IP and port of the Envoy proxy’s admin endpoint
for the service.
You can see it for example by running ss -tulnp
in the service’s network
namespace:
nsenter -t 1182230 -n ss -tulnp
Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
tcp LISTEN 0 4096 127.0.0.1:5577 0.0.0.0:* users:(("envoy",pid=1181034,fd=26))
tcp LISTEN 0 4096 127.0.0.1:6379 0.0.0.0:* users:(("envoy",pid=1181034,fd=25))
tcp LISTEN 0 1024 0.0.0.0:3000 0.0.0.0:* users:(("ruby",pid=1182550,fd=5),("ruby",pid=1182542,fd=5),("ruby",pid=1182256,fd=5))
tcp LISTEN 0 4096 127.0.0.2:19003 0.0.0.0:* users:(("envoy",pid=1181034,fd=14))
tcp LISTEN 0 4096 127.0.0.2:19004 0.0.0.0:* users:(("envoy",pid=1181041,fd=14))
tcp LISTEN 0 511 0.0.0.0:4000 0.0.0.0:* users:(("node",pid=1182257,fd=21))
tcp LISTEN 0 4096 0.0.0.0:20226 0.0.0.0:* users:(("envoy",pid=1181041,fd=33),("envoy",pid=1181041,fd=24))
This output is an example from my Mastodon Sidekiq container. It shows multiple
envoy processes, e.g. the upstreams supplying Redis and Postgres connectivity.
The admin ports for the Envoy instances are the ports starting with 19xxx
.
So for inspecting the Envoy proxy running under 127.0.0.2:19003
, I would enter
the following:
nsenter -t 1182230 -n consul troubleshoot proxy -envoy-admin-endpoint=127.0.0.2:19003 -upstream-ip=127.0.0.1
This will show certificate errors, like this one:
==> Validation
! Certificate chain is expired
-> Check the logs of the Consul agent configuring the local proxy and ensure XDS updates are being sent to the proxy
! Certificate chain is expired
-> Check the logs of the Consul agent configuring the local proxy and ensure XDS updates are being sent to the proxy
! Certificate chain is expired
-> Check the logs of the Consul agent configuring the local proxy and ensure XDS updates are being sent to the proxy
✓ Envoy has 0 rejected configurations
✓ Envoy has detected 1464 connection failure(s)
! No listener for upstream "127.0.0.1"
-> Check that your upstream service is registered with Consul
-> Make sure your upstream exists by running the `consul[-k8s] troubleshoot upstreams` command
-> If you are using transparent proxy for this upstream, ensure you have set up allow intentions to the upstream
-> Check the logs of the Consul agent configuring the local proxy to ensure XDS resources were sent by Consul
! No clusters found on route or listener
Another interesting thing you can do is to check the certs directly with OpenSSL. For this, you need a service’s outside listening port from Consul. Then you can run:
openssl s_client -showcerts -connect <ADDR>
This will produce an output similar to this:
CONNECTED(00000003)
Can't use SSL_get_servername
depth=0
verify error:num=20:unable to get local issuer certificate
verify return:1
depth=0
verify error:num=21:unable to verify the first certificate
verify return:1
depth=0
verify return:1
---
Certificate chain
0 s:
i:CN = REDACTED.consul
a:PKEY: id-ecPublicKey, 256 (bit); sigalg: RSA-SHA256
v:NotBefore: Mar 31 19:56:43 2023 GMT; NotAfter: Apr 3 19:56:43 2023 GMT
-----BEGIN CERTIFICATE-----
REDACTED
-----END CERTIFICATE-----
---
Server certificate
subject=
issuer=CN = REDACTED.consul
---
Acceptable client certificate CA names
CN = REDACTED.consul
Requested Signature Algorithms: ECDSA+SHA256:RSA-PSS+SHA256:RSA+SHA256:ECDSA+SHA384:RSA-PSS+SHA384:RSA+SHA384:RSA-PSS+SHA512:RSA+SHA512:RSA+SHA1
Shared Requested Signature Algorithms: ECDSA+SHA256:RSA-PSS+SHA256:RSA+SHA256:ECDSA+SHA384:RSA-PSS+SHA384:RSA+SHA384:RSA-PSS+SHA512:RSA+SHA512
Peer signing digest: SHA256
Peer signature type: ECDSA
Server Temp Key: X25519, 253 bits
---
SSL handshake has read 1392 bytes and written 403 bytes
Verification error: unable to verify the first certificate
---
New, TLSv1.3, Cipher is TLS_AES_256_GCM_SHA384
Server public key is 256 bit
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
Early data was not sent
Verify return code: 21 (unable to verify the first certificate)
---
4097EA98687F0000:error:0A00045C:SSL routines:ssl3_read_bytes:tlsv13 alert certificate required:../ssl/record/rec_layer_s3.c:1584:SSL alert number 116
The interesting line is this one:
v:NotBefore: Mar 31 19:56:43 2023 GMT; NotAfter: Apr 3 19:56:43 2023 GMT
It shows the cert’s expiration date.
After some investigation in the ticket, it turns out that it wasn’t just me, but that a recent change in local caching behavior had prevented automatic renewals from being communicated to the Envoy proxies.
Which yet again proves: Caching is one of the two hard things in software engineering. The other two being “naming things” and “off by one errors”.
Luckily, the issue seems to have been found, and a new version, 1.15.2 has been released on Friday.
I immediately did the update, and found something nice: This time around, I did not stop all Nomad jobs in the cluster, but instead just updated the Consul version. And it worked without issue. So I will be able to do future Consul agent updates without full job restarts.
After the update, all services got new certificates. I’m not sure whether that’s an indication that the update fixed the issue, or whether they just got the new certs due to the Consul restart.
At the time of writing, the expiration date for the certs is tomorrow, April 3rd.
Let’s see what happens.