Wherein I replace Uptime Kuma on Nomad with Gatus on Kubernetes.
This is part 22 of my k8s migration series.
For my service monitoring needs, I’ve been using Uptime Kuma for a couple of years now. Please have a look at the repo’s Readme for a couple of screenshots, I completely forgot to make some before taking my instance down. 🤦 My main use for it was as a platform to monitor the services, not so much as a dashboard. To that end, I gathered Uptime Kuma’s data from the integrated Prometheus exporter and displayed it on my Grafana Homelab dashboard.
I had two methods for monitoring services. The main one was checking their domains via Consul’s DNS. Because all my service’s health checks in the Nomad/Consul setup were done by Consul anyway, this was a pretty nice method. When a service failed its health check, Consul would remove it from its DNS and the Uptime Kuma check would start failing.
But this approach wasn’t really enough - for example, Mastodon’s service might very well be up and healthy, but I might have screwed up the Traefik configuration, meaning my dashboards were green, but Mastodon would still be unreachable. So I slowly switched to HTTP and raw TCP socket checks to make sure that the services were actually reachable, and not just healthy.
There were always two things which I didn’t like about Uptime Kuma. First, it requires some storage, because it stores its data in an SQLite database. Second, the configuration can only be done via the web UI and is then stored into the database. So no versioning of the config. And I’ve become very fond of having my Homelab configs under version control over the years.
So when it came to planning the k8s migration, I looked around and was pointed to Gatus, I think by this video video from Techno Tim on YouTube. It has two advantages over Uptime Kuma, namely that it does not need any storage and that it is entirely configured via a YAML file. Of course, the fact that it can run without storage also means that after a restart, the history is gone. But this is fine for me, because I don’t need a history, as I’m sending the data to Prometheus anyway. This is not to say that Gatus doesn’t support persistence. It can be run with a PostgreSQL or SQLite database. But I don’t need any persistence in my setup.
Setup
As Gatus doesn’t have any dependencies, I can get right into the Deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: gatus
spec:
replicas: 1
selector:
matchLabels:
homelab/app: gatus
strategy:
type: "Recreate"
template:
metadata:
labels:
homelab/app: gatus
annotations:
checksum/config: {{ include (print $.Template.BasePath "/gatus-config.yaml") . | sha256sum }}
spec:
automountServiceAccountToken: false
securityContext:
fsGroup: 1000
sysctls:
- name: net.ipv4.ping_group_range
value: 0 65536
containers:
- name: gatus
image: twinproduction/gatus:{{ .Values.appVersion }}
securityContext:
capabilities:
add:
- CAP_NET_RAW
volumeMounts:
- name: config
mountPath: /config
readOnly: true
resources:
requests:
cpu: 250m
memory: 100Mi
env:
- name: GATUS_LOG_LEVEL
value: "DEBUG"
livenessProbe:
httpGet:
port: {{ .Values.port }}
path: "/health"
initialDelaySeconds: 15
periodSeconds: 30
ports:
- name: gatus-http
containerPort: {{ .Values.port }}
protocol: TCP
volumes:
- name: config
configMap:
name: gatus-conf
There’s not much interesting to say about the Deployment, it’s pretty much
the standard Deployment in my Homelab. With one exception: The CAP_NET_RAW
I’m adding to the container, and the sysctls
setting:
securityContext:
fsGroup: 1000
sysctls:
- name: net.ipv4.ping_group_range
value: 0 65536
[...]
securityContext:
capabilities:
add:
- CAP_NET_RAW
These are due to my usage of pings to determine whether a host is up or not. When initially running without those configs, I got the following:
2025/03/08 16:09:38 [watchdog.execute] Monitored group=Hosts; endpoint=Host: Foobar; key=hosts_host:-Foobar; success=false; errors=0; duration=0s; body=
Not too helpful, but it indicated that the host foobar
was not returning the
pings. But I knew the host was up, and I knew I was able to ping it from the
host running the Gatus pod. After some searching, I found this issue,
and the explanation that running ping
required some privileges. This is done
by setting the setuid bit on the ping
executable, which is owned by root
.
But here, the ping is executed through a Go library, not by running the ping
executable. And because the container doesn’t run as root, there are just not
enough privileges to ping anything from the Gatus process.
On a lower level, ping
uses RAW network sockets, which are privileged in the
Linux kernel.
The sysctls
setting was proposed as a solution in the issue I linked above,
but only setting that did not work for me. I had to add the CAP_NET_RAW
capability. Still better than running the container in fully privileged mode.
Configuration
Gatus allows configuration via a Yaml file. The common part of my config looks like this:
metrics: true
storage:
type: memory
web:
port: {{ .Values.port }}
ui:
title: "Meiers Homelab"
description: "Monitoring for Meiers Homelab"
Again, nothing really noteworthy, enabling the memory
storage type and the
metrics endpoint, which exposes Prometheus metrics for every endpoint at /metrics
.
Then come the endpoints, which is Gatus’ name for “things to monitor”. I will show a couple of examples for the different things I monitor, starting with the host monitoring via ping:
- name: "Host: Foobar"
group: "Hosts"
url: "icmp://foobar.home"
interval: 5m
conditions:
- "[CONNECTED] == true"
This config sends a ping to foobar.home
every five minutes and registers the
check as successful if it receives a reply.
It also puts the check into the Hosts
group. Here’s where Gatus is a bit less
flexible than Uptime Kuma was, where individual dashboards can be created.
Next, I’m using TCP socket connections to check whether my Ceph MON daemons are up, at least in so far as that they accept connections:
- name: "Ceph: Mon Baz"
group: "Ceph"
url: "tcp://baz.home:6789"
interval: 2m
conditions:
- "[CONNECTED] == true"
This check tries to establish a TCP connection to the host:port given in the URL. I was also wanting to configure a check on the health of the Ceph cluster overall. And Ceph’s MGR/dashboard module supplies one at /api/health, a few different ones with different details even. And Gatus itself allows you to check a lot of different things in the body of the response received by a HTTP check. But the issue here was that Gatus doesn’t support simple basic auth for monitored endpoints, and Ceph itself only allows authenticated access to the HTTP API, including the health endpoint.
As a short aside, I’m still a bit torn on authenticated health endpoints. I think that they should definitely be an option - if you’ve got auth infrastructure for everything anyway, there’s not much cost for setting your monitoring up with a valid token. But in a Homelab, it gets really annoying really fast. On the other hand, any unauthenticated endpoint is a potential entryway into your app. So I understand putting that behind auth. But I would like it to be optional, please. Give me an option to say “Yes, authenticate everything - besides the health API”. Sure, I could set up OAuth2 for the Ceph API and then configure Gatus to use it, but that seems just a bit too much hassle, considering that I’m already getting the health status via Prometheus scraping anyway.
Okay, next example is an HTTP check on my Consul server:
- name: "Cluster: Consul"
group: "Cluster"
url: "https://consul.example.com:8501/v1/status/leader"
method: "GET"
interval: 2m
conditions:
- "[STATUS] == 200"
client:
insecure: true
The insecure: true
option is required here, because the Consul server uses my
internal CA, and providing the CA certs to Gatus was just a bit too much hassle,
especially for a service I will be taking down soon anyway.
Next up, checking whether my internal authoritative DNS server is working:
- name: "Infra: DNS Bar"
group: "Infra"
url: "bar.home:53"
interval: 2m
dns:
query-name: "ingress.example.com"
query-type: "A"
conditions:
- "[BODY] == 300.300.300.1"
- "[DNS_RCODE] == NOERROR"
This check makes a DNS request for ingress.example.com
to bar.home
and then
checks that the response is the correct IP, and that there was no error.
I’m running this check with the IP of my ingress, because it’s a stable IP that
doesn’t change, and the ingress is probably the most stable component in my
setup.
Last but not least, here is the config for checking how long a cert is going to be valid:
- name: "Infra: mei-home.net cert"
group: "Infra"
url: "https://blog.mei-home.net"
interval: 12h
conditions:
- "[CERTIFICATE_EXPIRATION] > 72h"
This one uses my blog to check whether my cert for mei-home.net is still valid for at least three days.
And this is what the web UI looks like: Gatus dashboard with all groups collapsed. Expanded service in Gatus’ web UI.
I don’t foresee visiting this page too often, as I will mostly get the information from the Grafana dashboard I will describe in the next section.
Metrics and Grafana
Gatus provides metrics in Prometheus format at the /metrics
endpoint:
# HELP gatus_results_certificate_expiration_seconds Number of seconds until the certificate expires
# TYPE gatus_results_certificate_expiration_seconds gauge
gatus_results_certificate_expiration_seconds{group="Infra",key="infra_infra:-mei-home-net-cert",name="Infra: mei-home.net cert",type="HTTP"} 3.276935592538658e+06
# HELP gatus_results_endpoint_success Displays whether or not the endpoint was a success
# TYPE gatus_results_endpoint_success gauge
gatus_results_endpoint_success{group="Hosts",key="hosts_host:-foobar",name="Host: Foobar",type="ICMP"} 1
Armed with this information, I set up a new static scrape for my Prometheus deployment:
apiVersion: monitoring.coreos.com/v1alpha1
kind: ScrapeConfig
metadata:
name: scraping-gatus
labels:
prometheus: scrape-gatus
spec:
staticConfigs:
- labels:
job: gatus
targets:
- "gatus.gatus.svc.cluster.local:8080"
metricsPath: "/metrics"
scheme: HTTP
scrapeInterval: "1m"
metricRelabelings:
- action: drop
sourceLabels: ["__name__"]
regex: 'go_.*'
- action: drop
sourceLabels: ["__name__"]
regex: 'promhttp_.*'
- action: drop
sourceLabels: ["__name__"]
regex: 'process_.*'
Nothing special to see, besides filtering out some app metrics I never look at anyway.
Finally, I use that data in a Grafana state timeline visualization: Service uptime panel in my Homelab dashboard.
The panel is driven by this Prometheus query:
gatus_results_endpoint_success
Yupp, as simple as that.
In addition, I’m using Gatus’ certificate expiry metrics to drive a stat panel: Stat panel for my public cert expiry.
gatus_results_certificate_expiration_seconds{name="Infra: mei-home.net cert"}
Conclusion
And this concludes the Uptime Kuma to Gatus switch post. And this post also marks the end of phase 1 of the Nomad to k8s migration. Uptime Kuma was the last service left on Nomad, after it I only had infrastructure jobs like CSI plugins and a Traefik ingress running. I would say in total, this first phase of setting up the k8s cluster itself, Rook Ceph and migrating all services over cost me about six months or so. I got started in earnest towards Christmas 2023, and then worked away at it until about April, when I was rudely interrupted by my backup setup not being viable for k8s. I then finally got back into it a couple of months ago, in the beginning of 2025.
The next steps will be completely decommissioning the Nomad cluster and migrating the baremetal Ceph hosts over to the Rook Ceph cluster. The work is pretty mechanical at the moment, with all of the cleanups, so the next blog post might take a while. I mean, unless something explodes in my face in an amusing way. 😅 Although I might hold a wake for my HashiCorp Nomad cluster once I’ve fully taken it down.