Nomad to k8s, Part 21: Replacing Uptime Kuma with Gatus

Wherein I replace Uptime Kuma on Nomad with Gatus on Kubernetes.

This is part 22 of my k8s migration series.

For my service monitoring needs, I’ve been using Uptime Kuma for a couple of years now. Please have a look at the repo’s Readme for a couple of screenshots, I completely forgot to make some before taking my instance down. 🤦 My main use for it was as a platform to monitor the services, not so much as a dashboard. To that end, I gathered Uptime Kuma’s data from the integrated Prometheus exporter and displayed it on my Grafana Homelab dashboard.

I had two methods for monitoring services. The main one was checking their domains via Consul’s DNS. Because all my service’s health checks in the Nomad/Consul setup were done by Consul anyway, this was a pretty nice method. When a service failed its health check, Consul would remove it from its DNS and the Uptime Kuma check would start failing.

But this approach wasn’t really enough - for example, Mastodon’s service might very well be up and healthy, but I might have screwed up the Traefik configuration, meaning my dashboards were green, but Mastodon would still be unreachable. So I slowly switched to HTTP and raw TCP socket checks to make sure that the services were actually reachable, and not just healthy.

There were always two things which I didn’t like about Uptime Kuma. First, it requires some storage, because it stores its data in an SQLite database. Second, the configuration can only be done via the web UI and is then stored into the database. So no versioning of the config. And I’ve become very fond of having my Homelab configs under version control over the years.

So when it came to planning the k8s migration, I looked around and was pointed to Gatus, I think by this video video from Techno Tim on YouTube. It has two advantages over Uptime Kuma, namely that it does not need any storage and that it is entirely configured via a YAML file. Of course, the fact that it can run without storage also means that after a restart, the history is gone. But this is fine for me, because I don’t need a history, as I’m sending the data to Prometheus anyway. This is not to say that Gatus doesn’t support persistence. It can be run with a PostgreSQL or SQLite database. But I don’t need any persistence in my setup.

Setup

As Gatus doesn’t have any dependencies, I can get right into the Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gatus
spec:
  replicas: 1
  selector:
    matchLabels:
      homelab/app: gatus
  strategy:
    type: "Recreate"
  template:
    metadata:
      labels:
        homelab/app: gatus
      annotations:
        checksum/config: {{ include (print $.Template.BasePath "/gatus-config.yaml") . | sha256sum }}
    spec:
      automountServiceAccountToken: false
      securityContext:
        fsGroup: 1000
        sysctls:
          - name: net.ipv4.ping_group_range
            value: 0 65536
      containers:
        - name: gatus
          image: twinproduction/gatus:{{ .Values.appVersion }}
          securityContext:
            capabilities:
              add:
                - CAP_NET_RAW
          volumeMounts:
            - name: config
              mountPath: /config
              readOnly: true
          resources:
            requests:
              cpu: 250m
              memory: 100Mi
          env:
            - name: GATUS_LOG_LEVEL
              value: "DEBUG"
          livenessProbe:
            httpGet:
              port: {{ .Values.port }}
              path: "/health"
            initialDelaySeconds: 15
            periodSeconds: 30
          ports:
            - name: gatus-http
              containerPort: {{ .Values.port }}
              protocol: TCP
      volumes:
        - name: config
          configMap:
            name: gatus-conf

There’s not much interesting to say about the Deployment, it’s pretty much the standard Deployment in my Homelab. With one exception: The CAP_NET_RAW I’m adding to the container, and the sysctls setting:

      securityContext:
        fsGroup: 1000
        sysctls:
          - name: net.ipv4.ping_group_range
            value: 0 65536
[...]
          securityContext:
            capabilities:
              add:
                - CAP_NET_RAW

These are due to my usage of pings to determine whether a host is up or not. When initially running without those configs, I got the following:

2025/03/08 16:09:38 [watchdog.execute] Monitored group=Hosts; endpoint=Host: Foobar; key=hosts_host:-Foobar; success=false; errors=0; duration=0s; body=

Not too helpful, but it indicated that the host foobar was not returning the pings. But I knew the host was up, and I knew I was able to ping it from the host running the Gatus pod. After some searching, I found this issue, and the explanation that running ping required some privileges. This is done by setting the setuid bit on the ping executable, which is owned by root. But here, the ping is executed through a Go library, not by running the ping executable. And because the container doesn’t run as root, there are just not enough privileges to ping anything from the Gatus process. On a lower level, ping uses RAW network sockets, which are privileged in the Linux kernel. The sysctls setting was proposed as a solution in the issue I linked above, but only setting that did not work for me. I had to add the CAP_NET_RAW capability. Still better than running the container in fully privileged mode.

Configuration

Gatus allows configuration via a Yaml file. The common part of my config looks like this:

metrics: true
storage:
  type: memory
web:
  port: {{ .Values.port }}
ui:
  title: "Meiers Homelab"
  description: "Monitoring for Meiers Homelab"

Again, nothing really noteworthy, enabling the memory storage type and the metrics endpoint, which exposes Prometheus metrics for every endpoint at /metrics.

Then come the endpoints, which is Gatus’ name for “things to monitor”. I will show a couple of examples for the different things I monitor, starting with the host monitoring via ping:

- name: "Host: Foobar"
  group: "Hosts"
  url: "icmp://foobar.home"
  interval: 5m
  conditions:
    - "[CONNECTED] == true"

This config sends a ping to foobar.home every five minutes and registers the check as successful if it receives a reply. It also puts the check into the Hosts group. Here’s where Gatus is a bit less flexible than Uptime Kuma was, where individual dashboards can be created.

Next, I’m using TCP socket connections to check whether my Ceph MON daemons are up, at least in so far as that they accept connections:

- name: "Ceph: Mon Baz"
  group: "Ceph"
  url: "tcp://baz.home:6789"
  interval: 2m
  conditions:
    - "[CONNECTED] == true"

This check tries to establish a TCP connection to the host:port given in the URL. I was also wanting to configure a check on the health of the Ceph cluster overall. And Ceph’s MGR/dashboard module supplies one at /api/health, a few different ones with different details even. And Gatus itself allows you to check a lot of different things in the body of the response received by a HTTP check. But the issue here was that Gatus doesn’t support simple basic auth for monitored endpoints, and Ceph itself only allows authenticated access to the HTTP API, including the health endpoint.

As a short aside, I’m still a bit torn on authenticated health endpoints. I think that they should definitely be an option - if you’ve got auth infrastructure for everything anyway, there’s not much cost for setting your monitoring up with a valid token. But in a Homelab, it gets really annoying really fast. On the other hand, any unauthenticated endpoint is a potential entryway into your app. So I understand putting that behind auth. But I would like it to be optional, please. Give me an option to say “Yes, authenticate everything - besides the health API”. Sure, I could set up OAuth2 for the Ceph API and then configure Gatus to use it, but that seems just a bit too much hassle, considering that I’m already getting the health status via Prometheus scraping anyway.

Okay, next example is an HTTP check on my Consul server:

- name: "Cluster: Consul"
  group: "Cluster"
  url: "https://consul.example.com:8501/v1/status/leader"
  method: "GET"
  interval: 2m
  conditions:
    - "[STATUS] == 200"
  client:
    insecure: true

The insecure: true option is required here, because the Consul server uses my internal CA, and providing the CA certs to Gatus was just a bit too much hassle, especially for a service I will be taking down soon anyway.

Next up, checking whether my internal authoritative DNS server is working:

- name: "Infra: DNS Bar"
  group: "Infra"
  url: "bar.home:53"
  interval: 2m
  dns:
    query-name: "ingress.example.com"
    query-type: "A"
  conditions:
    - "[BODY] == 300.300.300.1"
    - "[DNS_RCODE] == NOERROR"

This check makes a DNS request for ingress.example.com to bar.home and then checks that the response is the correct IP, and that there was no error. I’m running this check with the IP of my ingress, because it’s a stable IP that doesn’t change, and the ingress is probably the most stable component in my setup.

Last but not least, here is the config for checking how long a cert is going to be valid:

- name: "Infra: mei-home.net cert"
  group: "Infra"
  url: "https://blog.mei-home.net"
  interval: 12h
  conditions:
    - "[CERTIFICATE_EXPIRATION] > 72h"

This one uses my blog to check whether my cert for mei-home.net is still valid for at least three days.

And this is what the web UI looks like:

A screenshot of Gatus main dashboard. It's headed 'Health Status' and shows several groups as collapsed lists. Each group has a name, in this case 'Ceph', 'Cluster', 'Hosts', 'Infra', 'K8s' and 'Services'. To the right of the name of each group is a green check mark indicating the groups current status, which turns into a red X if any of the checks in that group fails. — Gatus dashboard with all groups collapsed.

Each individual check is then shown like this when the group is expanded:

A screenshot of an expanded check in Gatus' dashboard. It shows the name of the check at the top and then a row of green check marks below that, one for each recent execution of the check. To the right, it also shows the average duration of the check, 41 ms in this case for the blog.mei-home.net check. To the very left and very right, the execution time of the oldest and newest check is shown, respectively. — Expanded service in Gatus’ web UI.

I don’t foresee visiting this page too often, as I will mostly get the information from the Grafana dashboard I will describe in the next section.

Metrics and Grafana

Gatus provides metrics in Prometheus format at the /metrics endpoint:

# HELP gatus_results_certificate_expiration_seconds Number of seconds until the certificate expires
# TYPE gatus_results_certificate_expiration_seconds gauge
gatus_results_certificate_expiration_seconds{group="Infra",key="infra_infra:-mei-home-net-cert",name="Infra: mei-home.net cert",type="HTTP"} 3.276935592538658e+06
# HELP gatus_results_endpoint_success Displays whether or not the endpoint was a success
# TYPE gatus_results_endpoint_success gauge
gatus_results_endpoint_success{group="Hosts",key="hosts_host:-foobar",name="Host: Foobar",type="ICMP"} 1

Armed with this information, I set up a new static scrape for my Prometheus deployment:

apiVersion: monitoring.coreos.com/v1alpha1
kind: ScrapeConfig
metadata:
  name: scraping-gatus
  labels:
    prometheus: scrape-gatus
spec:
  staticConfigs:
    - labels:
        job: gatus
      targets:
        - "gatus.gatus.svc.cluster.local:8080"
  metricsPath: "/metrics"
  scheme: HTTP
  scrapeInterval: "1m"
  metricRelabelings:
    - action: drop
      sourceLabels: ["__name__"]
      regex: 'go_.*'
    - action: drop
      sourceLabels: ["__name__"]
      regex: 'promhttp_.*'
    - action: drop
      sourceLabels: ["__name__"]
      regex: 'process_.*'

Nothing special to see, besides filtering out some app metrics I never look at anyway.

Finally, I use that data in a Grafana state timeline visualization:

A screenshot of a Grafana state timeline panel. On the left, it shows a number of service names, like 'Gitea' or 'Jellyfin'. To the right of each service name is a mostly green line, for some services interrupted by short intervals of red. — Service uptime panel in my Homelab dashboard.

The panel is driven by this Prometheus query:

gatus_results_endpoint_success

Yupp, as simple as that. In addition, I’m using Gatus’ certificate expiry metrics to drive a stat panel:

A screenshot of a Grafana stat panel. It is headed 'Cert Valid for' and currently shows '5.42 weeks' in green. — Stat panel for my public cert expiry.

It is driven by this PromQL query:

gatus_results_certificate_expiration_seconds{name="Infra: mei-home.net cert"}

Conclusion

And this concludes the Uptime Kuma to Gatus switch post. And this post also marks the end of phase 1 of the Nomad to k8s migration. Uptime Kuma was the last service left on Nomad, after it I only had infrastructure jobs like CSI plugins and a Traefik ingress running. I would say in total, this first phase of setting up the k8s cluster itself, Rook Ceph and migrating all services over cost me about six months or so. I got started in earnest towards Christmas 2023, and then worked away at it until about April, when I was rudely interrupted by my backup setup not being viable for k8s. I then finally got back into it a couple of months ago, in the beginning of 2025.

The next steps will be completely decommissioning the Nomad cluster and migrating the baremetal Ceph hosts over to the Rook Ceph cluster. The work is pretty mechanical at the moment, with all of the cleanups, so the next blog post might take a while. I mean, unless something explodes in my face in an amusing way. 😅 Although I might hold a wake for my HashiCorp Nomad cluster once I’ve fully taken it down.

Setup#

Configuration#

Metrics and Grafana#

Conclusion#

Setup

Configuration

Metrics and Grafana

Conclusion