<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>UptimeKuma on ln --help</title>
    <link>https://blog.mei-home.net/tags/uptimekuma/</link>
    <description>Recent content in UptimeKuma on ln --help</description>
    <generator>Hugo -- 0.147.2</generator>
    <language>en</language>
    <lastBuildDate>Wed, 12 Mar 2025 22:40:24 +0100</lastBuildDate>
    <atom:link href="https://blog.mei-home.net/tags/uptimekuma/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Nomad to k8s, Part 21: Replacing Uptime Kuma with Gatus</title>
      <link>https://blog.mei-home.net/posts/k8s-migration-21-gatus/</link>
      <pubDate>Wed, 12 Mar 2025 22:40:24 +0100</pubDate>
      <guid>https://blog.mei-home.net/posts/k8s-migration-21-gatus/</guid>
      <description>Replacing Uptime Kuma on Nomad with Gatus on k8s</description>
      <content:encoded><![CDATA[<p>Wherein I replace Uptime Kuma on Nomad with Gatus on Kubernetes.</p>
<p>This is part 22 of my <a href="https://blog.mei-home.net/tags/k8s-migration/">k8s migration series</a>.</p>
<p>For my service monitoring needs, I&rsquo;ve been using <a href="https://github.com/louislam/uptime-kuma">Uptime Kuma</a>
for a couple of years now. Please have a look at the repo&rsquo;s Readme for a couple
of screenshots, I completely forgot to make some before taking my instance down. &#x1f926;
My main use for it was as a platform to monitor the
services, not so much as a dashboard. To that end, I gathered Uptime Kuma&rsquo;s data
from the integrated Prometheus exporter and displayed it on my Grafana Homelab
dashboard.</p>
<p>I had two methods for monitoring services. The main one was checking their
domains via <a href="https://developer.hashicorp.com/consul/docs/services/discovery/dns-overview">Consul&rsquo;s DNS</a>.
Because all my service&rsquo;s health checks in the Nomad/Consul setup were done by
Consul anyway, this was a pretty nice method. When a service failed its health
check, Consul would remove it from its DNS and the Uptime Kuma check would start
failing.</p>
<p>But this approach wasn&rsquo;t really enough - for example, Mastodon&rsquo;s service might
very well be up and healthy, but I might have screwed up the Traefik configuration,
meaning my dashboards were green, but Mastodon would still be unreachable. So
I slowly switched to HTTP and raw TCP socket checks to make sure that the
services were actually reachable, and not just healthy.</p>
<p>There were always two things which I didn&rsquo;t like about Uptime Kuma. First, it
requires some storage, because it stores its data in an SQLite database. Second,
the configuration can only be done via the web UI and is then stored into the
database. So no versioning of the config. And I&rsquo;ve become very fond of having
my Homelab configs under version control over the years.</p>
<p>So when it came to planning the k8s migration, I looked around and was pointed
to <a href="https://github.com/TwiN/gatus">Gatus</a>, I think by <a href="https://www.youtube.com/watch?v=LeZQjWlDUHs">this video</a> video from <a href="https://www.youtube.com/@TechnoTim">Techno Tim</a>
on YouTube. It has two advantages over Uptime Kuma, namely that it does not
need any storage and that it is entirely configured via a YAML file. Of course,
the fact that it can run without storage also means that after a restart, the
history is gone. But this is fine for me, because I don&rsquo;t need a history, as
I&rsquo;m sending the data to Prometheus anyway. This is not to say that Gatus
doesn&rsquo;t support persistence. It can be run with a PostgreSQL or SQLite database.
But I don&rsquo;t need any persistence in my setup.</p>
<h2 id="setup">Setup</h2>
<p>As Gatus doesn&rsquo;t have any dependencies, I can get right into the Deployment:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-yaml" data-lang="yaml"><span style="display:flex;"><span><span style="color:#f92672">apiVersion</span>: <span style="color:#ae81ff">apps/v1</span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">kind</span>: <span style="color:#ae81ff">Deployment</span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">metadata</span>:
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">name</span>: <span style="color:#ae81ff">gatus</span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">spec</span>:
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">replicas</span>: <span style="color:#ae81ff">1</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">selector</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">matchLabels</span>:
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">homelab/app</span>: <span style="color:#ae81ff">gatus</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">strategy</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">type</span>: <span style="color:#e6db74">&#34;Recreate&#34;</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">template</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">metadata</span>:
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">labels</span>:
</span></span><span style="display:flex;"><span>        <span style="color:#f92672">homelab/app</span>: <span style="color:#ae81ff">gatus</span>
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">annotations</span>:
</span></span><span style="display:flex;"><span>        <span style="color:#f92672">checksum/config</span>: {{ <span style="color:#ae81ff">include (print $.Template.BasePath &#34;/gatus-config.yaml&#34;) . | sha256sum }}</span>
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">spec</span>:
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">automountServiceAccountToken</span>: <span style="color:#66d9ef">false</span>
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">securityContext</span>:
</span></span><span style="display:flex;"><span>        <span style="color:#f92672">fsGroup</span>: <span style="color:#ae81ff">1000</span>
</span></span><span style="display:flex;"><span>        <span style="color:#f92672">sysctls</span>:
</span></span><span style="display:flex;"><span>          - <span style="color:#f92672">name</span>: <span style="color:#ae81ff">net.ipv4.ping_group_range</span>
</span></span><span style="display:flex;"><span>            <span style="color:#f92672">value</span>: <span style="color:#ae81ff">0</span> <span style="color:#ae81ff">65536</span>
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">containers</span>:
</span></span><span style="display:flex;"><span>        - <span style="color:#f92672">name</span>: <span style="color:#ae81ff">gatus</span>
</span></span><span style="display:flex;"><span>          <span style="color:#f92672">image</span>: <span style="color:#ae81ff">twinproduction/gatus:{{ .Values.appVersion }}</span>
</span></span><span style="display:flex;"><span>          <span style="color:#f92672">securityContext</span>:
</span></span><span style="display:flex;"><span>            <span style="color:#f92672">capabilities</span>:
</span></span><span style="display:flex;"><span>              <span style="color:#f92672">add</span>:
</span></span><span style="display:flex;"><span>                - <span style="color:#ae81ff">CAP_NET_RAW</span>
</span></span><span style="display:flex;"><span>          <span style="color:#f92672">volumeMounts</span>:
</span></span><span style="display:flex;"><span>            - <span style="color:#f92672">name</span>: <span style="color:#ae81ff">config</span>
</span></span><span style="display:flex;"><span>              <span style="color:#f92672">mountPath</span>: <span style="color:#ae81ff">/config</span>
</span></span><span style="display:flex;"><span>              <span style="color:#f92672">readOnly</span>: <span style="color:#66d9ef">true</span>
</span></span><span style="display:flex;"><span>          <span style="color:#f92672">resources</span>:
</span></span><span style="display:flex;"><span>            <span style="color:#f92672">requests</span>:
</span></span><span style="display:flex;"><span>              <span style="color:#f92672">cpu</span>: <span style="color:#ae81ff">250m</span>
</span></span><span style="display:flex;"><span>              <span style="color:#f92672">memory</span>: <span style="color:#ae81ff">100Mi</span>
</span></span><span style="display:flex;"><span>          <span style="color:#f92672">env</span>:
</span></span><span style="display:flex;"><span>            - <span style="color:#f92672">name</span>: <span style="color:#ae81ff">GATUS_LOG_LEVEL</span>
</span></span><span style="display:flex;"><span>              <span style="color:#f92672">value</span>: <span style="color:#e6db74">&#34;DEBUG&#34;</span>
</span></span><span style="display:flex;"><span>          <span style="color:#f92672">livenessProbe</span>:
</span></span><span style="display:flex;"><span>            <span style="color:#f92672">httpGet</span>:
</span></span><span style="display:flex;"><span>              <span style="color:#f92672">port</span>: {{ <span style="color:#ae81ff">.Values.port }}</span>
</span></span><span style="display:flex;"><span>              <span style="color:#f92672">path</span>: <span style="color:#e6db74">&#34;/health&#34;</span>
</span></span><span style="display:flex;"><span>            <span style="color:#f92672">initialDelaySeconds</span>: <span style="color:#ae81ff">15</span>
</span></span><span style="display:flex;"><span>            <span style="color:#f92672">periodSeconds</span>: <span style="color:#ae81ff">30</span>
</span></span><span style="display:flex;"><span>          <span style="color:#f92672">ports</span>:
</span></span><span style="display:flex;"><span>            - <span style="color:#f92672">name</span>: <span style="color:#ae81ff">gatus-http</span>
</span></span><span style="display:flex;"><span>              <span style="color:#f92672">containerPort</span>: {{ <span style="color:#ae81ff">.Values.port }}</span>
</span></span><span style="display:flex;"><span>              <span style="color:#f92672">protocol</span>: <span style="color:#ae81ff">TCP</span>
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">volumes</span>:
</span></span><span style="display:flex;"><span>        - <span style="color:#f92672">name</span>: <span style="color:#ae81ff">config</span>
</span></span><span style="display:flex;"><span>          <span style="color:#f92672">configMap</span>:
</span></span><span style="display:flex;"><span>            <span style="color:#f92672">name</span>: <span style="color:#ae81ff">gatus-conf</span>
</span></span></code></pre></div><p>There&rsquo;s not much interesting to say about the Deployment, it&rsquo;s pretty much
the standard Deployment in my Homelab. With one exception: The <code>CAP_NET_RAW</code>
I&rsquo;m adding to the container, and the <code>sysctls</code> setting:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-yaml" data-lang="yaml"><span style="display:flex;"><span>      <span style="color:#f92672">securityContext</span>:
</span></span><span style="display:flex;"><span>        <span style="color:#f92672">fsGroup</span>: <span style="color:#ae81ff">1000</span>
</span></span><span style="display:flex;"><span>        <span style="color:#f92672">sysctls</span>:
</span></span><span style="display:flex;"><span>          - <span style="color:#f92672">name</span>: <span style="color:#ae81ff">net.ipv4.ping_group_range</span>
</span></span><span style="display:flex;"><span>            <span style="color:#f92672">value</span>: <span style="color:#ae81ff">0</span> <span style="color:#ae81ff">65536</span>
</span></span><span style="display:flex;"><span>[<span style="color:#ae81ff">...]</span>
</span></span><span style="display:flex;"><span>          <span style="color:#f92672">securityContext</span>:
</span></span><span style="display:flex;"><span>            <span style="color:#f92672">capabilities</span>:
</span></span><span style="display:flex;"><span>              <span style="color:#f92672">add</span>:
</span></span><span style="display:flex;"><span>                - <span style="color:#ae81ff">CAP_NET_RAW</span>
</span></span></code></pre></div><p>These are due to my usage of pings to determine whether a host is up or not.
When initially running without those configs, I got the following:</p>
<pre tabindex="0"><code>2025/03/08 16:09:38 [watchdog.execute] Monitored group=Hosts; endpoint=Host: Foobar; key=hosts_host:-Foobar; success=false; errors=0; duration=0s; body=
</code></pre><p>Not too helpful, but it indicated that the host <code>foobar</code> was not returning the
pings. But I knew the host was up, and I knew I was able to ping it from the
host running the Gatus pod. After some searching, I found <a href="https://github.com/TwiN/gatus/issues/697">this issue</a>,
and the explanation that running <code>ping</code> required some privileges. This is done
by setting the setuid bit on the <code>ping</code> executable, which is owned by <code>root</code>.
But here, the ping is executed through a Go library, not by running the <code>ping</code>
executable. And because the container doesn&rsquo;t run as root, there are just not
enough privileges to ping anything from the Gatus process.
On a lower level, <code>ping</code> uses RAW network sockets, which are privileged in the
Linux kernel.
The <code>sysctls</code> setting was proposed as a solution in the issue I linked above,
but only setting that did not work for me. I had to add the <code>CAP_NET_RAW</code>
capability. Still better than running the container in fully privileged mode.</p>
<h2 id="configuration">Configuration</h2>
<p>Gatus allows configuration via a Yaml file. The common part of my config
looks like this:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-yaml" data-lang="yaml"><span style="display:flex;"><span><span style="color:#f92672">metrics</span>: <span style="color:#66d9ef">true</span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">storage</span>:
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">type</span>: <span style="color:#ae81ff">memory</span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">web</span>:
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">port</span>: {{ <span style="color:#ae81ff">.Values.port }}</span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">ui</span>:
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">title</span>: <span style="color:#e6db74">&#34;Meiers Homelab&#34;</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">description</span>: <span style="color:#e6db74">&#34;Monitoring for Meiers Homelab&#34;</span>
</span></span></code></pre></div><p>Again, nothing really noteworthy, enabling the <code>memory</code> storage type and the
metrics endpoint, which exposes Prometheus metrics for every endpoint at <code>/metrics</code>.</p>
<p>Then come the endpoints, which is Gatus&rsquo; name for &ldquo;things to monitor&rdquo;.
I will show a couple of examples for the different things I monitor, starting
with the host monitoring via ping:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-yaml" data-lang="yaml"><span style="display:flex;"><span>- <span style="color:#f92672">name</span>: <span style="color:#e6db74">&#34;Host: Foobar&#34;</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">group</span>: <span style="color:#e6db74">&#34;Hosts&#34;</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">url</span>: <span style="color:#e6db74">&#34;icmp://foobar.home&#34;</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">interval</span>: <span style="color:#ae81ff">5m</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">conditions</span>:
</span></span><span style="display:flex;"><span>    - <span style="color:#e6db74">&#34;[CONNECTED] == true&#34;</span>
</span></span></code></pre></div><p>This config sends a ping to <code>foobar.home</code> every five minutes and registers the
check as successful if it receives a reply.
It also puts the check into the <code>Hosts</code> group. Here&rsquo;s where Gatus is a bit less
flexible than Uptime Kuma was, where individual dashboards can be created.</p>
<p>Next, I&rsquo;m using TCP socket connections to check whether my Ceph MON daemons
are up, at least in so far as that they accept connections:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-yaml" data-lang="yaml"><span style="display:flex;"><span>- <span style="color:#f92672">name</span>: <span style="color:#e6db74">&#34;Ceph: Mon Baz&#34;</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">group</span>: <span style="color:#e6db74">&#34;Ceph&#34;</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">url</span>: <span style="color:#e6db74">&#34;tcp://baz.home:6789&#34;</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">interval</span>: <span style="color:#ae81ff">2m</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">conditions</span>:
</span></span><span style="display:flex;"><span>    - <span style="color:#e6db74">&#34;[CONNECTED] == true&#34;</span>
</span></span></code></pre></div><p>This check tries to establish a TCP connection to the host:port given in the URL.
I was also wanting to configure a check on the health of the Ceph cluster
overall. And Ceph&rsquo;s MGR/dashboard module supplies one at <a href="https://docs.ceph.com/en/reef/mgr/ceph_api/#health">/api/health</a>,
a few different ones with different details even. And Gatus itself allows you
to check a lot of different things in the body of the response received by a
HTTP check. But the issue here was that Gatus doesn&rsquo;t support simple basic auth
for monitored endpoints, and Ceph itself only allows authenticated access to the
HTTP API, including the health endpoint.</p>
<p>As a short aside, I&rsquo;m still a bit torn on authenticated health endpoints. I think
that they should definitely be an option - if you&rsquo;ve got auth infrastructure for
everything anyway, there&rsquo;s not much cost for setting your monitoring up with a
valid token. But in a Homelab, it gets really annoying really fast. On the other
hand, any unauthenticated endpoint is a potential entryway into your app. So I
understand putting that behind auth. But I would like it to be optional, please.
Give me an option to say &ldquo;Yes, authenticate everything - besides the health API&rdquo;.
Sure, I could set up OAuth2 for the Ceph API and then configure Gatus to use it,
but that seems just a bit too much hassle, considering that I&rsquo;m already getting
the health status via Prometheus scraping anyway.</p>
<p>Okay, next example is an HTTP check on my Consul server:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-yaml" data-lang="yaml"><span style="display:flex;"><span>- <span style="color:#f92672">name</span>: <span style="color:#e6db74">&#34;Cluster: Consul&#34;</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">group</span>: <span style="color:#e6db74">&#34;Cluster&#34;</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">url</span>: <span style="color:#e6db74">&#34;https://consul.example.com:8501/v1/status/leader&#34;</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">method</span>: <span style="color:#e6db74">&#34;GET&#34;</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">interval</span>: <span style="color:#ae81ff">2m</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">conditions</span>:
</span></span><span style="display:flex;"><span>    - <span style="color:#e6db74">&#34;[STATUS] == 200&#34;</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">client</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">insecure</span>: <span style="color:#66d9ef">true</span>
</span></span></code></pre></div><p>The <code>insecure: true</code> option is required here, because the Consul server uses my
internal CA, and providing the CA certs to Gatus was just a bit too much hassle,
especially for a service I will be taking down soon anyway.</p>
<p>Next up, checking whether my internal authoritative DNS server is working:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-yaml" data-lang="yaml"><span style="display:flex;"><span>- <span style="color:#f92672">name</span>: <span style="color:#e6db74">&#34;Infra: DNS Bar&#34;</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">group</span>: <span style="color:#e6db74">&#34;Infra&#34;</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">url</span>: <span style="color:#e6db74">&#34;bar.home:53&#34;</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">interval</span>: <span style="color:#ae81ff">2m</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">dns</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">query-name</span>: <span style="color:#e6db74">&#34;ingress.example.com&#34;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">query-type</span>: <span style="color:#e6db74">&#34;A&#34;</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">conditions</span>:
</span></span><span style="display:flex;"><span>    - <span style="color:#e6db74">&#34;[BODY] == 300.300.300.1&#34;</span>
</span></span><span style="display:flex;"><span>    - <span style="color:#e6db74">&#34;[DNS_RCODE] == NOERROR&#34;</span>
</span></span></code></pre></div><p>This check makes a DNS request for <code>ingress.example.com</code> to <code>bar.home</code> and then
checks that the response is the correct IP, and that there was no error.
I&rsquo;m running this check with the IP of my ingress, because it&rsquo;s a stable IP that
doesn&rsquo;t change, and the ingress is probably the most stable component in my
setup.</p>
<p>Last but not least, here is the config for checking how long a cert is going
to be valid:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-yaml" data-lang="yaml"><span style="display:flex;"><span>- <span style="color:#f92672">name</span>: <span style="color:#e6db74">&#34;Infra: mei-home.net cert&#34;</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">group</span>: <span style="color:#e6db74">&#34;Infra&#34;</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">url</span>: <span style="color:#e6db74">&#34;https://blog.mei-home.net&#34;</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">interval</span>: <span style="color:#ae81ff">12h</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">conditions</span>:
</span></span><span style="display:flex;"><span>    - <span style="color:#e6db74">&#34;[CERTIFICATE_EXPIRATION] &gt; 72h&#34;</span>
</span></span></code></pre></div><p>This one uses my blog to check whether my cert for mei-home.net is still valid
for at least three days.</p>
<p>And this is what the web UI looks like:
<figure>
    <img loading="lazy" src="gatus-groups.png"
         alt="A screenshot of Gatus main dashboard. It&#39;s headed &#39;Health Status&#39; and shows several groups as collapsed lists. Each group has a name, in this case &#39;Ceph&#39;, &#39;Cluster&#39;, &#39;Hosts&#39;, &#39;Infra&#39;, &#39;K8s&#39; and &#39;Services&#39;. To the right of the name of each group is a green check mark indicating the groups current status, which turns into a red X if any of the checks in that group fails."/> <figcaption>
            <p>Gatus dashboard with all groups collapsed.</p>
        </figcaption>
</figure>

Each individual check is then shown like this when the group is expanded:
<figure>
    <img loading="lazy" src="gatus-service.png"
         alt="A screenshot of an expanded check in Gatus&#39; dashboard. It shows the name of the check at the top and then a row of green check marks below that, one for each recent execution of the check. To the right, it also shows the average duration of the check, 41 ms in this case for the blog.mei-home.net check. To the very left and very right, the execution time of the oldest and newest check is shown, respectively."/> <figcaption>
            <p>Expanded service in Gatus&rsquo; web UI.</p>
        </figcaption>
</figure>
</p>
<p>I don&rsquo;t foresee visiting this page too often, as I will mostly get the information
from the Grafana dashboard I will describe in the next section.</p>
<h2 id="metrics-and-grafana">Metrics and Grafana</h2>
<p>Gatus provides metrics in Prometheus format at the <code>/metrics</code> endpoint:</p>
<pre tabindex="0"><code># HELP gatus_results_certificate_expiration_seconds Number of seconds until the certificate expires
# TYPE gatus_results_certificate_expiration_seconds gauge
gatus_results_certificate_expiration_seconds{group=&#34;Infra&#34;,key=&#34;infra_infra:-mei-home-net-cert&#34;,name=&#34;Infra: mei-home.net cert&#34;,type=&#34;HTTP&#34;} 3.276935592538658e+06
# HELP gatus_results_endpoint_success Displays whether or not the endpoint was a success
# TYPE gatus_results_endpoint_success gauge
gatus_results_endpoint_success{group=&#34;Hosts&#34;,key=&#34;hosts_host:-foobar&#34;,name=&#34;Host: Foobar&#34;,type=&#34;ICMP&#34;} 1
</code></pre><p>Armed with this information, I set up a new static scrape for my <a href="https://blog.mei-home.net/posts/k8s-migration-9-prometheus/">Prometheus deployment</a>:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-yaml" data-lang="yaml"><span style="display:flex;"><span><span style="color:#f92672">apiVersion</span>: <span style="color:#ae81ff">monitoring.coreos.com/v1alpha1</span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">kind</span>: <span style="color:#ae81ff">ScrapeConfig</span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">metadata</span>:
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">name</span>: <span style="color:#ae81ff">scraping-gatus</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">labels</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">prometheus</span>: <span style="color:#ae81ff">scrape-gatus</span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">spec</span>:
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">staticConfigs</span>:
</span></span><span style="display:flex;"><span>    - <span style="color:#f92672">labels</span>:
</span></span><span style="display:flex;"><span>        <span style="color:#f92672">job</span>: <span style="color:#ae81ff">gatus</span>
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">targets</span>:
</span></span><span style="display:flex;"><span>        - <span style="color:#e6db74">&#34;gatus.gatus.svc.cluster.local:8080&#34;</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">metricsPath</span>: <span style="color:#e6db74">&#34;/metrics&#34;</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">scheme</span>: <span style="color:#ae81ff">HTTP</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">scrapeInterval</span>: <span style="color:#e6db74">&#34;1m&#34;</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">metricRelabelings</span>:
</span></span><span style="display:flex;"><span>    - <span style="color:#f92672">action</span>: <span style="color:#ae81ff">drop</span>
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">sourceLabels</span>: [<span style="color:#e6db74">&#34;__name__&#34;</span>]
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">regex</span>: <span style="color:#e6db74">&#39;go_.*&#39;</span>
</span></span><span style="display:flex;"><span>    - <span style="color:#f92672">action</span>: <span style="color:#ae81ff">drop</span>
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">sourceLabels</span>: [<span style="color:#e6db74">&#34;__name__&#34;</span>]
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">regex</span>: <span style="color:#e6db74">&#39;promhttp_.*&#39;</span>
</span></span><span style="display:flex;"><span>    - <span style="color:#f92672">action</span>: <span style="color:#ae81ff">drop</span>
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">sourceLabels</span>: [<span style="color:#e6db74">&#34;__name__&#34;</span>]
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">regex</span>: <span style="color:#e6db74">&#39;process_.*&#39;</span>
</span></span></code></pre></div><p>Nothing special to see, besides filtering out some app metrics I never look at
anyway.</p>
<p>Finally, I use that data in a <a href="https://grafana.com/docs/grafana/latest/panels-visualizations/visualizations/state-timeline/">Grafana state timeline visualization</a>:
<figure>
    <img loading="lazy" src="state-timeline.png"
         alt="A screenshot of a Grafana state timeline panel. On the left, it shows a number of service names, like &#39;Gitea&#39; or &#39;Jellyfin&#39;. To the right of each service name is a mostly green line, for some services interrupted by short intervals of red. "/> <figcaption>
            <p>Service uptime panel in my Homelab dashboard.</p>
        </figcaption>
</figure>
</p>
<p>The panel is driven by this Prometheus query:</p>
<pre tabindex="0"><code>gatus_results_endpoint_success
</code></pre><p>Yupp, as simple as that.
In addition, I&rsquo;m using Gatus&rsquo; certificate expiry metrics to drive a <a href="https://grafana.com/docs/grafana/latest/panels-visualizations/visualizations/stat/">stat panel</a>:
<figure>
    <img loading="lazy" src="cert-expiry.png"
         alt="A screenshot of a Grafana stat panel. It is headed &#39;Cert Valid for&#39; and currently shows &#39;5.42 weeks&#39; in green."/> <figcaption>
            <p>Stat panel for my public cert expiry.</p>
        </figcaption>
</figure>

It is driven by this PromQL query:</p>
<pre tabindex="0"><code>gatus_results_certificate_expiration_seconds{name=&#34;Infra: mei-home.net cert&#34;}
</code></pre><h2 id="conclusion">Conclusion</h2>
<p>And this concludes the Uptime Kuma to Gatus switch post. And this post also marks
the end of phase 1 of the Nomad to k8s migration. Uptime Kuma was the last service
left on Nomad, after it I only had infrastructure jobs like CSI plugins and
a Traefik ingress running.
I would say in total, this first phase of setting up the k8s cluster itself,
Rook Ceph and migrating all services over cost me about six months or so. I got
started in earnest towards Christmas 2023, and then worked away at it until about
April, when I was rudely interrupted by my backup setup not being viable for
k8s. I then finally got back into it a couple of months ago, in the beginning
of 2025.</p>
<p>The next steps will be completely decommissioning the Nomad cluster and migrating
the baremetal Ceph hosts over to the Rook Ceph cluster. The work is pretty
mechanical at the moment, with all of the cleanups, so the next blog post might
take a while. I mean, unless something explodes in my face in an amusing way. &#x1f605; Although I might hold a wake for my HashiCorp Nomad cluster once
I&rsquo;ve fully taken it down.</p>
]]></content:encoded>
    </item>
  </channel>
</rss>
