Configuring Mastodon Prometheus Metrics

With release 4.4.0 Mastodon introduced a Prometheus exporter. In this post, I will configure it and show the data it provides.

With the new release, Mastodon provides metrics from Ruby and Sidekiq. I’ve attached examples for both to this post, see here for Ruby and here for Sidekiq.

The information is not actually that interesting, it’s just generic process data. But I did find at least the Sidekiq data worth gathering. It will provide an interesting future look into my usage of Mastodon and perhaps even the activity in the Fediverse (or at least the part I’m connected to) overall.

I’m running Mastodon via the official Helm chart, so I enabled the metrics exporters via the values.yaml file like this:

mastodon:
  metrics:
    statsd:
      exporter:
        enabled: false
    prometheus:
      enabled: true
      sidekiq:
        detailed: true

As I’ve noted above, I didn’t find the Ruby data interesting at all, so I did not enable the detailed data for that.

Enabling the Prometheus exporter adds containers running the exporter to the Sidekiq and Web Pods. Both listen on port 9394 by default. These ports are not added to any Services.

To instruct my Prometheus instance to scrape the endpoints, I created a PodMonitor like this:

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: sidekiq-metrics
  labels:
    {{- range $label, $value := .Values.commonLabels }}
    {{ $label }}: {{ $value | quote }}
    {{- end }}
spec:
  selector:
    matchLabels:
      app.kubernetes.io/component: sidekiq-all-queues
      app.kubernetes.io/instance: mastodon
      app.kubernetes.io/name: mastodon
      homelab/part-of: mastodon
  podMetricsEndpoints:
    - port: prometheus
      path: /metrics
      scheme: http
      interval: 1m
      metricRelabelings:
        - sourceLabels:
            - "__name__"
          action: drop
          regex: collector_.*
        - sourceLabels:
            - "__name__"
          action: drop
          regex: heap_.*
        - sourceLabels:
            - "__name__"
          action: drop
          regex: rss
        - sourceLabels:
            - "__name__"
          action: drop
          regex: malloc_increase_bytes_limit
        - sourceLabels:
            - "__name__"
          action: drop
          regex: oldmalloc_increase_bytes_limit
        - sourceLabels:
            - "__name__"
          action: drop
          regex: major_gc_ops_total
        - sourceLabels:
            - "__name__"
          action: drop
          regex: minor_gc_ops_total
        - sourceLabels:
            - "__name__"
          action: drop
          regex: allocated_objects_total
        - sourceLabels:
            - "__name__"
          action: drop
          regex: sidekiq_job_duration_seconds.*
        - sourceLabels:
            - "__name__"
          action: drop
          regex: active_record_connection_pool.*

Nothing really special about it, besides perhaps dropping a couple of metrics I did not find too interesting at ingestion.

One note: If you’ve got network policies in use, make sure that your Prometheus instance can actually reach the Mastodon Pods.

Next I went to my Grafana instance and created a few panels in a fresh dashboard to show the interesting data. I created a couple of stats panels first:

A screenshot of multiple Grafana stats panels. The first one is for 'Dead Jobs', showing that right now, there are 1196 of them. Next come the failed jobs, with 57k jobs and then the big one, 3.55 million processed jobs. That's followed by the retry queue with 96 entries and the scheduled queue, with 54 entries. — The overview stats panels in my Mastodon dashboard.

Then I’ve also got two time series panels, starting with the total jobs by type:

A screenshot of a Grafana time series panel, showing six hours from 16:00 to 22:00. The legend shows a number of different job types from Mastodon's system, like the FetchReplyWorker or the RefollowWorker. The plot is stacked, and hovers around 100 jobs on average. But around 16:50, 19:10, 20:28, 20:40, 21:00, 21:45, 22:00, there are peaks up to 300 to 500 jobs, driven by the ActivityPub:DeliveryWorker. — The current jobs running.

This plot shows the increase in jobs in the given period. It nicely shows the times where I made a post or boosted a post today. So this plot alone was already worth it. 🙂

Next, I’ve also got a plot for the failed jobs:

A screenshot of a Grafana time series panel, showing the failed jobs by type. It shows none for a lot of the time, but again, whenever I post or boost something, around the same times as the previous plot, the failed jobs shoot up, albeit only to around 22. — Failed jobs during the same time frame.

I would have wished for a bit more info, to be honest. At least the general instance information available in Mastodon’s admin dashboard would have been nice.

But this is enough for now, and it’s going to be interesting to see how the daily jobs develop in the future.