I had to clean up my Prometheus data, and it got pretty darned close there.

When it comes to my metrics, I’m very much a data hoarder. Metrics gathering was what got me into Homelabbing as a hobby, instead of just a means to an end. Telegraf/Influx/Grafana were the first new services on my Homeserver in about five years. And I really do like looking at my dashboards, including looking at past data. My retention period currently is five years. And I’m already pretty sure that when I come up to those five years for the initial data, I will just extend that to 10 years. 😅

But back in the beginning of June, I hit the 98.5% utilization for the Ceph RBD which was housing my Prometheus TSDB. The volume has 100 GB of available space. And it was full. I migrated Prometheus to my new k8s cluster back in March. In the same setup, I also deployed kube-prometheus-stack, to gather the data from my k8s cluster. At the same time, I’m still gathering data from my Nomad/Consul cluster. I’m also gathering data from two Ceph clusters. One being my old baremetal cluster, and one being my new Ceph Rook cluster. Since the k8s migration started, I’m also gathering data for eight additional hosts. Plus the data from the k8s cluster itself. That additional data has had quite some impact. Here is the growth of the utilization of the volume storing the Prometheus TSDB:

A Grafana time series plot. It shows time, starting from 3rd of March through to June 26th on the x axis, and volume utilization in percent on the y axis. The curve starts at around 50% in the beginning of March and very consistently rises up to about 100% by June eighth. On June eighth, there is a strong drop to 72%, then some jitter and a final drop to 69% around June tenth. After that, the curve steadily grows again, although at a lower rate than before.

Storage volume utilization of the TSDB

The initial 50%/50GB was the data I had gathered since around February 2021, which was when I initially switched from Influx to Prometheus. The next 50% then came in less than three months. It was clear that I had to slow down the rate of growth.

And so I embarked on a project to drop uninteresting metrics from scraping and deleting the data for those metrics already in Prometheus. My main guidance for this action was this blog post.

Figuring out what to keep and what to drop

The first step was to take inventory. Tanmay was describing a method to find the most costly metrics in their post, but I wasn’t really after costly metrics, just ones I knew I was not interested in now and probably would never be interested in in the future. This took some thinking, because what if I suddenly realized a metric is super interesting but don’t have five years worth of data on it?!

The horror! 😱

But before I could start, I had to port-forward the Prometheus port to my local machine, because Prometheus is not directly accessible outside the cluster:

kubectl port-forward -n monitoring prometheus-monitoring-kube-prometheus-prometheus-0 28015:9090

First step, getting all the metric series currently in the TSDB:

curl http://localhost:28015/api/v1/label/__name__/values | tr ',' '\n' | tr -d '"' > all.txt

This produces a list with all of the series names, like this:

apiserver_client_certificate_expiration_seconds_bucket
apiserver_client_certificate_expiration_seconds_count
apiserver_client_certificate_expiration_seconds_sum
apiserver_current_inflight_requests
apiserver_current_inqueue_requests

This resulted in a grant total of 3455 metrics. And remember, those are entire metrics - not unique permutations of labels!

Because I didn’t have any good way to make a decision just from the series name, I went through each and every one of them and plonked them into Grafana’s explore tab and looked at them.

In the end, I had 1279 metrics I wanted to keep, and 3176 I wanted to drop. But before I went and deleted the series from Prometheus, I had to stop Prometheus from scraping them. The overwhelming majority of metrics I wanted to drop came from my Ceph clusters and the k8s cluster itself.

To drop a metric, you can use Prometheus’ metric relabeling with the drop action. Let’s say you want to drop all VolumeAttachment metrics, you could add the following list item to the metricRelabelings of the corresponding ServiceMonitor:

- action: drop
  regex: 'kube_volumeattachment_.*'
  sourceLabels: [__name__]

It’s important to note that this does not stop Prometheus from scraping the metric. The metric just isn’t written to the database, and is instead dropped.

I could comfortably add all the k8s metrics I wanted to drop into the values.yaml for the kube-prometheus-stack Helm chart. The only thing to look out for is that there are ServiceMonitors for each individual k8s component, e.g. kube-scheduler, apiserver and etcd all have their own. To figure out into which ServiceMonitor a specific metric drop belongs, it helps to look at the job label, which contains the k8s component which produced the metric. But watch out! Some metrics are actually produced by more than one component, so check all the label values, not just the most recently scraped one. Also look out for the names of the metrics. For example, just because a metric starts with apiserver_ doesn’t necessarily mean that it is produced by the apiserver and hence should be dropped there.

As I mentioned above, Ceph was the other big source of drop-able metrics. But here, I hit a severe disappointment: As nice as rook is, it doesn’t currently support configuring the ServiceMonitors it creates, at all. So while I could drop the unwanted metrics for my baremetal cluster, the same metrics are still gathered for the Ceph Rook cluster. I will have to look into that in the near future.

Overall, this was a pretty time consuming procedure, but at least at the end I was pretty happy with the amount of metrics I was able to drop.

I’ve just done another very “rough and ready” check, and it seems that my dropping of metrics did not actually slow the growth down that much. Before I started dropping all those metrics, the disk usage was growing at about 600 MB per 24h. Now it’s at approximately 500 MB per 24h. Not really that much gain, to be honest.

Deleting unwanted metrics

Deleting unwanted data in Prometheus is a two-step process. First, the unwanted series need to be marked for deletion with the Prometheus API. Once that’s done, the metrics need to actually be deleted, again with the Prometheus API.

Because I already had the metrics to be deleted nicely listed in a file with one metric name per line, I wrote a quick bash script to automate the deletion:

#! /bin/bash

PROMETHEUS_SERVER="http://localhost:28015"
metrics_list="${1}"

while read metric; do
  echo "$metric"
  curl -X POST -g "${PROMETHEUS_SERVER}/api/v1/admin/tsdb/delete_series?match[]=${metric}"
done <$metrics_list
curl -X POST -g "${PROMETHEUS_SERVER}/api/v1/admin/tsdb/clean_tombstones"

It takes the aforementioned list and calls the series deletion API for each of the lines in that file, assuming that each line only contains the name of a metric to be deleted. Once that’s done, it calls the tombstone cleaning API, which really deletes the series from the TSDB.

The first part of the script ran through perfectly fine for me. But the cleaning of the tombstones failed:

"status":"error","errorType":"internal","error":"clean tombstones: /prometheus/01HSKJ3DX27WGS1NZTS0F87K6V: 2 errors: preallocate: no space left on device; sync /prometheus/01HZW9FS0W6JT1R0YJ1ZTH332N.tmp-for-creation/chunks/000011: file a

And no, I did not accidentally cut off the message there - that’s all the Prom API send. 😁 So I left it too late. Prometheus didn’t even have enough space left on the volume to execute the tombstone deletion.

Prometheus stores data in chunks on the disk, containing all the data for a specific time period. When you delete the data, it has to open every chunk which contains data from the metric to be removed, needs to remove the data and then write out a new chunk. And it did not have enough space left to write out the new chunk while not having deleted the old chunk yet. In the end I capitulated and increased the size of the Prometheus volume again, by another 20 GB. That was enough. This is one of the reasons why I like to use S3 whenever I can. When I run out of space there I just need to throw in another disk, no mucking about with volume sizes.

Increasing the volume size was not entirely simple, because the PVC is controlled not by a manually created manifest or even a Helm chart, but is instead created by the Prometheus operator when running the kube-prometheus-stack. As a consequence, updating the size has to follow a specific process, which is documented here. That was a pretty humiliating defeat. 😦

The tombstone cleanup is also pretty compute and memory intensive. Here’s the CPU consumption of the Prometheus container:

A Grafana time series plot. It shows the time from 20:00 on one day to 12:30 on the next. On the Y axis it shows CPU utilization. There is only one curve, which starts rather constant around 0.14 and then goes up to 0.6 around 20:30. It stays there until 21:30, when it goes up to its maximum of 0.85. It stays there for 1.5h and then goes down to 0.4 for 30 minutes, until going up to 0.82 again. The utilization drops again to fluctuating between 0.53 and 0.66 at 00:30 and roughly stays there. At 12:30 the next day, it finally goes back to the initial 0.14.

CPU utilization of the Prometheus container during tombstone cleaning.

A Grafana time series plot. It shows the time from 20:00 on one day to 12:30 on the next, similar to the previous plot. On the Y axis, this one shows the memory consumption in Gigabytes. There is only one curve, which starts out at around 500 MB and then goes up to around 850 MB at 20:30. It stays around that level until 00:30, when it goes up to 900 MB. Over the next couple of hours, the memory consumption constantly increases until it reaches its 1.6 GB peak around 10:00 the next morning.

Memory utilization of the Prometheus container during tombstone cleaning.

Future approaches

All the above being said, the entire action was not really much of a success. The percentage disk utilization plot I showed in the beginning looks rather impressive, dropping the disk utilization by about 30%. But quite frankly: That’s mostly due to the 20 GB increase in the volume size. Not because of my cleanup. A more realistic picture can be seen here, which is the bytes used of the Prometheus volume up to today:

A Grafana time series plot. It shows the time from March 31st to June 28th. The curve grows consistently from 68 GiB to 104 GiB until June 9th, where it drops to 87 GiB. Then it rises again constantly until it reaches 95 GiB on June 28th.

Used bytes on the Prometheus volume in GiB.

This rate of growth isn’t really sustainable in the current setup. I will constantly run out of space on the volume. I could of course radically cut down on the retention period or on the metrics scraped, but I really don’t want to.

So in the near future, I will have to have a look at Thanos. It is already supported in kube-prometheus-stack, and it can provide long term storage in S3, which I prefer over disk volumes.

But that project has to wait until the k8s migration is done. I hope that I won’t have to repeat this action until that’s completed.