At the time of writing, I have 328 GiB of Prometheus data. When it all started, I had about 250 GiB. I could stop gathering more data whenever I like. 😅

So I’ve got a lot of Prometheus data. Especially since I started the Kubernetes cluster - or rather, since I started scraping it - I had to regularly increase the size of the storage volume for Prometheus. This might very well be due to my 5 year retention. But part of it, as it will turn out later, was because some of the things I was scraping had a 10s scrape interval configured.

So where’s all the data coming from? There are currently 21 hosts with the standard node exporter running. Then there’s the Kubernetes scraping I’m doing with kube-prometheus-stack. That gathers a lot of metrics for every single container I’ve got running. I don’t know how many those are right now, but at least 196, because that’s the number of Pods which are currently running. Then there’s also my Ceph cluster. And a few more bits and bobs, but I doubt that they contribute very much.

Here’s the problem in a single plot:

A screenshot of a Grafana time series plot. It shows the size of my Prometheus volume, starting in March 2024 until today. It starts out slightly below 60 GB and constantly growths from there. Shortly after reaching 100 GB in mid-May 2024, it goes down by about 20 GB, but continues growing linearly after that. For the growth rate, it grew by about 13 GB in July 2024. Around the beginning of 2025 the growth rate seems to accelerate, with +24 GB in April 2025. On May 5th, the size fell off a cliff down to 1.62 GB.

Size of my Prometheus volume.

So I was getting a bit tired of regularly having to increase the size of my Prometheus volume. It highlights the utter ridiculousness of the amount of data gathering I’m doing. 😁 I needed a solution. I considered considerably reducing my metrics gathering. The counterpoint: But pretty graphs! So another solution needed to be found.

Enter Thanos. Two things drew me to it. First and foremost, it promised to allow me to dump my metrics data into an S3 bucket. Which is great, because I would not have to worry about volume size increases anymore. The next time I run out of storage for the metrics, I would be running out of storage, period. And thanks to Ceph, I would just need to throw in an additional disk somewhere should that ever happen. That alone would already be a great advantage over my current setup. But Thanos also supports downsampling of data. While I do intentionally keep all data for five years right now, I don’t really need that data in full precision. So this would allow me to reduce my storage usage, without having to drop data entirely. I will even end up with more retention as before, just not in full precision.

How Thanos works

Thanos works with multiple components as follows:

A diagram showing how Thanos works. On the left side are a couple of squares labeled Kubernetes, Ceph, hosts representing the targets Prometheus, represented by another square, scrapes. The Kubernetes, Ceph, hosts and Prometheus boxes are white. Prometheus, in turn, is connected with an arrow labeled 'Gathers Blocks From' to a block called 'Thanos Sidecar'. This sidecar then has another arrow indicating that it uploads blocks to S3. A separate square labeled 'Thanos Compactor' is only connected to S3, with an arrow labeled 'Compact Blocks'. Then there's the 'Thanos Querier' block, connected with arrows labeled 'queries' to the 'Thanos Sidecar' and 'S3'. And finally, another block in white labeled 'Grafana' has an arrow towards the 'Thanos Querier' labeled 'queries'.

Overview of Thanos

The components marked in white in the diagram are the original components of my metrics setup, while the new Thanos components are kept in blue. Thanos starts out taking the uncompressed blocks from Prometheus’ storage via the Thanos Sidecar, uploading them unchanged to S3. Once the blocks are uploaded, they are downloaded again by the Compactor, who’s main job is to compact the blocks, similar to what Prometheus would do.

Queries against this storage are done by the Querier. It is not only connected to the S3 bucket and able to query the blocks there via range requests, but also to the Sidecar. This is necessary because Prometheus (by default) only creates a new actual block every two hours. Before that, newly scraped metrics are kept in the head block. So to get the most recent data, the Querier needs to got to the Sidecar. For queries over longer intervals, the Querier is able to combine data from multiple sources.

And finally, Grafana is no longer pointed at Prometheus, but instead at the Thanos Querier. There’s also an additional component, the Thanos Query Frontend, that does query distribution and caching. But to be honest, it doesn’t look like I need it right now.

Thanos setup

The first step to complete was setting up an S3 bucket, which I did via my Rook Ceph cluster:

apiVersion: objectbucket.io/v1alpha1
kind: ObjectBucketClaim
metadata:
  name: bucket-thanos
spec:
  bucketName: {{ .Values.bucketName }}
  storageClassName: rgw-bulk

Next step is the setup of the Sidecar. As I’m running kube-prometheus-stack for my monitoring stack and that already provides Thanos integration, I used that. There are a number of changes necessary in the Prometheus part of the values.yaml file. First, prometheus.prometheusSpec.disableCompaction: true needs to be set. That completely removes Prometheus’ own compaction, which is necessary so Thanos can take over compaction duties. Then I also set some Thanos related options:

prometheus:
  thanosService:
    enabled: true
  thanosServiceMonitor:
    enabled: false
  thanosServiceExternal:
    enabled: false
  thanosIngress:
    enabled: false
  prometheusSpec:
    disableCompaction: true
    thanos:
      objectStorageConfig:
        existingSecret:
          name: thanos-objectstore-config
          key: "bucket.yml"
      logFormat: json
      additionalArgs:
        - name: shipper.upload-compacted

I didn’t need any ingress to the Sidecar, so I disabled it. As is my habit, I also disabled Thanos’ own metric gathering, at least until I could get around to properly setting it up and creating some dashboards.

The thanos: section provides the configuration for the Sidecar. It’s part of the Prometheus Operator config, so it’s not the kube-prometheus-stack which adds the Thanos Sidecar, but the Prometheus Operator. The content of the prometheusSpec.thanos key is copied verbatim into the thanos section of the PrometheusSpec for the Prom operator, so any other options from that section can also be added here.

The shipper.upload-compacted flag for the Thanos Sidecar is required so that it uploads already compacted blocks to the S3 bucket. Without this option, the Sidecar will only ever touch uncompacted blocks. As I wanted to move my entire metrics history to S3, I enabled the option.

The main problem during the setup, as seems to happen so often, was how to effectively use the S3 config and credentials so helpfully provided by Rook in the form of a Secret and a ConfigMap. There are two ways of providing the bucket config to Thanos, both need a Thanos-specific config file, either supplied as an actual file or by providing the file content as a verbatim string parameter to a command line flag.

Because the ability to provide environment variables to the Sidecar was completely missing, I opted again for my external-secrets Kubernetes Store approach to providing the S3 credentials. For details, see this post. But external-secrets does not allow taking some values from a ConfigMap for the template, so while I could provide the credentials from the Secret generated by Rook, I couldn’t use the configs from the ConfigMap it also creates and had to hardcode them:

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: thanos-objectstore-config
spec:
  refreshInterval: "10m"
  secretStoreRef:
    name: monitoring-secrets-store
    kind: SecretStore
  target:
    name: thanos-objectstore-config
    template:
      data:
        bucket.yml: |
          type: S3
          config:
            bucket: thanos
            endpoint: rook-ceph-rgw-rgw-bulk.example.svc:80
            disable_dualstack: true
            aws_sdk_auth: false
            access_key: {{ `{{ .AWS_ACCESS_KEY_ID }}` }}
            secret_key: {{ `{{ .AWS_SECRET_ACCESS_KEY }}` }}
            insecure: true
            bucket_lookup_type: path          
  dataFrom:
    - extract:
        key: bucket-thanos

Once I deployed this configuration, the Sidecar immediately started uploading the older, already compacted blocks. For the about 250 GB worth of metrics data I had at that point, it took about 4.5h to upload everything.

Next, the deployment of the other Thanos components. I decided to deploy them into my monitoring namespace, similar to kube-prometheus-stack, because it allowed me to share the configs and Secret between the Sidecar and the other components.

The first Thanos standalone component I deployed was the Thanos Store. It serves as a backend for the Querier, downloading and supplying blocks from the S3 bucket.

Before deploying the Store, I had to define a cache config. This particular cache is for the indexes of Prometheus blocks. I decided on using Redis for this, as I’ve already got an instance running anyway. My configuration looks like this:

apiVersion: v1
kind: ConfigMap
metadata:
  name: redis-cache-conf
data:
  redis-cache.yaml: |
    type: REDIS
    config:
      addr: redis.redis.svc.cluster.local:6379
      tls_enabled: false
      cache_size: 256MB
      max_async_buffer_size: 100000    

One really annoying thing: The unit of the cache_size doesn’t seem to be documented anywhere. So I went spelunking a little bit. First, I looked up the cache_size in the Thanos repo:

	clientOpts := rueidis.ClientOption{
		InitAddress:       strings.Split(config.Addr, ","),
		ShuffleInit:       true,
		Username:          config.Username,
		Password:          config.Password,
		SelectDB:          config.DB,
		CacheSizeEachConn: int(config.CacheSize),
		Dialer:            net.Dialer{Timeout: config.DialTimeout},
		ConnWriteTimeout:  config.WriteTimeout,
		DisableCache:      clientSideCacheDisabled,
		TLSConfig:         tlsConfig,
	}

Through the rueidis in the package name of that struct, I landed on the repo of the Redis Go client:

// CacheStoreOption will be passed to NewCacheStoreFn
type CacheStoreOption struct {
	// CacheSizeEachConn is redis client side cache size that bind to each TCP connection to a single redis instance.
	// The default is DefaultCacheBytes.
	CacheSizeEachConn int
}

And then I finally found the default value and figured out what the unit was here:

const (
	// DefaultCacheBytes is the default value of ClientOption.CacheSizeEachConn, which is 128 MiB
	DefaultCacheBytes = 128 * (1 << 20)

And all of that sleuthing just to realize that the value takes any unit I want. 🤦

Ah well. At least I got to look at some Go code again. All of that said, the Store also needs a bit of local disk space, as a scratch space for temporarily downloading chunks or indexes. I gave it a 5 GiB volume, and that has been more than enough the last couple of weeks since I’ve had the setup running.

The deployment of the store then looks like this:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: thanos-store
spec:
  replicas: 1
  selector:
    matchLabels:
      homelab/app: thanos-store
  strategy:
    type: "Recreate"
  template:
    metadata:
      labels:
        homelab/app: thanos-store
      annotations:
        checksum/redis-conf: {{ include (print $.Template.BasePath "/redis-cache-config.yaml") . | sha256sum }}
    spec:
      automountServiceAccountToken: false
      securityContext:
        fsGroup: 1000
      containers:
        - name: thanos-store
          image: quay.io/thanos/thanos:{{ .Values.appVersion }}
          args:
            - store
            - --cache-index-header
            - --chunk-pool-size=1GB
            - --data-dir={{ .Values.store.cacheDir }}
            - --index-cache.config-file=/homelab/thanos-store/configs/redis-cache.yaml
            - --log.format={{ .Values.logFormat }}
            - --log.level={{ .Values.logLevel }}
            - --objstore.config-file=/homelab/thanos-store/configs/bucket-config.yml
            - --web.disable
            - --grpc-address=0.0.0.0:{{ .Values.ports.grpcPort }}
            - --http-address=0.0.0.0:{{ .Values.ports.httpPort }}
          volumeMounts:
            - name: cache
              mountPath: {{ .Values.store.cacheDir }}
            - name: thanos-configs
              mountPath: /homelab/thanos-store/configs
              readOnly: true
          resources:
            requests:
              cpu: 200m
              memory: 1500Mi
            limits:
              memory: 1500Mi
          livenessProbe:
            failureThreshold: 8
            httpGet:
              path: /-/healthy
              port: {{ .Values.ports.httpPort }}
              scheme: HTTP
            periodSeconds: 30
            timeoutSeconds: 1
          readinessProbe:
            failureThreshold: 20
            httpGet:
              path: /-/ready
              port: {{ .Values.ports.httpPort }}
              scheme: HTTP
            periodSeconds: 5
          ports:
            - name: store-http
              containerPort: {{ .Values.ports.httpPort }}
              protocol: TCP
            - name: store-grpc
              containerPort: {{ .Values.ports.grpcPort }}
              protocol: TCP
      volumes:
        - name: cache
          persistentVolumeClaim:
            claimName: thanos-store-volume
        - name: thanos-configs
          projected:
            sources:
              - secret:
                  name: thanos-objectstore-config
                  items:
                    - key: "bucket.yml"
                      path: bucket-config.yml
              - configMap:
                  name: redis-cache-conf
                  items:
                    - key: "redis-cache.yaml"
                      path: redis-cache.yaml

Nothing special to see here, so let’s move on to the next piece of the puzzle. At this point, the upload of the older blocks was done, so I went into the config and removed the shipper.upload-compacted flag from the additionalArgs of the Thanos Sidecar config. I still left my 5 year retention in Prometheus for now, because I hadn’t tested anything related to Thanos yet.

That’s coming now, with the deployment of the Querier. That’s the component that goes to a number of metrics stores, which can be Thanos Store instances, or Thanos Sidecars on multiple Prometheus instances and queries them for data. It implements the Prometheus query language, so it’s fully compatible with frontends like Grafana.

It’s also not very complicated, as it doesn’t need any local storage for example:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: thanos-querier
spec:
  replicas: 1
  selector:
    matchLabels:
      homelab/app: thanos-querier
  strategy:
    type: "Recreate"
  template:
    metadata:
      labels:
        homelab/app: thanos-querier
    spec:
      automountServiceAccountToken: false
      securityContext:
        fsGroup: 1000
      containers:
        - name: thanos-querier
          image: quay.io/thanos/thanos:{{ .Values.appVersion }}
          args:
            - query
            - --query.auto-downsampling
            - --log.format={{ .Values.logFormat }}
            - --log.level={{ .Values.logLevel }}
            - --grpc-address=0.0.0.0:{{ .Values.ports.grpcPort }}
            - --http-address=0.0.0.0:{{ .Values.ports.httpPort }}
            - --endpoint=dnssrv+_thanos-store-grpc._tcp.thanos-store.monitoring.svc
            - --endpoint=dnssrv+_grpc._tcp.monitoring-kube-prometheus-thanos-discovery.monitoring.svc
          resources:
            requests:
              cpu: 200m
              memory: 512Mi
          livenessProbe:
            failureThreshold: 8
            httpGet:
              path: /-/healthy
              port: {{ .Values.ports.httpPort }}
              scheme: HTTP
            periodSeconds: 30
            timeoutSeconds: 1
          readinessProbe:
            failureThreshold: 20
            httpGet:
              path: /-/ready
              port: {{ .Values.ports.httpPort }}
              scheme: HTTP
            periodSeconds: 5
          ports:
            - name: querier-http
              containerPort: {{ .Values.ports.httpPort }}
              protocol: TCP
            - name: querier-grpc
              containerPort: {{ .Values.ports.grpcPort }}
              protocol: TCP

The only thing to note is the configuration of the endpoints. There are a number of options. I decided to configure them via DNS and using the records for the Store and Sidecar services, which worked nicely. Note that I was even able to use SRV queries, so I didn’t need to hardcode the ports either. Honestly, more things ought to support SRV queries. The two --endpoint flags tell the Querier to request data from my Sidecar and Thanos Store deployments. This means that for older data, the Querier will take it from the Store, and for the most current data (the past 2h max in my config) it will go to the Sidecar, which in turn will query Prometheus itself.

The last Thanos component to be deployed was the Compactor. Its job is to take the raw blocks uploaded to the bucket by the Sidecar and compact them. That doesn’t reduce the size of the actual samples at all, because they’re all kept, but it does reduce the size of the index, as duplicate entries for the same series in different blocks could be combined. As an example of the current situation, I’ve got a couple of 2h blocks where the index takes up around 14.2 MiB. But the 8h, already compacted block right before those has an index of 29.2 MiB. Without compaction, the four 2h blocks making up the 8h block would take a total of 4x14.2=56.8 MiB, instead of 29.2 MiB.

The Compactor does not need to be kept running all of the time, it could be deployed as a CronJob, as it can just do its thing and then shut down again. It only interacts with the rest of the system by downloading blocks from the S3 bucket, working on them, uploading the result and possibly deleted some now unneeded blocks. But I decided to run it as a deployment, because I figure that I would need to keep the resources free for its regular run anyway, so why not just keep it running?

This is what the deployment looks like:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: thanos-compactor
spec:
  replicas: 1
  selector:
    matchLabels:
      homelab/app: thanos-compactor
  strategy:
    type: "Recreate"
  template:
    metadata:
      labels:
        homelab/app: thanos-compactor
      annotations:
        checksum/objectstore-conf: {{ include (print $.Template.BasePath "/thanos-objectstore-config.yaml") . | sha256sum }}
    spec:
      automountServiceAccountToken: false
      securityContext:
        fsGroup: 1000
      containers:
        - name: thanos-compactor
          image: quay.io/thanos/thanos:{{ .Values.appVersion }}
          args:
            - compact
            - --wait
            - --wait-interval=30m
            - --retention.resolution-1h=0d
            - --retention.resolution-5m=5y
            - --retention.resolution-raw=2y
            - --data-dir={{ .Values.compactor.scratchDir }}
            - --log.format={{ .Values.logFormat }}
            - --log.level={{ .Values.logLevel }}
            - --objstore.config-file={{ .Values.compactor.configDir }}/bucket.yml
            - --http-address=0.0.0.0:{{ .Values.ports.httpPort }}
            - --disable-admin-operations
          volumeMounts:
            - name: scratch
              mountPath: {{ .Values.compactor.scratchDir }}
            - name: objectstore-conf
              mountPath: {{ .Values.compactor.configDir }}
              readOnly: true
          resources:
            requests:
              cpu: 500m
              memory: 1500Mi
            limits:
              memory: 1500Mi
          livenessProbe:
            failureThreshold: 8
            httpGet:
              path: /-/healthy
              port: {{ .Values.ports.httpPort }}
              scheme: HTTP
            periodSeconds: 30
            timeoutSeconds: 1
          readinessProbe:
            failureThreshold: 20
            httpGet:
              path: /-/ready
              port: {{ .Values.ports.httpPort }}
              scheme: HTTP
            periodSeconds: 5
          ports:
            - name: compactor-http
              containerPort: {{ .Values.ports.httpPort }}
              protocol: TCP
      volumes:
        - name: scratch
          persistentVolumeClaim:
            claimName: thanos-compactor-volume
        - name: objectstore-conf
          secret:
            secretName: "thanos-objectstore-config"

The interesting part here are the command line flags. The --wait and --wait-interval flags configure the Compactor to keep running, and to execute its tasks every 30 minutes. I’m also configuring retention. On Prometheus itself, I’ve got five years worth of retention configured. The end of that time will only be met in February next year, as I initially set up Prometheus in 2021. But as I’ve noted above, I’ve gathered quite a lot of data over the years. And I wanted to at least reduce it a little bit.

What I figured was that I probably didn’t need raw precision data for the whole five years. So I decided to set the --retention.resolution.raw to two years. This means that all data will be retained at full precision for two years. That should be enough, even for a graph connaisseur like myself. Most of the time when I look at older data I don’t look at it closely zoomed in, but rather I look at very long time frames for a metric. I then configured the 5 minute precision retention to my previous five years, which is still a lot of precision for a long time frame. Finally, I indulged a little bit and set the 1 hour precision to never be deleted, so I will always have at least that precision available for any data I ever gathered.

One thing has to be clear with this downsampling: The way I’ve configured it, it will not actually reduce the overall size of the TSDB. Quite to the contrary, it will increase the size. Because, at least for the first two years, I’m keeping the full precision data, and I’m also adding two more blocks for every raw precision block. One at 5m precision, and one at 1h.

I started out at around 250 GB worth of metrics data. After the downsampling ran through, I ended up with about 343 GB. Well, it’s good that reducing the size was not a goal of the entire exercise. 😅

Running on a Raspberry Pi 4 worker node, working off of a Ceph RBD backed by a SATA SSD, the downsampling of the data since February 2021 took about 19.5h in total. That’s the overall time for computing the blocks for both 1h and 5m precision.

Before continuing to the Grafana configuration, I would like to highlight another nice feature of Thanos, the block viewer:

A screenshot of Thanos block viewer web UI. It displays information on the TSDB blocks currently in the S3 bucket. At the bottom is a timeline from 2021-02-05 to 2025-05-05. Above it are multiple rows of blocks in different colors. From the start to June 2024, they all have the same size, representing 20 day blocks. There are three rows, representing the raw precision, 1h precision and 5m precision blocks. After that, there's another triplet of rows going up to mid-April 2025 in different colors. There's also a set of shorter blocks, representing 7 days, in December 2024, followed again by longer 20 day blocks. Coming closer to today, the 20 day blocks are first replaced with 7 day blocks, then 2 day blocks, then 8h blocks and finally 2h blocks for the most recent ones. To the right of this graph, an information window is showing some info about the selected block. It contains the block's start and end time. Here, September 20 2024 8 PM to October 11 2024 2 AM. It shows that the block contains over 250k series, with over 7 billion samples. The total size is given as 8.31 GiB, of which 93%, or 7.81 GiB, are the Chunks with the samples and 515.95 MiB the index. It also shows that the ingestion rate for that block is about 420.27 MiB per day. Finally, it shows the Resolution as '0', meaning raw data, the level is 6, meaning it has been compacted 6 times, and then the source is given as the sidecar, meaning this block was uploaded directly from Prometheus and not further compacted by Thanos.

A screenshot of the TSDB in the bucket, taken shortly after downsampling was done.

This is a really nice tool for looking at the general data about the TSDB. It allowed me to get an overview of how the ingestion rate has increased. I will go into details later, but first here is the configuration to get this view, starting with a Service for the compactor:

apiVersion: v1
kind: Service
metadata:
  name: thanos-compactor
spec:
  type: ClusterIP
  selector:
    homelab/app: thanos-compactor
  ports:
    - name: thanos-compactor-http
      port: {{ .Values.ports.httpPort }}
      targetPort: compactor-http
      protocol: TCP

Then I added the following Ingress to make it available via my Traefik ingress:

apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
  name: thanos-block-viewer
  annotations:
    external-dns.alpha.kubernetes.io/hostname: "block-viewer.example.com"
    external-dns.alpha.kubernetes.io/target: "ingress.example.com"
spec:
  entryPoints:
    - secureweb
  routes:
    - kind: Rule
      match: Host(`block-viewer.example.com`)
      services:
        - kind: Service
          name: thanos-compactor
          namespace: monitoring
          port: thanos-compactor-http
          scheme: http

Another important setting is the --disable-admin-operations flag on the Compactor container. This disables some write operations you could do via the web UI, like marking a block for deletion or marking a block as not to be compacted. Because there’s no authentication of any kind available, I disabled these functions.

Configuring Grafana

Initially, I configured a separate data source for Thanos via the values.yaml file of the kube-prometheus-stack chart, like this:

grafana:
  datasources:
    datasource.yaml:
      apiVersion: 1
      editable: false
      datasources:
        - name: thanos
          type: prometheus
          access: proxy
          url: http://thanos-querier.monitoring.svc.cluster.local:10902
          isDefault: false
          prometheusType: Thanos

With that, I was able to verify that I could query all of the data, so I then reconfigured Prometheus to only have a 24h retention:

prometheus:
  prometheusSpec:
    retention: 24h

Prometheus then dutifully removed all of the old blocks in very short order, reducing the size of the TSDB to only about 1.5 GiB. I had wanted to reduce the size of the volume as well, but found that while volume sizes could be increased in Kubernetes, a reduction in size is currently not implemented. I will have to create a fresh volume and copy the data around, and decided to put that off to another day. But I was able to free the space in the Ceph cluster by running this command on the host where the Prometheus Pod was running:

fstrim /var/lib/kubelet/pods/82278fd5-0903-4bdc-b128-562028e435bd/volume-subpaths/pvc-7f0e51e6-40c4-4880-8b52-169c5d1fcdef/prometheus/2

With that, Linux discards unused storage in a filesystem, and in the case of a Ceph RBD, it frees up space in the cluster because RBDs are sparse by default. That took about 20 minutes to run through.

But now back to Grafana. While the querying of Thanos did work, I happened to have a kubectl logs run on the Thanos Store logs open when I went to Grafana’s Drilldown page, and the logs were flooded with messages like this:

{
    "caller":"memcached.go:175",
    "err":"the async buffer is full",
    "level":"error",
    "msg":"failed to cache series in memcached",
    "ts":"2025-05-03T22:00:11.260316949Z"
}

And when I say “flooded”, I mean flooded:

A screenshot of a Grafana log volume graph. It shows very large log amounts up to 5k log events per second, with some multi-second periods of constant 2k logs per second.

There were a lot of logs.

Digging a little bit, I found this issue, which noted that the problem was the max_async_buffer_size setting the cache config. I bumped it up to 100000 and the error went mostly away.

Now I needed to switch all of my dashboards over to using the Thanos date source, instead of going directly to Prometheus. The issue was: I had the “Prometheus” data source configured in all of my Grafana panels, and I did not want to go over all of them and change them to the Thanos source.

I ended up just replacing the original Prometheus source with Thanos in the data source config. For that, I first had to disable the default source that the kube-prometheus-stack Helm chart configures:

grafana:
  sidecar:
    datasources:
      defaultDatasourceEnabled: false
      isDefaultDatasource: false

Next, I added Thanos as a data source called “Prometheus”:

grafana:
  datasources:
    datasource.yaml:
      datasources:
        - name: Prometheus
          uid: prometheus
          type: prometheus
          access: proxy
          url: http://thanos-querier.monitoring.svc.cluster.local:10902
          isDefault: true
          prometheusType: Thanos
          jsonData:
            customQueryParameters: "max_source_resolution=auto"
      deleteDatasources:
        - name: thanos
          orgId: 1

I also used the deleteDatasources entry to have the Grafana provisioning functionality, documented here, remove my temporary Thanos source.

This had the desired effect, and I was able to query all of the data through Thanos without having to go into every panel and change the data source.

Then, as intended, the Thanos retention removed all raw precision data past two years ago. I then wanted to make sure that everything still worked. And I was pretty shocked to see that the answer seemed to be “no”. Here is a plot over the node_load1{} metric in May 2023:

A screenshot of a Grafana time series plot. What the plot shows is not important here. The important part is that the beginning of the plot, from May 1st to about the middle of May 3rd, the plot is empty. Data only starts after that.

Data for normal queries only starts in the middle of May 3rd, even though I definitely had data blocks, both 5m and 1h precision, right back to February 2021.

That was rather shocking. After a quick check, I found that I definitely had data, both in 5m and 1h downsampled state, right back to February 2021. But for some reason, Grafana didn’t show it. I went to try a couple of my dashboards, and interestingly found that on some, data was indeed shown - namely when the panel was only half width, instead of spanning the entire dashboard. Funnily enough, I also found that making the browser window smaller would also return the data. After some more trial-and-error, I found that this was due to the step width changing when the panel got smaller. There’s less pixels per data point available, and so Grafana increases the distance between the data points it requests from the source.

Initially, I thought just adding the --query.auto-downsampling to the command line flags of the Querier would already fix the problem, because that was what showed up in several similar issues reported in the Thanos bug tracker and the wider Internet. But that had no effect at all. There is even a Grafana issue, but this was rejected.

I finally found a serviceable workaround in this issue. There is seemingly no good way to make use of a single data source and have that handle downsampled data. It simply doesn’t work. But what does work is creating a second data source, pointing to the same Thanos, and setting max_source_resolution=5m for that source:

grafana:
  datasources:
    datasource.yaml:
      datasources:
        - name: Thanos-5m
          uid: thanos5m
          type: prometheus
          access: proxy
          url: http://thanos-querier.monitoring.svc.cluster.local:10902
          isDefault: false
          prometheusType: Thanos
          jsonData:
            customQueryParameters: "max_source_resolution=5m"

And that solved the issue. Using that data source, I’m getting data past the end of the raw resolution data in the TSDB without having to configure anything else. And because I only very occasionally, and then very intentionally, look at data older than a year or so, I don’t have a problem with having to explicitly set a different data source.

I would have really liked if this was handled automatically, either by Grafana or by the Querier, but that just doesn’t seem to be how it works.

Reducing Prometheus metrics ingestion

With the Thanos block viewer on hand, I was finally able to dig a little bit deeper into why I had to increase the size of the Prometheus volume so often. Going back to my oldest raw precision block from May 2023, before the k8s migration, I saw that that 20 day block had a size of almost exactly 1 GiB, with 49.47 MiB of data per day. Then looking at the most recent 20 day block, from March/April 2025, that block had a size of 13.61 GiB with 688 MiB per day. Here is a table with a few milestone blocks:

End DateDurationSizeDaily
2023-05-2320d1 GiB49 MiB
2024-02-0220d1.25 GiB62 MiB
2024-03-2220d7.15 GiB361 MiB
2024-09-2020d8.51 GiB430 MiB
2024-12-3120d7.84 GiB396 MiB
2025-03-0120d10.81 GiB546 MiB
2025-04-1120d13.61 GiB688 MiB
2025-05-017d5.10 GiB773 MiB
2025-05-062d1.63 GiB834 MiB
2025-05-122d1.42 GiB724 MiB
2025-05-162d935 MiB467 MiB

It’s pretty clear that the massive jump is coming from the k8s scraping, as I enabled that in March 2024. For the rest of 2024, the daily intake was reasonably stable though. But then, starting in 2025, it increases pretty seriously again. I’m pretty sure that’s because for most of the latter half of 2024, I was working on my backup operator implementation, so there wasn’t much change in the k8s cluster, and most apps were still running in the Nomad cluster. Then, starting in 2025, I began to migrate the rest of the services over, which seemingly increased the amount of scraped data by a lot. This makes some sense, considering that some of my largest series are probably the per-container metrics scraped from the kubelet.

As you can see with the last couple of entries, I made some progress in reducing the ingest in the last couple of days. I would have loved to show you some Prometheus ingest plots, bad sadly I only realized that Prometheus provides ingest metrics too late. 😞

For analyzing the data and finding metrics to cut down, I went to looking directly at the most recent 20 day block with data ending on 2025-04-11. I then opened it with the promtool. This works without having the block in any special directory structure, it doesn’t need to sit in a Prometheus TSDB. I just downloaded it from the S3 bucket with s3cmd.

Then I launched promtool like this:

promtool tsdb analyze ./ 01J3CD4846QQYQEJ3XN7VZ5NMH/

Here, 01J3CD4846QQYQEJ3XN7VZ5NMH is the name of the block to be analyzed, in my case the newest block of full 20 day size. The result looks like this:

Block ID: 01JRJ8EHHW12VY9SZX5Z45SSQV
Duration: 485h59m59.948s
Total Series: 367092
Label names: 226
Postings (unique label pairs): 20435
Postings entries (total label pairs): 4101019

Label pairs most involved in churning:
105190 service=monitoring-kube-prometheus-kubelet
105190 endpoint=https-metrics
105190 job=kubelet
102018 metrics_path=/metrics/cadvisor
51674 namespace=vault
50203 container=vault
49917 image=docker.io/hashicorp/vault:1.18.5
43187 job=kube-state-metrics
43187 service=monitoring-kube-state-metrics
43187 endpoint=http
35765 service=kubernetes
35765 namespace=default
35765 job=apiserver
35765 endpoint=https
29513 container=kube-state-metrics
22812 namespace=backups
21820 namespace=kube-system
21620 instance=10.8.11.250:8080
21566 instance=10.8.12.213:8080
20230 namespace=rook-ceph

Label names most involved in churning:
201049 __name__
201049 instance
201049 job
194577 service
194053 endpoint
194053 namespace
155528 pod
143798 container
108242 node
105190 metrics_path
102043 id
94318 name
75452 image
46727 device
38871 uid
32460 le
28929 scope
21394 resource
20692 verb
17317 version

Most common label pairs:
145767 endpoint=https-metrics
145767 job=kubelet
145767 service=monitoring-kube-prometheus-kubelet
128459 service=kubernetes
128459 endpoint=https
128459 namespace=default
128459 job=apiserver
123738 metrics_path=/metrics/cadvisor
62525 component=apiserver
57546 service=monitoring-kube-state-metrics
57546 endpoint=http
57546 job=kube-state-metrics
52926 version=v1
52036 namespace=vault
50810 instance=10.86.5.202:6443
50337 container=vault
50020 image=docker.io/hashicorp/vault:1.18.5
49771 namespace=kube-system
39454 container=kube-state-metrics
33481 scope=cluster

Label names with highest cumulative label value length:
636231 id
252234 name
224352 container_id
136756 mountpoint
41804 __name__
25744 uid
23777 pod
14309 owner_name
14243 created_by_name
12707 device
11585 job_name
11220 type
11194 image_id
7383 resource
5504 csi_volume_handle
4755 client
4465 image
4209 pod_ip
4209 ip
4107 interface

Highest cardinality labels:
3814 id
3528 name
3116 container_id
1316 __name__
1185 mountpoint
718 uid
688 pod
566 device
417 pod_ip
417 ip
354 type
349 owner_name
343 created_by_name
317 client
315 resource
280 interface
216 job_name
188 le
182 container
156 kind

Highest cardinality metric names:
31872 etcd_request_duration_seconds_bucket
25920 apiserver_request_duration_seconds_bucket
21736 apiserver_request_sli_duration_seconds_bucket
15232 container_memory_failures_total
10208 apiserver_request_body_size_bytes_bucket
8092 container_blkio_device_usage_total
6968 apiserver_response_sizes_bucket
6734 container_fs_reads_total
6734 container_fs_writes_total
4590 kube_pod_status_phase
4590 kube_pod_status_reason
4036 container_fs_reads_bytes_total
4036 kubernetes_feature_enabled
4036 container_fs_writes_bytes_total
3808 container_memory_kernel_usage
3808 container_memory_failcnt
3808 container_memory_rss
3808 container_memory_max_usage_bytes
3808 container_oom_events_total
3808 container_memory_working_set_bytes

Before I go any deeper, one glaring omission in the output that has me a little bit confused: There is no indicator of the actual number of samples in a metric or series. So you get a lot of information about labels and series, but nothing about the samples besides the initial total number of samples in the block.

So let’s look at the information we get there. First the typical metadata, like how long the block is and what the oldest and newest timestamp contained in it are. One headline number is the count of 367092 series. Let me shortly explain the difference between a series and a metric. Let’s take as an example container_fs_reads_total. This is a metric - a certain value, gathered from potentially multiple targets, which has certain labels. A series is then one explicit permutation of those label’s values. For example like this:

container_fs_reads_total{
    container="POD",
    device="/dev/nvme0n1",
    endpoint="https-metrics",
    instance="300.300.300.1:10250",
    job="kubelet",
    metrics_path="/metrics/cadvisor",
    namespace="rook-cluster",
    node="mynode1",
    pod="rook-ceph-osd-2-85b8f48c47-p24kc",
    prometheus="monitoring/monitoring-kube-prometheus-prometheus",
    prometheus_replica="prometheus-monitoring-kube-prometheus-prometheus-0",
    service="monitoring-kube-prometheus-kubelet"
}

This is one single series of the container_fs_reads_total metric - one specific combination of label values. From what I understand, these series are the basis for Prometheus’ TSDB storage architecture. Having more or less samples per series doesn’t make much of a difference for Prometheus, but having many more series per metric tends to get expensive, leading to a cardinality problem for Prometheus and significantly increasing computational and memory requirements. That’s why the output of promtool fixates on labels and their cardinality, not the number of samples.

I started out looking at the Label names with highest cumulative label value length section. If I interpret it right, this is the total length of all values for that particular label. I then went into Grafana’s explore tab and started, well, exploring. Take the first label, id. Concatenating all values of that label produces 636k characters. I then chose a random one of the values, which makes Grafana show you all the metrics using that label+value combination:

A screenshot of Grafana's explore tab, with the metrics browser open. It shows several selection fields, one of them for labels. The label 'id' is chosen, showing that it has 1900 values. A list of those values is also shown, where a random one is currently chosen, starting with '/kubepods.slice/kubepods-burst...'. On the left side is another list with all of the metrics which have that label+value combination. All of them starting with 'container_', and then a lot of different container metrics like 'fs_writes_total'.

An example exploration of the ‘id’ label.

Note that the label has 1900 values. I then chose a random one of the metrics to figure out where it’s coming from and whether the label is necessary for uniqueness. Here is an example:

container_fs_inodes_total{
    container="install-cni-binaries",
    device="/dev/sda2",
    endpoint="https-metrics",

    id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podcd726cab_879b_4b26_9916_278220f88d5b.slice/crio-3e855297d10572a5e369c72b4194911528becb34ec905da058ca78df3a3286ca.scope",

    image="quay.io/cilium/cilium@sha256:1782794aeac951af139315c10eff34050aa7579c12827ee9ec376bb719b82873",
    instance="300.300.300.2:10250",
    job="kubelet",
    metrics_path="/metrics/cadvisor",
    name="k8s_install-cni-binaries_cilium-bs8mb_kube-system_cd726cab-879b-4b26-9916-278220f88d5b_1",
    namespace="kube-system",
    node="control-plane1",
    pod="cilium-bs8mb",
    prometheus="monitoring/monitoring-kube-prometheus-prometheus",
    prometheus_replica="prometheus-monitoring-kube-prometheus-prometheus-0",
    service="monitoring-kube-prometheus-kubelet"
}

Just looking at the value for the id label, the problem is immediately clear: That value is not just extremely long, but it’s probably also unique. A restart of the container might already result in a new one, a restart of the Pod definitely would. But the value is also unnecessary to guarantee uniqueness of the series. That’s already guaranteed by the pod plus container labels. And the same is true for the name label, which has 1902 values and similarly looks like it might be randomly generated. And it too should be covered by the pod plus container label combination when it comes to uniqueness. So I decided to completely drop that particular label. Note the job and the metrics_path labels. Those indicate where the metric is coming from, namely the kubelet’s cAdvisor scrape. Those can be configured from the kube-prometheus-stack values.yaml.

Worth noting here: The chart has different configs for the different metrics paths the kubelet offers, hence why looking at the metrics_path label is also important. I dropped the label via this config:

kubelet:
  serviceMonitor:
    cAdvisorMetricRelabelings:
      - action: labeldrop
        regex: id
      - action: labeldrop
        regex: name

With that, those two labels will be dropped completely before the samples are ingested. This will not have an immediate effect on the size of new blocks. That’s because we’re not gathering fewer samples. We’re instead getting fewer series for all of the cAdvisor metrics. This will have a larger impact once different blocks are compacted into larger blocks. Because the compaction does not actually remove any samples, but it is able to deduplicate the series in the index. For example, with those two labels still in there, I would have larger daily blocks on days with my regular host or service updates. During both of those maintenance actions, I’m restarting and rescheduling Pods, which would lead to both labels changing - so I would suddenly have two series in the same block for what is pretty much the same Pod, because the name and id labels would change after a restart of the node hosting the Pod.

I went through the rest of the list in the section and applied different actions, ranging from leaving the label untouched, like the pod label, because it’s needed for uniqueness, to dropping the entire metric using the label, like the mountpoint, which is only used in the node_filesystem_device_error and node_filesystem_readonly metrics, neither of which is particularly interesting, so I dropped them in node_exporter, where they’re coming from.

I then went through the Highest cadrinality metric names section, and dropped a lot of the metrics in there because they just didn’t look very interesting.

See, I’m perfectly capable of even dropping entire metrics. I’m a responsible adult! 🥹

But one value in the cardinality section deserves a shout out: etcd_request_duration_seconds_bucket. That metric is just humongous. It produces a total of 45k series. That’s how many unique label combinations it had seen. That comes from the fact that that metric has labels for 24 histogram buckets times 6 HTTP operations times 317 different Kubernetes object kinds. Wow.

One mistake I made during the configuration that I found just now was that, as I wrote above, the labeldrop action belongs in the metrics relabelings. I had put them into the relabelings config, but that does not work.

Those initial fixes and cleanups were done last weekend. I did not see much drop in the overall size of the gathered metrics, so I dug in a little bit more. Which was when I saw that all those cAdvisor metrics - which at 196 Pods, and at least that many containers were definitely contributing the most - were scraped at 10 second intervals. Which is ridiculous. I increased that to 30s, via this setting in the kube-prometheus-stack chart:

kubelet:
  serviceMonitor:
    cAdvisorInterval: 30s

Then I had the idea of checking whether any other ServiceMonitors were also configured with too short of a scrape interval, and I discovered that Ceph was also doing 10s by default. I was able to change that in the cluster chart like this:

monitoring:
  interval: 30s

One thing I found interesting was that dropping the scrape interval for cAdvisor measurably dropped Prometheus’ CPU usage as well:

A screenshot of a Grafana time series plot. It shows relatively consistent CPU usage by Prometheus, fluctuating between 0.13 and 0.2, with very occasional spikes to 0.5. The utilization markedly dropped around 21:38, to fluctuate between 0.075 and 0.15.

CPU usage of the Prometheus container. I dropped the cAdvisor scrape interval to 30s around 21:38.

Finally a small misguided adventure in metrics reduction. I saw that for the container metrics, there were entries for each Pod that had POD as their container name. I surmised that those were the metrics for the container Kubernetes uses to hold the networking namespace, and I thought I could drop it to reduce ingest a bit further. But it turned out that yes, this container was actually important, because it is where all the networking metrics for a Pod are reported. So I had to revert the drop.

Conclusion

I’m just going to stop here, short of the 30 minute reading time limit. 😉

This project was a very enjoyable success. I of course always welcome any chance to look at my metrics. And the main goal was reached to its fullest: I’ve now got my metrics in an S3 bucket, and will never have to increase a volume size again. The downsampling, with somewhat smaller storage requirements after two years and the ability that with the 1h precision, I can just keep the metrics indefinitely, was a really nice bonus. The only thing I would wish for is that the Thanos Queries did automatically query the next lower precision if it doesn’t find any raw precision data.

I was also quite happy that this project had me learn a bit more about how Prometheus stores its data, and it was another welcome trigger to reduce the metrics ingestion at least a little bit more.

This project has again shown me that I should get a move on and start scraping more of my services, instead of mostly scraping Kubernetes, Ceph and my hosts. I would have really loved to show some plots from scraped Prometheus metrics on the effects of my metric reduction attempts.

Finally, the Thanos block viewer again demonstrates a principle I read in The Art of Unix Programming many years ago: The Rule of Transparency. It’s always a good idea to make your program’s inner workings transparent, and the block viewer was genuinely helpful.

So what’s next? I decided to continue going down my “smaller project” list before starting something big and completely new again. So the next thing will likely be the migration from Gitea to Forgejo, simply because that’s next on the list of Homelab things to do.