At the time of writing, I have 328 GiB of Prometheus data. When it all started, I had about 250 GiB. I could stop gathering more data whenever I like. 😅
So I’ve got a lot of Prometheus data. Especially since I started the Kubernetes cluster - or rather, since I started scraping it - I had to regularly increase the size of the storage volume for Prometheus. This might very well be due to my 5 year retention. But part of it, as it will turn out later, was because some of the things I was scraping had a 10s scrape interval configured.
So where’s all the data coming from? There are currently 21 hosts with the standard node exporter running. Then there’s the Kubernetes scraping I’m doing with kube-prometheus-stack. That gathers a lot of metrics for every single container I’ve got running. I don’t know how many those are right now, but at least 196, because that’s the number of Pods which are currently running. Then there’s also my Ceph cluster. And a few more bits and bobs, but I doubt that they contribute very much.
Here’s the problem in a single plot: Size of my Prometheus volume.
So I was getting a bit tired of regularly having to increase the size of my Prometheus volume. It highlights the utter ridiculousness of the amount of data gathering I’m doing. 😁 I needed a solution. I considered considerably reducing my metrics gathering. The counterpoint: But pretty graphs! So another solution needed to be found.
Enter Thanos. Two things drew me to it. First and foremost, it promised to allow me to dump my metrics data into an S3 bucket. Which is great, because I would not have to worry about volume size increases anymore. The next time I run out of storage for the metrics, I would be running out of storage, period. And thanks to Ceph, I would just need to throw in an additional disk somewhere should that ever happen. That alone would already be a great advantage over my current setup. But Thanos also supports downsampling of data. While I do intentionally keep all data for five years right now, I don’t really need that data in full precision. So this would allow me to reduce my storage usage, without having to drop data entirely. I will even end up with more retention as before, just not in full precision.
How Thanos works
Thanos works with multiple components as follows: Overview of Thanos
The components marked in white in the diagram are the original components of my metrics setup, while the new Thanos components are kept in blue. Thanos starts out taking the uncompressed blocks from Prometheus’ storage via the Thanos Sidecar, uploading them unchanged to S3. Once the blocks are uploaded, they are downloaded again by the Compactor, who’s main job is to compact the blocks, similar to what Prometheus would do.
Queries against this storage are done by the Querier. It is not only connected to the S3 bucket and able to query the blocks there via range requests, but also to the Sidecar. This is necessary because Prometheus (by default) only creates a new actual block every two hours. Before that, newly scraped metrics are kept in the head block. So to get the most recent data, the Querier needs to got to the Sidecar. For queries over longer intervals, the Querier is able to combine data from multiple sources.
And finally, Grafana is no longer pointed at Prometheus, but instead at the Thanos Querier. There’s also an additional component, the Thanos Query Frontend, that does query distribution and caching. But to be honest, it doesn’t look like I need it right now.
Thanos setup
The first step to complete was setting up an S3 bucket, which I did via my Rook Ceph cluster:
apiVersion: objectbucket.io/v1alpha1
kind: ObjectBucketClaim
metadata:
name: bucket-thanos
spec:
bucketName: {{ .Values.bucketName }}
storageClassName: rgw-bulk
Next step is the setup of the Sidecar. As I’m running kube-prometheus-stack
for my monitoring stack and that already provides Thanos integration, I used that.
There are a number of changes necessary in the Prometheus part of the values.yaml
file. First, prometheus.prometheusSpec.disableCompaction: true
needs to be set.
That completely removes Prometheus’ own compaction, which is necessary so Thanos
can take over compaction duties. Then I also set some Thanos related options:
prometheus:
thanosService:
enabled: true
thanosServiceMonitor:
enabled: false
thanosServiceExternal:
enabled: false
thanosIngress:
enabled: false
prometheusSpec:
disableCompaction: true
thanos:
objectStorageConfig:
existingSecret:
name: thanos-objectstore-config
key: "bucket.yml"
logFormat: json
additionalArgs:
- name: shipper.upload-compacted
I didn’t need any ingress to the Sidecar, so I disabled it. As is my habit, I also disabled Thanos’ own metric gathering, at least until I could get around to properly setting it up and creating some dashboards.
The thanos:
section provides the configuration for the Sidecar. It’s part of
the Prometheus Operator config, so it’s not the kube-prometheus-stack which adds
the Thanos Sidecar, but the Prometheus Operator. The content of the prometheusSpec.thanos
key is
copied verbatim into the thanos section
of the PrometheusSpec for the Prom operator, so any other options from that
section can also be added here.
The shipper.upload-compacted
flag for the Thanos Sidecar is required so that
it uploads already compacted blocks to the S3 bucket. Without this option, the
Sidecar will only ever touch uncompacted blocks. As I wanted to move my entire
metrics history to S3, I enabled the option.
The main problem during the setup, as seems to happen so often, was how to effectively use the S3 config and credentials so helpfully provided by Rook in the form of a Secret and a ConfigMap. There are two ways of providing the bucket config to Thanos, both need a Thanos-specific config file, either supplied as an actual file or by providing the file content as a verbatim string parameter to a command line flag.
Because the ability to provide environment variables to the Sidecar was completely missing, I opted again for my external-secrets Kubernetes Store approach to providing the S3 credentials. For details, see this post. But external-secrets does not allow taking some values from a ConfigMap for the template, so while I could provide the credentials from the Secret generated by Rook, I couldn’t use the configs from the ConfigMap it also creates and had to hardcode them:
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: thanos-objectstore-config
spec:
refreshInterval: "10m"
secretStoreRef:
name: monitoring-secrets-store
kind: SecretStore
target:
name: thanos-objectstore-config
template:
data:
bucket.yml: |
type: S3
config:
bucket: thanos
endpoint: rook-ceph-rgw-rgw-bulk.example.svc:80
disable_dualstack: true
aws_sdk_auth: false
access_key: {{ `{{ .AWS_ACCESS_KEY_ID }}` }}
secret_key: {{ `{{ .AWS_SECRET_ACCESS_KEY }}` }}
insecure: true
bucket_lookup_type: path
dataFrom:
- extract:
key: bucket-thanos
Once I deployed this configuration, the Sidecar immediately started uploading the older, already compacted blocks. For the about 250 GB worth of metrics data I had at that point, it took about 4.5h to upload everything.
Next, the deployment of the other Thanos components. I decided to deploy them
into my monitoring
namespace, similar to kube-prometheus-stack, because it
allowed me to share the configs and Secret between the Sidecar and the other
components.
The first Thanos standalone component I deployed was the Thanos Store. It serves as a backend for the Querier, downloading and supplying blocks from the S3 bucket.
Before deploying the Store, I had to define a cache config. This particular cache is for the indexes of Prometheus blocks. I decided on using Redis for this, as I’ve already got an instance running anyway. My configuration looks like this:
apiVersion: v1
kind: ConfigMap
metadata:
name: redis-cache-conf
data:
redis-cache.yaml: |
type: REDIS
config:
addr: redis.redis.svc.cluster.local:6379
tls_enabled: false
cache_size: 256MB
max_async_buffer_size: 100000
One really annoying thing: The unit of the cache_size
doesn’t seem to be
documented anywhere. So I went spelunking a little bit. First, I looked up the
cache_size
in the Thanos repo:
clientOpts := rueidis.ClientOption{
InitAddress: strings.Split(config.Addr, ","),
ShuffleInit: true,
Username: config.Username,
Password: config.Password,
SelectDB: config.DB,
CacheSizeEachConn: int(config.CacheSize),
Dialer: net.Dialer{Timeout: config.DialTimeout},
ConnWriteTimeout: config.WriteTimeout,
DisableCache: clientSideCacheDisabled,
TLSConfig: tlsConfig,
}
Through the rueidis
in the package name of that struct, I landed on the repo
of the Redis Go client:
// CacheStoreOption will be passed to NewCacheStoreFn
type CacheStoreOption struct {
// CacheSizeEachConn is redis client side cache size that bind to each TCP connection to a single redis instance.
// The default is DefaultCacheBytes.
CacheSizeEachConn int
}
And then I finally found the default value and figured out what the unit was here:
const (
// DefaultCacheBytes is the default value of ClientOption.CacheSizeEachConn, which is 128 MiB
DefaultCacheBytes = 128 * (1 << 20)
And all of that sleuthing just to realize that the value takes any unit I want. 🤦
Ah well. At least I got to look at some Go code again. All of that said, the Store also needs a bit of local disk space, as a scratch space for temporarily downloading chunks or indexes. I gave it a 5 GiB volume, and that has been more than enough the last couple of weeks since I’ve had the setup running.
The deployment of the store then looks like this:
apiVersion: apps/v1
kind: Deployment
metadata:
name: thanos-store
spec:
replicas: 1
selector:
matchLabels:
homelab/app: thanos-store
strategy:
type: "Recreate"
template:
metadata:
labels:
homelab/app: thanos-store
annotations:
checksum/redis-conf: {{ include (print $.Template.BasePath "/redis-cache-config.yaml") . | sha256sum }}
spec:
automountServiceAccountToken: false
securityContext:
fsGroup: 1000
containers:
- name: thanos-store
image: quay.io/thanos/thanos:{{ .Values.appVersion }}
args:
- store
- --cache-index-header
- --chunk-pool-size=1GB
- --data-dir={{ .Values.store.cacheDir }}
- --index-cache.config-file=/homelab/thanos-store/configs/redis-cache.yaml
- --log.format={{ .Values.logFormat }}
- --log.level={{ .Values.logLevel }}
- --objstore.config-file=/homelab/thanos-store/configs/bucket-config.yml
- --web.disable
- --grpc-address=0.0.0.0:{{ .Values.ports.grpcPort }}
- --http-address=0.0.0.0:{{ .Values.ports.httpPort }}
volumeMounts:
- name: cache
mountPath: {{ .Values.store.cacheDir }}
- name: thanos-configs
mountPath: /homelab/thanos-store/configs
readOnly: true
resources:
requests:
cpu: 200m
memory: 1500Mi
limits:
memory: 1500Mi
livenessProbe:
failureThreshold: 8
httpGet:
path: /-/healthy
port: {{ .Values.ports.httpPort }}
scheme: HTTP
periodSeconds: 30
timeoutSeconds: 1
readinessProbe:
failureThreshold: 20
httpGet:
path: /-/ready
port: {{ .Values.ports.httpPort }}
scheme: HTTP
periodSeconds: 5
ports:
- name: store-http
containerPort: {{ .Values.ports.httpPort }}
protocol: TCP
- name: store-grpc
containerPort: {{ .Values.ports.grpcPort }}
protocol: TCP
volumes:
- name: cache
persistentVolumeClaim:
claimName: thanos-store-volume
- name: thanos-configs
projected:
sources:
- secret:
name: thanos-objectstore-config
items:
- key: "bucket.yml"
path: bucket-config.yml
- configMap:
name: redis-cache-conf
items:
- key: "redis-cache.yaml"
path: redis-cache.yaml
Nothing special to see here, so let’s move on to the next piece of the puzzle.
At this point, the upload of the older blocks was done, so I went into the config
and removed the shipper.upload-compacted
flag from the additionalArgs
of the
Thanos Sidecar config. I still left my 5 year retention in Prometheus for now,
because I hadn’t tested anything related to Thanos yet.
That’s coming now, with the deployment of the Querier. That’s the component that goes to a number of metrics stores, which can be Thanos Store instances, or Thanos Sidecars on multiple Prometheus instances and queries them for data. It implements the Prometheus query language, so it’s fully compatible with frontends like Grafana.
It’s also not very complicated, as it doesn’t need any local storage for example:
apiVersion: apps/v1
kind: Deployment
metadata:
name: thanos-querier
spec:
replicas: 1
selector:
matchLabels:
homelab/app: thanos-querier
strategy:
type: "Recreate"
template:
metadata:
labels:
homelab/app: thanos-querier
spec:
automountServiceAccountToken: false
securityContext:
fsGroup: 1000
containers:
- name: thanos-querier
image: quay.io/thanos/thanos:{{ .Values.appVersion }}
args:
- query
- --query.auto-downsampling
- --log.format={{ .Values.logFormat }}
- --log.level={{ .Values.logLevel }}
- --grpc-address=0.0.0.0:{{ .Values.ports.grpcPort }}
- --http-address=0.0.0.0:{{ .Values.ports.httpPort }}
- --endpoint=dnssrv+_thanos-store-grpc._tcp.thanos-store.monitoring.svc
- --endpoint=dnssrv+_grpc._tcp.monitoring-kube-prometheus-thanos-discovery.monitoring.svc
resources:
requests:
cpu: 200m
memory: 512Mi
livenessProbe:
failureThreshold: 8
httpGet:
path: /-/healthy
port: {{ .Values.ports.httpPort }}
scheme: HTTP
periodSeconds: 30
timeoutSeconds: 1
readinessProbe:
failureThreshold: 20
httpGet:
path: /-/ready
port: {{ .Values.ports.httpPort }}
scheme: HTTP
periodSeconds: 5
ports:
- name: querier-http
containerPort: {{ .Values.ports.httpPort }}
protocol: TCP
- name: querier-grpc
containerPort: {{ .Values.ports.grpcPort }}
protocol: TCP
The only thing to note is the configuration of the endpoints. There are a number
of options. I decided to configure them via DNS and using the records for the
Store and Sidecar services, which worked nicely. Note that I was even able to
use SRV queries, so I didn’t need to hardcode the ports either. Honestly, more
things ought to support SRV queries. The two --endpoint
flags tell the Querier
to request data from my Sidecar and Thanos Store deployments.
This means that for older data, the Querier will take it from the Store, and for
the most current data (the past 2h max in my config) it will go to the Sidecar,
which in turn will query Prometheus itself.
The last Thanos component to be deployed was the Compactor. Its job is to take the raw blocks uploaded to the bucket by the Sidecar and compact them. That doesn’t reduce the size of the actual samples at all, because they’re all kept, but it does reduce the size of the index, as duplicate entries for the same series in different blocks could be combined. As an example of the current situation, I’ve got a couple of 2h blocks where the index takes up around 14.2 MiB. But the 8h, already compacted block right before those has an index of 29.2 MiB. Without compaction, the four 2h blocks making up the 8h block would take a total of 4x14.2=56.8 MiB, instead of 29.2 MiB.
The Compactor does not need to be kept running all of the time, it could be deployed as a CronJob, as it can just do its thing and then shut down again. It only interacts with the rest of the system by downloading blocks from the S3 bucket, working on them, uploading the result and possibly deleted some now unneeded blocks. But I decided to run it as a deployment, because I figure that I would need to keep the resources free for its regular run anyway, so why not just keep it running?
This is what the deployment looks like:
apiVersion: apps/v1
kind: Deployment
metadata:
name: thanos-compactor
spec:
replicas: 1
selector:
matchLabels:
homelab/app: thanos-compactor
strategy:
type: "Recreate"
template:
metadata:
labels:
homelab/app: thanos-compactor
annotations:
checksum/objectstore-conf: {{ include (print $.Template.BasePath "/thanos-objectstore-config.yaml") . | sha256sum }}
spec:
automountServiceAccountToken: false
securityContext:
fsGroup: 1000
containers:
- name: thanos-compactor
image: quay.io/thanos/thanos:{{ .Values.appVersion }}
args:
- compact
- --wait
- --wait-interval=30m
- --retention.resolution-1h=0d
- --retention.resolution-5m=5y
- --retention.resolution-raw=2y
- --data-dir={{ .Values.compactor.scratchDir }}
- --log.format={{ .Values.logFormat }}
- --log.level={{ .Values.logLevel }}
- --objstore.config-file={{ .Values.compactor.configDir }}/bucket.yml
- --http-address=0.0.0.0:{{ .Values.ports.httpPort }}
- --disable-admin-operations
volumeMounts:
- name: scratch
mountPath: {{ .Values.compactor.scratchDir }}
- name: objectstore-conf
mountPath: {{ .Values.compactor.configDir }}
readOnly: true
resources:
requests:
cpu: 500m
memory: 1500Mi
limits:
memory: 1500Mi
livenessProbe:
failureThreshold: 8
httpGet:
path: /-/healthy
port: {{ .Values.ports.httpPort }}
scheme: HTTP
periodSeconds: 30
timeoutSeconds: 1
readinessProbe:
failureThreshold: 20
httpGet:
path: /-/ready
port: {{ .Values.ports.httpPort }}
scheme: HTTP
periodSeconds: 5
ports:
- name: compactor-http
containerPort: {{ .Values.ports.httpPort }}
protocol: TCP
volumes:
- name: scratch
persistentVolumeClaim:
claimName: thanos-compactor-volume
- name: objectstore-conf
secret:
secretName: "thanos-objectstore-config"
The interesting part here are the command line flags. The --wait
and --wait-interval
flags configure the Compactor to keep running, and to execute its tasks every
30 minutes. I’m also configuring retention. On Prometheus itself, I’ve got five
years worth of retention configured. The end of that time will only be met in
February next year, as I initially set up Prometheus in 2021. But as I’ve noted
above, I’ve gathered quite a lot of data over the years. And I wanted to at least
reduce it a little bit.
What I figured was that I probably didn’t need raw precision data for the whole
five years. So I decided to set the --retention.resolution.raw
to two years.
This means that all data will be retained at full precision for two years. That
should be enough, even for a graph connaisseur like myself. Most of the time when
I look at older data I don’t look at it closely zoomed in, but rather I look at
very long time frames for a metric. I then configured the 5 minute precision
retention to my previous five years, which is still a lot of precision for a long
time frame. Finally, I indulged a little bit and set the 1 hour precision to
never be deleted, so I will always have at least that precision available for
any data I ever gathered.
One thing has to be clear with this downsampling: The way I’ve configured it, it will not actually reduce the overall size of the TSDB. Quite to the contrary, it will increase the size. Because, at least for the first two years, I’m keeping the full precision data, and I’m also adding two more blocks for every raw precision block. One at 5m precision, and one at 1h.
I started out at around 250 GB worth of metrics data. After the downsampling ran through, I ended up with about 343 GB. Well, it’s good that reducing the size was not a goal of the entire exercise. 😅
Running on a Raspberry Pi 4 worker node, working off of a Ceph RBD backed by a SATA SSD, the downsampling of the data since February 2021 took about 19.5h in total. That’s the overall time for computing the blocks for both 1h and 5m precision.
Before continuing to the Grafana configuration, I would like to highlight another
nice feature of Thanos, the block viewer: A screenshot of the TSDB in the bucket, taken shortly after downsampling was done.
This is a really nice tool for looking at the general data about the TSDB. It allowed me to get an overview of how the ingestion rate has increased. I will go into details later, but first here is the configuration to get this view, starting with a Service for the compactor:
apiVersion: v1
kind: Service
metadata:
name: thanos-compactor
spec:
type: ClusterIP
selector:
homelab/app: thanos-compactor
ports:
- name: thanos-compactor-http
port: {{ .Values.ports.httpPort }}
targetPort: compactor-http
protocol: TCP
Then I added the following Ingress to make it available via my Traefik ingress:
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: thanos-block-viewer
annotations:
external-dns.alpha.kubernetes.io/hostname: "block-viewer.example.com"
external-dns.alpha.kubernetes.io/target: "ingress.example.com"
spec:
entryPoints:
- secureweb
routes:
- kind: Rule
match: Host(`block-viewer.example.com`)
services:
- kind: Service
name: thanos-compactor
namespace: monitoring
port: thanos-compactor-http
scheme: http
Another important setting is the --disable-admin-operations
flag on the Compactor
container. This disables some write operations you could do via the web UI, like
marking a block for deletion or marking a block as not to be compacted. Because
there’s no authentication of any kind available, I disabled these functions.
Configuring Grafana
Initially, I configured a separate data source for Thanos via the values.yaml
file of the kube-prometheus-stack chart, like this:
grafana:
datasources:
datasource.yaml:
apiVersion: 1
editable: false
datasources:
- name: thanos
type: prometheus
access: proxy
url: http://thanos-querier.monitoring.svc.cluster.local:10902
isDefault: false
prometheusType: Thanos
With that, I was able to verify that I could query all of the data, so I then reconfigured Prometheus to only have a 24h retention:
prometheus:
prometheusSpec:
retention: 24h
Prometheus then dutifully removed all of the old blocks in very short order, reducing the size of the TSDB to only about 1.5 GiB. I had wanted to reduce the size of the volume as well, but found that while volume sizes could be increased in Kubernetes, a reduction in size is currently not implemented. I will have to create a fresh volume and copy the data around, and decided to put that off to another day. But I was able to free the space in the Ceph cluster by running this command on the host where the Prometheus Pod was running:
fstrim /var/lib/kubelet/pods/82278fd5-0903-4bdc-b128-562028e435bd/volume-subpaths/pvc-7f0e51e6-40c4-4880-8b52-169c5d1fcdef/prometheus/2
With that, Linux discards unused storage in a filesystem, and in the case of a Ceph RBD, it frees up space in the cluster because RBDs are sparse by default. That took about 20 minutes to run through.
But now back to Grafana. While the querying of Thanos did work, I happened to
have a kubectl logs
run on the Thanos Store logs open when I went to Grafana’s
Drilldown page,
and the logs were flooded with messages like this:
{
"caller":"memcached.go:175",
"err":"the async buffer is full",
"level":"error",
"msg":"failed to cache series in memcached",
"ts":"2025-05-03T22:00:11.260316949Z"
}
And when I say “flooded”, I mean flooded: There were a lot of logs.
Digging a little bit, I found this issue,
which noted that the problem was the max_async_buffer_size
setting the cache
config. I bumped it up to 100000
and the error went mostly away.
Now I needed to switch all of my dashboards over to using the Thanos date source, instead of going directly to Prometheus. The issue was: I had the “Prometheus” data source configured in all of my Grafana panels, and I did not want to go over all of them and change them to the Thanos source.
I ended up just replacing the original Prometheus source with Thanos in the data source config. For that, I first had to disable the default source that the kube-prometheus-stack Helm chart configures:
grafana:
sidecar:
datasources:
defaultDatasourceEnabled: false
isDefaultDatasource: false
Next, I added Thanos as a data source called “Prometheus”:
grafana:
datasources:
datasource.yaml:
datasources:
- name: Prometheus
uid: prometheus
type: prometheus
access: proxy
url: http://thanos-querier.monitoring.svc.cluster.local:10902
isDefault: true
prometheusType: Thanos
jsonData:
customQueryParameters: "max_source_resolution=auto"
deleteDatasources:
- name: thanos
orgId: 1
I also used the deleteDatasources
entry to have the Grafana provisioning
functionality, documented here,
remove my temporary Thanos source.
This had the desired effect, and I was able to query all of the data through Thanos without having to go into every panel and change the data source.
Then, as intended, the Thanos retention removed all raw precision data past two
years ago. I then wanted to make sure that everything still worked. And I was
pretty shocked to see that the answer seemed to be “no”. Here is a plot over the
node_load1{}
metric in May 2023:
Data for normal queries only starts in the middle of May 3rd, even though I definitely had data blocks, both 5m and 1h precision, right back to February 2021.
Initially, I thought just adding the --query.auto-downsampling
to the command
line flags of the Querier would already fix the problem, because that was what
showed up in several similar issues reported in the Thanos bug tracker and the
wider Internet. But that had no effect at all. There is even a Grafana issue,
but this was rejected.
I finally found a serviceable workaround in this issue.
There is seemingly no good way to make use of a single data source and have that
handle downsampled data. It simply doesn’t work. But what does work is creating
a second data source, pointing to the same Thanos, and setting max_source_resolution=5m
for that source:
grafana:
datasources:
datasource.yaml:
datasources:
- name: Thanos-5m
uid: thanos5m
type: prometheus
access: proxy
url: http://thanos-querier.monitoring.svc.cluster.local:10902
isDefault: false
prometheusType: Thanos
jsonData:
customQueryParameters: "max_source_resolution=5m"
And that solved the issue. Using that data source, I’m getting data past the end of the raw resolution data in the TSDB without having to configure anything else. And because I only very occasionally, and then very intentionally, look at data older than a year or so, I don’t have a problem with having to explicitly set a different data source.
I would have really liked if this was handled automatically, either by Grafana or by the Querier, but that just doesn’t seem to be how it works.
Reducing Prometheus metrics ingestion
With the Thanos block viewer on hand, I was finally able to dig a little bit deeper into why I had to increase the size of the Prometheus volume so often. Going back to my oldest raw precision block from May 2023, before the k8s migration, I saw that that 20 day block had a size of almost exactly 1 GiB, with 49.47 MiB of data per day. Then looking at the most recent 20 day block, from March/April 2025, that block had a size of 13.61 GiB with 688 MiB per day. Here is a table with a few milestone blocks:
End Date | Duration | Size | Daily |
---|---|---|---|
2023-05-23 | 20d | 1 GiB | 49 MiB |
2024-02-02 | 20d | 1.25 GiB | 62 MiB |
2024-03-22 | 20d | 7.15 GiB | 361 MiB |
2024-09-20 | 20d | 8.51 GiB | 430 MiB |
2024-12-31 | 20d | 7.84 GiB | 396 MiB |
2025-03-01 | 20d | 10.81 GiB | 546 MiB |
2025-04-11 | 20d | 13.61 GiB | 688 MiB |
2025-05-01 | 7d | 5.10 GiB | 773 MiB |
2025-05-06 | 2d | 1.63 GiB | 834 MiB |
2025-05-12 | 2d | 1.42 GiB | 724 MiB |
2025-05-16 | 2d | 935 MiB | 467 MiB |
It’s pretty clear that the massive jump is coming from the k8s scraping, as I enabled that in March 2024. For the rest of 2024, the daily intake was reasonably stable though. But then, starting in 2025, it increases pretty seriously again. I’m pretty sure that’s because for most of the latter half of 2024, I was working on my backup operator implementation, so there wasn’t much change in the k8s cluster, and most apps were still running in the Nomad cluster. Then, starting in 2025, I began to migrate the rest of the services over, which seemingly increased the amount of scraped data by a lot. This makes some sense, considering that some of my largest series are probably the per-container metrics scraped from the kubelet.
As you can see with the last couple of entries, I made some progress in reducing the ingest in the last couple of days. I would have loved to show you some Prometheus ingest plots, bad sadly I only realized that Prometheus provides ingest metrics too late. 😞
For analyzing the data and finding metrics to cut down, I went to looking
directly at the most recent 20 day block with data ending on 2025-04-11. I then
opened it with the promtool
. This works without having the block in any special
directory structure, it doesn’t need to sit in a Prometheus TSDB. I just downloaded
it from the S3 bucket with s3cmd.
Then I launched promtool
like this:
promtool tsdb analyze ./ 01J3CD4846QQYQEJ3XN7VZ5NMH/
Here, 01J3CD4846QQYQEJ3XN7VZ5NMH
is the name of the block to be analyzed,
in my case the newest block of full 20 day size. The result looks like this:
Block ID: 01JRJ8EHHW12VY9SZX5Z45SSQV
Duration: 485h59m59.948s
Total Series: 367092
Label names: 226
Postings (unique label pairs): 20435
Postings entries (total label pairs): 4101019
Label pairs most involved in churning:
105190 service=monitoring-kube-prometheus-kubelet
105190 endpoint=https-metrics
105190 job=kubelet
102018 metrics_path=/metrics/cadvisor
51674 namespace=vault
50203 container=vault
49917 image=docker.io/hashicorp/vault:1.18.5
43187 job=kube-state-metrics
43187 service=monitoring-kube-state-metrics
43187 endpoint=http
35765 service=kubernetes
35765 namespace=default
35765 job=apiserver
35765 endpoint=https
29513 container=kube-state-metrics
22812 namespace=backups
21820 namespace=kube-system
21620 instance=10.8.11.250:8080
21566 instance=10.8.12.213:8080
20230 namespace=rook-ceph
Label names most involved in churning:
201049 __name__
201049 instance
201049 job
194577 service
194053 endpoint
194053 namespace
155528 pod
143798 container
108242 node
105190 metrics_path
102043 id
94318 name
75452 image
46727 device
38871 uid
32460 le
28929 scope
21394 resource
20692 verb
17317 version
Most common label pairs:
145767 endpoint=https-metrics
145767 job=kubelet
145767 service=monitoring-kube-prometheus-kubelet
128459 service=kubernetes
128459 endpoint=https
128459 namespace=default
128459 job=apiserver
123738 metrics_path=/metrics/cadvisor
62525 component=apiserver
57546 service=monitoring-kube-state-metrics
57546 endpoint=http
57546 job=kube-state-metrics
52926 version=v1
52036 namespace=vault
50810 instance=10.86.5.202:6443
50337 container=vault
50020 image=docker.io/hashicorp/vault:1.18.5
49771 namespace=kube-system
39454 container=kube-state-metrics
33481 scope=cluster
Label names with highest cumulative label value length:
636231 id
252234 name
224352 container_id
136756 mountpoint
41804 __name__
25744 uid
23777 pod
14309 owner_name
14243 created_by_name
12707 device
11585 job_name
11220 type
11194 image_id
7383 resource
5504 csi_volume_handle
4755 client
4465 image
4209 pod_ip
4209 ip
4107 interface
Highest cardinality labels:
3814 id
3528 name
3116 container_id
1316 __name__
1185 mountpoint
718 uid
688 pod
566 device
417 pod_ip
417 ip
354 type
349 owner_name
343 created_by_name
317 client
315 resource
280 interface
216 job_name
188 le
182 container
156 kind
Highest cardinality metric names:
31872 etcd_request_duration_seconds_bucket
25920 apiserver_request_duration_seconds_bucket
21736 apiserver_request_sli_duration_seconds_bucket
15232 container_memory_failures_total
10208 apiserver_request_body_size_bytes_bucket
8092 container_blkio_device_usage_total
6968 apiserver_response_sizes_bucket
6734 container_fs_reads_total
6734 container_fs_writes_total
4590 kube_pod_status_phase
4590 kube_pod_status_reason
4036 container_fs_reads_bytes_total
4036 kubernetes_feature_enabled
4036 container_fs_writes_bytes_total
3808 container_memory_kernel_usage
3808 container_memory_failcnt
3808 container_memory_rss
3808 container_memory_max_usage_bytes
3808 container_oom_events_total
3808 container_memory_working_set_bytes
Before I go any deeper, one glaring omission in the output that has me a little bit confused: There is no indicator of the actual number of samples in a metric or series. So you get a lot of information about labels and series, but nothing about the samples besides the initial total number of samples in the block.
So let’s look at the information we get there. First the typical metadata, like
how long the block is and what the oldest and newest timestamp contained in it
are. One headline number is the count of 367092 series. Let me shortly explain
the difference between a series and a metric. Let’s take as an example
container_fs_reads_total
. This is a metric - a certain value, gathered from
potentially multiple targets, which has certain labels. A series is then
one explicit permutation of those label’s values. For example like this:
container_fs_reads_total{
container="POD",
device="/dev/nvme0n1",
endpoint="https-metrics",
instance="300.300.300.1:10250",
job="kubelet",
metrics_path="/metrics/cadvisor",
namespace="rook-cluster",
node="mynode1",
pod="rook-ceph-osd-2-85b8f48c47-p24kc",
prometheus="monitoring/monitoring-kube-prometheus-prometheus",
prometheus_replica="prometheus-monitoring-kube-prometheus-prometheus-0",
service="monitoring-kube-prometheus-kubelet"
}
This is one single series of the container_fs_reads_total
metric - one specific
combination of label values. From what I understand, these series are the basis
for Prometheus’ TSDB storage architecture. Having more or less samples per series
doesn’t make much of a difference for Prometheus, but having many more series
per metric tends to get expensive, leading to a cardinality problem for Prometheus
and significantly increasing computational and memory requirements. That’s why
the output of promtool
fixates on labels and their cardinality, not the number
of samples.
I started out looking at the An example exploration of the ‘id’ label.Label names with highest cumulative label value length
section. If I interpret it right, this is the total length of all values for
that particular label. I then went into Grafana’s explore tab and started, well,
exploring. Take the first label, id
. Concatenating all values of that label
produces 636k characters. I then chose a random one of the values, which makes
Grafana show you all the metrics using that label+value combination:
container_fs_inodes_total{
container="install-cni-binaries",
device="/dev/sda2",
endpoint="https-metrics",
id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podcd726cab_879b_4b26_9916_278220f88d5b.slice/crio-3e855297d10572a5e369c72b4194911528becb34ec905da058ca78df3a3286ca.scope",
image="quay.io/cilium/cilium@sha256:1782794aeac951af139315c10eff34050aa7579c12827ee9ec376bb719b82873",
instance="300.300.300.2:10250",
job="kubelet",
metrics_path="/metrics/cadvisor",
name="k8s_install-cni-binaries_cilium-bs8mb_kube-system_cd726cab-879b-4b26-9916-278220f88d5b_1",
namespace="kube-system",
node="control-plane1",
pod="cilium-bs8mb",
prometheus="monitoring/monitoring-kube-prometheus-prometheus",
prometheus_replica="prometheus-monitoring-kube-prometheus-prometheus-0",
service="monitoring-kube-prometheus-kubelet"
}
Just looking at the value for the id
label, the problem is immediately clear:
That value is not just extremely long, but it’s probably also unique. A
restart of the container might already result in a new one, a restart of the
Pod definitely would. But the value is also unnecessary to guarantee uniqueness
of the series.
That’s already guaranteed by the pod
plus container
labels. And the same is
true for the name
label, which has 1902 values and similarly looks like it
might be randomly generated. And it too should be covered by the pod
plus
container
label combination when it comes to uniqueness. So I decided to
completely drop that particular label. Note the job
and the metrics_path
labels. Those indicate where the metric is coming from, namely the kubelet’s
cAdvisor scrape. Those can be configured from the kube-prometheus-stack values.yaml
.
Worth noting here: The chart has different configs for the different metrics paths
the kubelet offers, hence why looking at the metrics_path
label is also important.
I dropped the label via this config:
kubelet:
serviceMonitor:
cAdvisorMetricRelabelings:
- action: labeldrop
regex: id
- action: labeldrop
regex: name
With that, those two labels will be dropped completely before the samples are
ingested. This will not have an immediate effect on the size of new blocks. That’s
because we’re not gathering fewer samples. We’re instead getting fewer series for
all of the cAdvisor metrics. This will have a larger impact once different blocks
are compacted into larger blocks. Because the compaction does not actually remove
any samples, but it is able to deduplicate the series in the index. For example,
with those two labels still in there, I would have larger daily blocks on days
with my regular host or service updates. During both of those maintenance actions,
I’m restarting and rescheduling Pods, which would lead to both labels changing -
so I would suddenly have two series in the same block for what is pretty much the
same Pod, because the name
and id
labels would change after a restart of the
node hosting the Pod.
I went through the rest of the list in the section and applied different actions,
ranging from leaving the label untouched, like the pod
label, because it’s needed
for uniqueness, to dropping the entire metric using the label, like the mountpoint
,
which is only used in the node_filesystem_device_error
and node_filesystem_readonly
metrics, neither of which is particularly interesting, so I dropped them in
node_exporter, where they’re coming from.
I then went through the Highest cadrinality metric names
section, and dropped a
lot of the metrics in there because they just didn’t look very interesting.
See, I’m perfectly capable of even dropping entire metrics. I’m a responsible adult! 🥹
But one value in the cardinality section deserves a shout out: etcd_request_duration_seconds_bucket
.
That metric is just humongous. It produces a total of 45k series. That’s how
many unique label combinations it had seen. That comes from the fact that that
metric has labels for 24 histogram buckets times 6 HTTP operations times 317
different Kubernetes object kinds. Wow.
One mistake I made during the configuration that I found just now was that,
as I wrote above, the labeldrop
action belongs in the metrics relabelings
.
I had put them into the relabelings
config, but that does not work.
Those initial fixes and cleanups were done last weekend. I did not see much drop in the overall size of the gathered metrics, so I dug in a little bit more. Which was when I saw that all those cAdvisor metrics - which at 196 Pods, and at least that many containers were definitely contributing the most - were scraped at 10 second intervals. Which is ridiculous. I increased that to 30s, via this setting in the kube-prometheus-stack chart:
kubelet:
serviceMonitor:
cAdvisorInterval: 30s
Then I had the idea of checking whether any other ServiceMonitors were also configured with too short of a scrape interval, and I discovered that Ceph was also doing 10s by default. I was able to change that in the cluster chart like this:
monitoring:
interval: 30s
One thing I found interesting was that dropping the scrape interval for cAdvisor
measurably dropped Prometheus’ CPU usage as well: CPU usage of the Prometheus container. I dropped the cAdvisor scrape interval to 30s around 21:38.
Finally a small misguided adventure in metrics reduction. I saw that for the container
metrics, there were entries for each Pod that had POD
as their container name.
I surmised that those were the metrics for the container Kubernetes uses to hold
the networking namespace, and I thought I could drop it to reduce ingest a bit
further. But it turned out that yes, this container was actually important,
because it is where all the networking metrics for a Pod are reported. So I had
to revert the drop.
Conclusion
I’m just going to stop here, short of the 30 minute reading time limit. 😉
This project was a very enjoyable success. I of course always welcome any chance to look at my metrics. And the main goal was reached to its fullest: I’ve now got my metrics in an S3 bucket, and will never have to increase a volume size again. The downsampling, with somewhat smaller storage requirements after two years and the ability that with the 1h precision, I can just keep the metrics indefinitely, was a really nice bonus. The only thing I would wish for is that the Thanos Queries did automatically query the next lower precision if it doesn’t find any raw precision data.
I was also quite happy that this project had me learn a bit more about how Prometheus stores its data, and it was another welcome trigger to reduce the metrics ingestion at least a little bit more.
This project has again shown me that I should get a move on and start scraping more of my services, instead of mostly scraping Kubernetes, Ceph and my hosts. I would have really loved to show some plots from scraped Prometheus metrics on the effects of my metric reduction attempts.
Finally, the Thanos block viewer again demonstrates a principle I read in The Art of Unix Programming many years ago: The Rule of Transparency. It’s always a good idea to make your program’s inner workings transparent, and the block viewer was genuinely helpful.
So what’s next? I decided to continue going down my “smaller project” list before starting something big and completely new again. So the next thing will likely be the migration from Gitea to Forgejo, simply because that’s next on the list of Homelab things to do.