TSDB storage use constantly increasing

53 views
Skip to first unread message

Rufus Schäfing

unread,
Mar 2, 2021, 12:07:56 PM3/2/21
to Prometheus Users
Hey everyone!

We're running a quite large (at least for me) prometheus-monitored environment. At the time of me writing this 'prometheus_tsdb_storage_blocks_bytes' reports 6.24 TiB of storage used.

That alone would not be a huge problem I guess - calculating an estimate for the storage needed with our scrape sizes and retention time (26w) our figure seems to be in the ballpark.

Our main issue is that storage consumption is ever increasing and I can't figure out exactly why or how to deal with it.

Looking at the 1-week-deriv of used blocks i get around 600 KiB/s which gives me 48 GiB/d - which seems quite excessive to me. Every 20 days ~500 GiB are shaved off but overall growth exceeds shrinkage.

Our current prometheus has data from the beginning of December 'til now and it looks like it's been this way at least since then.

Now I've been graphing different metrics pertaining storage-use, TSDB-behaviour, target-count and churn-rate.

In terms of churn I've noticed that the 'Kubernetes cAdvisor' seems to be generating new series all the time. I'm not sure how to interpret the numbers but compared to other jobs it's definitely noticable (around 1%-1.5% of all head series are created by cAdvisor). This also makes sense since containers and everything keep changing.

If there is a connection between churn and storage use I don't quite get it. From reading TSDB-documentation I assume that more new series lead to an increased amount of series-metadata in the blocks and worse compression performance?

I'd imagine that a big part of storage is used for actual sample-data though and not so much for labels and such. Please - if I seem to be misunderstanding this point enlighten me!

Over the course of 3 months the amount of head-series has been quite stable between 4.2 Mil and 4.7 Mil. The last month's numbers are also lower than January - seemingly decreasing.

In a 12-hour window head-series count follows a saw-tooth-pattern between ~4.2 Mil and ~4.5 Mil currently.

The one metric I've found to be correlating with our storage use is the number of loaded TSDB-blocks which has definitely been following an upwards-trend since at least the beginning of December.

The inner workings concering chunks, blocks and compaction are quite mysterious to me at this point so I cannot really imagine an explanation for this either.


I think/hope that's everything from my part. Another hope of mine is that I've explained everything in an understandable way. I'm quite new to these kinds of setups and performance issues and English is only my second language.

I don't expect anyone to solve my problem (would be great of course). I'd definitely love to hear some opinions or insight from someone more experiecened. If there are any - some good resources or ideas for me to learn debugging/analyzing this will be appreciated as well!


I leave my thanks in advance to anyone reading this!

- Rufus

Jiacai Liu

unread,
Mar 3, 2021, 5:07:05 AM3/3/21
to Rufus Schäfing, promethe...@googlegroups.com
Compared with samples, series can be ignored in size.
How old your query will be? 26w? If your query only care about
recent data, maybe you can move old data to remote storage, which
is cheap. Remote storage can also be queried, but with bad
performance.
signature.asc

Julien Pivotto

unread,
Mar 3, 2021, 5:10:51 AM3/3/21
to Rufus Schäfing, Prometheus Users
If you could share screenshot of the benchmark dashboard:
https://grafana.com/grafana/dashboards/12054
and more details about the setup (prometheus version, command line
flags), it would help us better understanding your exact situation.

Thanks!


>
> - Rufus
>
> --
> You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/99e7fef3-d0b3-4feb-9b29-a0e9cb14f26bn%40googlegroups.com.


--
Julien Pivotto
@roidelapluie

Stuart Clark

unread,
Mar 3, 2021, 9:13:33 AM3/3/21
to Jiacai Liu, Rufus Schäfing, promethe...@googlegroups.com
On 03/03/2021 10:06, Jiacai Liu wrote:
> Compared with samples, series can be ignored in size.
> How old your query will be? 26w? If your query only care about recent
> data, maybe you can move old data to remote storage, which is cheap.
> Remote storage can also be queried, but with bad performance.
Depending on the remote write destination used the amount of storage
needed could be a lot larger than it would be to keep directly in
Prometheus.

--
Stuart Clark

Jiacai Liu

unread,
Mar 3, 2021, 10:07:12 PM3/3/21
to Stuart Clark, Rufus Schäfing, promethe...@googlegroups.com
Of course it depends on what kinds of storage you use for legacy
metrics. FYI, I use HDD for remote storage and SSD for local
storage.
signature.asc
Reply all
Reply to author
Forward
0 new messages