WAL size estimate

605 views
Skip to first unread message

Florin Andrei

unread,
Feb 21, 2020, 4:45:30 PM2/21/20
to Prometheus Users
To store Prometheus data, I have a volume with a given, fixed size. I need to make sure usage level stays below 100%.

I've already set storage.tsdb.retention.time, but this is not always very effective. Sometimes large data spikes will fill up the volume. I cannot reduce the retention time too much.

I want to also use storage.tsdb.retention.size. But the problem is, this doesn't take into account the WAL file size.

Is there a way to estimate the size of WAL data? Are there any guidelines for fitting the retention size in a given volume size?

Thanks!

Dhiman Barman

unread,
Feb 21, 2020, 4:59:50 PM2/21/20
to Florin Andrei, Prometheus Users

In 2.15.0
[ENHANCEMENT] TSDB: WAL size is now used for size based retention calculation. 



--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/f6eb1c19-5721-4d84-8b50-0f2699ff3901%40googlegroups.com.

Florin Andrei

unread,
Feb 21, 2020, 5:04:57 PM2/21/20
to Prometheus Users
Awesome! Thank you, got it, will upgrade.


On Friday, February 21, 2020 at 1:59:50 PM UTC-8, Dhiman Barman wrote:

In 2.15.0
[ENHANCEMENT] TSDB: WAL size is now used for size based retention calculation. 



On Fri, Feb 21, 2020 at 1:45 PM Florin Andrei <florin...@gmail.com> wrote:
To store Prometheus data, I have a volume with a given, fixed size. I need to make sure usage level stays below 100%.

I've already set storage.tsdb.retention.time, but this is not always very effective. Sometimes large data spikes will fill up the volume. I cannot reduce the retention time too much.

I want to also use storage.tsdb.retention.size. But the problem is, this doesn't take into account the WAL file size.

Is there a way to estimate the size of WAL data? Are there any guidelines for fitting the retention size in a given volume size?

Thanks!

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to promethe...@googlegroups.com.

Andrey Kezikov

unread,
Feb 21, 2020, 5:13:26 PM2/21/20
to Prometheus Users
In most cases WAL growing graphic is pretty linear, if you're not adding dynamically exporters. So you can check increase per hour and it will be your number.
Also this article maybe not very actual on current versions, but giving some highlights on what causing WAL to grow: https://www.robustperception.io/how-much-ram-does-prometheus-2-x-need-for-cardinality-and-ingestion

To reduce it you can decrease retention time. Prometheus will run compaction jobs compacting WAL what older 1/10 of retention time.

Other workaround is to play with --storage.tsdb.min-block-duration and --storage.tsdb.max-block-duration options
Example: --storage.tsdb.min-block-duration=30m --storage.tsdb.max-block-duration=2h will cause running compactions every 30 minutes for WAL files older 2 hours. Consuming also less resources because of smaller chunks.

NOTE: despite I can see people more and more tweaking these options - Prometheus developers strongly not recommend to use them, as they are for inner performance testing and non-default block sizes can cause performance issues.

Ben Kochie

unread,
Feb 22, 2020, 4:45:00 AM2/22/20
to Florin Andrei, Prometheus Users
You can also enable compression on the WAL with --storage.tsdb.wal-compression.

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/f4bf8508-78a5-4eb8-aa77-ebde4533a842%40googlegroups.com.

Florin Andrei

unread,
Feb 24, 2020, 2:37:16 PM2/24/20
to Prometheus Users
How often is the limit checked? Is there a way to get a time estimate for when is the next check?


On Friday, February 21, 2020 at 1:59:50 PM UTC-8, Dhiman Barman wrote:

In 2.15.0
[ENHANCEMENT] TSDB: WAL size is now used for size based retention calculation. 



On Fri, Feb 21, 2020 at 1:45 PM Florin Andrei <florin...@gmail.com> wrote:
To store Prometheus data, I have a volume with a given, fixed size. I need to make sure usage level stays below 100%.

I've already set storage.tsdb.retention.time, but this is not always very effective. Sometimes large data spikes will fill up the volume. I cannot reduce the retention time too much.

I want to also use storage.tsdb.retention.size. But the problem is, this doesn't take into account the WAL file size.

Is there a way to estimate the size of WAL data? Are there any guidelines for fitting the retention size in a given volume size?

Thanks!

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to promethe...@googlegroups.com.

Andrey Kezikov

unread,
Feb 24, 2020, 3:43:08 PM2/24/20
to Prometheus Users
Default value is 2h, so we can assume that check happening every two hours if min block param has't been changed

Florin Andrei

unread,
Feb 24, 2020, 4:09:42 PM2/24/20
to Prometheus Users
Andrey,

Got it, and I just saw Prometheus doing that on a test instance.

What is the retention size overhead that should be prudent to leave for housekeeping? On a 100 GB volume dedicated to Prometheus data, what's the best way to estimate the maximum safe value for storage.tsdb.retention.size?

There are two factors here I am concerned with:
- if the limit is only checked once in a while, it may go over it as new data keeps being written
- once every 2 hours, as it's moving WAL files and checkpoints around, it may temporarily surpass the size limit

Andrey Kezikov

unread,
Feb 24, 2020, 5:04:58 PM2/24/20
to Prometheus Users
It very depends on data ingestion you have. 
Best way would be to check average increase of WAL files per 15 minutes (S), since that you have to consider 8*S for 2h delay between checks and additional  8*S overhead because WAL growing is not paused during compactions and temporary compacted data are keeping on the same storage. Deletion of WAL files is the latest step if all process steps were succeed.

That's only for WAL, but you also should consider how long historical data you have to keep. That would be a bit easier to calcluate because you should have data blocks for similar time periods to get amount of ingestion per hour.

Considering all above you can calculate what to set in storage.tsdb.retention.size, leaving not less than about 30% of storage for cases when compactions get stuck and WALs be not compressed in time, proceeding to grow.


Please correct me if I'm wrong, but I get that retentions.size is only about deletion of DB files, not about triggering WAL compactions. If so - you have to calculate not for 2 hours between checks, but for 1/10 of retention period, as Prometheus starts to compress only WALs older than that time.
Reply all
Reply to author
Forward
0 new messages