tsdb size and retention

1,792 views
Skip to first unread message

s.krist...@dotbydot.gr

unread,
Sep 28, 2018, 6:22:06 AM9/28/18
to Prometheus Users
Hi

My Name is Stratos and I am using Prometheus a couple of years now in our infrastructure.
I want to migrate  our infrastructure from version Prometheus 1.6 to 2.4 so I have done a couple of tests to evaluate the new version.

My lab was very simple. I have only one wmi exporter with  60s scrape interval and 1h retention.

There are 2 observations which I would like to share and get feedback/help/opinions on them.

1. Even If I had configured 1h retention I can see data in the Graph for the last 6 hours. I understand that each Data file is keeping 2  hours data, but is this behavior normal? Should I consider that 6 hours is the minimum retention?

2. The wal folder is getting bigger and bigger and I can't do any estimation on Data size. It is getting 1MB bigger every hour. Is it configurable? How much extra storage should I count for wal?

Thank you in advance for your support

Stratos

Ben Kochie

unread,
Sep 28, 2018, 9:26:28 AM9/28/18
to s.krist...@dotbydot.gr, Prometheus Users
On Fri, Sep 28, 2018 at 12:22 PM <s.krist...@dotbydot.gr> wrote:
Hi

My Name is Stratos and I am using Prometheus a couple of years now in our infrastructure.
I want to migrate  our infrastructure from version Prometheus 1.6 to 2.4 so I have done a couple of tests to evaluate the new version.

My lab was very simple. I have only one wmi exporter with  60s scrape interval and 1h retention.

There are 2 observations which I would like to share and get feedback/help/opinions on them.

1. Even If I had configured 1h retention I can see data in the Graph for the last 6 hours. I understand that each Data file is keeping 2  hours data, but is this behavior normal? Should I consider that 6 hours is the minimum retention?

Prometheus 2.x has a 2-hour window compacting database, so the minimum retention time is based on deleting time windows of data, rather than truncating files per metric like Prometheus 1.x did.

Yes, 6 hours is probably a minimum, based on the way compaction and cleanup lifecycles take.
 

2. The wal folder is getting bigger and bigger and I can't do any estimation on Data size. It is getting 1MB bigger every hour. Is it configurable? How much extra storage should I count for wal?

It's a write-ahead-log, which is not well compressed, like the compaction directories are. The size is based on the rate of ingested data. It gets cleaned up every 2 hours.
 

Thank you in advance for your support

Stratos

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/2da28071-fbaf-4a97-b007-14cc9fafbe7a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

s.krist...@dotbydot.gr

unread,
Sep 28, 2018, 9:59:47 AM9/28/18
to Prometheus Users
Ben Thank you very much.

I will change the retention time for my LAB to 6h and check the results.

Regarding wal size... it is not cleaned up every 2h. It is getting bigger and bigger all week.

maint@Prometheus2:/$ cd /srv/data
maint@Prometheus2:/srv/data$ du --max-depth=1 --human-readable .
85M     ./wal
340K    ./01CRG47WCZEDT6NY911P1XQZ4M
316K    ./01CRFXC52G28AHG08QA29Y0YDK
86M     .

I suppose that if it was a version bug it would have been mention by others also....?

The problem is that I am trying to estimate required size for this scope... 1500 wmi exporters with 60s scrape interval and retention 744 hours
If wal size requires 20 x actual_datasize I have a huge problem....

Any ideas?

s.krist...@dotbydot.gr

unread,
Oct 1, 2018, 5:48:31 AM10/1/18
to Prometheus Users
HI Again.

I have changed the retention to 6 hours and now I have metrics for 11-12 hours! Also the wal folder is still getting bigger and bigger....
In my LAB now I have only 1 wmi exporter with  60s scrape interval and 6h retention
The target is to estimate the required disk size for an infrastructure with 1500 wmi exporters and 30 days retention.

Is somebody using all ready version 2.4 to confirm proper wal and retention behavor or explain the behavior of my LAB?

maint@Prometheus2:/srv/data$ du --max-depth=1 --human-readable .
54M     ./wal
360K    ./01CRQDPHMT8C9ZQKXCX06VPDFJ
336K    ./01CRPS3B48MHF9VQSP3N7RGPHN
336K    ./01CRPZZ2BW1K9K56NDHCFBGGYQ
336K    ./01CRQ6TWZ7CBMY97APPPBYFPT7
344K    ./01CRPJ7KVRNJK1KX8P2S6JM1FP
56M     .

Goutham Veeramachaneni

unread,
Oct 1, 2018, 6:05:51 AM10/1/18
to s.krist...@dotbydot.gr, Prometheus Users
Hi,

We had problems with Windows that were fixed in 2.4.2. Could you try that? Also, could you share the logs?

WAL should be quite small, and the fact that it's taking up so much is a bug. Could you please upgrade to 2.4.2 and let us know?

Thanks,
Goutham.

s.krist...@dotbydot.gr

unread,
Oct 1, 2018, 7:15:45 AM10/1/18
to Prometheus Users
HI Goutham

I have all ready upgraded to 2.4.2 last week.
maint@Prometheus2:/opt/prometheus/prometheus-2.4.2.linux-amd64$ ./prometheus --version
prometheus, version 2.4.2 (branch: HEAD, revision: c305ffaa092e94e9d2dbbddf8226c4813b1190a0)
  build user:       root@dcde2b74c858
  build date:       20180921-07:22:29
  go version:       go1.10.3

Do you want me to upgrade wmi exporter also? I am using version  0.3.3 (0.4.3 is the latest). 
I have changed --log.level to debug an hour ago. If you need more logs, I will provide them.

I am at your disposal for further info/actions

Stratos

Today logs...

level=info ts=2018-10-01T01:00:13.802585127Z caller=compact.go:398 component=tsdb msg="write block" mint=1538344800000 maxt=1538352000000 ulid=01CRPJ7KVRNJK1KX8P2S6JM1FP
level=info ts=2018-10-01T01:00:13.815205774Z caller=head.go:446 component=tsdb msg="head GC completed" duration=2.251838ms
level=info ts=2018-10-01T03:00:13.81135973Z caller=compact.go:398 component=tsdb msg="write block" mint=1538352000000 maxt=1538359200000 ulid=01CRPS3B48MHF9VQSP3N7RGPHN
level=info ts=2018-10-01T03:00:13.823618893Z caller=head.go:446 component=tsdb msg="head GC completed" duration=2.047968ms
level=info ts=2018-10-01T05:00:13.804314662Z caller=compact.go:398 component=tsdb msg="write block" mint=1538359200000 maxt=1538366400000 ulid=01CRPZZ2BW1K9K56NDHCFBGGYQ
level=info ts=2018-10-01T05:00:13.816976411Z caller=head.go:446 component=tsdb msg="head GC completed" duration=2.180046ms
level=info ts=2018-10-01T07:00:17.31755449Z caller=compact.go:398 component=tsdb msg="write block" mint=1538366400000 maxt=1538373600000 ulid=01CRQ6TWZ7CBMY97APPPBYFPT7
level=info ts=2018-10-01T07:00:17.329985682Z caller=head.go:446 component=tsdb msg="head GC completed" duration=2.280076ms
level=info ts=2018-10-01T09:00:14.676870685Z caller=compact.go:398 component=tsdb msg="write block" mint=1538373600000 maxt=1538380800000 ulid=01CRQDPHMT8C9ZQKXCX06VPDFJ
level=info ts=2018-10-01T09:00:14.68941656Z caller=head.go:446 component=tsdb msg="head GC completed" duration=2.218501ms
level=warn ts=2018-10-01T10:18:41.10118061Z caller=main.go:398 msg="Received SIGTERM, exiting gracefully..."
level=info ts=2018-10-01T10:18:41.101255678Z caller=main.go:423 msg="Stopping scrape discovery manager..."
level=info ts=2018-10-01T10:18:41.101266155Z caller=main.go:437 msg="Stopping notify discovery manager..."
level=info ts=2018-10-01T10:18:41.101271981Z caller=main.go:459 msg="Stopping scrape manager..."
level=info ts=2018-10-01T10:18:41.101295137Z caller=main.go:419 msg="Scrape discovery manager stopped"
level=info ts=2018-10-01T10:18:41.101322177Z caller=main.go:433 msg="Notify discovery manager stopped"
level=info ts=2018-10-01T10:18:41.101340384Z caller=manager.go:638 component="rule manager" msg="Stopping rule manager..."
level=info ts=2018-10-01T10:18:41.101351042Z caller=manager.go:644 component="rule manager" msg="Rule manager stopped"
level=info ts=2018-10-01T10:18:41.101430887Z caller=main.go:453 msg="Scrape manager stopped"
level=info ts=2018-10-01T10:18:41.133235899Z caller=notifier.go:512 component=notifier msg="Stopping notification manager..."
level=info ts=2018-10-01T10:18:41.133378708Z caller=main.go:608 msg="Notifier manager stopped"
level=info ts=2018-10-01T10:18:41.133389796Z caller=main.go:620 msg="See you next time!"
level=info ts=2018-10-01T10:18:42.158147949Z caller=main.go:238 msg="Starting Prometheus" version="(version=2.4.2, branch=HEAD, revision=c305ffaa092e94e9d2dbbddf8226c4813b1190a0)"
level=info ts=2018-10-01T10:18:42.15834582Z caller=main.go:239 build_context="(go=go1.10.3, user=root@dcde2b74c858, date=20180921-07:22:29)"
level=info ts=2018-10-01T10:18:42.158433936Z caller=main.go:240 host_details="(Linux 4.4.0-31-generic #50-Ubuntu SMP Wed Jul 13 00:07:12 UTC 2016 x86_64 Prometheus2 (none))"
level=info ts=2018-10-01T10:18:42.158509254Z caller=main.go:241 fd_limits="(soft=1024, hard=4096)"
level=info ts=2018-10-01T10:18:42.15857173Z caller=main.go:242 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2018-10-01T10:18:42.158948285Z caller=main.go:554 msg="Starting TSDB ..."
level=info ts=2018-10-01T10:18:42.159090449Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1538344800000 maxt=1538352000000 ulid=01CRPJ7KVRNJK1KX8P2S6JM1FP
level=info ts=2018-10-01T10:18:42.15912734Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1538352000000 maxt=1538359200000 ulid=01CRPS3B48MHF9VQSP3N7RGPHN
level=info ts=2018-10-01T10:18:42.159149073Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1538359200000 maxt=1538366400000 ulid=01CRPZZ2BW1K9K56NDHCFBGGYQ
level=info ts=2018-10-01T10:18:42.159174607Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1538366400000 maxt=1538373600000 ulid=01CRQ6TWZ7CBMY97APPPBYFPT7
level=info ts=2018-10-01T10:18:42.1591963Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1538373600000 maxt=1538380800000 ulid=01CRQDPHMT8C9ZQKXCX06VPDFJ
level=info ts=2018-10-01T10:18:42.182553081Z caller=web.go:397 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=warn ts=2018-10-01T10:18:42.960028991Z caller=head.go:371 component=tsdb msg="unknown series references" count=2751
level=info ts=2018-10-01T10:18:42.96615735Z caller=main.go:564 msg="TSDB started"
level=info ts=2018-10-01T10:18:42.966256138Z caller=main.go:624 msg="Loading configuration file" filename=/opt/prometheus/prometheus-2.4.2.linux-amd64/prometheus.yml
level=debug ts=2018-10-01T10:18:42.96693538Z caller=manager.go:158 component="discovery manager scrape" msg="Starting provider" provider=string/0 subs=[wmi_tests]
level=debug ts=2018-10-01T10:18:42.967037357Z caller=manager.go:158 component="discovery manager notify" msg="Starting provider" provider=string/0 subs=[defc8654e1f8780ce80f9010266ce592]
level=info ts=2018-10-01T10:18:42.967110026Z caller=main.go:650 msg="Completed loading of configuration file" filename=/opt/prometheus/prometheus-2.4.2.linux-amd64/prometheus.yml
level=info ts=2018-10-01T10:18:42.967171739Z caller=main.go:523 msg="Server is ready to receive web requests."
level=debug ts=2018-10-01T10:18:42.967259429Z caller=manager.go:180 component="discovery manager scrape" msg="discoverer channel closed, sending the last update" provider=string/0
level=debug ts=2018-10-01T10:18:42.96732882Z caller=manager.go:180 component="discovery manager notify" msg="discoverer channel closed, sending the last update" provider=string/0
level=debug ts=2018-10-01T10:18:42.967386752Z caller=manager.go:183 component="discovery manager notify" msg="discoverer exited" provider=string/0
level=debug ts=2018-10-01T10:18:42.967570963Z caller=manager.go:183 component="discovery manager scrape" msg="discoverer exited" provider=string/0
level=info ts=2018-10-01T11:00:36.159643482Z caller=compact.go:398 component=tsdb msg="write block" mint=1538380800000 maxt=1538388000000 ulid=01CRQMJXRK05SBD784BDTDNJK9
level=info ts=2018-10-01T11:00:36.201536879Z caller=head.go:446 component=tsdb msg="head GC completed" duration=20.002795ms

s.krist...@dotbydot.gr

unread,
Oct 5, 2018, 6:25:37 AM10/5/18
to Prometheus Users
Hi 

I upgraded to ver 2.4.3 yesterday and I added 9 more windows exporters.

So now my eval-setup is 10 wim exporters, 1m scrape interval and 6h retention.

I can still see data for 12h (2*retention time) and also wal folder is getting bigger and bigger for the last 24h.

maint@Prometheus2:/srv/data$ du --max-depth=1 --human-readable .
4.2M    ./01CS0VT6G21BJ16YXAD5TS69X2
4.3M    ./01CS12NXRXB18K1TV8JHACS2DA
234M    ./wal
4.3M    ./01CS19HN2KS739B69WM9K2ZYMP
4.2M    ./01CS1GDD2KH9ET4T3BAJ02ECP6
5.2M    ./01CS1Q94B5MTMV5TH3GYQTV2TP
256M    .

Even if there is no clear bug in my setup or in prometheus 2.4.x version... Any ideas for what to try next?
Do you believe that if I setup a bigger retention and/or more terminals the behavior of the wal is going to be normal?
Or a walk-through to reduce wal size with a cron job or something? 
Any suggestion are welcome!

Stratos

Ben Kochie

unread,
Oct 5, 2018, 10:22:18 AM10/5/18
to s.krist...@dotbydot.gr, Prometheus Users
The wall will be larger, as it's an uncompressed datastream.

Can you provide some more info:

* ls -l wal/
* ls -l 01*/meta.json

Also, it would be useful to see the contents of the meta.json files.

One thing you can do is lock the retention block size.

./prometheus --storage.tsdb.min-block-duration=2h --storage.tsdb.max-block-duration=2h

For 1 min scrape interval, 2 hours is the optimal number.

s.krist...@dotbydot.gr

unread,
Oct 8, 2018, 7:55:29 AM10/8/18
to Prometheus Users
HI Everybody.

With versions Prometheus 2.4.3 and wim exporter 0.4.3 and this setup:
10 wim targets, 6h retention, 1m scrape interval and min/max as suggested by Ben it seems that WAL size is going up to 550 MB and then drops down to 300 MB.
The circle is about 25-30 hours... and I still can see info for 12h in Graph.

One way or another I will test with more targets and retention time to see how it goes... Maybe the Issue has to do with min limitations of the setup
Thank you for your help... I will inform about further observations.

Stratos


maint@Prometheus2:/srv/data$ ls -l wal/
total 286384
-rw-r--r-- 1 root root 134217728 Oct  5 03:29 00000000
-rw-r--r-- 1 root root 134217728 Oct  5 15:32 00000001
-rw-r--r-- 1 root root  24810985 Oct  5 17:55 00000002
maint@Prometheus2:/srv/data$ ls -l 01*/meta.json
-rw-r--r-- 1 root root 281 Oct  5 08:00 01CS19HN2KS739B69WM9K2ZYMP/meta.json
-rw-r--r-- 1 root root 281 Oct  5 10:00 01CS1GDD2KH9ET4T3BAJ02ECP6/meta.json
-rw-r--r-- 1 root root 281 Oct  5 12:00 01CS1Q94B5MTMV5TH3GYQTV2TP/meta.json
-rw-r--r-- 1 root root 281 Oct  5 14:00 01CS1Y4V3V1QFPDC263KW29G1X/meta.json
-rw-r--r-- 1 root root 281 Oct  5 16:00 01CS250KDW9NCR5FGM1DJPXHJ1/meta.json
maint@Prometheus2:/srv/data$ cat ./01CS19HN2KS739B69WM9K2ZYMP/meta.json
{
        "ulid": "01CS19HN2KS739B69WM9K2ZYMP",
        "minTime": 1538704800000,
        "maxTime": 1538712000000,
        "stats": {
                "numSamples": 1767413,
                "numSeries": 27470,
                "numChunks": 27470
        },
        "compaction": {
                "level": 1,
                "sources": [
                        "01CS19HN2KS739B69WM9K2ZYMP"
                ]
        },
        "version": 1
}

s.krist...@dotbydot.gr

unread,
Oct 11, 2018, 6:42:27 AM10/11/18
to Prometheus Users
prometheus2 Ver 2.4

interval 60s 60s 60s
Retention 6h 6h 24h
Targets 1 wmi 10 wmi 10 wmi
Data (MB) 3,2-5 14 65
Wal (MB) 135++ 200-550 280-550

Ben Kochie

unread,
Oct 11, 2018, 7:24:13 AM10/11/18
to s.krist...@dotbydot.gr, Prometheus Users
The good news here is that everything seems to be working as intended.  Retention in Prometheus is not designed to be a strict hard cut-off, it's a background garbage collection.

Some example servers I have:
* 1800 samples/sec, 56 targets, WAL 350MiB
* 2400 samples/sec, 38 targets, WAL 550MiB
* 56000 samples/sec, 1000 targets, WAL 10GiB

Basically, the WAL scales as the samples per second scale. For very small servers, there is a bit of a minimum due to the 128MiB sized-based WAL rotation.

Reply all
Reply to author
Forward
0 new messages