Prometheus 2.0 start time with moderate load

264 views
Skip to first unread message

Dan Simone

unread,
Nov 21, 2017, 9:27:24 PM11/21/17
to Prometheus Users
Hi,

I'm trying to get a sense of what is "normal" behavior on Prometheus 2.0 with respect to startup time.  I've set up an experiment with a large number of  metrics on pretty beefy machines, and I'm seeing that Prometheus takes fairly long (15-30 minutes, depending on the machine specs) to fully restart (where "TSDB started" appears in the logs and queries are accepted) with the existing data set.  Here are some details:

Metrics generated:
- 5M - 10M distinct metrics getting scraped every 15 seconds
- Running for 72 hours.
- Data generated in local storage: ~100GB

Machine 1:
- 28GB memory, 4 CPUs
- 33 minutes to restart Prometheus
- Top output:
PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
27303 root      20   0 71.044g 0.026t   4292 D  42.5 96.9   2811:58 prometheus

Machine 2:
- 60GB memory, 4 CPUs
- 17 minutes to restart Prometheus using an exact clone of the above data set
- NVMe disk for storage.tsdb.path
- Top output:
PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
20714 root      20   0  0.100t 0.046t 0.011t S 100.0 79.5 211:37.08 prometheus

I've done similar experiments with Prometheus 1.8, with 1.5M metrics generated over the course of several days, and never saw the startup time get higher than a couple minutes.  I realize that Prometheus 2.0 loads more metrics in memory in order to be less disk-intensive, which probably account for this.  But have others been seeing similar results, and is this expected?  I am trying to gauge whether to move to Prometheus 2.0, and whether this behavior will be acceptable: tolerating that kind of downtime whenever I need to restart Prometheus.

Thanks in advance for any help on this,

Dan

aaro...@gmail.com

unread,
Nov 27, 2017, 3:52:59 PM11/27/17
to Prometheus Users
I bumped into similar issue while trying the Prometheus 2.0. In general, I like the 2.0 features but hesitate to upgrade my current system because the restart time is longer than usual and it scares me off. Is the long start time is expected behavior?

Dan Simone

unread,
Nov 28, 2017, 1:32:59 PM11/28/17
to Prometheus Users
Some addition data here, on the non-NVMe machine:

Starting from a Prometheus data directory with no /wal, but only blocks:
* 3GB, 1 block - 8 seconds to load
* 10.7GB, 2 blocks - 55 seconds to load
* 18.5GB, 3 blocks - 109 seconds to load
* 41.5GB, 4 blocks - 174 seconds to load
* 94.5GB, 7 blocks - 220 seconds to load

When a 90GB /wal directory is present, it the last experiment above takes 1260 seconds instead of 220.  So the bulk of the startup time appears to be dealing with /wal.

isha girdhar

unread,
Nov 29, 2017, 3:59:06 AM11/29/17
to Prometheus Users
We are facing the similar thing, We have already moved to Prometheus 2.0 but restart time is longer than 1.8 for sure. Not sure if that's expected behaviour.
Reply all
Reply to author
Forward
0 new messages