Prometheus crashes when number of unique metrics increase

509 views
Skip to first unread message

Rupesh Tripathi

unread,
Dec 5, 2019, 2:34:05 AM12/5/19
to Prometheus Users
Hello Folks,

I performed some load/stress tests on Prometheus, Please find the details and outcome below. I observed that prometheus docker container abruptly disappeared/crashed in few instances. Can someone please help us explaining what are the limitation of prometheus in terms of number of unique metrics with high cardinality data?  

Steps:

 

  1. Start Avalanche for producing unique metrics
    1. docker run -d --net=host quay.io/freshtracks.io/avalanche --metric-count=1000 --series-count=1000 --port=9001
    2. This will create 1000 unique metrics name each with 1000 unique tag values, overall 1000*1000 unique metrics.
    3. Avalanche runs on 9001 port by default but we can change by providing port value.
  1. Start prometheus docker container pointing to 9001/metrics to fetch metrics from Avalanche
    1. docker run -d --net=host --rm -v $(pwd)/prometheus0_eu2.yml:/etc/prometheus/prometheus.yml -v $(pwd)/prometheus0_eu2_data:/prometheus -u root --name prometheus-0-eu2 prom/prometheus --config.file=/etc/prometheus/prometheus.yml --storage.tsdb.path=/prometheus --web.listen-address=:9092 --web.enable-lifecycle --web.enable-admin-api

 

 

 

 

Number of unique metrics

Duration

CPU Usage (Average)

Memory Usage

Issues/outcome

1000 * 1000 = 1000,000 (1000 unique metric names each with 1000 unique label/tag values)

10 - 15 minutes

15.0-20.0%

95-99%

The prometheus container crashes and stops abruptly after 15-30 minutes, most likely due to out of memory.

100 * 1000 = 100,000 (100 unique metric names each with 1000 unique label/tag values)

1-1.5 hours

15.0-20.0%

90-99%(Starts increasing slowly and after an hour it grows to 90 %+

Prometheus container stops abruptly after running for 3-4 hours.

100 * 100 = 10,000 (100 unique metric names each with 100 unique label/tag values)

2 days

5.0-7.0%

25-28%

Prometheus service continues to run without any issues.



Stuart Clark

unread,
Dec 5, 2019, 2:57:40 AM12/5/19
to Rupesh Tripathi, Prometheus Users
Limits will very much depend on the resources available to Prometheus as well as what other non scrape related work is happening (eg. queries and rules).

In general high cardinality is not recommended, so try to adjust the metric sources to remove or reduce this (eg. don't have a metric which exposes user identifiers or other unique items in labels).

Prometheus and the metrics being stored should be carefully considered alongside whatever event store solutions you might use (eg. Elasticsearch, Loki or Splunk) where you can find the details once informed by the metrics.

Equally you may want to look at the number and location of the servers themselves. It is generally recommended for Prometheus servers to live with the same failure domain (eg. data centre) as the services being monitored, as well as being shared where sensible.

While a very rough memory and disk usage can be calculated from the number of time series being scraped and frequency you really need to test real world usage to take into account all the factors.
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Rupesh Tripathi

unread,
Dec 5, 2019, 3:41:51 AM12/5/19
to Prometheus Users
I understand that prometheus is not the right tool for high cardinality , just was curious to understand what could be the reson for high memory usage, shouldnt we writing more on disk and keeping a limited data in memory. Are there any configuration for how often it should write to disk and can we edit those configurations.

Thanks,
Rupesh

Rupesh Tripathi

unread,
Dec 5, 2019, 3:45:27 AM12/5/19
to Prometheus Users
I completely agree that prometheus is not the right tool for data with high cardinality , just was curious to understand what could be the reason for high memory usage and crash, shouldn't we writing more on disk and keeping a limited set of data in memory. Are there any configurations for how often it should write to disk and can we edit those configurations?

Thanks,
Rupesh

Aliaksandr Valialkin

unread,
Dec 5, 2019, 2:09:55 PM12/5/19
to Rupesh Tripathi, Prometheus Users
Note that `avalanche` introduces high churn rate for time series, i.e. old time series are constantly substituted by new time series every `--series-interval` seconds. Default value for `--series-interval` is 60 seconds, i.e. every 60 seconds new time series are created. So for `--metrics-count=1000 --series-count=1000` case `avalanche` introduces new 1M time series every minute. In 30 minutes Prometheus scrapes 30M time series from `avalanche`. See also `--metric-interval` command-line flag, which has almost the same meaning as `--series-interval`.

BTW, how much RAM is available for your Prometheus setup?

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/2063df2a-cf7f-4e2c-a679-4fcef7cf5c9c%40googlegroups.com.


--
Best Regards,

Aliaksandr

Rajesh Reddy Nachireddi

unread,
Jan 28, 2020, 11:16:26 AM1/28/20
to Aliaksandr Valialkin, Rupesh Tripathi, Prometheus Users
Hi Aliaksandr,

we have seen the same issue - we are using 128 GB of RAM. Could you point us to any documentation of avalanche settings for further study.

we are hitting other issues with Label or metric cardinality - we have labels each with 8k cardinality and metrics with 1.4 million each and Prometheus is getting crashed and could you provide the numbers on labela dn metric cardinality when Prometheus gets killed with OOM issues.

Is there way, we can avoid these cardinality crashing the Prometheus instance ?

Thanks,
Rajesh

Rajesh Reddy Nachireddi

unread,
Jan 28, 2020, 11:18:08 AM1/28/20
to Aliaksandr Valialkin, Rupesh Tripathi, Prometheus Users
Hi Aliaksandr,

we have seen the same issue - we are using 128 GB of RAM. Could you point us to documentaion which talks about --metric-interval ?

we are hitting other issues with Label or metric cardinality - we have labels each with 8k cardinality and metrics with 1.4 million each and Prometheus is getting crashed and could you provide the numbers on labela dn metric cardinality when Prometheus gets killed with OOM issues.

Is there way, we can avoid these cardinality crashing the Prometheus instance ?

Thanks,
Rajesh

Aliaksandr Valialkin

unread,
Jan 28, 2020, 6:02:47 PM1/28/20
to Rajesh Reddy Nachireddi, Rupesh Tripathi, Prometheus Users
Hi Rajesh,

On Tue, Jan 28, 2020 at 6:18 PM Rajesh Reddy Nachireddi <rajesh...@gmail.com> wrote:
Hi Aliaksandr,

we have seen the same issue - we are using 128 GB of RAM. Could you point us to documentaion which talks about --metric-interval ?

Just pass `--help` to avalanche in order to see docs for all the command-line flags including `--metric-interval`:

./avalanche --help
usage: avalanche [<flags>]

avalanche - metrics test server

Flags:
  --help                         Show context-sensitive help (also try --help-long and --help-man).
  --metric-count=500             Number of metrics to serve.
  --label-count=10               Number of labels per-metric.
  --series-count=10              Number of series per-metric.
  --metricname-length=5          Modify length of metric names.
  --labelname-length=5           Modify length of label names.
  --value-interval=30            Change series values every {interval} seconds.
  --series-interval=60           Change series_id label values every {interval} seconds.
  --metric-interval=120          Change __name__ label values every {interval} seconds.
  --port=9001                    Port to serve at
  --remote-url=REMOTE-URL        URL to send samples via remote_write API.
  --remote-pprof-urls=REMOTE-PPROF-URLS ...  
                                 a list of urls to download pprofs during the remote write: --remote-pprof-urls=http://127.0.0.1:10902/debug/pprof/heap
                                 --remote-pprof-urls=http://127.0.0.1:10902/debug/pprof/profile
  --remote-pprof-interval=REMOTE-PPROF-INTERVAL  
                                 how often to download pprof profiles.When not provided it will download a profile once before the end of the test.
  --remote-batch-size=2000       how many samples to send with each remote_write API request.
  --remote-requests-count=100    how many requests to send in total to the remote_write API.
  --remote-write-interval=100ms  delay between each remote write request.
  --version                      Show application version.



--
Best Regards,

Aliaksandr

Rajesh Reddy Nachireddi

unread,
Jan 31, 2020, 12:32:58 AM1/31/20
to Aliaksandr Valialkin, Rupesh Tripathi, Prometheus Users
Thanks Aliaksandr.

Could you please provide, if we have any bench marking result on how much Label cardinality per each label, Total No.of Unique labels & their size and Metric cardinality supported by 128GB ?

Thanks,
Rajesh

Aliaksandr Valialkin

unread,
Jan 31, 2020, 8:11:27 AM1/31/20
to Rajesh Reddy Nachireddi, Rupesh Tripathi, Prometheus Users

Paul Dubuc

unread,
Jun 22, 2020, 2:12:29 PM6/22/20
to Prometheus Users
I have a question about this for some clarification. How does the '--metric-interval' affect the cardinality of the metric data?  Does it have the same multplier effect that '--series-interval' does in that every 120 seconds (default) a whole new set of unique metrics are generated on top of those generated at the '--series-interval'?

Also you you know if setting these intervals to 0 would keep the changes from taking effect?

Thanks.
To unsubscribe from this group and stop receiving emails from it, send an email to promethe...@googlegroups.com.


--
Best Regards,

Aliaksandr

Aliaksandr Valialkin

unread,
Jun 22, 2020, 4:47:16 PM6/22/20
to Paul Dubuc, Prometheus Users
On Mon, Jun 22, 2020 at 9:12 PM Paul Dubuc <goo...@paul.dubuc.org> wrote:
I have a question about this for some clarification. How does the '--metric-interval' affect the cardinality of the metric data?  Does it have the same multplier effect that '--series-interval' does in that every 120 seconds (default) a whole new set of unique metrics are generated on top of those generated at the '--series-interval'?

Yes.
 

Also you you know if setting these intervals to 0 would keep the changes from taking effect?

No. Just set `--metric-interval` to the value exceeding your test duration if you need suppressing churn rate. Something like the following should work: --metric-interval=100000
 
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/ce7d967c-7f56-4ec2-b549-770a51191e0do%40googlegroups.com.


--
Best Regards,

Aliaksandr Valialkin, CTO VictoriaMetrics
Reply all
Reply to author
Forward
0 new messages