Memory and CPU usage of prometheus

wangchao...@gmail.com

unread,

Aug 13, 2018, 9:32:59 AM8/13/18

to Prometheus Users

I found today that the prometheus consumes lots of memory(avg 1.75GB) and CPU (avg 24.28%). There are two prometheus instances, one is the local prometheus, the other is the remote prometheus instance. The local prometheus gets metrics from different metrics endpoints inside a kubernetes cluster, while the remote prometheus gets metrics from the local prometheus periodically (scrape_interval is 20 seconds).

It's the local prometheus which is consuming lots of CPU and memory. The retention configured for the local prometheus is 10 minutes.

--storage.tsdb.retention=10m

I am thinking how to decrease the memory and CPU usage of the local prometheus. Since the remote prometheus gets metrics from local prometheus once every 20 seconds, so probably we can configure a small retention value (i.e. 2 minutes) for the local prometheus so as to reduce the size of the memory cache? Does it make sense?

Do anyone have any ideas on how to reduce the CPU usage?

Thanks!

Simon Pasquier

unread,

Aug 13, 2018, 10:26:05 AM8/13/18

to wangchao...@gmail.com, Prometheus Users

Decreasing the retention period to less than 6 hours isn't recommended. Also there's no support right now for a "storage-less" mode (I think there's an issue somewhere but it isn't a high-priority for the project).

Also memory usage depends on the number of scraped targets/metrics so without knowing the numbers, it's hard to know whether the usage you're seeing is expected or not.

When you say "the remote prometheus gets metrics from the local prometheus periodically", do you mean that you federate all metrics?

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/82c053b8-125e-4227-8c10-dcb8b40d632d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

wangchao...@gmail.com

unread,

Aug 13, 2018, 11:08:13 AM8/13/18

to Prometheus Users

Thanks for the response.

Yes, the remote/central prometheus federates all metrics.

Since the central prometheus has a longer retention (30 days), so can we reduce the retention of the local prometheus so as to reduce the memory usage?

Actually I deployed the following 3rd party services in my kubernetes cluster. And there are 10+ customized metrics as well. I am not sure what's the best memory should I configure for the local prometheus?

1. cadvisor

2. kube-state-metrics

3. node-exporter

在 2018年8月13日星期一 UTC+8下午10:26:05，Simon Pasquier写道：

Decreasing the retention period to less than 6 hours isn't recommended. Also there's no support right now for a "storage-less" mode (I think there's an issue somewhere but it isn't a high-priority for the project).
Also memory usage depends on the number of scraped targets/metrics so without knowing the numbers, it's hard to know whether the usage you're seeing is expected or not.
When you say "the remote prometheus gets metrics from the local prometheus periodically", do you mean that you federate all metrics?

On Mon, Aug 13, 2018 at 3:32 PM, <wangchao...@gmail.com> wrote:

I found today that the prometheus consumes lots of memory(avg 1.75GB) and CPU (avg 24.28%). There are two prometheus instances, one is the local prometheus, the other is the remote prometheus instance. The local prometheus gets metrics from different metrics endpoints inside a kubernetes cluster, while the remote prometheus gets metrics from the local prometheus periodically (scrape_interval is 20 seconds).

It's the local prometheus which is consuming lots of CPU and memory. The retention configured for the local prometheus is 10 minutes.
--storage.tsdb.retention=10m

I am thinking how to decrease the memory and CPU usage of the local prometheus. Since the remote prometheus gets metrics from local prometheus once every 20 seconds, so probably we can configure a small retention value (i.e. 2 minutes) for the local prometheus so as to reduce the size of the memory cache? Does it make sense?

Do anyone have any ideas on how to reduce the CPU usage?

Thanks!

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.

Ben Kochie

unread,

Aug 13, 2018, 11:19:01 AM8/13/18

to 汪超, Prometheus Users

Federation is not meant to be a all metrics replication method to a central Prometheus. If you are looking to "forward only", you will want to look into using something like Cortex or Thanos.

The retention time on the local Prometheus server doesn't have a direct impact on the memory use. Memory and CPU use on an individual Prometheus server is dependent on ingestion and queries.

AFAIK, Federating all metrics is probably going to make memory use worse.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/3b189eca-3c0e-430c-84a9-30b6cd212e09%40googlegroups.com.

wangchao...@gmail.com

unread,

Aug 13, 2018, 8:58:56 PM8/13/18

to Prometheus Users

Thanks for the response.

So it seems that the only way to reduce the memory and CPU usage of the local prometheus is to reduce the scrape_interval of both the local prometheus and the central prometheus?

Currently the scrape_interval of the local prometheus is 15 seconds, while the central prometheus is 20 seconds. What's the best practice to configure the two values?

Since the grafana is integrated with the central prometheus, so we have to make sure the central prometheus has all the metrics available.

在 2018年8月13日星期一 UTC+8下午11:19:01，Ben Kochie写道：

Ben Kochie

unread,

Aug 14, 2018, 5:38:55 AM8/14/18

to 汪超, Prometheus Users

No, in order to reduce memory use, eliminate the central Prometheus scraping all metrics. Federation is not meant to pull all metrics.

It is better to have Grafana talk directly to the local Prometheus.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/5aa0ceb4-3309-4922-968d-cf1a36f0b258%40googlegroups.com.

Reply all

Reply to author

Forward