Optimal solution for storing 3 years of data from 300 hosts in prometheus server

73 views
Skip to first unread message

Puneet Singh

unread,
Feb 20, 2024, 12:24:04 PM2/20/24
to Prometheus Users
HI All,
I am planning to store 3 years of data from 300 server in a single prometheus server.  The data will primarily consist of default exporter metrics and the server has 500G memory and 80 cores.

I'd like to ensure that my solution is optimal in terms of resource utilization, query performance and scalability. Is there a general recommendation about the amount of resources i should be having on this setup?

Regards,
Puneet



Puneet Singh

unread,
Feb 20, 2024, 12:25:08 PM2/20/24
to Prometheus Users
By resources i meant - Number of servers , RAM per server and Cores per server.

Regards,
Puneet

Ben Kochie

unread,
Feb 20, 2024, 12:47:19 PM2/20/24
to Puneet Singh, Prometheus Users
Prometheus needs a minimum of about 4KiB per "active series". Retention policy doesn't affect the memory usage very much.

If you have 10,000 metrics per server and 300 servers, that's 3 million series.

3 million * 4KiB = 11.4GiB of memory.

Of course, you will also need some page cache and such. Usually 2x is more than good enough, but it depends on your query load. So 20-30GiB of memory should be enough.

But of course, it highly depends on how many metrics per server you have.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/2735babe-1905-45e0-90e7-2316c2f35f8bn%40googlegroups.com.

Puneet Singh

unread,
Feb 20, 2024, 2:09:24 PM2/20/24
to Prometheus Users
Hi Ben, 
Thank you for the response. So for scraping and storing/writing the data for this setup, prometheus would need ~30G memory .

i have a query on 2 Data Query scenarios- 
scenario1: We run a query to get 3 years of load via grafana as - 
node_load15{instance="$node",job="$job"}[$__rate_interval]

scenario2: we decide to query the system load  worth 3 years for a single server from the grafana, which uses following query:
avg_over_time(node_load15{instance="$node",job="$job"}[$__rate_interval]) * 100 / on(instance) group_left sum by (instance)(irate(node_cpu_seconds_total{instance="$node",job="$job"}[$__rate_interval]))

this involves 2 series, plus 3 functions (avg_over_time,sum by , irate) and a division operation

Is there a way to  get a rough estimate of the amount of CPU and RAM required for the queries mentioned in scenario1 and scenario2?

Regards,
Puneet

Chris Siebenmann

unread,
Feb 20, 2024, 2:46:34 PM2/20/24
to Puneet Singh, Prometheus Users, Chris Siebenmann
> I am planning to store 3 years of data from 300 server in a single
> prometheus server. The data will primarily consist of default exporter
> metrics and the server has 500G memory and 80 cores.

We currently scrape metrics from 908 different sources (from
'count(up)'), 153 of which are the Prometheus (Unix/Linux) host agent on
servers here (the rest are a combination of additional agents and
Blackbox checks). We're currently running at a typical ingestion rate of
73,000 samples a second (some of those additional agents generate a lot
of sample points due to copious histograms) and have around 1.4 million
active series (taken from 'promtheus_tsdb_head_series'). Our current
retention goes back to November of 2018, when we took our Prometheus
setup into production.

We're doing all of this on a 1U server with a six-core Xeon E-2226G CPU,
64 GB of RAM, and a mirrored pair of 20 TB HDDs. The server is not
particularly busy; it runs about 4% CPU utilization and under
1Mbytes/sec of both network traffic and disk writes. Querying (in
Prometheus) for the three year average node_load15 across all of the
servers temporarily briefly took the system to 12% CPU usage and almost
80% disk utilization to read data (and it appears negligible additional
memory usage); this will vary with the query.

If you want to make very long historical queries, you will need to
increase various internal safety limits in Prometheus (and possibly also
query time limits in Grafana), but the server you're describing should
be more than able to handle this.

- cks

Ben Kochie

unread,
Feb 21, 2024, 1:48:30 AM2/21/24
to Puneet Singh, Prometheus Users
Again, it depends, I don't know what you want to collect and store. It could be enough, it could not be enough. Only you can capacity plan this with the knowledge of your internal requirements.

On Tue, Feb 20, 2024 at 8:09 PM Puneet Singh <singh.p...@gmail.com> wrote:
Hi Ben, 
Thank you for the response. So for scraping and storing/writing the data for this setup, prometheus would need ~30G memory .

i have a query on 2 Data Query scenarios- 
scenario1: We run a query to get 3 years of load via grafana as - 
node_load15{instance="$node",job="$job"}[$__rate_interval]

scenario2: we decide to query the system load  worth 3 years for a single server from the grafana, which uses following query:
avg_over_time(node_load15{instance="$node",job="$job"}[$__rate_interval]) * 100 / on(instance) group_left sum by (instance)(irate(node_cpu_seconds_total{instance="$node",job="$job"}[$__rate_interval]))

this involves 2 series, plus 3 functions (avg_over_time,sum by , irate) and a division operation

Is there a way to  get a rough estimate of the amount of CPU and RAM required for the queries mentioned in scenario1 and scenario2?

Again, it depends on how exactly many series and samples you need to load. Figure maybe 20MiB of memory per series with 15s scrape intervals.

That query is a bit nonsensical. Load average is not really a useful metric to look at, and I don't understand why you're dividing it by CPU seconds.

For things that you want to graph over long periods of time, you can use recording rules to generate pre-computed data that is easier to query.

 

Regards,
Puneet

On Tuesday 20 February 2024 at 23:17:19 UTC+5:30 Ben Kochie wrote:
Prometheus needs a minimum of about 4KiB per "active series". Retention policy doesn't affect the memory usage very much.

If you have 10,000 metrics per server and 300 servers, that's 3 million series.

3 million * 4KiB = 11.4GiB of memory.

Of course, you will also need some page cache and such. Usually 2x is more than good enough, but it depends on your query load. So 20-30GiB of memory should be enough.

But of course, it highly depends on how many metrics per server you have.

On Tue, Feb 20, 2024 at 6:24 PM Puneet Singh <singh.p...@gmail.com> wrote:
HI All,
I am planning to store 3 years of data from 300 server in a single prometheus server.  The data will primarily consist of default exporter metrics and the server has 500G memory and 80 cores.

I'd like to ensure that my solution is optimal in terms of resource utilization, query performance and scalability. Is there a general recommendation about the amount of resources i should be having on this setup?

Regards,
Puneet



--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/2735babe-1905-45e0-90e7-2316c2f35f8bn%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

Puneet Singh

unread,
Feb 22, 2024, 2:44:18 PM2/22/24
to Prometheus Users
Thank you Chris,
That information related to your setup 's resource usage was very helpful.

Hi Ben,
The query i had shared is provided as part of the default exporter dashboard from grafana.
It is supposed to give a measure of the % load on a server .
Untitled.png
Untitled.png
Here data fill be read for 2 series and along  with that there are irate, avg_over_time and a division operators.
Scrape interval is 20s. So if i query data for last 3 years, can i assume that this might take ~  40MiB of memory per series ? and these operators will not have any significant memory consumption?



Regards
Puneet

Ben Kochie

unread,
Feb 22, 2024, 3:01:44 PM2/22/24
to Puneet Singh, Prometheus Users
Load average is not really "load" in the way you're thinking. That query is not going to work the way you are thinking.

You probably want CPU utilization.

Something like: avg without (cpu,mode) (1-rate(node_cpu_seconds_total{...}[$__rate_interval]))

Again, it's recommended to create recording rules for these kinds of queries.

Reply all
Reply to author
Forward
0 new messages