Prometheus self scrape job taking too much memory

Yaron B

unread,

Aug 23, 2021, 5:23:33 AM8/23/21

to Prometheus Users

Hi,

we are facing an issue with the prometheus server memory usage.

when starting the server it starts with around 30GB of ram , even without any jobs configured other than the self one.

in the image attached you can see the heap size usage for the prometheus job.

is there a way to reduce this size? when we add our kubernetes scrape job we reach our node limit and get OOMKilled.

please advise.

Screen Shot 2021-08-23 at 12.20.15.png

Yaron B

unread,

Aug 23, 2021, 6:23:47 AM8/23/21

to Prometheus Users

I am attaching the heap.svg if someone can help me figure out what is using the memory

ב-יום שני, 23 באוגוסט 2021 בשעה 12:23:33 UTC+3, ‪Yaron B‬‏ כתב/ה:

heap.svg

Stuart Clark

unread,

Aug 23, 2021, 6:29:59 AM8/23/21

to Yaron B, Prometheus Users

So at the moment it isn't scraping anything other than itself via the /metrics endpoint?

Is this a brand new service (i.e. no existing data stored on disk)?

Is there anything querying the server (e.g. Grafana dashboards, etc.)?

-- 
Stuart Clark

Yaron B

unread,

Aug 23, 2021, 6:35:18 AM8/23/21

to Prometheus Users

at the moment we did add some scrape jobs that bumped the memory usage from around 30gb to 40gb but we are not sure why the self scraping takes so much ram.

its not a new implementation, we did notice it is using a lot of memory but it didn't crash on us so we let it run. today

as you can see in the attached image, it crashed, skyrocket the memory usage to 60gb ,then we started to disable jobs until the server didn't crash anymore but it is using more than it used in the last 15 days

ב-יום שני, 23 באוגוסט 2021 בשעה 13:29:59 UTC+3, Stuart Clark כתב/ה:

Screen Shot 2021-08-23 at 13.33.21.png

Yaron B

unread,

Aug 23, 2021, 7:55:52 AM8/23/21

to Prometheus Users

can anyone understand from this image why is the server is using so much ?

production-prometheus-server-869bffc459-r92nh 1186m 54937Mi

thats crazy!

ב-יום שני, 23 באוגוסט 2021 בשעה 13:35:18 UTC+3, ‪Yaron B‬‏ כתב/ה:

heap3.svg

Ben Kochie

unread,

Aug 23, 2021, 7:58:36 AM8/23/21

to Yaron B, Prometheus Users

Prometheus needs memory to buffer incoming data before writing it to disk. The more you scrape, the more it needs.

You can see a summary of this information on prometheus:9090/tsdb-status

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/0659c262-daeb-452e-8dc4-4df8df22021dn%40googlegroups.com.

Yaron B

unread,

Aug 23, 2021, 8:18:12 AM8/23/21

to Prometheus Users

that makes sense but if I look at the numbers in the url you gave me:

Number of Series 2514033

Number of Chunks 3098707

Number of Label Pairs 1088507

and use them in memory calculator I found, it shows me much less ram than what I am using now.

do you see any number here that should be a red light for me? something that is not right?

ב-יום שני, 23 באוגוסט 2021 בשעה 14:58:36 UTC+3, sup...@gmail.com כתב/ה:

Ben Kochie

unread,

Aug 23, 2021, 9:25:21 AM8/23/21

to Yaron B, Prometheus Users

Seems about correct for that many series. Kubernetes use includes a lot of label data/cardinality that requires extra memory for tracking.

How big is your cluster in terms of total memory for all nodes?

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/b17b8d09-fe23-4c43-b85e-c2f4d7a87539n%40googlegroups.com.

Yaron B

unread,

Aug 23, 2021, 10:12:56 AM8/23/21

to Prometheus Users

we have around 50 nodes with 64 gig of ram.

by the way, we found that our backend added a metric that spammed the prometheus until it crashed :)

they removed the metric and the server seems to be stable.

still using around 30gb of ram but at least not crashing

ב-יום שני, 23 באוגוסט 2021 בשעה 16:25:21 UTC+3, sup...@gmail.com כתב/ה:

Ben Kochie

unread,

Aug 23, 2021, 3:27:57 PM8/23/21

to Yaron B, Prometheus Users

50 nodes at 64Gi is 3200Gi of memory. Using 30Gi is 0.9% of the cluster. This is a little high, but not out of bounds for a normal deployment.

I would recommend starting to consider sharding by Kubernetes namespace. This is what we're working on to avoid single service namespaces from exploding the cluster monitoring too badly.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/8fefbe26-6cf0-498a-96d2-0bb21f536ee5n%40googlegroups.com.

Yaron B

unread,

Aug 24, 2021, 3:48:20 AM8/24/21

to Prometheus Users

thanks!

ב-יום שני, 23 באוגוסט 2021 בשעה 22:27:57 UTC+3, sup...@gmail.com כתב/ה:

Reply all

Reply to author

Forward