Prometheus self scrape job taking too much memory

366 views
Skip to first unread message

Yaron B

unread,
Aug 23, 2021, 5:23:33 AM8/23/21
to Prometheus Users
Hi,

we are facing an issue with the prometheus server memory usage.
when starting the server it starts with around 30GB of ram , even without any jobs configured other than the self one.
in the image attached you can see the heap size usage for the prometheus job. 
is there a way to reduce this size? when we add our kubernetes scrape job we reach our node limit and get OOMKilled.

please advise.

Screen Shot 2021-08-23 at 12.20.15.png

Yaron B

unread,
Aug 23, 2021, 6:23:47 AM8/23/21
to Prometheus Users
I am attaching the heap.svg if someone can help me figure out what is using the memory
ב-יום שני, 23 באוגוסט 2021 בשעה 12:23:33 UTC+3, ‪Yaron B‬‏ כתב/ה:
heap.svg

Stuart Clark

unread,
Aug 23, 2021, 6:29:59 AM8/23/21
to Yaron B, Prometheus Users

So at the moment it isn't scraping anything other than itself via the /metrics endpoint?

Is this a brand new service (i.e. no existing data stored on disk)?

Is there anything querying the server (e.g. Grafana dashboards, etc.)?

-- 
Stuart Clark

Yaron B

unread,
Aug 23, 2021, 6:35:18 AM8/23/21
to Prometheus Users
at the moment we did add some scrape jobs that bumped the memory usage from around 30gb to 40gb but we are not sure why the self scraping takes so much ram.
 its not a new implementation, we did notice it is using a lot of memory but it didn't crash on us so we let it run. today 
as you can see in the attached image, it crashed, skyrocket the memory usage to 60gb ,then we started to disable jobs until the server didn't crash anymore but it is using more than it used in the last 15 days


ב-יום שני, 23 באוגוסט 2021 בשעה 13:29:59 UTC+3, Stuart Clark כתב/ה:
Screen Shot 2021-08-23 at 13.33.21.png

Yaron B

unread,
Aug 23, 2021, 7:55:52 AM8/23/21
to Prometheus Users
can anyone understand from this image why is the server is using so much ?
production-prometheus-server-869bffc459-r92nh                     1186m        54937Mi
thats crazy!
ב-יום שני, 23 באוגוסט 2021 בשעה 13:35:18 UTC+3, ‪Yaron B‬‏ כתב/ה:
heap3.svg

Ben Kochie

unread,
Aug 23, 2021, 7:58:36 AM8/23/21
to Yaron B, Prometheus Users
Prometheus needs memory to buffer incoming data before writing it to disk. The more you scrape, the more it needs.

You can see a summary of this information on prometheus:9090/tsdb-status

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/0659c262-daeb-452e-8dc4-4df8df22021dn%40googlegroups.com.

Yaron B

unread,
Aug 23, 2021, 8:18:12 AM8/23/21
to Prometheus Users
that makes sense but if I look at the numbers in the url you gave me:
Number of Series 2514033
Number of Chunks 3098707
Number of Label Pairs 1088507
and use them in memory calculator I found, it shows me much less ram than what I am using now.

do you see any number here that should be a red light for me? something that is not right?
ב-יום שני, 23 באוגוסט 2021 בשעה 14:58:36 UTC+3, sup...@gmail.com כתב/ה:

Ben Kochie

unread,
Aug 23, 2021, 9:25:21 AM8/23/21
to Yaron B, Prometheus Users
Seems about correct for that many series. Kubernetes use includes a lot of label data/cardinality that requires extra memory for tracking.

How big is your cluster in terms of total memory for all nodes?

Yaron B

unread,
Aug 23, 2021, 10:12:56 AM8/23/21
to Prometheus Users
we have around 50 nodes with 64 gig of ram.

by the way, we found that our backend added a metric that spammed the prometheus until it crashed :)
they removed the metric and the server seems to be stable.
still using around 30gb of ram but at least not crashing

ב-יום שני, 23 באוגוסט 2021 בשעה 16:25:21 UTC+3, sup...@gmail.com כתב/ה:

Ben Kochie

unread,
Aug 23, 2021, 3:27:57 PM8/23/21
to Yaron B, Prometheus Users
50 nodes at 64Gi is 3200Gi of memory. Using 30Gi is 0.9% of the cluster. This is a little high, but not out of bounds for a normal deployment.

I would recommend starting to consider sharding by Kubernetes namespace. This is what we're working on to avoid single service namespaces from exploding the cluster monitoring too badly.

Yaron B

unread,
Aug 24, 2021, 3:48:20 AM8/24/21
to Prometheus Users
thanks!

ב-יום שני, 23 באוגוסט 2021 בשעה 22:27:57 UTC+3, sup...@gmail.com כתב/ה:
Reply all
Reply to author
Forward
0 new messages