Prometheus not able to scale vertically due to lock contention

158 views
Skip to first unread message

Dhruv Patel

unread,
Mar 4, 2021, 5:10:00 PM3/4/21
to Prometheus Users
Hi Folks,
  We are seeing an issue in our current Prometheus Setup where we are not able to ingest beyond 22 million metrics/min. We have run several Load Test at 25 Million, 29 Million and 35 Million but the ingestion rate remains constant around the same 22 million metrics/min. Moreover, we are also seeing that our CPU Usage is around 70% and have more than 50% memory available memory. Looking at this it feels like we are not hitting resource limitations but something to do with lock contention.

Prometheus Version: 2.9.1
Host Shape: x7-enclave-104 (It is a bare metal host with 104 processor units). More info can be obtained in below screenshots
Memory Info: 
                       total        used        free         shared  buff/cache   available
Mem:           754G         88G        528G         67M        136G        719G
Swap:          1.0G           0B           1.0G
Total:           755G          88G        529G

We also ran some profiling during our load test setup at 20Million, 22 Million and 25 Million and have seen an increase in time taken taken for running runtime.mallocgc which leads to an increased usage in runtime.futex. Some how we are not able to figure out what could be the issue of the lock contention. I have attached our profiling results at different load test levels if thats any useful. Any ideas on what could be causing the high time taken in runtime malloc gc?


22_million_cpu_profile_50secs.png
Memory Info.png
25_million_cpu_profile_50secs.png
20_million_cpu_profile_50secs.png
CPU Info.png

Julien Pivotto

unread,
Mar 4, 2021, 5:13:45 PM3/4/21
to Dhruv Patel, Prometheus Users
On 04 Mar 14:09, Dhruv Patel wrote:
> Hi Folks,
> We are seeing an issue in our current Prometheus Setup where we are not
> able to ingest beyond 22 million metrics/min. We have run several Load Test
> at 25 Million, 29 Million and 35 Million but the ingestion rate remains
> constant around the same 22 million metrics/min. Moreover, we are also
> seeing that our CPU Usage is around 70% and have more than 50% memory
> available memory. Looking at this it feels like we are not hitting resource
> limitations but something to do with lock contention.
>
> *Prometheus Version:* 2.9.1

Your Prometheus version is pretty old (2019). Could you run your benchmarks
again with a recent release?

Thanks

> *Host Shape:* x7-enclave-104 (It is a bare metal host with 104 processor
> units). More info can be obtained in below screenshots
> *Memory Info: *
> total used free shared
> buff/cache available
> Mem: 754G 88G 528G 67M 136G
> 719G
> Swap: 1.0G 0B 1.0G
> Total: 755G 88G 529G
>
> We also ran some profiling during our load test setup at 20Million, 22
> Million and 25 Million and have seen an increase in time taken taken for
> running runtime.mallocgc which leads to an increased usage in
> runtime.futex. Some how we are not able to figure out what could be the
> issue of the lock contention. I have attached our profiling results at
> different load test levels if thats any useful. Any ideas on what could be
> causing the high time taken in runtime malloc gc?
>
>
> --
> You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/abccd4c0-c69d-4869-8598-899b3de693f7n%40googlegroups.com.







--
Julien Pivotto
@roidelapluie
Message has been deleted

Ben Kochie

unread,
Mar 5, 2021, 3:10:49 AM3/5/21
to Nguyen Phu Quy, Prometheus Users
Please do not post new issues to other people's threads.

On Fri, Mar 5, 2021 at 8:47 AM Nguyen Phu Quy <nguyenph...@gmail.com> wrote:
Sorry i have a problem after installed Promethes following:  

Warning: Error fetching server time: Detected 18681.458999872208 seconds time difference between your browser and the server. Prometheus relies on accurate time and time drift might cause unexpected query results.

Who can explain help me?

Thank you very much  
Vào lúc 05:13:45 UTC+7 ngày Thứ Sáu, 5 tháng 3, 2021, Julien Pivotto đã viết:

Aliaksandr Valialkin

unread,
Mar 9, 2021, 2:22:39 PM3/9/21
to Dhruv Patel, Prometheus Users
Prometheus is written in Go. The runtime.mallocgc function is called every time Prometheus allocates a new object during its operation. It looks like Prometheus 2.9.1 allocates a lot during the load test. The runtime.futex is used internally by Go runtime during objects' allocation and subsequent objects' deallocation (aka garbage collection). It looks like the Go runtime used in Prometheus 2.9.1 isn't optimized well for programs with frequent object allocations that run on systems with many CPU cores. This should be improved in Go 1.15 - Allocation of small objects now performs much better at high core counts, and has lower worst-case latency . So it is recommended repeating the load test on to the latest available version of Prometheus, which is hopefully built with at least Go 1.15 - see https://github.com/prometheus/prometheus/releases .

Additionally, you can run the load test on VictoriaMetrics and compare its scalability with Prometheus. See https://victoriametrics.github.io/#how-to-scrape-prometheus-exporters-such-as-node-exporter .
 


--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/abccd4c0-c69d-4869-8598-899b3de693f7n%40googlegroups.com.


--
Best Regards,

Aliaksandr Valialkin, CTO VictoriaMetrics

Dhruv Patel

unread,
Mar 14, 2021, 9:52:15 PM3/14/21
to Prometheus Users
Thanks for the help Aliaksandr and Julien. I upgraded to latest Prometheus 2.25.0 and golang 1.15.8 and see a huge performance improvement.
Reply all
Reply to author
Forward
0 new messages