Prometheus query_range is slow

85 views
Skip to first unread message

亢哲

unread,
Apr 1, 2022, 6:01:54 AM4/1/22
to Prometheus Users
I am requesting Prometheus's query_range interface between 15 seconds and 25 seconds every minute, and the number of requests is about 200,000 to 300,000 every minute.
The interface of Prometheus sometimes returns data very slowly because CPU is 100% usage with 4 cores and 8g RAM.


Here are some typical query logs:
{"error":"query was canceled in expression evaluation","httpRequest":{"clientIP":"172.17.0.42","method":"GET","path":"/api/v1/query_range"},"params":{"end":"2022-03-30T09:35:59.000Z","query":"jvm_threads_current{appName=\"sunfire-selfmonitor-reduce.url4625\", }","start":"2022-03-30T09:35:00.000Z","step":60},"stats":{"timings":{"evalTotalTime":5.672762408,"resultSortTime":0,"queryPreparationTime":5.464064693,"innerEvalTime":0,"execQueueTime":0.000004868,"execTotalTime":5.995925647}},"ts":"2022-03-30T09:35:22.686Z"} {"error":"query was canceled in expression evaluation","httpRequest":{"clientIP":"172.17.0.126","method":"GET","path":"/api/v1/query_range"},"params":{"end":"2022-03-30T09:38:59.000Z","query":"jvm_memory_pool_bytes_max{appName=\"sunfire-selfmonitor-reduce.url1927\", pool=~\"Code Cache\"}","start":"2022-03-30T09:38:00.000Z","step":60},"stats":{"timings":{"evalTotalTime":3.5984797520000003,"resultSortTime":0,"queryPreparationTime":2.605931976,"innerEvalTime":0,"execQueueTime":0.911022524,"execTotalTime":5.196457487}},"ts":"2022-03-30T09:38:25.080Z"} {"error":"query was canceled in expression evaluation","httpRequest":{"clientIP":"172.17.0.126","method":"GET","path":"/api/v1/query_range"},"params":{"end":"2022-03-30T09:38:59.000Z","query":"jvm_buffer_pool_used_bytes{appName=\"sunfire-selfmonitor-reduce.url132\", pool=~\"mapped\"} / jvm_buffer_pool_capacity_bytes{appName=\"sunfire-selfmonitor-reduce.url132\", pool=~\"mapped\"}","start":"2022-03-30T09:38:00.000Z","step":60},"stats":{"timings":{"evalTotalTime":3.5998245989999997,"resultSortTime":0,"queryPreparationTime":2.607460066,"innerEvalTime":0,"execQueueTime":0.910834909,"execTotalTime":5.20090463}},"ts":"2022-03-30T09:38:25.081Z"}


 - '--storage.tsdb.retention.time=5m' 
 - '--storage.tsdb.max-block-duration=5m' 
 - '--storage.tsdb.min-block-duration=5m'
The reason I use these three parameters is because I want to reduce the memory and disk usage as much as possible (currently this should keep the memory at 5 minutes of data, right?)
I don't know if it is because of the setting of these parameters that the CPU consumption is too high, because I see that the official does not recommend setting these parameters.If I shouldn't use these parameters, then I wonder if I just want to keep 5-10 minutes of data in memory and on disk, is there a way to do it?

Brian Candler

unread,
Apr 1, 2022, 10:22:50 AM4/1/22
to Prometheus Users
> the number of requests is about 200,000 to 300,000 every minute

What do you mean by that? Do you mean you are hitting the query API with 200K to 300K queries per minute?!  Can you describe what the application is that requires that?  It could be that prometheus is not well suited to such use.

I would also say your chosen tsdb parameters are likely to make performance *worse* overall.  To work efficiently, prometheus depends on being able to aggregate data in memory, organize it in columnar format (i.e. values for the same timeseries are stored next to each other) and delta-compressed.  Tiny block sizes will defeat that.

> then I wonder if I just want to keep 5-10 minutes of data in memory and on disk, is there a way to do it?

Disk is cheap, so disk space is generally not a problem with any realistic workload.  Why not keep 12 hours of data, or 24 hours of data, on disk?  Prometheus uses mmap() so disk and RAM are really just two sides of the same thing.

If you cannot afford to keep 24 hours of data on disk, then this implies that your *ingestion* rate is also extremely high (i.e. the number of new metric data points you add every minute)

> appName=\"sunfire-selfmonitor-reduce.url1927\",
> appName=\"sunfire-selfmonitor-reduce.url132\"

There is a hint here that you may also be suffering from a cardinality explosion, i.e. too many distinct timeseries.  Every time you change the value of any label, you are creating a whole new timeseries.

Go to the Prometheus web interface, select Status > TSDB Status, and then post what you see.  The "Head Stats" section is the most important, but the others are useful too.
Reply all
Reply to author
Forward
0 new messages