Prometheus slow Performance / Prometheus Performance tuning

65 views

Skip to first unread message

Midhun K

unread,

Oct 27, 2020, 10:42:27 PM10/27/20

to Prometheus Users

Hello Guys,

What is the Problem?
I'm facing slow Grafana dashboard performance, I'm using Prometheus as my datastore,
just need to debug/understand the bottleneck/slowness.

What I've tried to improve performance?
1. Tried Trickster as a caching/accelerator layer between Prometheus and Grafana.
2. Increase some query parameters limits.

--query.max-concurrency=20

Maximum number of queries executed concurrently.

--query.max-samples=50000000

Maximum number of samples a single query can load into memory.
These help to reduce connection timeout issues but not help for slow performance

3. Check System resources usage - Its good enough to handle the query.

What I need to know ?
1. Want understand more about below timing stats which can fetch from prometheus query logs (evalTotalTime,execQueueTime,execTotalTime",innerEvalTime,queryPreparationTime",resultSortTime )

"stats": {

"timings": {

"evalTotalTime": 0.000447452,

"execQueueTime": 7.599e-06,

"execTotalTime": 0.000461232,

"innerEvalTime": 0.000427033,

"queryPreparationTime": 1.4177e-05,

"resultSortTime": 6.48e-07

}
2. We're using Prometheus widely but unable to find a useful resource for performance tuning, so can you guys please flood this email chain with the tunable options/ideas to improve Prometheus query performance, guide me, to do anything better to narrow down the exact area which contributing the slowness.

Stack Details
OS: Centos 7
Version: Prometheus 2.20
Deployment: Docker compose stack (Prometheus, Grafana, Trickster)

Midhun K

unread,

Oct 27, 2020, 11:23:38 PM10/27/20

to Prometheus Users

+ Adding some additional points.

If prometheus_engine_queries is greater than prometheus_engine_queries_concurrent_max, it means that some queries are queued. The queue time is part of the two-minute default timeout.

We have Analysed the max query rate from our dashboard it's between 30-40 & our default value was 20, this will cause some timeout and slowness due to queueing the request so now its increased it to 60. (this will cause some more resource utilisation but it's ok as per our system specification )

max.png

max_limit.png

Reply all

Reply to author

Forward

0 new messages