Problem interpreting histogram_quantile

72 views
Skip to first unread message

yopt...@tripactions.com

unread,
May 15, 2019, 6:10:22 PM5/15/19
to Prometheus Users
Hi there,

I've recently started exposing prometheus metrics from my application and most things work as expected, but I'm confused by the results generated by histogram_quantile.

I think some examples will illustrate the issue:

Latency plotted for a single API endpoint (4 different instances), barely exceeds 1.0s:
Query: 
rate(sanic_request_latency_sec_sum{http_status="200",endpoint="/api/flight/rank_flights"}[5m]) / rate(sanic_request_latency_sec_count{http_status="200",endpoint="/api/flight/rank_flights"}[5m])


Screenshot 2019-05-15 at 14.24.26.png


Then, 95th percentile for the same endpoint is mostly above 2.0s:

Query:

histogram_quantile(0.95, sum(rate(sanic_request_latency_sec_bucket{endpoint="/api/flight/rank_flights",http_status="200"}[5m])) by (le))


Screenshot 2019-05-15 at 14.24.38.png

It doesn't make sense to me that the 95th percentile is that high. Sure, the latency graph is averaged, but I've cross checked with our logs in Kibana and there is no request that takes longer than 1.1s.


Some context:
I'm new to Prometheus and it's entirely possible that I'm getting things completely wrong, but I just can't wrap my head around this.

Any help would be appreciated!

Youri

Brian Brazil

unread,
May 15, 2019, 6:28:22 PM5/15/19
to yopt...@tripactions.com, Prometheus Users
On Wed, 15 May 2019 at 23:10, <yopt...@tripactions.com> wrote:
Hi there,

I've recently started exposing prometheus metrics from my application and most things work as expected, but I'm confused by the results generated by histogram_quantile.

I think some examples will illustrate the issue:

Latency plotted for a single API endpoint (4 different instances), barely exceeds 1.0s:
Query: 
rate(sanic_request_latency_sec_sum{http_status="200",endpoint="/api/flight/rank_flights"}[5m]) / rate(sanic_request_latency_sec_count{http_status="200",endpoint="/api/flight/rank_flights"}[5m])


Screenshot 2019-05-15 at 14.24.26.png


Then, 95th percentile for the same endpoint is mostly above 2.0s:

Query:

histogram_quantile(0.95, sum(rate(sanic_request_latency_sec_bucket{endpoint="/api/flight/rank_flights",http_status="200"}[5m])) by (le))


Screenshot 2019-05-15 at 14.24.38.png

It doesn't make sense to me that the 95th percentile is that high. Sure, the latency graph is averaged, but I've cross checked with our logs in Kibana and there is no request that takes longer than 1.1s.


The relevant buckets here are 1.0 and 2.5, so if there's lots of requests in the 1.0-1.1s range then the interpolation will consider them to be nearer the 2.5 than the 1.0.

Brian
 


Some context:
I'm new to Prometheus and it's entirely possible that I'm getting things completely wrong, but I just can't wrap my head around this.

Any help would be appreciated!

Youri

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/4b0a8ede-2e00-4459-b950-b9a5173b953e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--

yopt...@tripactions.com

unread,
May 15, 2019, 7:48:59 PM5/15/19
to Prometheus Users
Hi Brian,

Thanks a lot! I guess I did not fully understand how this worked. I should change my buckets to better reflect the ranges I'm expecting to see!

Youri
To unsubscribe from this group and stop receiving emails from it, send an email to promethe...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages