Problem interpreting histogram

yopt...@tripactions.com

unread,

May 15, 2019, 6:10:22 PM5/15/19

to Prometheus Users

Hi there,

I've recently started exposing prometheus metrics from my application and most things work as expected, but I'm confused by the results generated by histogram_quantile.

I think some examples will illustrate the issue:

Latency plotted for a single API endpoint (4 different instances), barely exceeds 1.0s:

Query:

rate(sanic_request_latency_sec_sum{http_status="200",endpoint="/api/flight/rank_flights"}[5m]) / rate(sanic_request_latency_sec_count{http_status="200",endpoint="/api/flight/rank_flights"}[5m])

Screenshot 2019-05-15 at 14.24.26.png

Then, 95th percentile for the same endpoint is mostly above 2.0s:

Query:

histogram_quantile(0.95, sum(rate(sanic_request_latency_sec_bucket{endpoint="/api/flight/rank_flights",http_status="200"}[5m])) by (le))

Screenshot 2019-05-15 at 14.24.38.png

It doesn't make sense to me that the 95th percentile is that high. Sure, the latency graph is averaged, but I've cross checked with our logs in Kibana and there is no request that takes longer than 1.1s.

Some context:

As we're testing this service, the req/s is very low, more like 10 requests per minute.
I'm running Gunicorn with Sanic and sanic_prometheus
The histogram relies on the default buckets as defined by the Prometheus python client: https://github.com/prometheus/client_python/blob/master/prometheus_client/metrics.py#L473

I'm new to Prometheus and it's entirely possible that I'm getting things completely wrong, but I just can't wrap my head around this.

Any help would be appreciated!

Youri

Brian Brazil

unread,

May 15, 2019, 6:28:22 PM5/15/19

to yopt...@tripactions.com, Prometheus Users

On Wed, 15 May 2019 at 23:10, <yopt...@tripactions.com> wrote:

Hi there,

I've recently started exposing prometheus metrics from my application and most things work as expected, but I'm confused by the results generated by histogram_quantile.

I think some examples will illustrate the issue:

Latency plotted for a single API endpoint (4 different instances), barely exceeds 1.0s:
Query:
rate(sanic_request_latency_sec_sum{http_status="200",endpoint="/api/flight/rank_flights"}[5m]) / rate(sanic_request_latency_sec_count{http_status="200",endpoint="/api/flight/rank_flights"}[5m])

Then, 95th percentile for the same endpoint is mostly above 2.0s:
Query:
histogram_quantile(0.95, sum(rate(sanic_request_latency_sec_bucket{endpoint="/api/flight/rank_flights",http_status="200"}[5m])) by (le))

It doesn't make sense to me that the 95th percentile is that high. Sure, the latency graph is averaged, but I've cross checked with our logs in Kibana and there is no request that takes longer than 1.1s.

The relevant buckets here are 1.0 and 2.5, so if there's lots of requests in the 1.0-1.1s range then the interpolation will consider them to be nearer the 2.5 than the 1.0.

Brian

Some context:
As we're testing this service, the req/s is very low, more like 10 requests per minute.
I'm running Gunicorn with Sanic and sanic_prometheus
The histogram relies on the default buckets as defined by the Prometheus python client: https://github.com/prometheus/client_python/blob/master/prometheus_client/metrics.py#L473
I'm new to Prometheus and it's entirely possible that I'm getting things completely wrong, but I just can't wrap my head around this.

Any help would be appreciated!

Youri

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/4b0a8ede-2e00-4459-b950-b9a5173b953e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Brian Brazil

www.robustperception.io

yopt...@tripactions.com

unread,

May 15, 2019, 7:48:59 PM5/15/19

to Prometheus Users

Hi Brian,

Thanks a lot! I guess I did not fully understand how this worked. I should change my buckets to better reflect the ranges I'm expecting to see!

Youri

To unsubscribe from this group and stop receiving emails from it, send an email to promethe...@googlegroups.com.

To post to this group, send email to promethe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/4b0a8ede-2e00-4459-b950-b9a5173b953e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
Brian Brazil
www.robustperception.io

Reply all

Reply to author

Forward

Problem interpreting histogram_quantile

yopt...@tripactions.com

Brian Brazil

yopt...@tripactions.com