Julius Volz wrote:
[dd]
> >
> > The query `app1_response_duration_bucket{{le="0.75"}` will return a
> > list of endpoints which have responded faster than 0.75s.
> >
>
> This is not quite correct - this query gives you the le="0.75" bucket
> counter for *all* endpoints,
OK, I stand corrected.
> and the value of each bucket counter tells you
> how many requests that endpoint has handled that completed within 0.75s
> since the exposing process started tracking things.
What if I want to see how many requests each endpoint has handled that
DID NOT complete within 0.75s since the exposing process started
tracking things?
>
>
> > How do I invert the "le" and find the endpoints slower than "le"?
> >
>
> Hmm, histograms are usually used to tell you about the *distribution* of
> request latencies to a given endpoint (or other label combination). So it's
> unclear what you mean with an endpoint being slower than some "le" value.
Please see above.
> Do you want to find out whether some endpoint has handled any requests *at
> all* that took longer than some duration? Or only if that happened in the
> last X amount of time?
Yes, I think I can put it like this. I would like to be informed if any
endpoint has become "slow" and the details may vary.
> Or only if a certain percentage of requests were too
> slow?
>
> One thing people frequently do is to calculate percentiles / quantiles from
> a histogram, for example:
>
> histogram_quantile(0.9, rate(app1_response_duration_bucket[5m]))
>
> ...would tell you the approximated 90th percentile latency in seconds as
> averaged over a moving 5-minute window for a given label combination, which
> you can then combine with a filter operator to find slow endpoints (e.g.
> "... > 10" would give you those endpoints that have a 90th percentile
> latency above 10s).
I've tried to graph "histogram_quantile(0.9, rate(app1_response_duration_bucket[5m])) > 3"
but the result is very hard to interpret visually. It almost makes no sense.
It's slightly more understandable as a table/list.