Extracting long queries from multiple histograms

258 views
Skip to first unread message

Victor Sudakov

unread,
Apr 19, 2022, 2:12:12 PM4/19/22
to promethe...@googlegroups.com
Dear Colleages,

There is a web app which exports its metrics as multiple histograms,
one histogram per Web endpoint. So each set of histogram data is also
labelled by the {endpoint} label. There are about 50 endpoints so
about 50 histograms.

I would like to detect and graph slow endpoints, that is I would like
to know the value of {endpoint} when its {le} is over 1s or something
like that.

Can you please help with a relevant PromQL query and an idea how to
represent it in Grafana?

I don't actually want 50 heatmaps, there must be a clever way to make
an overview of all the slow endpoints, or all the endpoints with a
particular status code etc.

--
Victor Sudakov VAS4-RIPE
http://vas.tomsk.ru/
2:5005/49@fidonet

Victor Sudakov

unread,
Apr 20, 2022, 4:25:32 PM4/20/22
to promethe...@googlegroups.com
Victor Sudakov wrote:
>
> There is a web app which exports its metrics as multiple histograms,
> one histogram per Web endpoint. So each set of histogram data is also
> labelled by the {endpoint} label. There are about 50 endpoints so
> about 50 histograms.
>
> I would like to detect and graph slow endpoints, that is I would like
> to know the value of {endpoint} when its {le} is over 1s or something
> like that.
>
> Can you please help with a relevant PromQL query and an idea how to
> represent it in Grafana?
>
> I don't actually want 50 heatmaps, there must be a clever way to make
> an overview of all the slow endpoints, or all the endpoints with a
> particular status code etc.

An example. The PromQL query
`app1_response_duration_bucket{external_endpoint="http://YY/XX",status_code="200",method="GET"}`
produces a histogram.

The PromQL query
`app1_response_duration_bucket{external_endpoint="http://YY/XX",status_code="200",method="POST"}`
produces another histogram.

The query `app1_response_duration_bucket{{le="0.75"}` will return a
list of endpoints which have responded faster than 0.75s.

How do I invert the "le" and find the endpoints slower than "le"?

Julius Volz

unread,
Apr 21, 2022, 1:21:20 PM4/21/22
to Victor Sudakov, Prometheus Users
On Wed, Apr 20, 2022 at 10:25 PM Victor Sudakov <v...@sibptus.ru> wrote:
Victor Sudakov wrote:
>
> There is a web app which exports its metrics as multiple histograms,
> one histogram per Web endpoint. So each set of histogram data is also
> labelled by the {endpoint} label. There are about 50 endpoints so
> about 50 histograms.
>
> I would like to detect and graph slow endpoints, that is I would like
> to know the value of {endpoint} when its {le} is over 1s or something
> like that.
>
> Can you please help with a relevant PromQL query and an idea how to
> represent it in Grafana?
>
> I don't actually want 50 heatmaps, there must be a clever way to make
> an overview of all the slow endpoints, or all the endpoints with a
> particular status code etc.

An example. The PromQL query
`app1_response_duration_bucket{external_endpoint="http://YY/XX",status_code="200",method="GET"}`
produces a histogram.

The PromQL query
`app1_response_duration_bucket{external_endpoint="http://YY/XX",status_code="200",method="POST"}`
produces another histogram.

The query `app1_response_duration_bucket{{le="0.75"}` will return a
list of endpoints which have responded faster than 0.75s.

This is not quite correct - this query gives you the le="0.75" bucket counter for *all* endpoints, and the value of each bucket counter tells you how many requests that endpoint has handled that completed within 0.75s since the exposing process started tracking things.
 
How do I invert the "le" and find the endpoints slower than "le"?

Hmm, histograms are usually used to tell you about the *distribution* of request latencies to a given endpoint (or other label combination). So it's unclear what you mean with an endpoint being slower than some "le" value. Do you want to find out whether some endpoint has handled any requests *at all* that took longer than some duration? Or only if that happened in the last X amount of time? Or only if a certain percentage of requests were too slow?

One thing people frequently do is to calculate percentiles / quantiles from a histogram, for example:

    histogram_quantile(0.9, rate(app1_response_duration_bucket[5m]))

...would tell you the approximated 90th percentile latency in seconds as averaged over a moving 5-minute window for a given label combination, which you can then combine with a filter operator to find slow endpoints (e.g. "... > 10" would give you those endpoints that have a 90th percentile latency above 10s).

See also https://prometheus.io/docs/practices/histograms/ for more details on using histograms.

Regards,
Julius

--
Julius Volz
PromLabs - promlabs.com

Victor Sudakov

unread,
Apr 21, 2022, 6:51:00 PM4/21/22
to promethe...@googlegroups.com
Julius Volz wrote:

[dd]
> >
> > The query `app1_response_duration_bucket{{le="0.75"}` will return a
> > list of endpoints which have responded faster than 0.75s.
> >
>
> This is not quite correct - this query gives you the le="0.75" bucket
> counter for *all* endpoints,

OK, I stand corrected.

> and the value of each bucket counter tells you
> how many requests that endpoint has handled that completed within 0.75s
> since the exposing process started tracking things.

What if I want to see how many requests each endpoint has handled that
DID NOT complete within 0.75s since the exposing process started
tracking things?
>
>
> > How do I invert the "le" and find the endpoints slower than "le"?
> >
>
> Hmm, histograms are usually used to tell you about the *distribution* of
> request latencies to a given endpoint (or other label combination). So it's
> unclear what you mean with an endpoint being slower than some "le" value.

Please see above.

> Do you want to find out whether some endpoint has handled any requests *at
> all* that took longer than some duration? Or only if that happened in the
> last X amount of time?

Yes, I think I can put it like this. I would like to be informed if any
endpoint has become "slow" and the details may vary.


> Or only if a certain percentage of requests were too
> slow?
>
> One thing people frequently do is to calculate percentiles / quantiles from
> a histogram, for example:
>
> histogram_quantile(0.9, rate(app1_response_duration_bucket[5m]))
>
> ...would tell you the approximated 90th percentile latency in seconds as
> averaged over a moving 5-minute window for a given label combination, which
> you can then combine with a filter operator to find slow endpoints (e.g.
> "... > 10" would give you those endpoints that have a 90th percentile
> latency above 10s).

I've tried to graph "histogram_quantile(0.9, rate(app1_response_duration_bucket[5m])) > 3"
but the result is very hard to interpret visually. It almost makes no sense.

It's slightly more understandable as a table/list.

Julius Volz

unread,
Apr 22, 2022, 5:23:26 AM4/22/22
to Victor Sudakov, Prometheus Users
On Fri, Apr 22, 2022 at 12:50 AM Victor Sudakov <v...@sibptus.ru> wrote:
Julius Volz wrote:

[dd]
> >
> > The query `app1_response_duration_bucket{{le="0.75"}` will return a
> > list of endpoints which have responded faster than 0.75s.
> >
>
> This is not quite correct - this query gives you the le="0.75" bucket
> counter for *all* endpoints,

OK, I stand corrected.

> and the value of each bucket counter tells you
> how many requests that endpoint has handled that completed within 0.75s
> since the exposing process started tracking things.

What if I want to see how many requests each endpoint has handled that
DID NOT complete within 0.75s since the exposing process started
tracking things?

Then you could subtract the le="0.75" bucket from the total count (which is available both in the _bucket{le="+Inf"} as well as the _count series of the histogram):

----------
  app1_response_duration_bucket{le="+Inf"}
- ignoring(le)
  app1_response_duration_bucket{le="0.75"}
----------
 
The "ignoring(le)" tells the binary operator to ignore the "le" label for vector matching, since it will always be different on both sides.

And then you could also add a filter to only show outputs with >0 requests:

----------
  app1_response_duration_bucket{le="+Inf"}
- ignoring(le)
  app1_response_duration_bucket{le="0.75"}
 > 0
----------

BUT: It's important to note that operating on a raw histogram counter like this is not recommended, as the counts will totally depend on when the process started handling & tracking requests (e.g. 5 minutes ago vs. 2 months ago). You most likely will want to at least wrap rate() or increase() around the histogram counters to only consider the behavior of the histogram counters within a defined time range like the last 5 minutes, last 1h, etc.:

----------
  rate(app1_response_duration_bucket{le="+Inf"}[5m])
- ignoring(le)
  rate(app1_response_duration_bucket{le="0.75"}[5m])
 > 0
----------

The above would give you the per-second rate of slow requests for any endpoints that received any slow requests within the last 5m. Use increase() instead of rate() if you want absolute vs. per-second numbers.

>
>
> > How do I invert the "le" and find the endpoints slower than "le"?
> >
>
> Hmm, histograms are usually used to tell you about the *distribution* of
> request latencies to a given endpoint (or other label combination). So it's
> unclear what you mean with an endpoint being slower than some "le" value.

Please see above.

> Do you want to find out whether some endpoint has handled any requests *at
> all* that took longer than some duration? Or only if that happened in the
> last X amount of time?

Yes, I think I can put it like this. I would like to be informed if any
endpoint has become "slow" and the details may vary.


> Or only if a certain percentage of requests were too
> slow?
>
> One thing people frequently do is to calculate percentiles / quantiles from
> a histogram, for example:
>
>     histogram_quantile(0.9, rate(app1_response_duration_bucket[5m]))
>
> ...would tell you the approximated 90th percentile latency in seconds as
> averaged over a moving 5-minute window for a given label combination, which
> you can then combine with a filter operator to find slow endpoints (e.g.
> "... > 10" would give you those endpoints that have a 90th percentile
> latency above 10s).

I've tried to graph "histogram_quantile(0.9, rate(app1_response_duration_bucket[5m])) > 3"
but the result is very hard to interpret visually. It almost makes no sense.

It's slightly more understandable as a table/list.

Yes, queries that include filtering constructs usually look weird in graphs because the filter criterium might be true at some time steps in the graph, but not in others, so you can get graphs with many short intermittent series. Filters are more commonly used for alerting / table queries.
 
--
Victor Sudakov VAS4-RIPE
http://vas.tomsk.ru/
2:5005/49@fidonet

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/YmHfziweOcQGpIjh%40admin.sibptus.ru.
Reply all
Reply to author
Forward
0 new messages