histogram_quantile always returns the lowest bucket's upper bound even if all observations fall in higher buckets

27 views
Skip to first unread message

Matteo Olivi

unread,
Apr 19, 2020, 2:01:57 PM4/19/20
to Prometheus Users
Hello,
I have instrumented a Go application with Prometheus' client_golang v 0.9.2, and among the metrics there are the following histograms:

attachmentCreateToLocalIfcHistograms := prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Namespace: metricsNamespace,
Subsystem: metricsSubsystem,
Name: "attachment_create_to_local_ifc_latency_seconds",
Help: "Seconds from attachment creation to finished creating local interface",
Buckets: []float64{-0.125, 0, 0.125, 0.25, 0.5, 1, 2, 4, 8, 16, 32, 64, 128, 256},
ConstLabels: map[string]string{"node": node},
},
[]string{naWaitedForIPAMLabel, naWaitedForLocalCALabel})

I am displaying the 99th percentile (over 1 minute) of the histogram in Grafana with the following query:

histogram_quantile(0.99, rate(kos_agent_attachment_create_to_local_ifc_latency_seconds_bucket{waited_for_ipam="false", waited_for_local_ca="true"}[1m]))

where "kos" is the metrics namespace and "agent" is the metrics subsystem.

Notice that the expression picks only one of the histograms in the vector (the one with labels waited_for_ipam="false", waited_for_local_ca="true"), that's because
I'm running some experiments where I generate load for the application and currently those labels are the only ones that get used.

I repeated the same experiment multiple times and Grafana always displays the value -0.125, which is the upper bound of the lowest bucket. Via log statements, however,
I observed that there is always a total of 30 observations (because of how the experiment is built) and that all the observations have a value that is higher than 10. The same value
is returned if I execute the aforementioned query directly in Prometheus expression browser.

Also, not sure whether it is relevant but all the 30 observations always take place almost at the same time (within the same second).

Prometheus binary is v2.6.1 and runs within a Kubernetes Pod (like the instrumented application).

How is it possible that histogram_quantile always returns -0.125?

Thanks,
Matteo. 

Bjoern Rabenstein

unread,
Apr 22, 2020, 11:03:22 AM4/22/20
to Matteo Olivi, Prometheus Users
On 19.04.20 11:01, Matteo Olivi wrote:
> I have instrumented a Go application with Prometheus' client_golang v 0.9.2,

Probably unrelated to your problem, but that version is more than 16
months old. You might consider upgrading to the most recent version.

> histogram_quantile(0.99, rate
> (kos_agent_attachment_create_to_local_ifc_latency_seconds_bucket
> {waited_for_ipam="false", waited_for_local_ca="true"}[1m]))

My first intuition is that `[1m]` is way too short. You need a scrape
interval of 15s or less to make this a meaningful duration for the
`rate`. And even then, the observations you want to see have to happen
in this 1m timeframe.

What happens if to use 10m or 1h?

_However_, if really no observations happen in the 1m range, then the
result should be NaN rather than the lowest bucket. The latter would
happen if the only bucket that gets an increase is the lowest one.

To debug that, you could check the result without the
`histogram_quantile`, i.e. run the following:

rate(kos_agent_attachment_create_to_local_ifc_latency_seconds_bucket{waited_for_ipam="false",waited_for_local_ca="true"}[1m])

Then check which buckets have a value at all, which have a value of
zero, and which have a higher one.

> Prometheus binary is v2.6.1 and runs within a Kubernetes Pod (like the
> instrumented application).

Again probably unrelated, but it's highly recommended to upgrade to a
more recent version.

--
Björn Rabenstein
[PGP-ID] 0x851C3DA17D748D03
[email] bjo...@rabenste.in

Matteo Olivi

unread,
May 6, 2020, 5:49:07 AM5/6/20
to Prometheus Users
The problem was that the scrape interval was 10 secs but all the observations were happening during a 1 sec-burst,
and before that burst the histogram for the {waited_for_ipam="false",waited_for_local_ca="true"} labels did not exist.
This means that the first time the metric got scraped it already had all of its observations, hence rate() completely
missed the increment.

One hack is to define the metric with a call to .With(labels) ASAP where the labels are {waited_for_ipam="false",waited_for_local_ca="true"}
and ensure that there's at least one scrape before the burst that adds all the observations (this can be achieved, for instance,
with a sleep at least as long as the scrape interval).

Matteo.
Reply all
Reply to author
Forward
0 new messages