Measuring occurrences where a certain threshold was exceeded

21 views
Skip to first unread message

Chris Featherstone

unread,
Mar 26, 2020, 1:47:04 PM3/26/20
to Prometheus Users
I have this query

quantile_over_time(0.90, replication_read_duration_seconds{job="heartbeat-read"}[5m]) < .005 != bool 1

I am trying to measure the times when my duration is greater than 5ms and then report a percentage. Something like: Over the last 5 minutes, 99.9% of requests were below 5ms.

My thought was to create a recording rule with the above query, and then average_over_time against my recording rule. The issue i have, is that the 'bool 1' is only storing a 1 when this condition is true. It doesnt store a 0 when the condition is false. So when I take the average of the recording rule metric, the average is always 1.

Am I approaching this incorrectly?

Chris

Brian Candler

unread,
Mar 26, 2020, 2:08:09 PM3/26/20
to Prometheus Users
On Thursday, 26 March 2020 17:47:04 UTC, Chris Featherstone wrote:
I have this query

quantile_over_time(0.90, replication_read_duration_seconds{job="heartbeat-read"}[5m]) < .005 != bool 1

Obviously any value which is less than 0.005 is not equal to 1, so this will always return 1 or nothing.

It sounds like what you're trying to do here is:

quantile_over_time(0.90, replication_read_duration_seconds{job="heartbeat-read"}[5m]) < bool .005
 
which will return 0 or 1.

But I don't think this will solve your problem very well:


I am trying to measure the times when my duration is greater than 5ms and then report a percentage. Something like: Over the last 5 minutes, 99.9% of requests were below 5ms.

replication_read_duration_seconds is a gauge? How often does it change?

If you want to report that 999 in 1000 requests were below 5ms, then you need at least 1000 samples, and if that's over a 5 minute period you must be scraping more than 3 times per second.  That's not really how prometheus is supposed to be used.

It sounds like what you really want is to collect these events in a histogram, then report on the histogram.  But that means changing how you collect the data in the first place.

As a simple way to think about a histogram, imagine you have two counters:
- A counts the total events
- B counts only the events with latency < 5ms

If you take the increase of B over 5 minutes, divided by the increase in A over 5 minutes, that gives you the fraction you're looking for.
Message has been deleted

Chris Featherstone

unread,
Mar 26, 2020, 3:17:18 PM3/26/20
to Prometheus Users
Thanks Brian,

'replication_read_duration_seconds' is a gauge and it updates every time this exporter is hit by prometheus (30s right now). I was trying to see if I could somehow make our current metrics work, but its pretty clear that I need the total count (and histogram).

The current metrics report duration on only this individual scrape. So to get anything meaningful I need to know the results of every attempt. Kind of like two buckets, 0-5ms, and 5ms-Inf.

Brian Candler

unread,
Mar 26, 2020, 3:57:43 PM3/26/20
to Prometheus Users
On Thursday, 26 March 2020 19:17:18 UTC, Chris Featherstone wrote:
Thanks Brian,

'replication_read_duration_seconds' is a gauge and it updates every time this exporter is hit by prometheus (30s right now).

Giving the value of the most recent replication event presumably.  Do the replications themselves only happen once per 30s, or more often?

If they only happen every 30s then you can work along the lines of the formula you were building, but the 5-minute average will only have a resolution of 30/300 = 10%.

 
I was trying to see if I could somehow make our current metrics work, but its pretty clear that I need the total count (and histogram).

The current metrics report duration on only this individual scrape. So to get anything meaningful I need to know the results of every attempt. Kind of like two buckets, 0-5ms, and 5ms-Inf.

Yep, that's almost exactly how a histogram works. The standard in prometheus is that the buckets overlap, so you'll have 0-5ms and 0-Inf, but the principle is the same.

(1) replication_read_duration_seconds_bucket{le="0.005"}
(2) replication_read_duration_seconds_bucket{le="+Inf"}
(3) replication_read_duration_seconds_sum
(4) replication_read_duration_seconds_count

(3) allows the average to be calculated. (2) is the same as (4).
Reply all
Reply to author
Forward
0 new messages