Blackbox prober - count bad events for latency SLI

George Brighton

unread,

Jun 1, 2021, 7:04:19 PM6/1/21

to Prometheus Users

Hi folks,

I've written a blackbox_exporter-like process which exposes a probe_duration_seconds gauge. It is scraped every 15s. Given a latency threshold, I'd like to create recording rules to count the number of bad events and number of total events over the last 1m to derive a latency SLI. Obtaining the total number of events can be done with:

count_over_time(probe_duration_seconds[1m])

However, I'm not sure how to filter a range vector and count the remaining samples. My first thought was a subquery (assuming an arbitrary 2s threshold for a bad event):

sum_over_time((probe_duration_seconds > bool 2)[1m:15s])

During a complete outage, I've found this returns 5, whereas due to alignment, the total events query will almost always return 4, resulting in a negative ratio for the SLI. Using clamp_min() to fix this seems like a hack, as does using a sub-query in a recording rule.

Is there a better way than evaluating probe_duration_seconds > bool 2 in a separate rule group with 15s interval, then sum_over_time() the resulting series every 1m over the past 1m? A completely different option would be to have the exporter expose booleans and sum them over time, however it would be great to keep thresholds within Prometheus config.

Many thanks,

George

Marcelo Magallón

unread,

Jun 3, 2021, 7:10:04 PM6/3/21

to George Brighton, Prometheus Users

There's probe_success, which is zero or 1 depending on the result. Summing that over time should give you the number of successful checks, and counting over time gives you the number of total checks.

Does that help?

Marcelo

George Brighton

unread,

Jun 4, 2021, 9:38:26 PM6/4/21

to Prometheus Users

Thanks for taking the time to reply Marcelo. What you describe is exactly what I'm doing for the availability SLI; unfortunately the same trick cannot be done as readily for latency, as the underlying values are not a boolean.

George

Reply all

Reply to author

Forward