Hi folks,
I've written a blackbox_exporter-like process which exposes a probe_duration_seconds gauge. It is scraped every 15s. Given a latency threshold, I'd like to create recording rules to count the number of bad events and number of total events over the last 1m to derive a latency SLI. Obtaining the total number of events can be done with:
count_over_time(probe_duration_seconds[1m])
However, I'm not sure how to filter a range vector and count the remaining samples. My first thought was a subquery (assuming an arbitrary 2s threshold for a bad event):
sum_over_time((probe_duration_seconds > bool 2)[1m:15s])
During a complete outage, I've found this returns 5, whereas due to alignment, the total events query will almost always return 4, resulting in a negative ratio for the SLI. Using clamp_min() to fix this seems like a hack, as does using a sub-query in a recording rule.
Is there a better way than evaluating probe_duration_seconds > bool 2 in a separate rule group with 15s interval, then sum_over_time() the resulting series every 1m over the past 1m? A completely different option would be to have the exporter expose booleans and sum them over time, however it would be great to keep thresholds within Prometheus config.
Many thanks,
George