Error budget burndown chart

216 views
Skip to first unread message

Vikas Kumar

unread,
Oct 7, 2020, 2:46:58 AM10/7/20
to Prometheus Users
I'm trying to create an SLO dashboard using Prometheus and Grafana. One of the charts that I want to add is an Error Budget Burndown in the last X days (starts with 100% and then burns down), something similar to: 

slo-reset-button.png
I can do it using a custom script and two different Prometheus queries.

 - Fetch total no. of requests in last X days (TOTAL) [Query 1]
 - Allowed bad requests = TOTAL * (1-SLO)/100
 - Fetch no. of bad requests in each window (say 1 hr) in last X days [Query 2]
 - Calculate % budget remaining after each window and plot it

Is there a way to do it using a single Prometheus query and then plot it in Grafana?

Thanks










Juan Bran

unread,
Nov 2, 2020, 5:55:28 PM11/2/20
to Prometheus Users
Wouldn't something like this work for you?

100 - sum (100 * sum_over_time(myservice_error_ratio[7d]) / (0.001 * 604800))

Note: the sum is not necessary and is merely there to obfuscate my test set's labels.

0.001 - your error budget either specified by hand or encoded as a recording rule
604800 - your reporting period either specified by hand or populated by your dashboard



If using Grafana you can encode the range vector and reporting period using Grafana built-in variables, these are populated by the time picker so it may not be ideal for all cases.

100 - sum (100 * sum_over_time(myservice_error_ratio[$__range]) / (0.001 * $__range_s)

-Juan

Juan Bran

unread,
Nov 2, 2020, 5:57:38 PM11/2/20
to Prometheus Users
Example of the above query in action with a test set. The top graph is the error ratio and the graph below is the error budget utilization over time. My attachment didn't seem to go through with the first reply.

Selection_386.png
Reply all
Reply to author
Forward
0 new messages