Prometheus alerting rules test for counters

56 views
Skip to first unread message

Debashish Ghosh

unread,
Mar 8, 2020, 8:07:44 PM3/8/20
to Prometheus Users
Hi ,
   I have a few alerts created for some counter time series in Prometheus . I went through the basic alerting test examples in the prometheus web site. But they don't seem to work well with my counters that I use for alerting .I use some expressions on counters like increase() , rate() and sum() and want to have test rules created for these. I have attached my alerts file as well as my test file. 
I am trying the most basic test of the counter custom_message_volume_endpoint_organization_total  where I set all the values to 0 so that when my alert with expr - increase(custom_message_volume_endpoint_organization_total[2m]) == 0 runs it should always be zero for 15 minutes and then it should return the alert. But it keeps returning blank. 
Can you please help me on this ?

Also I has one question regarding the difference between interval and evaluation_interval in the test file . Are the same and if now what is the difference ? I now understand the meaning of eval_time .

Thanks
Debashish
interop_alert_rule_test.yml
Prometheus_alerting_rules.yml

Brian Candler

unread,
Mar 9, 2020, 4:26:00 AM3/9/20
to Prometheus Users
The great thing about prometheus alerting rules is you can just enter them into the GUI as normal queries.  If the graph is blank, there's no alert.  If it's non-blank (i.e. there are timeseries visible) then these are the timeseries which would trigger an alert.  This makes them easy to debug.

Regarding your tests: firstly the test references "prometheus_alerting_rules.yml", but your file was called "Prometheus_alerting_rules.yml"

After fixing that, your alert fires if you set the evaluation time to 16m in the test.

('for: 15m' means the alert must have been firing continuously for 15 minutes, i.e. from t=X to t=X+15 inclusive; however, because you're checking for an increase over a time window, it doesn't start to fire until you have two data points, i.e. it starts at t=1 not t=0)

Then the only remaining problem is label mismatch:

Unit Testing:  interop_alert_rule_test.yml
  FAILED:
    alertname:NoMessageForAnOrganizationMod, time:16m0s,
        exp:"[Labels:{alertname=\"NoMessageForAnOrganizationMod\", instance=\"localhost:9090\", job=\"prometheus\", severity=\"moderate\"} Annotations:{summary=\"localhost:9090 of job prometheus shows no Messages for organization {{ $labels.custom_message_volume_endpoint_organization_total}}\"}]",
        got:"[Labels:{alertname=\"NoMessageForAnOrganizationMod\", instance=\"localhost:9090\", job=\"prometheus\", severity=\"moderate\"} Annotations:{summary=\"localhost:9090 of job prometheus shows no Messages for organization  \"}]"

I'm not sure what you're actually trying to put in the annotation, but if you want the value of the metric then you get it using $value not $labels.metric_name, e.g.

      summary: '{{ $labels.instance }} of job {{ $labels.job }} shows no Messages for organization {{ $value }} '

Your test must match the actual value returned, not the template string, e.g.

                  exp_annotations:
                      summary: "localhost:9090 of job prometheus shows no Messages for organization 0 "

Brian Candler

unread,
Mar 9, 2020, 5:07:09 AM3/9/20
to Prometheus Users
BTW, I think that rule would be more robust against missing values by using

expr: increase(metric_name[15m]) == 0

instead of using "for:".  If you use "for:" then the condition must be true for every single evaluation, and a single missed sample may reset the alert.

Debashish Ghosh

unread,
Mar 9, 2020, 10:47:28 AM3/9/20
to Brian Candler, Prometheus Users
Thanks brian .. that answers most of my questions ... Regarding using [15m] in the increase we have purposely kept it [2m] that runs for 15 minutes since we are really tracking something continuously to be true all the time to trigger an alert as opposed to only once . 

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/f62a0e3a-8f71-48aa-a5f3-b86f16be57b2%40googlegroups.com.

Brian Candler

unread,
Mar 9, 2020, 11:18:48 AM3/9/20
to Prometheus Users
On Monday, 9 March 2020 14:47:28 UTC, Debashish Ghosh wrote:
Thanks brian .. that answers most of my questions ... Regarding using [15m] in the increase we have purposely kept it [2m] that runs for 15 minutes since we are really tracking something continuously to be true all the time to trigger an alert as opposed to only once . 


Yes, but think about it.  You are evaluating the rule every minute.

In one case you are saying:

b-a == 0  (*)
c-b == 0
d-e == 0
... must be true 15 times in a row

What I'm recommending is you do the rate over 15 minutes, which means

q-a == 0

You can still evaluate this rule every 1 minute, and it will first trigger once the counter has been flat for 15 minutes.

I think you can see that in both cases, the counter must be continuously non-incrementing over 15 minutes to alert.  However, the second formulation is more stable in the face of any missed data collection.  metric[2m] will return no value if there are not two points within a 2 minute window.

(*) It's not exactly "b-a==0", because rate() or increase() will skip cases where the counter resets.

Debashish Ghosh

unread,
Mar 9, 2020, 3:41:55 PM3/9/20
to Prometheus Users


This perfectly makes sense for alerting. This is really helpful. I was just using the queries I have for my grafana dashboard where I really wanted plots that are more fine grained and done every 2 minutes .
I have another metric regarding SLA that needs to be 99.95 % or above . I am using the formula 100-(((30*24*60*60) - increase(process_uptime_seconds{job="Interop-InboundApi"}[30d]))/(30*24*60*60))*100  which means if there is any time missing between the total number of seconds in 30 days minus the number of seconds the server was up in the last 30 days , that time should be less than .05%.. I am having difficulty writing test for this since I see that it doesn't allow '1d' as interval . So should I use something like 24*60m instead of 1d.

I have similar problem for Latency SLA . I am using histogram for that and am trying to get the percentage of messages below 1 second bucket . I am using the formula below :
sum(rate(http_server_requests_seconds_bucket{le="1.0",uri="/inboundapi/message/v2"}[30d])) by (job) /sum(rate(http_server_requests_seconds_count{uri="/inboundapi/message/v2"}[30d]))by (job)*100.
To test this too I need to use days in the interval.

Let me know your thoughts .

Thanks
Debashish

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages