Alert manager sending alerts, if one poll exists inside the 5m period.

Steven Relf

unread,

Dec 19, 2024, 7:48:34 AM12/19/24

to Prometheus Users

Hey,

Having an interesting issue with Prom and Alert manager, im 99% sure its a config issue, but having a hard time figuring it out.

We have a group of polls that use the blackbox exporter to ping some endpoints. It pings once every 30 seconds.

The rule looks like this

- name: blackbox.rules.icmpFailed
rules:
- alert: BlackboxIcmpFailed
expr: probe_icmp_duration_seconds == 0
for: 5m
labels:
severity: critical
annotations:
summary: Ping to Device Failed.

And our alert manager config look like this

spec:
route:
groupBy: [ 'instance','severity' ]
groupWait: 30s
groupInterval: 5m
repeatInterval: 12h

Now here is what I am seeing.

If we have a single ping failure then an alert message is sent to slack, which immediately clears on the next 5 min cycle.

I thought having the "for: 5m" should mean that an alert is ONLY sent if that condition has been seen for 5 mins consecutively. As you can imagine this leads to lots of angst :D

Any ideas?

This email contains information which is private and confidential, all commercial rights to the details included are owned exclusively by Nscale. Disclosure without written permission is strictly prohibited. If you have received this email in error, please inform me as soon as possible.

Brian Candler

unread,

Dec 19, 2024, 8:33:25 AM12/19/24

to Prometheus Users

As a starting point, put the expression "probe_icmp_duration_seconds == 0" into the PromQL web browser in Prometheus, and zoom into the expected time area. What do you see?

One possible issue is if the timeseries appears and disappears; 5 minutes happens to be the default staleness interval (lookback-delta). Another problem is that from a single scrape, probe_icmp_duration_seconds has multiple values with different labels:

probe_icmp_duration_seconds{phase="resolve"} 1.5765e-05
probe_icmp_duration_seconds{phase="rtt"} 0
probe_icmp_duration_seconds{phase="setup"} 8.742e-05

For both these reasons, it would be safer to use probe_success == 0 as your alerting expression. If the problem still exists with that, it should be easier to debug.

Also, does the rule group have a non-default rule evaluation interval, or have you globally set the default evaluation interval?

Steven Relf

unread,

Dec 19, 2024, 8:49:49 AM12/19/24

to Brian Candler, Prometheus Users

Brian, you sir are a gentleman and a scholar.

I have removed that rule as I already have the other rule you suggested in place.... So no need for the bouncy one.

Rgds

Steve.

--
You received this message because you are subscribed to a topic in the Google Groups "Prometheus Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/prometheus-users/uFK2aZdRT2A/unsubscribe.
To unsubscribe from this group and all its topics, send an email to prometheus-use...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/prometheus-users/46dfd7bd-4c46-40e7-810d-ffc347e7b28bn%40googlegroups.com.

Reply all

Reply to author

Forward