Hey,
Having an interesting issue with Prom and Alert manager, im 99% sure its a config issue, but having a hard time figuring it out.
We have a group of polls that use the blackbox exporter to ping some endpoints. It pings once every 30 seconds.
The rule looks like this
- name: blackbox.rules.icmpFailed
rules:
- alert: BlackboxIcmpFailed
expr: probe_icmp_duration_seconds == 0
for: 5m
labels:
severity: critical
annotations:
summary: Ping to Device Failed.
And our alert manager config look like this
spec:
route:
groupBy: [ 'instance','severity' ]
groupWait: 30s
groupInterval: 5m
repeatInterval: 12h
Now here is what I am seeing.
If we have a single ping failure then an alert message is sent to slack, which immediately clears on the next 5 min cycle.
I thought having the "for: 5m" should mean that an alert is ONLY sent if that condition has been seen for 5 mins consecutively. As you can imagine this leads to lots of angst :D
Any ideas?
This email contains information which is private and confidential, all commercial rights to the details included are owned exclusively by Nscale. Disclosure without written permission is strictly prohibited. If you have received this email in error, please inform me as soon as possible.