Alarm Refresh

71 views

Skip to first unread message

wei zhang

unread,

Aug 30, 2021, 10:49:26 PM8/30/21

to Prometheus Users

What did you do?
I use Prometheus. Due to resource problems, some indicators always reach the alarm threshold

What did you expect to see?
I hope that some alarms will continue to be reported without interruption

What did you see instead? Under which circumstances?
After a period of time, some of the alarms were interrupted, and the alarm was re-alerted a few minutes later. In this process, the indicator has always reached the alarm threshold

Environment
A multi-node federated cluster

Prometheus version:
2.15

Brian Candler

unread,

Aug 31, 2021, 3:39:25 AM8/31/21

to Prometheus Users

If an alert goes away, even for one rule evaluation cycle, it's immediately resolved. I'm guessing this is what has happened here. You can prove it by entering the alerting expression in the PromQL browser in the prometheus web UI, graphing it over the time when this was happening, and seeing if the alert value goes away briefly.

Personally I would love to see alerts go into a "resolving" state so that alerts which are mostly "fail" with occasional "success" or "don't know" don't keep re-alerting. There is some discussion here:

https://github.com/prometheus/alertmanager/issues/204

(although if the feature were implemented as I just described, then it would be implemented in prometheus rather than alertmanager)

For now, it's up to you to write more complex alerting rules using history, such as (average|sum|count|min|max)_over_time with a range vector, so that the alerts stay firing.

Reply all

Reply to author

Forward

0 new messages