Alertmanager: flapping on resolve when running multiple prometheus replicas

39 views

Skip to first unread message

T.G.

unread,

Sep 1, 2021, 5:25:20 AM9/1/21

to Prometheus Users

For redundancy, we are running two replicas of Prometheus which then alerts to a HA Alertmanager cluster.

However, we noticed that we experienced flapping when an alert would resolve. This is what it looks like from the point of vue of a receiver:

* Alert is resolved

* Alert is reopened immediatly

* Alert is resolved immediatly

Usually this will last < 30 seconds

We are pretty sure this is due to the alert being evaluated as resolved by one of the replicas but not by the other.

We tried to increase the group interval but it seems that a solved status is forwarded immediately to receivers, independently of that

Is there a setting in Alertmanager we can use to prevent that? Here are the settings we have I think might be relevant:

- Prometheus scrape interval: 15s

- Prometheus evaluation interval: 1m
- Alertmanager group_wait: 20s
- Alertmanager group_interval: 1m