Question regarding Loadbalanced Alertmanager Clusters

Matt Miller

unread,

Dec 3, 2021, 2:07:57 PM12/3/21

to Prometheus Users

Hi,

In the documentation under the HA section, it mentions:
"It's important not to load balance traffic between Prometheus and its Alertmanagers, but instead, point Prometheus to a list of all Alertmanagers."

I'm curious if this is strictly for high availability and network partitioning concerns, or if there is a more functional reason that every Alertmanager member needs to receive the alerts.

What prompted this question, is that in our three member HA alertmanager cluster that we've been sending alerts to via a load balancer (from multiple prometheus instances), we've observed that alerts stored on each cluster member may have pretty drastically different endsAt times for a single given alert (one to two minutes). We believe that this may be contributing to random flapping alerts, that prometheus indicates has been firing the entire duration.

Thanks.

Brian Brazil

unread,

Dec 3, 2021, 4:22:52 PM12/3/21

to Matt Miller, Prometheus Users

On Fri, 3 Dec 2021 at 19:08, Matt Miller <msm...@gmail.com> wrote:

Hi,

In the documentation under the HA section, it mentions:
"It's important not to load balance traffic between Prometheus and its Alertmanagers, but instead, point Prometheus to a list of all Alertmanagers."

I'm curious if this is strictly for high availability and network partitioning concerns, or if there is a more functional reason that every Alertmanager member needs to receive the alerts.

Both are true.

What prompted this question, is that in our three member HA alertmanager cluster that we've been sending alerts to via a load balancer (from multiple prometheus instances), we've observed that alerts stored on each cluster member may have pretty drastically different endsAt times for a single given alert (one to two minutes). We believe that this may be contributing to random flapping alerts, that prometheus indicates has been firing the entire duration.

This is what can happen if you don't follow the documentation. Alerts are not passed between alertmanagers, you need to send them to all of them.

Brian

Thanks.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/4b4e8a88-1243-4591-b2d7-43bd550738een%40googlegroups.com.

--

Brian Brazil

www.robustperception.io

Matt Miller

unread,

Dec 3, 2021, 4:38:46 PM12/3/21

to Prometheus Users

Thanks Brian,

From all of the documentation I found, it was unclear that alerts were not also gossiped, thanks for the clarification.

Brian Candler

unread,

Dec 4, 2021, 1:17:07 PM12/4/21

to Prometheus Users

Just to note what it says here:

Matthias Rampke

unread,

Dec 4, 2021, 7:52:21 PM12/4/21

to Brian Candler, Prometheus Users

The technical reason for this admonition is in how the Prometheus-Alertmanager complex implements high availability notifications.

The design goal is to send a notification in all possible circumstances, and *if possible* only send one.

By spraying alerts to the list of all Alertmanager instances, each of these *can* send the notification even if Alertmanager clustering is completely broken, for example due to network partitions, misconfiguration, or some Alertmanager instances being unable to send out the notification.

Worst case, you get multiple notifications, one from each Alertmanager. Some downstream services, like PagerDuty, will do their own deduplication, so you may not even notice. In other cases, like Slack or email, you get multiple but that's much better than none!

Every time Prometheus evaluates an alert rule, and finds it to be firing, it will send an event to every Alertmanager it knows about, with an endsAt time a few minutes into the future. As this goes on, the updated endsAt keeps being a few minutes away.

Each Alertmanager individually will determine what

notifications (firing or resolved) should be sent. When clustering works, Alertmanagers will communicate which notifications have already been sent, so you only get one of each in the happy case.

If you add a load balancer, only one Alertmanager will know that this alert even happened, and if for some reason it can't reach you, you may never know there was a problem.

This is somewhat mitigated in your case because Prometheus sends a new event on every rule evaluation cycle. Eventually, this will randomly reach every Alertmanager instance, but not necessarily in time to prevent the last event from timing out. These different timeouts is what you have observed as different endsAt times.

So the underlying reason is as you say – high availability and network partitioning. The architecture to achieve that, with Prometheus repeatedly sending short-term events, means that randomly load balancing these to only one of the Alertmanager instances will lead to weird effects including spurious "resolved" notifications.

/MR

On Sat, Dec 4, 2021, 19:17 Brian Candler <b.ca...@pobox.com> wrote:

Just to note what it says here:

It's important not to load balance traffic between Prometheus and its Alertmanagers, but instead, point Prometheus to a list of all Alertmanagers.

--

You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/3af11c5b-ec1a-4f97-8f67-10a4f5a03f41n%40googlegroups.com.

Matt Miller

unread,

Dec 4, 2021, 8:08:26 PM12/4/21

to Prometheus Users

Thanks you for the further clarification. I think the crux of my issue was (wrongfully) assuming that the documentation was instructing me to not use a load balancer for HA/network partitioning concerns only, and not that full Alertmanager cluster state isn't being gossiped. I may try to put a PR up on Monday for the docs to clarify this for what would have saved us a bit of time debugging.

Reply all

Reply to author

Forward