Hello!
I'm wondering how to correctly use Alertmanager at scale.
I have 10 regions. In each region, a pair of Prometheuses scrap exactly the same set of applications (which are also local, located in that region).
Then, each region has a pair or HA Alertmanagers, which are gossiping together.
Each Prometheus is connected to the 2 Alertmanagers of its region.
In order to benefit from a global metrics view + object storage, we are using Thanos.
It works great.
But with that kind of architecture, how I am supposed to silence an alert?
I want silences to be propagated to all Alertmanagers of the whole world. But if they are separated in 10 clusters of 2 members, this doesn't happen automatically.
How I am supposed to use the silencing system at scale? I can't afford creating only one silence in the correct region where the alert was firing, because it then means I have no global view of all of my silences, and I can forget where they are. It becomes hard to manage, and sometimes I may want to mute globally on several regions.
The memberlist library used by Alertmanager seems to have been exactly designed to exchange information between a lot of nodes of a big cluster, and keeping at the same time a good performance.
So, I then tried to connect all 20 Alertmanagers to the same Gossip cluster. The goal is to make them automatically propagate their silences.
By doing so, I made sure that one pair of Prometheus continues to ONLY be connected to 2 Alertmanagers of the same region.
=> It works well and it does what I want:
- Silences are propagated everywhere
- Alerts are gossiped to all nodes, but the other regions never do anything with the alert that they receive only by Gossip and not by Prom.
(If I understand correctly, an Alertmanager will never take responsibility to notify for an alert if it has not received it by a Prom.)
But then I noticed that in Alertmanager implementation, there is a timer depending on the index position of each node in the memberlist cluster: an Alertmanager receiving an alert from Prom will wait for 5s times its index in the cluster.
It means that if one Alertmanager region has index 19 and 20, I'll introduce a delay of 19x5 = 95s before the notification can be sent.
In official README in the Github project, it's cleary stated:
Important: Do not load balance traffic between Prometheus and its
Alertmanagers, but instead point Prometheus to a list of all
Alertmanagers. The Alertmanager implementation expects all alerts to be
sent to all Alertmanagers to ensure high availability.
Do you have advise on how to handle "Silencing at scale" with Alertmanager?
Usually, we say that Prometheus does not handle scale (beyond one node), because it focuses on doing correctly its job, in a very efficient manner (one Prometheus can ingest millions of samples and be very good at it).
That's why tools have separated responsibilities, and Thanos/Cortex can come to the rescue in that case.
But in Thanos, I see no component designed to transform Alertmanager to be scalable.
Connecting ALL 20 Prometheus to ALL 20 Alertmanager seems a bit overkill to me.
I think it would make the cluster less robust, because I would expose myself more and be more susceptible to network partitions, causing a higher probability of failing alert deduplication (higher probability of being notified twice for the same alert because of a higher probability that a network partition will occur somewhere).
Is it a good idea to connect all Alertmanager of different regions to the same memberlist cluster, but at the same time, keeping only 2 Prom connected to each Alertmanager?
Thank you for your advice!
Regards