HA alertmanager clusters may merge into one if they run in the same flat network

89 views
Skip to first unread message

Андрей Еньшин

unread,
Sep 9, 2021, 4:33:45 AM9/9/21
to Prometheus Developers
Hi prometheus folks,

I have a question about alertmanager.

Here is an one year old issue about merging few HA alertmanager clusters into one big over time: https://github.com/prometheus/alertmanager/issues/2250

I managed to reproduce it on my local k8s kind cluster. Seems there is small discrepancy between a list of peers reported by gossip library and a list of peers from am config file.

We can workaround it by using k8s network policy. However more proper fix would be on alertmanager side: keep eye on number of peers and compare with desired number. In case there is some unexpected state, clear table of peers, do DNS resolution once more and do form a new peer table. Maybe there is better solution. What do you think?

Probably I even can introduce a PR if we can agree on a way to fix it and someone can support me with review : )

Matthias Rampke

unread,
Sep 20, 2021, 5:11:26 AM9/20/21
to Андрей Еньшин, Prometheus Developers
What should happen if the DNS resolution does not result in the expected number of peers either? How would a deliberate shrinking or growing of a cluster work?

Another solution I have seen (e.g. in Cassandra) is to have a cluster identity, such as a cluster name. Instances would refuse to talk to other instances if they announce the wrong cluster name.

There could be a default cluster name (or a special case for when it's empty), so that it doesn't change anything for single-cluster use cases. It should also support the transition from older versions, or no cluster name, to a named cluster, with a rolling restart.

/MR

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/45dd29f4-cae7-4c42-9756-0ca92aa76884n%40googlegroups.com.

Андрей Еньшин

unread,
Nov 5, 2021, 10:16:24 AM11/5/21
to Prometheus Developers
Cluster ID seems to be a good solution. Also using mTLS for gossip communication which is a bit harder.

понедельник, 20 сентября 2021 г. в 18:11:26 UTC+9, matt...@prometheus.io:
Reply all
Reply to author
Forward
0 new messages