Hi folks,
I'm having 2 Alertmanagers (AM) in version 0.15.2 in HA in 2 separate Kubernetes clusters with a firewall in between.
So AM 1 is in cluster 1 with an external IP, AM 2 in cluster 2 with an external IP.
It worked quite well in v0.14.x, but I'm struggling since the cluster/mesh library was changed.
The logs show a constant flapping and quite a lot ports being used to probe the other AM:
level=debug ts=2018-09-03T15:39:01.357299612Z caller=delegate.go:209 component=cluster received=NotifyJoin node=01CNEVYV8WWV5QTE5FMPFYYRSH addr=AM1_IP:8001
level=debug ts=2018-09-03T15:39:01.357340819Z caller=cluster.go:417 component=cluster msg="peer rejoined" peer=01CNEVYV8WWV5QTE5FMPFYYRSH
level=debug ts=2018-09-03T15:39:06.40030404Z caller=cluster.go:287 component=cluster memberlist="2018/09/03 15:39:06 [DEBUG] memberlist: Failed ping: 01CNEVYV8WWV5QTE5FMPFYYRSH (timeout reached)\n"
level=debug ts=2018-09-03T15:39:07.107914474Z caller=cluster.go:287 component=cluster memberlist="2018/09/03 15:39:07 [DEBUG] memberlist: Stream connection from=AM2_IP:40644\n"
level=debug ts=2018-09-03T15:39:07.40015538Z caller=cluster.go:287 component=cluster memberlist="2018/09/03 15:39:07 [WARN] memberlist: Was able to connect to 01CNEVYV8WWV5QTE5FMPFYYRSH but other probes failed, network may be misconfigured\n"
level=debug ts=2018-09-03T15:39:12.422832049Z caller=cluster.go:287 component=cluster memberlist="2018/09/03 15:39:12 [ERR] memberlist: Failed fallback ping: write tcp AM2_IP:33836->AM1_IP:8001: i/o timeout\n"
level=debug ts=2018-09-03T15:39:12.42289798Z caller=cluster.go:287 component=cluster memberlist="2018/09/03 15:39:12 [INFO] memberlist: Suspect 01CNEVYV8WWV5QTE5FMPFYYRSH has failed, no acks received\n"
level=debug ts=2018-09-03T15:39:14.107856472Z caller=cluster.go:287 component=cluster memberlist="2018/09/03 15:39:14 [DEBUG] memberlist: Stream connection from=AM2_IP:41956\n"
level=debug ts=2018-09-03T15:39:16.423064557Z caller=cluster.go:287 component=cluster memberlist="2018/09/03 15:39:16 [INFO] memberlist: Marking 01CNEVYV8WWV5QTE5FMPFYYRSH as failed, suspect timeout reached (0 peer confirmations)\n"
level=debug ts=2018-09-03T15:39:16.423137193Z caller=delegate.go:215 component=cluster received=NotifyLeave node=01CNEVYV8WWV5QTE5FMPFYYRSH addr=AM1_IP:8001
level=debug ts=2018-09-03T15:39:16.423153606Z caller=cluster.go:439 component=cluster msg="peer left" peer=01CNEVYV8WWV5QTE5FMPFYYRSH
level=debug ts=2018-09-03T15:39:18.400277226Z caller=cluster.go:287 component=cluster memberlist="2018/09/03 15:39:18 [DEBUG] memberlist: Failed ping: 01CNEVYV8WWV5QTE5FMPFYYRSH (timeout reached)\n"
level=debug ts=2018-09-03T15:39:19.400186537Z caller=cluster.go:287 component=cluster memberlist="2018/09/03 15:39:19 [WARN] memberlist: Was able to connect to 01CNEVYV8WWV5QTE5FMPFYYRSH but other probes failed, network may be misconfigured\n"
level=debug ts=2018-09-03T15:39:20.107980379Z caller=cluster.go:287 component=cluster memberlist="2018/09/03 15:39:20 [DEBUG] memberlist: Stream connection from=AM2_IP:46858\n"
level=debug ts=2018-09-03T15:39:21.127934103Z caller=cluster.go:287 component=cluster memberlist="2018/09/03 15:39:21 [DEBUG] memberlist: Stream connection from=AM2_IP:47076\n"
level=debug ts=2018-09-03T15:39:21.313821061Z caller=cluster.go:287 component=cluster memberlist="2018/09/03 15:39:21 [WARN] memberlist: Refuting a suspect message (from: 01CNGCPBWMP8SSXJY9W3VWBW5Z)\n"
level=debug ts=2018-09-03T15:39:23.505397942Z caller=cluster.go:287 component=cluster memberlist="2018/09/03 15:39:23 [DEBUG] memberlist: Initiating push/pull sync with: AM1_IP:8001\n"
The documentation is also not too explicit here I was wondering which TCP/UPD ports are used.
Alertmanager deployment specs are here (1).
AM 1 is pointed to AM 2 and vice-versa:
...
--cluster.listen-address=:8001
--cluster.advertise-address=<AM2_IP>:8001
--cluster.peer=<AM2_IP>:8001
...
So obviously port 8001 is required.
What am I missing?
Thanks a lot for the help!
---