Alertmanager cluster ports

Arno Uhlig

unread,

Sep 3, 2018, 12:12:44 PM9/3/18

to Prometheus Users

Hi folks,

I'm having 2 Alertmanagers (AM) in version 0.15.2 in HA in 2 separate Kubernetes clusters with a firewall in between.

So AM 1 is in cluster 1 with an external IP, AM 2 in cluster 2 with an external IP.

It worked quite well in v0.14.x, but I'm struggling since the cluster/mesh library was changed.

The logs show a constant flapping and quite a lot ports being used to probe the other AM:

level=debug ts=2018-09-03T15:39:01.357299612Z caller=delegate.go:209 component=cluster received=NotifyJoin node=01CNEVYV8WWV5QTE5FMPFYYRSH addr=AM1_IP:8001
level=debug ts=2018-09-03T15:39:01.357340819Z caller=cluster.go:417 component=cluster msg="peer rejoined" peer=01CNEVYV8WWV5QTE5FMPFYYRSH
level=debug ts=2018-09-03T15:39:06.40030404Z caller=cluster.go:287 component=cluster memberlist="2018/09/03 15:39:06 [DEBUG] memberlist: Failed ping: 01CNEVYV8WWV5QTE5FMPFYYRSH (timeout reached)\n"
level=debug ts=2018-09-03T15:39:07.107914474Z caller=cluster.go:287 component=cluster memberlist="2018/09/03 15:39:07 [DEBUG] memberlist: Stream connection from=AM2_IP:40644\n"
level=debug ts=2018-09-03T15:39:07.40015538Z caller=cluster.go:287 component=cluster memberlist="2018/09/03 15:39:07 [WARN] memberlist: Was able to connect to 01CNEVYV8WWV5QTE5FMPFYYRSH but other probes failed, network may be misconfigured\n"
level=debug ts=2018-09-03T15:39:12.422832049Z caller=cluster.go:287 component=cluster memberlist="2018/09/03 15:39:12 [ERR] memberlist: Failed fallback ping: write tcp AM2_IP:33836->AM1_IP:8001: i/o timeout\n"
level=debug ts=2018-09-03T15:39:12.42289798Z caller=cluster.go:287 component=cluster memberlist="2018/09/03 15:39:12 [INFO] memberlist: Suspect 01CNEVYV8WWV5QTE5FMPFYYRSH has failed, no acks received\n"
level=debug ts=2018-09-03T15:39:14.107856472Z caller=cluster.go:287 component=cluster memberlist="2018/09/03 15:39:14 [DEBUG] memberlist: Stream connection from=AM2_IP:41956\n"
level=debug ts=2018-09-03T15:39:16.423064557Z caller=cluster.go:287 component=cluster memberlist="2018/09/03 15:39:16 [INFO] memberlist: Marking 01CNEVYV8WWV5QTE5FMPFYYRSH as failed, suspect timeout reached (0 peer confirmations)\n"
level=debug ts=2018-09-03T15:39:16.423137193Z caller=delegate.go:215 component=cluster received=NotifyLeave node=01CNEVYV8WWV5QTE5FMPFYYRSH addr=AM1_IP:8001
level=debug ts=2018-09-03T15:39:16.423153606Z caller=cluster.go:439 component=cluster msg="peer left" peer=01CNEVYV8WWV5QTE5FMPFYYRSH
level=debug ts=2018-09-03T15:39:18.400277226Z caller=cluster.go:287 component=cluster memberlist="2018/09/03 15:39:18 [DEBUG] memberlist: Failed ping: 01CNEVYV8WWV5QTE5FMPFYYRSH (timeout reached)\n"
level=debug ts=2018-09-03T15:39:19.400186537Z caller=cluster.go:287 component=cluster memberlist="2018/09/03 15:39:19 [WARN] memberlist: Was able to connect to 01CNEVYV8WWV5QTE5FMPFYYRSH but other probes failed, network may be misconfigured\n"
level=debug ts=2018-09-03T15:39:20.107980379Z caller=cluster.go:287 component=cluster memberlist="2018/09/03 15:39:20 [DEBUG] memberlist: Stream connection from=AM2_IP:46858\n"
level=debug ts=2018-09-03T15:39:21.127934103Z caller=cluster.go:287 component=cluster memberlist="2018/09/03 15:39:21 [DEBUG] memberlist: Stream connection from=AM2_IP:47076\n"
level=debug ts=2018-09-03T15:39:21.313821061Z caller=cluster.go:287 component=cluster memberlist="2018/09/03 15:39:21 [WARN] memberlist: Refuting a suspect message (from: 01CNGCPBWMP8SSXJY9W3VWBW5Z)\n"
level=debug ts=2018-09-03T15:39:23.505397942Z caller=cluster.go:287 component=cluster memberlist="2018/09/03 15:39:23 [DEBUG] memberlist: Initiating push/pull sync with: AM1_IP:8001\n"

The documentation is also not too explicit here I was wondering which TCP/UPD ports are used.

Alertmanager deployment specs are here (1).

AM 1 is pointed to AM 2 and vice-versa:

...
--cluster.listen-address=:8001
--cluster.advertise-address=<AM2_IP>:8001
--cluster.peer=<AM2_IP>:8001
...

So obviously port 8001 is required.

What am I missing?

Thanks a lot for the help!

---

(1) Deployment spec: https://github.com/sapcc/helm-charts/blob/master/global/prometheus-alertmanager/templates/deployment.yaml

Simon Pasquier

unread,

Sep 4, 2018, 4:18:04 AM9/4/18

to Arno Uhlig, Prometheus Users

The cluster library uses UDP and TCP so you need to open the 8001 port for both protocols.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/a02fd9eb-7f37-401c-8dc5-0ea98a823bca%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Arno Uhlig

unread,

Sep 18, 2018, 8:17:15 AM9/18/18

to Prometheus Users

Hi Simon,

Thanks for confirming.

After double-checking with the colleagues from network, it turns out the firewall was blocking udp traffic -.-

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.

Arno Uhlig

unread,

Sep 26, 2018, 8:10:31 AM9/26/18

to Prometheus Users

While the initial sync of the alertmanager is successful I'm seeing the alertmanager #1 trying to ping the alertmanager #2 on another port than configured:

...
--cluster.listen-address=:8001
--cluster.advertise-address=<AM2_IP>:8001
--cluster.peer=<AM2_IP>:8001
...

The log shows port :47512 is used. Sometimes also :48xyz | :49xyz | :50xyz.

level=debug ts=2018-09-26T11:56:32.140721252Z caller=cluster.go:287 component=cluster memberlist="2018/09/26 11:56:32 [ERR] memberlist: Failed fallback ping: write tcp <IP AM 1>:47512-><IP AM 2>:8001: i/o timeout\n"

Shouldn't the alertmanagers just talk to each other on port :8001 ?

Am Dienstag, 4. September 2018 10:18:04 UTC+2 schrieb Simon Pasquier:

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.

Brian Brazil

unread,

Sep 26, 2018, 8:19:42 AM9/26/18

to Arno Uhlig, Prometheus Users

On Wed, 26 Sep 2018 at 13:10, Arno Uhlig <arno....@gmail.com> wrote:

While the initial sync of the alertmanager is successful I'm seeing the alertmanager #1 trying to ping the alertmanager #2 on another port than configured:
... --cluster.listen-address=:8001 --cluster.advertise-address=<AM2_IP>:8001 --cluster.peer=<AM2_IP>:8001 ...

The log shows port :47512 is used. Sometimes also :48xyz | :49xyz | :50xyz.

level=debug ts=2018-09-26T11:56:32.140721252Z caller=cluster.go:287 component=cluster memberlist="2018/09/26 11:56:32 [ERR] memberlist: Failed fallback ping: write tcp <IP AM 1>:47512-><IP AM 2>:8001: i/o timeout\n"

Shouldn't the alertmanagers just talk to each other on port :8001 ?

47512 here is the TCP source port, the communication is happening to 8001. Source ports are usually random assigned, so this port number is normal.

Brian

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/fa29c9d5-7077-40b1-8d7d-61443b92b175%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Brian Brazil

www.robustperception.io

Reply all

Reply to author

Forward