Failed ping in HA alertmanagers

115 views

Skip to first unread message

Ionel Sirbu

unread,

Aug 5, 2022, 6:38:06 AM8/5/22

to Prometheus Users

Hello all,

We've recently configured our alertmanagers to be HA as per the specs:

- 3 instances, using a kubernetes statefulset;

- both TCP & UDP opened for the HA cluster port:

ports:
- containerPort: 8001
name: service
protocol: TCP
- containerPort: 8002
name: ha-tcp
protocol: TCP
- containerPort: 8002
name: ha-udp
protocol: UDP

- all 3 instances point to instance 0 for clustering (I assumed there wouldn't be a problem with instance 0 pointing to itself):

spec:
containers:
- args:
// ...
- --cluster.peer=testprom-am-0.testprom-am.default.svc.cluster.local:8002

image: quay.io/prometheus/alertmanager:v0.23.0

- prometheus points to the 3 alertmanager instances:

alertmanagers:
- static_configs:
- targets:
- testprom-am-0.testprom-am.default.svc.cluster.local:8001
- testprom-am-1.testprom-am.default.svc.cluster.local:8001
- testprom-am-2.testprom-am.default.svc.cluster.local:8001

However, against all that, we keep getting errors like this rather often (e.g. 124 within 30 minutes):

level=debug ts=2022-08-04T12:03:19.284Z caller=cluster.go:329 component=cluster memberlist="2022/08/04 12:03:19 [DEBUG] memberlist: Failed ping: 01G9M3WYRFHA0DCCWRVERYJX2A (timeout reached)\n"

Is that something to worry about? Is there anything more that needs to be configured with regards to HA?

With the exception of a particular case, alerts seem to work just fine. It's when we do a rolling upgrade to the kubernetes cluster that previous alerts fire again all of a sudden. Any idea what could be causing that?

Many thanks,

Ionel

Ionel Sirbu

unread,

Aug 23, 2022, 4:46:47 AM8/23/22

to Prometheus Users

Any thoughts on this, anyone?

Reply all

Reply to author

Forward

0 new messages