Hello all,
We've recently configured our alertmanagers to be HA as per the specs:
- 3 instances, using a kubernetes statefulset;
- both TCP & UDP opened for the HA cluster port:
ports:
- containerPort: 8001
name: service
protocol: TCP
- containerPort: 8002
name: ha-tcp
protocol: TCP
- containerPort: 8002
name: ha-udp
protocol: UDP
- all 3 instances point to instance 0 for clustering (I assumed there wouldn't be a problem with instance 0 pointing to itself):
spec:
containers:
- args:
// ...
- --cluster.peer=testprom-am-0.testprom-am.default.svc.cluster.local:8002
image: quay.io/prometheus/alertmanager:v0.23.0
- prometheus points to the 3 alertmanager instances:
alertmanagers:
- static_configs:
- targets:
- testprom-am-0.testprom-am.default.svc.cluster.local:8001
- testprom-am-1.testprom-am.default.svc.cluster.local:8001
- testprom-am-2.testprom-am.default.svc.cluster.local:8001
However, against all that, we keep getting errors like this rather often (e.g. 124 within 30 minutes):
level=debug ts=2022-08-04T12:03:19.284Z caller=cluster.go:329 component=cluster memberlist="2022/08/04 12:03:19 [DEBUG] memberlist: Failed ping: 01G9M3WYRFHA0DCCWRVERYJX2A (timeout reached)\n"
Is that something to worry about? Is there anything more that needs to be configured with regards to HA?
With the exception of a particular case, alerts seem to work just fine. It's when we do a rolling upgrade to the kubernetes cluster that previous alerts fire again all of a sudden. Any idea what could be causing that?
Many thanks,
Ionel