Failed ping in HA alertmanagers

115 views
Skip to first unread message

Ionel Sirbu

unread,
Aug 5, 2022, 6:38:06 AM8/5/22
to Prometheus Users
Hello all,

We've recently configured our alertmanagers to be HA as per the specs:
- 3 instances, using a kubernetes statefulset;
- both TCP & UDP opened for the HA cluster port:

    ports:
    - containerPort: 8001
      name: service
      protocol: TCP
    - containerPort: 8002
      name: ha-tcp
      protocol: TCP
    - containerPort: 8002
      name: ha-udp
      protocol: UDP


- all 3 instances point to instance 0 for clustering (I assumed there wouldn't be a problem with instance 0 pointing to itself):

spec:
  containers:
  - args:
    // ...
    - --cluster.peer=testprom-am-0.testprom-am.default.svc.cluster.local:8002
    image: quay.io/prometheus/alertmanager:v0.23.0

- prometheus points to the 3 alertmanager instances:

  alertmanagers:
    - static_configs:
      - targets:
        - testprom-am-0.testprom-am.default.svc.cluster.local:8001
        - testprom-am-1.testprom-am.default.svc.cluster.local:8001
        - testprom-am-2.testprom-am.default.svc.cluster.local:8001


However, against all that, we keep getting errors like this rather often (e.g. 124 within 30 minutes):

level=debug ts=2022-08-04T12:03:19.284Z caller=cluster.go:329 component=cluster memberlist="2022/08/04 12:03:19 [DEBUG] memberlist: Failed ping: 01G9M3WYRFHA0DCCWRVERYJX2A (timeout reached)\n"

Is that something to worry about? Is there anything more that needs to be configured with regards to HA?
With the exception of a particular case, alerts seem to work just fine. It's when we do a rolling upgrade to the kubernetes cluster that previous alerts fire again all of a sudden. Any idea what could be causing that?

Many thanks,
Ionel

Ionel Sirbu

unread,
Aug 23, 2022, 4:46:47 AM8/23/22
to Prometheus Users
Any thoughts on this, anyone?
Reply all
Reply to author
Forward
0 new messages