Alertmanager gossip dns issue

822 views
Skip to first unread message

Povilas Versockas

unread,
Jun 25, 2018, 3:43:01 AM6/25/18
to Prometheus Developers
Hey,

I have a following issue running clustered alertmanager:

When I restart all of the alertmanager instances in Kubernetes, sometimes gossip "islands" of 1 instance will form. It isn't always "islands" of 1, I've also seen where 2 instances were connected and 1 was left outside.

I think the issue is that kubernetes DNS may contain old alertmanager instance IPs, but on startup (when `Join()` happens) none of the new instance IPs. 
As at the start DNS is not empty `resolvePeers waitIfEmpty=true`, will return and this way none of the alertmanagers will actually connect.

Here are some logs/debug info:

All alert manager metrics endpoints show: `alertmanager_cluster_members 1` 

logs of alertmanager1:

```
level=info ts=2018-06-21T14:35:44.824688253Z caller=main.go:141 build_context="(go=go1.10, user=root@f278953f13ef, date=20180323-13:05:10)"
level=warn ts=2018-06-21T14:36:04.840918631Z caller=cluster.go:129 component=cluster msg="failed to join cluster" err="2 errors occurred:\n\n* Failed to join 10.2.19.164: dial tcp 10.2.19.164:8001: i/o timeout\n* Failed to join 10.2.43.52: dial tcp 10.2.43.52:8001: i/o timeout"
level=info ts=2018-06-21T14:36:04.841778884Z caller=cluster.go:249 component=cluster msg="Waiting for gossip to settle..." interval=2s
level=info ts=2018-06-21T14:36:04.84242773Z caller=main.go:270 msg="Loading configuration file" file=/etc/alertmanager/config.yml
level=info ts=2018-06-21T14:36:04.849504109Z caller=main.go:346 msg=Listening address=0.0.0.0:9093
level=info ts=2018-06-21T14:36:06.84218093Z caller=cluster.go:274 component=cluster msg="gossip not settled" polls=0 before=0 now=1 elapsed=2.000165572s
level=info ts=2018-06-21T14:36:14.843072999Z caller=cluster.go:266 component=cluster msg="gossip settled; proceeding" elapsed=10.001056858s
level=info ts=2018-06-21T14:51:04.842216499Z caller=nflog.go:313 component=nflog msg="Running maintenance"
level=info ts=2018-06-21T14:51:04.842784642Z caller=silence.go:252 component=silences msg="Running maintenance"
level=info ts=2018-06-21T14:51:04.84482936Z caller=silence.go:254 component=silences msg="Maintenance done" duration=2.046436ms size=0
level=info ts=2018-06-21T14:51:04.844844053Z caller=nflog.go:315 component=nflog msg="Maintenance done" duration=2.631928ms size=6305
```

alertmanager2:
```
evel=info ts=2018-06-21T14:35:44.824589916Z caller=main.go:140 msg="Starting Alertmanager" version="(version=0.15.0-rc.1, branch=HEAD, revision=acb111e812530bec1ac6d908bc14725
793e07cf3)"
level=info ts=2018-06-21T14:35:44.824688253Z caller=main.go:141 build_context="(go=go1.10, user=root@f278953f13ef, date=20180323-13:05:10)"
level=warn ts=2018-06-21T14:36:04.840918631Z caller=cluster.go:129 component=cluster msg="failed to join cluster" err="2 errors occurred:\n\n* Failed to join 10.2.19.164: dial
tcp 10.2.19.164:8001: i/o timeout\n* Failed to join 10.2.43.52: dial tcp 10.2.43.52:8001: i/o timeout"
level=info ts=2018-06-21T14:36:04.841778884Z caller=cluster.go:249 component=cluster msg="Waiting for gossip to settle..." interval=2s
level=info ts=2018-06-21T14:36:04.84242773Z caller=main.go:270 msg="Loading configuration file" file=/etc/alertmanager/config.yml
level=info ts=2018-06-21T14:36:04.849504109Z caller=main.go:346 msg=Listening address=0.0.0.0:9093
level=info ts=2018-06-21T14:36:06.84218093Z caller=cluster.go:274 component=cluster msg="gossip not settled" polls=0 before=0 now=1 elapsed=2.000165572s
level=info ts=2018-06-21T14:36:14.843072999Z caller=cluster.go:266 component=cluster msg="gossip settled; proceeding" elapsed=10.001056858s
level=info ts=2018-06-21T14:51:04.842216499Z caller=nflog.go:313 component=nflog msg="Running maintenance"
level=info ts=2018-06-21T14:51:04.842784642Z caller=silence.go:252 component=silences msg="Running maintenance"
level=info ts=2018-06-21T14:51:04.84482936Z caller=silence.go:254 component=silences msg="Maintenance done" duration=2.046436ms size=0
level=info ts=2018-06-21T14:51:04.844844053Z caller=nflog.go:315 component=nflog msg="Maintenance done" duration=2.631928ms size=6305
```

my k8s config:
```
apiVersion: v1
kind: Service
metadata:
  labels:
    name: alertmanager-peers
  name: alertmanager-peers
  namespace: sys-mon
spec:
  clusterIP: None
  ports:
  - name: cluster
    protocol: TCP
    port: 8001
    targetPort: cluster
  selector:
    app: alertmanager
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: alertmanager
  namespace: sys-mon
spec:
  replicas: 3
  selector:
    matchLabels:
      app: alertmanager
  template:
    metadata:
      name: alertmanager
      labels:
        app: alertmanager
    spec:
      containers:
      - name: alertmanager
        image: prom/alertmanager:v0.15.0-rc.1
        args:
          - --config.file=/etc/alertmanager/config.yml
          - --web.listen-address=0.0.0.0:9093
          - --storage.path=/alertmanager
          - --web.external-url=https://alertmanager.dev.uw.systems
          - --cluster.listen-address=0.0.0.0:8001
          - --cluster.peer=alertmanager-peers.sys-mon:8001
...
```

To reproduce I ran alertmanager in Kubernetes with headless service and `kubectl delete po --force --grace-period=0 -l app=alertmanager`

I've also did a fix, which adds a period job for dns refresh in https://github.com/prometheus/alertmanager/pull/1428 and after applying change and setting refresh to ~ 5mins I can see that nodes joined back after 5mins by refresh job, as the refresh counter got increased:
```
alertmanager_cluster_refresh_total 2
```

We also have the same issue in Thanos (https://github.com/improbable-eng/thanos/pull/383 and https://github.com/improbable-eng/thanos/issues/372), which will hopefully fix soon.

Can someone take a look at the PR / issue? What should I do next? I'm thinking that it shouldn't be that hard to write a test for this case using https://github.com/miekg/dns library or something similiar, but not sure whether it's ok to add another dependency for alertmanager.


Reply all
Reply to author
Forward
0 new messages