When I restart all of the alertmanager instances in Kubernetes, sometimes gossip "islands" of 1 instance will form. It isn't always "islands" of 1, I've also seen where 2 instances were connected and 1 was left outside.
I think the issue is that kubernetes DNS may contain old alertmanager instance IPs, but on startup (when `Join()` happens) none of the new instance IPs.
As at the start DNS is not empty `resolvePeers waitIfEmpty=true`, will return and this way none of the alertmanagers will actually connect.
Here are some logs/debug info:
All alert manager metrics endpoints show: `alertmanager_cluster_members 1`
logs of alertmanager1:
```
level=info ts=2018-06-21T14:35:44.824688253Z caller=main.go:141 build_context="(go=go1.10, user=root@f278953f13ef, date=20180323-13:05:10)"
level=warn ts=2018-06-21T14:36:04.840918631Z caller=cluster.go:129 component=cluster msg="failed to join cluster" err="2 errors occurred:\n\n* Failed to join
10.2.19.164: dial tcp
10.2.19.164:8001: i/o timeout\n* Failed to join
10.2.43.52: dial tcp
10.2.43.52:8001: i/o timeout"
level=info ts=2018-06-21T14:36:04.841778884Z caller=cluster.go:249 component=cluster msg="Waiting for gossip to settle..." interval=2s
level=info ts=2018-06-21T14:36:04.84242773Z caller=main.go:270 msg="Loading configuration file" file=/etc/alertmanager/config.yml
level=info ts=2018-06-21T14:36:04.849504109Z caller=main.go:346 msg=Listening address=
0.0.0.0:9093level=info ts=2018-06-21T14:36:06.84218093Z caller=cluster.go:274 component=cluster msg="gossip not settled" polls=0 before=0 now=1 elapsed=2.000165572s
level=info ts=2018-06-21T14:36:14.843072999Z caller=cluster.go:266 component=cluster msg="gossip settled; proceeding" elapsed=10.001056858s
level=info ts=2018-06-21T14:51:04.842216499Z caller=nflog.go:313 component=nflog msg="Running maintenance"
level=info ts=2018-06-21T14:51:04.842784642Z caller=silence.go:252 component=silences msg="Running maintenance"
level=info ts=2018-06-21T14:51:04.84482936Z caller=silence.go:254 component=silences msg="Maintenance done" duration=2.046436ms size=0
level=info ts=2018-06-21T14:51:04.844844053Z caller=nflog.go:315 component=nflog msg="Maintenance done" duration=2.631928ms size=6305
```
alertmanager2:
```
evel=info ts=2018-06-21T14:35:44.824589916Z caller=main.go:140 msg="Starting Alertmanager" version="(version=0.15.0-rc.1, branch=HEAD, revision=acb111e812530bec1ac6d908bc14725
793e07cf3)"
level=info ts=2018-06-21T14:35:44.824688253Z caller=main.go:141 build_context="(go=go1.10, user=root@f278953f13ef, date=20180323-13:05:10)"
level=warn ts=2018-06-21T14:36:04.840918631Z caller=cluster.go:129 component=cluster msg="failed to join cluster" err="2 errors occurred:\n\n* Failed to join
10.2.19.164: dial
level=info ts=2018-06-21T14:36:04.841778884Z caller=cluster.go:249 component=cluster msg="Waiting for gossip to settle..." interval=2s
level=info ts=2018-06-21T14:36:04.84242773Z caller=main.go:270 msg="Loading configuration file" file=/etc/alertmanager/config.yml
level=info ts=2018-06-21T14:36:04.849504109Z caller=main.go:346 msg=Listening address=
0.0.0.0:9093level=info ts=2018-06-21T14:36:06.84218093Z caller=cluster.go:274 component=cluster msg="gossip not settled" polls=0 before=0 now=1 elapsed=2.000165572s
level=info ts=2018-06-21T14:36:14.843072999Z caller=cluster.go:266 component=cluster msg="gossip settled; proceeding" elapsed=10.001056858s
level=info ts=2018-06-21T14:51:04.842216499Z caller=nflog.go:313 component=nflog msg="Running maintenance"
level=info ts=2018-06-21T14:51:04.842784642Z caller=silence.go:252 component=silences msg="Running maintenance"
level=info ts=2018-06-21T14:51:04.84482936Z caller=silence.go:254 component=silences msg="Maintenance done" duration=2.046436ms size=0
level=info ts=2018-06-21T14:51:04.844844053Z caller=nflog.go:315 component=nflog msg="Maintenance done" duration=2.631928ms size=6305
```
my k8s config:
```
apiVersion: v1
kind: Service
metadata:
labels:
name: alertmanager-peers
name: alertmanager-peers
namespace: sys-mon
spec:
clusterIP: None
ports:
- name: cluster
protocol: TCP
port: 8001
targetPort: cluster
selector:
app: alertmanager
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: alertmanager
namespace: sys-mon
spec:
replicas: 3
selector:
matchLabels:
app: alertmanager
template:
metadata:
name: alertmanager
labels:
app: alertmanager
spec:
containers:
- name: alertmanager
image: prom/alertmanager:v0.15.0-rc.1
args:
- --config.file=/etc/alertmanager/config.yml
- --storage.path=/alertmanager
- --cluster.peer=alertmanager-peers.sys-mon:8001
...
```
To reproduce I ran alertmanager in Kubernetes with headless service and `kubectl delete po --force --grace-period=0 -l app=alertmanager`
I've also did a fix, which adds a period job for dns refresh in
https://github.com/prometheus/alertmanager/pull/1428 and after applying change and setting refresh to ~ 5mins I can see that nodes joined back after 5mins by refresh job, as the refresh counter got increased:
```
alertmanager_cluster_refresh_total 2
```