If a peer is stopped and started on a new IP address without leaving the cluster, the peers list and the catalog gets out of sync. The peers list contains both the old and new IP address while the catalog contains only the new. The existing members continue to try and contact the old peer. It seems the only recovery at this point is to edit the raft/peers.json file.
The context of this is docker. When a docker container is restarted, docker gives the restarted container a new IP address.
Here is what I am doing to reproduce:
$ docker run -d --name node1 -h node1 progrium/consul -server -bootstrap-expect 3
$ docker run -d --name node2 --link node1:node1 -h node2 progrium/consul -server -join node1
$ docker run -d --name node3 --link node1:node1 -h node3 progrium/consul -server -join node1
$ docker run -d --name node4 --link node1:node1 -p 8400:8400 -p 8500:8500 -p 8600:53/udp -h node4 progrium/consul -join node1
$ http GET localhost:8500/v1/catalog/nodes
HTTP/1.1 200 OK
Content-Length: 165
Content-Type: application/json
Date: Thu, 06 Nov 2014 15:50:56 GMT
X-Consul-Index: 9
X-Consul-Knownleader: true
X-Consul-Lastcontact: 0
[
{
"Address": "172.17.0.36",
"Node": "node1"
},
{
"Address": "172.17.0.37",
"Node": "node2"
},
{
"Address": "172.17.0.38",
"Node": "node3"
},
{
"Address": "172.17.0.39",
"Node": "node4"
}
]
$ http GET localhost:8500/v1/status/peers
HTTP/1.1 200 OK
Content-Length: 58
Content-Type: application/json
Date: Thu, 06 Nov 2014 15:51:17 GMT
[
]
$ docker stop node2
$ docker start node2
Now the leader shows failed to connect errors continuously:
2014/11/06 15:52:40 [WARN] raft: Failed to contact
172.17.0.37:8300 in 14.711937062s
$ http GET localhost:8500/v1/catalog/nodes
HTTP/1.1 200 OK
Content-Length: 165
Content-Type: application/json
Date: Thu, 06 Nov 2014 15:53:28 GMT
X-Consul-Index: 21
X-Consul-Knownleader: true
X-Consul-Lastcontact: 0
[
{
"Address": "172.17.0.36",
"Node": "node1"
},
{
"Address": "172.17.0.40", <---- IP address of node2 has been updated
"Node": "node2"
},
{
"Address": "172.17.0.38",
"Node": "node3"
},
{
"Address": "172.17.0.39",
"Node": "node4"
}
]
$ http GET localhost:8500/v1/status/peers
HTTP/1.1 200 OK
Content-Length: 77
Content-Type: application/json
Date: Thu, 06 Nov 2014 15:54:06 GMT
[
]
Can (or should) Consul see that a peer with the same name as a prior peer is joining from a different IP address, and remove the old peer with that name from the peers list?
Regards,
Raman