While testing a Core Os cluster with three nodes, after successfully adding and removing few additional nodes, I encountered the following problem, supposedly due to a race condition during the election process for etcd.
Checking the new leader gives:
> $ curl -L http://127.0.0.1:4001/v2/stats/leader
> {"errorCode":300,"message":"Raft Internal Error","index":629006}
Journalctl for each machine in the cluster gives:
> $ journalctl -r -u etcd
> -- Logs begin at Wed 2014-11-12 15:09:01 UTC, end at Mon 2014-11-24 10:47:34 UTC. -- Nov 24 10:47:34 node-1 etcd[56576]: [etcd] Nov 24
> 10:47:34.307 INFO | 965d12d38a4a4b2c807bd232fb7b0db7: term #5221
> started. Nov 24 10:47:34 node-1 etcd[56576]: [etcd] Nov 24
> 10:47:34.306 INFO | 965d12d38a4a4b2c807bd232fb7b0db7: state
> changed from 'candidate' to 'follower'. Nov 24 10:47:33 node-1
> etcd[56576]: [etcd] Nov 24 10:47:33.098 INFO |
> 965d12d38a4a4b2c807bd232fb7b0db7: state changed from 'follower' to
> 'candidate'. Nov 24 10:47:32 node-1 etcd[56576]: [etcd] Nov 24
> 10:47:32.081 INFO | 965d12d38a4a4b2c807bd232fb7b0db7: term #5219
> started. Nov 24 10:47:32 node-1 etcd[56576]: [etcd] Nov 24
> 10:47:32.081 INFO | 965d12d38a4a4b2c807bd232fb7b0db7: state
> changed from 'candidate' to 'follower'. Nov 24 10:47:31 node-1
> etcd[56576]: [etcd] Nov 24 10:47:31.962 INFO |
> 965d12d38a4a4b2c807bd232fb7b0db7: state changed from 'follower' to
> 'candidate'.
And listing the machines with fleet fails:
> $ fleetctl list-machines 2014/11/24 10:56:19 INFO client.go:278:
> Failed getting response from http://127.0.0.1:4001/: dial tcp
> 127.0.0.1:4001: connection refused 2014/11/24 10:56:19 ERROR client.go:200: Unable to get result for {Get
> /_coreos.com/fleet/machines}, retrying in 100ms 2014/11/24 10:56:19
> INFO client.go:278: Failed getting response from
> http://127.0.0.1:4001/: dial tcp 127.0.0.1:4001: connection refused
> 2014/11/24 10:56:19 ERROR client.go:200: Unable to get result for {Get
> /_coreos.com/fleet/machines}, retrying in 200ms 2014/11/24 10:56:19
> INFO client.go:278: Failed getting response from
> http://127.0.0.1:4001/: dial tcp 127.0.0.1:4001: connection refused
Listing the machines in the cluster gives:
> $ curl -L http://127.0.0.1:7001/v2/admin/machines [{"name":"
> ","state":"follower","clientURL":"http://100.72.62.35:4001","peerURL":"http://100.72.62.35:7001"},{"name":"555cca74216644fea48990673b3d539c","state":"follower","clientURL":"http://100.72.62.59:4001","peerURL":"http://100.72.62.59:7001"},{"name":"965d12d38a4a4b2c807bd232fb7b0db7","state":"follower","clientURL":"http://100.72.20.153:4001","peerURL":"http://100.72.20.153:7001"},{"name":"a1b566dedb194c259f7eb2ffde5595b1","state":"follower","clientURL":"http://100.72.62.2:4001","peerURL":"http://100.72.62.2:7001"},{"name":"a45efba827754b5f93c38b751a0ae273","state":"follower","clientURL":"http://100.72.62.31:4001","peerURL":"http://100.72.62.31:7001"},{"name":"d041738235a9483cb814d37ca7fa4b6d","state":"follower","clientURL":"http://100.72.20.18:4001","peerURL":"http://100.72.20.18:7001"}]
but only three machines are currently running. I tried to add additional machines to reach the quorum with no avail. I'm running the following version:
> $ etcdctl -v
> etcdctl version 0.4.6
for which, as mentioned here https://coreos.com/docs/distributed-configuration/etcd-api/#cluster-config, the leader module to force a leader has been removed. The ugly part is that since there is no quorum I'm not able to remove from the list of machines the ones that are not currently running using for example:
> $ curl -L -XDELETE
> http://127.0.0.1:7001/v2/admin/machines/2abbf47a9e644bc69652a986d796d7a6
which has no effect. Is there any way to save the cluster?