How to solve race condition in etcd leader election?

Enrico Massa

unread,

Nov 24, 2014, 10:50:57 AM11/24/14

to etcd...@googlegroups.com

I recently posted a question on stack overflow (http://stackoverflow.com/questions/27104721/how-to-solve-race-condition-in-etcd-leader-election) for the above problem. I decided to post it in here for additional visibility, apologies for the duplication.

While testing a Core Os cluster with three nodes, after successfully adding and removing few additional nodes, I encountered the following problem, supposedly due to a race condition during the election process for etcd.

Checking the new leader gives:

> $ curl -L http://127.0.0.1:4001/v2/stats/leader
> {"errorCode":300,"message":"Raft Internal Error","index":629006}

Journalctl for each machine in the cluster gives:

> $ journalctl -r -u etcd
> -- Logs begin at Wed 2014-11-12 15:09:01 UTC, end at Mon 2014-11-24 10:47:34 UTC. -- Nov 24 10:47:34 node-1 etcd[56576]: [etcd] Nov 24
> 10:47:34.307 INFO      | 965d12d38a4a4b2c807bd232fb7b0db7: term #5221
> started. Nov 24 10:47:34 node-1 etcd[56576]: [etcd] Nov 24
> 10:47:34.306 INFO      | 965d12d38a4a4b2c807bd232fb7b0db7: state
> changed from 'candidate' to 'follower'. Nov 24 10:47:33 node-1
> etcd[56576]: [etcd] Nov 24 10:47:33.098 INFO      |
> 965d12d38a4a4b2c807bd232fb7b0db7: state changed from 'follower' to
> 'candidate'. Nov 24 10:47:32 node-1 etcd[56576]: [etcd] Nov 24
> 10:47:32.081 INFO      | 965d12d38a4a4b2c807bd232fb7b0db7: term #5219
> started. Nov 24 10:47:32 node-1 etcd[56576]: [etcd] Nov 24
> 10:47:32.081 INFO      | 965d12d38a4a4b2c807bd232fb7b0db7: state
> changed from 'candidate' to 'follower'. Nov 24 10:47:31 node-1
> etcd[56576]: [etcd] Nov 24 10:47:31.962 INFO      |
> 965d12d38a4a4b2c807bd232fb7b0db7: state changed from 'follower' to
> 'candidate'.

And listing the machines with fleet fails:

> $ fleetctl list-machines 2014/11/24 10:56:19 INFO client.go:278:
> Failed getting response from http://127.0.0.1:4001/: dial tcp
> 127.0.0.1:4001: connection refused 2014/11/24 10:56:19 ERROR client.go:200: Unable to get result for {Get
> /_coreos.com/fleet/machines}, retrying in 100ms 2014/11/24 10:56:19
> INFO client.go:278: Failed getting response from
> http://127.0.0.1:4001/: dial tcp 127.0.0.1:4001: connection refused
> 2014/11/24 10:56:19 ERROR client.go:200: Unable to get result for {Get
> /_coreos.com/fleet/machines}, retrying in 200ms 2014/11/24 10:56:19
> INFO client.go:278: Failed getting response from
> http://127.0.0.1:4001/: dial tcp 127.0.0.1:4001: connection refused

Listing the machines in the cluster gives:

> $ curl -L http://127.0.0.1:7001/v2/admin/machines [{"name":"
> ","state":"follower","clientURL":"http://100.72.62.35:4001","peerURL":"http://100.72.62.35:7001"},{"name":"555cca74216644fea48990673b3d539c","state":"follower","clientURL":"http://100.72.62.59:4001","peerURL":"http://100.72.62.59:7001"},{"name":"965d12d38a4a4b2c807bd232fb7b0db7","state":"follower","clientURL":"http://100.72.20.153:4001","peerURL":"http://100.72.20.153:7001"},{"name":"a1b566dedb194c259f7eb2ffde5595b1","state":"follower","clientURL":"http://100.72.62.2:4001","peerURL":"http://100.72.62.2:7001"},{"name":"a45efba827754b5f93c38b751a0ae273","state":"follower","clientURL":"http://100.72.62.31:4001","peerURL":"http://100.72.62.31:7001"},{"name":"d041738235a9483cb814d37ca7fa4b6d","state":"follower","clientURL":"http://100.72.20.18:4001","peerURL":"http://100.72.20.18:7001"}]

but only three machines are currently running. I tried to add additional machines to reach the quorum with no avail. I'm running the following version:

> $ etcdctl -v  
> etcdctl version 0.4.6

for which, as mentioned here https://coreos.com/docs/distributed-configuration/etcd-api/#cluster-config, the leader module to force a leader has been removed. The ugly part is that since there is no quorum I'm not able to remove from the list of machines the ones that are not currently running using for example:

> $ curl -L -XDELETE
> http://127.0.0.1:7001/v2/admin/machines/2abbf47a9e644bc69652a986d796d7a6

which has no effect. Is there any way to save the cluster?

Brandon Philips

unread,

Nov 24, 2014, 1:25:46 PM11/24/14

to Enrico Massa, etcd...@googlegroups.com

Hello Enrico-

On Mon, Nov 24, 2014 at 7:50 AM, Enrico Massa <enr...@antix.mobi> wrote:
> I recently posted a question on stack overflow
> (http://stackoverflow.com/questions/27104721/how-to-solve-race-condition-in-etcd-leader-election)
> for the above problem. I decided to post it in here for additional
> visibility, apologies for the duplication.

Yes, please don't use stackoverflow. It is just one more place for us
to check on top of mailing lists, IRC and GitHub issues.

> While testing a Core Os cluster with three nodes, after successfully adding
> and removing few additional nodes, I encountered the following problem,
> supposedly due to a race condition during the election process for etcd.

I am guessing that you started some VMs, they joined the cluster and
then destroyed them? If so: What has happened is that you kept adding
machines to your cluster without removing them up to the point where
you lost quorum. Notice you have 6 members. If you are destroying
machines you must remove them from the cluster first, otherwise you
risk losing quorum which you have done here. After quorum is lost the
cluster can't make any decisions for risk of a split-brain.

In the current 0.4 versions of etcd there isn't anything that can be
safely done to repair this situation without loading all of the data
into a new cluster. In 0.5 (currently in alpha) we provide tools to
recover from this situation[1]

[1] https://github.com/coreos/etcd/blob/master/Documentation/0.5/admin_guide.md#disaster-recovery

Hope that helps,

Brandon

Enrico Massa

unread,

Nov 24, 2014, 3:39:00 PM11/24/14

to etcd...@googlegroups.com, enr...@antix.mobi

Hi Brandon,

many thanks for the information, I really appreciate your reply.

As you guessed I added new VMs to the cluster and destroyed them without removing them first from the cluster.

It was a test cluster, so I already made a new one but of course I was interested to know if there was a way to deal with the situation but I'm glad that this has been taken care of in the 0.5 version.

CoreOs is great, keep it up!

Best,

Enrico

Reply all

Reply to author

Forward