Restarting one of etcd members triggers leader election

Maciej Borsz

unread,

Feb 15, 2018, 5:15:44 AM2/15/18

to etcd...@googlegroups.com, Joe Betz, Xiang Li, Wojciech Tyczynski, GKE Regional Clusters

Hi,

I was investigating why we have so many leader elections in our etcd clusters and found out that when we are restarting one of etcd members, it pretty often triggers a leader election in the cluster.

It surprised me, because it happens before election timeout elapses from process startup.

I found https://github.com/coreos/etcd/commit/21f5b885f2a7df2025faa79807de6117848d2015 and have few questions about it:

1. Why advanceTicksForElection is used in restartNode()?

https://github.com/coreos/etcd/blob/master/etcdserver/raft.go#L454

The fact that node is restarted doesn't mean that we are joining a new cluster and I think this feature is mostly about the new-cluster case, right?

2. Why advanceTicksForElection is used in startNode()?

Even calling this from startNode is not obvious to me, as if I understand correctly, this is called when etcd node is recreated it can still be joining an existing cluster.

3. Is this feature important? Can we simply revert that commit?

4. Can we partially revert that commit (at least restartNode case which hurts us the most)?

5. Any other suggestion how to mitigate this issue?

Thanks,

Maciej

Gyuho Lee

unread,

Feb 15, 2018, 3:44:06 PM2/15/18

to etcd-dev

2. Why advanceTicksForElection is used in startNode()?

As I understand, it expedites leader election for multi data center

deployments with larger election timeouts. Otherwise, cluster

would suffer from slow start (e.g. election timeout is 5s, without

"advance tick" starting node would require 5s wait to start an election).

1. Why advanceTicksForElection is used in restartNode()?

For the same reason above. All nodes can restart on upgrades.

And so long as the leader still holds its lease, restarted follower

won't trigger leader elections.

I was investigating why we have so many leader elections in our etcd clusters and found out that when we are restarting one of etcd members, it pretty often triggers a leader election in the cluster.

Do we server logs around leader elections? Even if a follower restarts

with advanced tick, it won't always trigger election (before election

timeout follower may receive heartbeat from leader and resets its

election tick).

Another way to improve the robustness of leader election is Pre-vote,

where a candidate asks peers if its log is up-to-date enough to get

their votes. This is already implemented in our Raft, but not in etcd.

Maciej Borsz

unread,

Feb 16, 2018, 7:15:40 AM2/16/18

to etcd-dev

Thanks for the response! :)

On Thursday, February 15, 2018 at 9:44:06 PM UTC+1, Gyuho Lee wrote:

2. Why advanceTicksForElection is used in startNode()?

As I understand, it expedites leader election for multi data center
deployments with larger election timeouts. Otherwise, cluster
would suffer from slow start (e.g. election timeout is 5s, without
"advance tick" starting node would require 5s wait to start an election).

I see that case, makes sense to me.

1. Why advanceTicksForElection is used in restartNode()?

For the same reason above. All nodes can restart on upgrades.
And so long as the leader still holds its lease, restarted follower
won't trigger leader elections.

My mean, as long as the leader still holds its lease AND follower receives a heartbeat before a single Tick (heartbeat) elapses, right?

I was investigating why we have so many leader elections in our etcd clusters and found out that when we are restarting one of etcd members, it pretty often triggers a leader election in the cluster.

Do we server logs around leader elections? Even if a follower restarts
with advanced tick, it won't always trigger election (before election
timeout follower may receive heartbeat from leader and resets its
election tick).

If I understand correctly, advanceTicksForElection reduces the time the follower waits from 'election timeout' to a single 'heartbeat' (more or less, as raft adds some randomness to the timeout in https://github.com/coreos/etcd/blob/release-3.1/raft/raft.go#L1210).

The master election doesn't happen always, but often enough for us to notice this.

In our case, we don't care too much about initial delay on cluster creation, but we do care about etcd clusters availability.

To achieve high availability, we are doing gradual upgrades of all nodes (i.e. we are restarting nodes one by one) so that the etcd cluster should work ~all the time (-some time for leadership transfer)

I believe in our case this advanceTicksForElection feature hurts us more than it helps, so I'm thinking about disabling this in our case.

How about making this feature flag-controlled?

I'm keen to hear your opinion about this.

Thanks,

Maciej

Gyuho Lee

unread,

Feb 16, 2018, 5:51:09 PM2/16/18

to etcd-dev

My mean, as long as the leader still holds its lease AND follower receives a heartbeat before a single Tick (heartbeat) elapses, right?

A node restarts as follower with advanceTicksForElection to

have one more tick to start campaigning. Then there are two cases:

1. Restarted follower received a heartbeat from leader before the last tick elapses.

Then follower resets its election ticks that had been fast-forwarded by the

advanceTicksForElection. Restarting this node doesn't trigger leader

election. Follower receiving heartbeats implies there exists an acting leader.

2. Restarted follower has not received a heartbeat from leader, and the last tick

elapses, thus timing out to start campaign. The follower becomes a candidate

to request votes from its peers.

2-1. Restarted follower was not a leader and there exists a leader in cluster

- Current leader still holds its lease, thus ignoring vote requests.

2-2. Restarted follower was a leader and there is no acting leader

- Whichever candidate node gets majority of votes becomes the new leader.

To summarize, there are two cases where it could trigger leader elections:

1. restarted node was a leader, so losing leader triggered election.

2. restarting follower node made the quorum not active, for leader's viewpoint

- e.g. restart 1 node out of 2 node cluster, or 2 out of 3 nodes are unavailable

To achieve high availability, we are doing gradual upgrades of all nodes (i.e. we are restarting nodes one by one) so that the etcd cluster should work ~all the time (-some time for leadership transfer)

Now I can see that disabling it would increase the availabilities of follower, in case 2.

It will increase the time window that follower can receive heartbeats from leader.

3. Is this feature important? Can we simply revert that commit?

I am fine with making it configurable (let's move this discussion to github).

I still believe advancing tick would trigger campaign in follower node

but won't take down the existing leader. Did the leader ever step down

because of the campaign triggered by advanced tick + election timeout

from follower? I might be missing some edge cases.

Gyuho Lee

unread,

Feb 16, 2018, 6:05:46 PM2/16/18

to etcd-dev

I've created an issue on GitHub here https://github.com/coreos/etcd/issues/9333.

Maciej Borsz

unread,

Feb 19, 2018, 7:49:54 AM2/19/18

to gyu...@gmail.com, etcd...@googlegroups.com, Joe Betz, Wojciech Tyczynski

On Sat, Feb 17, 2018 at 12:05 AM Gyuho Lee <gyu...@gmail.com> wrote:

I've created an issue on GitHub here https://github.com/coreos/etcd/issues/9333.

On Friday, February 16, 2018 at 2:51:09 PM UTC-8, Gyuho Lee wrote:

My mean, as long as the leader still holds its lease AND follower receives a heartbeat before a single Tick (heartbeat) elapses, right?

A node restarts as follower with advanceTicksForElection to
have one more tick to start campaigning. Then there are two cases:

1. Restarted follower received a heartbeat from leader before the last tick elapses.
Then follower resets its election ticks that had been fast-forwarded by the
advanceTicksForElection. Restarting this node doesn't trigger leader
election. Follower receiving heartbeats implies there exists an acting leader.

2. Restarted follower has not received a heartbeat from leader, and the last tick
elapses, thus timing out to start campaign. The follower becomes a candidate
to request votes from its peers.

2-1. Restarted follower was not a leader and there exists a leader in cluster
- Current leader still holds its lease, thus ignoring vote requests.

Hmm... maybe I'm misunderstanding the log entries, but in my case leader is dropping leadership in that case:

2018-02-11 23:38:16.450210 I | raft: f813a994322b2b40 [term: 9] received a MsgAppResp message with higher term from 9937f2a5f3634020 [term: 10]

2018-02-11 23:38:16.450261 I | raft: f813a994322b2b40 became follower at term 10

2018-02-11 23:38:16.450274 I | raft: raft.node: f813a994322b2b40 changed leader from f813a994322b2b40 to 9937f2a5f3634020 at term 10

where f813a994322b2b40 is the current leader and 9937f2a5f3634020 is a member that restarted.

Is it a bug?

Interestingly, the third member ignores such vote requests:

2018-02-11 23:38:16.430232 I | raft: ad782d6b7abde5c3 [logterm: 9, index: 2853, vote: f813a994322b2b40] ignored MsgVote from 9937f2a5f3634020 [logterm: 9, index: 2364] at term 9: lease is not expired (remaining ticks: 10)

2-2. Restarted follower was a leader and there is no acting leader
- Whichever candidate node gets majority of votes becomes the new leader.

To summarize, there are two cases where it could trigger leader elections:

1. restarted node was a leader, so losing leader triggered election.
2. restarting follower node made the quorum not active, for leader's viewpoint
- e.g. restart 1 node out of 2 node cluster, or 2 out of 3 nodes are unavailable

To achieve high availability, we are doing gradual upgrades of all nodes (i.e. we are restarting nodes one by one) so that the etcd cluster should work ~all the time (-some time for leadership transfer)

Now I can see that disabling it would increase the availabilities of follower, in case 2.
It will increase the time window that follower can receive heartbeats from leader.

3. Is this feature important? Can we simply revert that commit?

I am fine with making it configurable (let's move this discussion to github).

I still believe advancing tick would trigger campaign in follower node
but won't take down the existing leader. Did the leader ever step down
because of the campaign triggered by advanced tick + election timeout
from follower? I might be missing some edge cases.

I believe this is what happens - see log entries above.

One more interesting I found is that in the logs of restarted member, election is started before 'peer X become active' and 'established a TCP streaming connection with peer X' log entries:

2018-02-11 23:38:15.892374 I | etcdmain: etcd Version: 3.1.11

(...)

2018-02-11 23:38:16.163995 I | rafthttp: started streaming with peer ad782d6b7abde5c3 (writer)

2018-02-11 23:38:16.164045 I | rafthttp: started streaming with peer ad782d6b7abde5c3 (writer)

2018-02-11 23:38:16.164142 I | rafthttp: started streaming with peer ad782d6b7abde5c3 (stream MsgApp v2 reader)

2018-02-11 23:38:16.192182 I | rafthttp: started streaming with peer ad782d6b7abde5c3 (stream Message reader)

2018-02-11 23:38:16.220700 I | rafthttp: started streaming with peer f813a994322b2b40 (writer)

2018-02-11 23:38:16.220760 I | rafthttp: started streaming with peer f813a994322b2b40 (writer)

2018-02-11 23:38:16.220806 I | rafthttp: started streaming with peer f813a994322b2b40 (stream MsgApp v2 reader)

2018-02-11 23:38:16.249231 I | rafthttp: started streaming with peer f813a994322b2b40 (stream Message reader)

2018-02-11 23:38:16.311931 I | raft: 9937f2a5f3634020 is starting a new election at term 9

2018-02-11 23:38:16.312034 I | raft: 9937f2a5f3634020 became candidate at term 10

2018-02-11 23:38:16.312155 I | raft: 9937f2a5f3634020 received MsgVoteResp from 9937f2a5f3634020 at term 10

2018-02-11 23:38:16.312176 I | raft: 9937f2a5f3634020 [logterm: 9, index: 2364] sent MsgVote request to f813a994322b2b40 at term 10

2018-02-11 23:38:16.312191 I | raft: 9937f2a5f3634020 [logterm: 9, index: 2364] sent MsgVote request to ad782d6b7abde5c3 at term 10

2018-02-11 23:38:16.366186 I | rafthttp: peer f813a994322b2b40 became active

2018-02-11 23:38:16.366257 I | rafthttp: established a TCP streaming connection with peer f813a994322b2b40 (stream Message writer)

2018-02-11 23:38:16.366418 I | rafthttp: established a TCP streaming connection with peer f813a994322b2b40 (stream MsgApp v2 writer)

2018-02-11 23:38:16.369265 I | rafthttp: peer ad782d6b7abde5c3 became active

2018-02-11 23:38:16.369301 I | rafthttp: established a TCP streaming connection with peer ad782d6b7abde5c3 (stream Message writer)

2018-02-11 23:38:16.369448 I | rafthttp: established a TCP streaming connection with peer ad782d6b7abde5c3 (stream MsgApp v2 writer)

2018-02-11 23:38:16.425179 I | rafthttp: established a TCP streaming connection with peer ad782d6b7abde5c3 (stream MsgApp v2 reader)

2018-02-11 23:38:16.434434 I | rafthttp: established a TCP streaming connection with peer ad782d6b7abde5c3 (stream Message reader)

2018-02-11 23:38:16.453593 I | rafthttp: established a TCP streaming connection with peer f813a994322b2b40 (stream MsgApp v2 reader)

2018-02-11 23:38:16.496026 I | rafthttp: established a TCP streaming connection with peer f813a994322b2b40 (stream Message reader)

Does it mean that election started before we even initialized connections to peers?

Shouldn't we wait for initialization of connections (which?) before we start measuring heartbeat timeout?

I really appreciate your help,

Maciej

--
You received this message because you are subscribed to the Google Groups "etcd-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to etcd-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gyuho Lee

unread,

Feb 23, 2018, 6:22:33 PM2/23/18

to etcd-dev

Thanks for the logs.

I was able to find root cause of this disruptions in our Raft package.

Solution is to enable Raft Pre-Vote in etcd layer. It won't solve all

the problems of disruptive servers, but should improve leader election

robustness, when an etcd server rejoins with advance tick (then higher

term).

Patch is here https://github.com/coreos/etcd/pull/9352.

Reply all

Reply to author

Forward