leader election questions

PatC

unread,

Jun 9, 2015, 4:55:13 PM6/9/15

to raft...@googlegroups.com

Hi,

Could someone clarify some questions around leader election for me?

1. Is it ok to only grant votes only as a candidate? This is just to mitigate the problem of new candidates disrupting the existing majority. As an alternative to the pre-vote idea. Or is it important that votes be granted even while in a follower state?

2. After granting a vote, is there some period of time where the member stops granting further votes? I can imagine a member receiving two request votes with larger and different terms and immediately granting both of them. This would immediately cause two leaders to exist.

3. I have an async implementation and wondering if the following is a reasonable optimization:

When a request vote is received and the sender has an old log, respond with granted=false.

Otherwise don't respond at all.

When a granted=false reply is received, go to follower mode since it's impossible to ever be a leader

thanks

Kijana Woodard

unread,

Jun 9, 2015, 5:31:13 PM6/9/15

to PatC, raft...@googlegroups.com

2. One vote per term. The leader in the [now] older term will quickly discover that it's no longer the leader and step down. The higher term leader sending heartbeats is one mechanism for that to happen. IIRC, the higher term requestVote will cause the older term leader to step down [or never ascend depending on timing].

3. Don't respond at all? When the sender has an up to date or newer log, never send granted=true? Not sure I follow.

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Oren Eini (Ayende Rahien)

unread,

Jun 9, 2015, 7:08:50 PM6/9/15

to PatC, raft...@googlegroups.com

1) No, if you wait until you get to candidate status, you'll have a lot more conflicted elections. What you can do is refuse to grant a vote if you got a recent heartbeat from the leader. That wait a single disconnected node can't interrupt operations.

Note that this also require pre voting feature, otherwise on join it will force an election.

2) That can happen quite easily, if you have two candidates that want to vote, and they failed, then fail again, etc. You can end up with a permenant hang elections.

3) Do NOT ever not respond. You want to send a reply back (maybe "don't bother talking to me again" flag), if only so you can trace it so operations can tell what is going on

Hibernating Rhinos Ltd

Oren Eini l CEO l Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

--

jordan.h...@gmail.com

unread,

Jun 9, 2015, 7:28:40 PM6/9/15

to Oren Eini (Ayende Rahien), raft...@googlegroups.com

On Jun 9, 2015, at 4:08 PM, Oren Eini (Ayende Rahien) <aye...@ayende.com> wrote:

1) No, if you wait until you get to candidate status, you'll have a lot more conflicted elections. What you can do is refuse to grant a vote if you got a recent heartbeat from the leader. That wait a single disconnected node can't interrupt operations.
Note that this also require pre voting feature, otherwise on join it will force an election.

2) That can happen quite easily, if you have two candidates that want to vote, and they failed, then fail again, etc. You can end up with a permenant hang elections.

I wouldn't say this is permanent. In my experience prior to implementing the pre-vote protocol, multiple candidates could simply force some arbitrary number of rounds of elections, but sufficient randomization of election timeouts was still enough to ensure one candidate eventually won an election before the other could block it. The biggest thing the pre-vote protocol gets you is more predictable elections through less rounds.

On a side note, to the OP's point it is possible for two leaders to exist in some scenarios, but even so only one will be able to commit operations and the other will eventually step down. For instance, a simple scenario is a partition wherein the leader is partitioned from the rest of the cluster, so a new leader is elected on the quorum side. A good Raft implementation will have the first leader step down once it can no longer contact a majority of the cluster so that clients can connect to the new leader, but it's not invalid for the leader to persist. Once the partition is healed, the new leader will simply force the old leader to step down.

Oren Eini (Ayende Rahien)

unread,

Jun 9, 2015, 7:30:37 PM6/9/15

to jordan.h...@gmail.com, raft...@googlegroups.com

The main reason we implemented pre vote wasn't to get more predictable elections. It was to avoid a disconnected node reconnecting to the cluster and forcing us to go to election.

PatC

unread,

Jun 9, 2015, 7:50:13 PM6/9/15

to raft...@googlegroups.com, pat...@gmail.com

On Tuesday, June 9, 2015 at 2:31:13 PM UTC-7, Kijana Woodard wrote:

2. One vote per term. The leader in the [now] older term will quickly discover that it's no longer the leader and step down. The higher term leader sending heartbeats is one mechanism for that to happen. IIRC, the higher term requestVote will cause the older term leader to step down [or never ascend depending on timing].

Yes, one vote per term. I guess after granting candidate for term 5, it would actually be better to immediately grant candidate for term 6, so it can become leader faster and override the leader at term 5.

3. Don't respond at all? When the sender has an up to date or newer log, never send granted=true? Not sure I follow.

Sorry, I meant to say:

If a request vote is received with a later log, respond with granted=true

If a request vote is received with an old log, respond with granted=false.

Otherwise if the log is the same but the sender's term is lower, don't reply.

When a granted=false reply is received, go to follower mode since it's impossible to ever be a leader.

This seems like an unworkable idea, as I try to explain it.

The case I haven't quite figured out is this one.

Suppose a candidate with an old log, sends out a request vote with the highest term.

Then according to a raft rule, all receivers with a lower term will become followers.

Which doesn't seem right because the candidate has an out-of-date log and can never become leader.

Of course, the followers will then become candidates at some point and then the rightful leader will be re-elected.

Is there a way to tell that out-of-date candidate to just become a follower?

Does the pre-vote solve this?

In the case of a pre-vote, if the candidate's current term is bigger than everyone else's, does he win the prevote?

Or do you ignore the current terms in the pre-vote and vote just based on the log state?

Oren Eini (Ayende Rahien)

unread,

Jun 9, 2015, 7:52:24 PM6/9/15

to PatC, raft...@googlegroups.com

That is what pre vote does, yes

Hibernating Rhinos Ltd

Oren Eini l CEO l Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

PatC

unread,

Jun 10, 2015, 1:46:43 PM6/10/15

to raft...@googlegroups.com, pat...@gmail.com

thanks for all the help so far.

Do I understand this correctly?

When a node receives a RequestVote with a newer term and an older log, the node should

1. update currentTerm

2. reject the vote

3. become a follower

What would break if I made a minor adjustment in this case? If the node is a leader, do steps 1 and 2 and skip 3, avoid becoming a follower.

thanks

Oren Eini (Ayende Rahien)

unread,

Jun 11, 2015, 2:30:29 AM6/11/15

to PatC, raft...@googlegroups.com

If its log is more up to date than the candidate, it should reject the vote, yes.

And you can't assume that the vote will go to the leader only

PatC

unread,

Jun 11, 2015, 3:02:36 AM6/11/15

to raft...@googlegroups.com, pat...@gmail.com

With pre-voting, having a term greater than the leader is much more rarer than without pre-voting.

And so I'm now ok letting a seemingly perfectly fine leader revert to follower status, just because it probably handles some unusual edge case.

thanks for the info.

Diego Ongaro

unread,

Jun 11, 2015, 7:21:36 PM6/11/15

to PatC, raft...@googlegroups.com

I think you guys worked it out, but I want to make sure you're not doing this:

When a node receives a RequestVote with a newer term and an older log, the node should
1. update currentTerm
2. reject the vote
3. become a follower

What would break if I made a minor adjustment in this case? If the node is a leader, do steps 1 and 2 and skip 3, avoid becoming a follower.

If you're a leader in a particular term and you change your term, you better not stay as a leader! That'd be a great way to get multiple leaders in the current term.

When a granted=false reply is received, go to follower mode since it's impossible to ever be a leader.

That might have an adverse effect on your election performance. In normal Raft, if a candidate's log is good enough to get votes from a majority, even if it doesn't have the best log around, it can still become leader.

-Diego

PatC

unread,

Jun 11, 2015, 9:53:10 PM6/11/15

to raft...@googlegroups.com, pat...@gmail.com

thank god, no :) I'm on to the next questionable variant.

The reason for not taking the stock implementation as described in the summary is because of disruptions. When we tried the basic set of rules, the system was too eager to start an unnecessary election. We're looking for ways to reduce such disruptions because failover is costly.

The variant we're trying now is:

1. fixed timeout that converts non-candidates to candidates.

2. candidates sends out pre-votes at random intervals. If a pre-vote succeeds, increment the term and immediately send out a request vote.

3. only candidates respond to request votes and can grant or deny votes

4. any servers will become followers if term > currentTerm for all messages (even pre-votes)

When there's a functioning majority, restarted and reattached servers don't appear to cause a new election.

BTW, It feels like the choice of the basic raft rules was designed to minimize the number of rules so it could fit on a page, at the expense of needless elections. And I think that's a fine choice. However, it would be nice to have an official "extension rule set" version that is designed to reduce needless elections. That is, instead of some ideas written in text form, actual set of rules and variable names that implement the extension, just like the condensed summary.

cheers

Oren Eini (Ayende Rahien)

unread,

Jun 12, 2015, 2:23:12 AM6/12/15

to PatC, raft...@googlegroups.com

PatC,

See what happens when you take down your leader.

Especially with cluster size of 5 or 7.

Because of 1, all nodes will become candidate immediately. They will all vote for themselves, then ask all others to vote for them, which won't happen because they already voted.

Then you would go into the hung election recovery, one of them would timeout first ,and succeed in getting the election.

So you probably would have 2 or more election cycles, where as if you had randomize election timeout and followers voting, you have much greater chance for the system to handle this in a single election

Also, if a pre vote force me to become follower, what happens when a node is disconnected, increment its term for pre vote, then reconnected, and ask for pre vote?

You force election this way

PatC

unread,

Jun 12, 2015, 10:11:54 AM6/12/15

to raft...@googlegroups.com, pat...@gmail.com

Hi,

candidates do wait a random time before sending out a pre-vote. That hasn't gone away (see #2). The first one to become a candidate will wait a random time and then send out a vote. The second one to become a candidate will grant the vote immediately, since it will most likely be waiting for it's random period before sending out a vote.

Nodes are not allowed to increment it's term for pre-votes. Only if it wins the pre-vote is it then allowed to increment it's term. So there's no way an election can be forced.

cheers

Kijana Woodard

unread,

Jun 12, 2015, 10:27:07 AM6/12/15

to PatC, raft...@googlegroups.com

"To begin an election, a follower increments its current

term and transitions to candidate state. It then votes for

itself and issues RequestVote RPCs..." http://ramcloud.stanford.edu/raft.pdf p.6

The first thing a candidate does is vote for itself, precluding granting a vote.

Reading through the post, I still don't understand why the stock algorithm is worse than moving into uncharted territory. If you were writing a new paper, sure, but it doesn't strike me as a good idea for an implementation.

In your first post, you were trying to do away with the pre-vote. Now it looks like you're fine with a pre-vote, so why keep the "only candidates can vote" idea?

As a side note, even as an independent concept focusing on the English words, of course candidates wouldn't vote for other candidates. What politician would do that? ;-0

Archie Cobbs

unread,

Jun 12, 2015, 12:02:29 PM6/12/15

to raft...@googlegroups.com

Something about this whole conversation is bugging me. I think it's the continual "tweaking" that's going on without stepping back and looking at the big picture - coming up with a rigorous approach.

I'm thinking out loud here....

First, what are we trying to accomplish?

Let's assume the problem we are trying to solve is the disruption in this specific scenario: a node gets disconnected and then reconnected. For simplicity, let's assume the node's ability to communicate is "all or none" (e.g., Ethernet cable falls out, is plugged back in; backhoe severs fiber, crew repairs fiber). I think this is a fair model for the way these failures commonly happen in the real world but am not an expert on that....

If this happens, the node will become a candidate, try repeated failed elections, and then when it's reconnected it's higher term will force the current leader to be deposed, causing disruption for everyone.

Goal: what's the simplest solution to this specific problem?

First a couple of observations...

What occurs in this situation is a duration of time in which the node cannot communicate with any other nodes. Call this condition isolatedness.
It would be easy for a node to detect entering or leaving the Isolatedness condition, give or take a minimum election timeout:

When normal follower election timeout expires, detect entering the Isolatedness condition and start sending "pings" to all other nodes
If zero nodes have responded to a "ping" within a minimum election timeout, continue in the Isolatedness condition
If some node responds to a "ping" within a minimum election timeout, or any other Raft message is received, exit the Isolatedness condition

Assume the "pings" are used merely to detect communication ability and otherwise have no affect on the Raft state of pinged nodes (pings should have sequence numbers so we can discard delayed responses).

Now suppose we add the following:

Every time a follower node enters Isolatedness it stops all Raft activity, essentially playing dead... except for the "ping" probes.
Every time leaves isolatedness, it goes back to being a follower but resets its election timer.

This would essentially be adding a new Raft state Isolated. There would be two new state transitions (in both directions) between Follower and Isolated, for #1 and #2 above.

Key observation: from the point of view of Raft state, a node entering and leaving the Isolated behaves exactly like a node that was powered down and then restarted.

Since we know Raft is all rigorous and proven correct in that case, and we are not changing any node's Raft state in any other way, we inherit that correctness even with this new state added. QED :)

But we have also solved the stated problem: because the node resets its election timer when leaving the isolated state, it will hear from the current leader (if still there), and since it has not advanced it term, quietly rejoin the cluster.

This solution is also robust in the face of rapid disconnect/reconnect transitions, because the minimum election timeout delay provides for hysteresis.

From what I can tell this is the simplest solution that is provably correct to the specific problem stated.

-Archie

Kijana Woodard

unread,

Jun 12, 2015, 12:41:19 PM6/12/15

to Archie Cobbs, raft...@googlegroups.com

I think you just described the stock pre vote mechanism.

From: Archie Cobbs
Sent: ‎6/‎12/‎2015 11:02 AM
To: raft...@googlegroups.com
Subject: Re: leader election questions

Archie Cobbs

unread,

Jun 12, 2015, 12:48:03 PM6/12/15

to raft...@googlegroups.com, archie...@gmail.com

On Friday, June 12, 2015 at 11:41:19 AM UTC-5, Kijana Woodard wrote:

I think you just described the stock pre vote mechanism.

I know this is similar to that, but I think it's simpler and easier to prove correct. E.g. there is no vote required, pre- or otherwise.

-Archie

patrick chan

unread,

Jun 12, 2015, 1:40:19 PM6/12/15

to Kijana Woodard, raft...@googlegroups.com

Hi Kijana,

pre-votes only solve half of the problem. The other half is the minimum election timeout idea. Without it, a candidate with current logs will force an election. If you're interested in reducing disruptions (i.e. needless elections), you really need both.

There was also one other desire - decouple the election timeout from the random period that request votes are sent out. In stock raft, they're the same. We need to be conservative with the election timeout (i.e. 5+ seconds) because our network connectivity is sometimes flakey, but be aggressive with the request vote period (i.e. 150ms).

We can satisfy all the requirements with two election timeouts - one for candidates and one for non-candidates:

1. random non-candidate election timeout T1

2. random candidate election timeout T2

3. followers reply to request votes only if no heartbeat has been heard within the minimum T1

But after looking at all the new additions, we just moved things around a little and settled on something that should be equivalent but we thought a little easier to explain, code up, and observe in production:

1. fixed non-candidate election timeout T1

2. random vote period T2

3. only candidates respond to request votes

We're hoping with these changes, no server that is not already part of a functioning majority can ever cause a disruption. But we're a little worried we might have introduced a case where the candidate can never join a functioning majority.

---

Candidates in raft are reluctant leaders. They'll become leaders only if no-one else wants to. And that's why leaders eagerly step down when a noisy candidate shows up :)

cheers

PatC

unread,

Jun 12, 2015, 2:43:06 PM6/12/15

to raft...@googlegroups.com

I'm not liking all the tweaking that's going on either. It doesn't take much to break safety.

Stepping back is a good idea. But I wouldn't focus on that particular scenario. I think the focus should be minimum disruptability. Several ideas are mentioned in the dissertation and actually recommended for a real deployment. But we're all here implementing them as tweaks.

The stock raft paper demonstrates a really useful algorithm with the least amount of rules. But some of us are finding out there needs to be a little more for it to be practical. In a way, I feel that Raft also suffers from the same statement it makes about multi-paxos. Re-phrased from the dissertation:

....One reason is that there is no widely agreed-upon algorithm for Minimum Disruptive Raft...sketched possible approaches....but these differ from others...details have not been published.

I think what's needed is another version of the condensed summary that folds in pre-votes and minimum election timeout and just lays it all out in a "understandable" form. Should it be a different message or a flag in request vote? Some extra rules for different roles. etc. I would also throw in the no-op issue as well since it's important in practice. E.g. should we always insert a no-op right after an election? etc.

Cheers

Archie Cobbs

unread,

Jun 12, 2015, 2:59:11 PM6/12/15

to raft...@googlegroups.com

On Friday, June 12, 2015 at 1:43:06 PM UTC-5, PatC wrote:

I'm not liking all the tweaking that's going on either. It doesn't take much to break safety.

Stepping back is a good idea. But I wouldn't focus on that particular scenario. I think the focus should be minimum disruptability. Several ideas are mentioned in the dissertation and actually recommended for a real deployment. But we're all here implementing them as tweaks.

The stock raft paper demonstrates a really useful algorithm with the least amount of rules. But some of us are finding out there needs to be a little more for it to be practical. In a way, I feel that Raft also suffers from the same statement it makes about multi-paxos. Re-phrased from the dissertation:
....One reason is that there is no widely agreed-upon algorithm for Minimum Disruptive Raft...sketched possible approaches....but these differ from others...details have not been published.

I think what's needed is another version of the condensed summary that folds in pre-votes and minimum election timeout and just lays it all out in a "understandable" form. Should it be a different message or a flag in request vote? Some extra rules for different roles. etc. I would also throw in the no-op issue as well since it's important in practice. E.g. should we always insert a no-op right after an election? etc.

Well said. I agree with all that, with one comment: before you can focus on "minimum disruptability" you have to precisely define what that means. Otherwise it becomes a wild goose chase of "oh yeah, what if this happens, or wait, what if that happens" which leads to all the tweaks.

There is an infinite taxonomy of possible network failures. Unless you precisely state exactly which failure scenarios you are targeting and declare all others as non-goals, you will not know when to stop tweaking.

In practice, probably only 1% of network failure scenarios will account for 99% of actual network failures. Spending effort on all the other odd-ball 99% that will never happen is a waste of time.

My earlier idea works reliably (only) for "all or none" disconnection failures. I'm guessing that type of failure qualifies as one of those special 1% that is actually likely to happen.

I'm sure there are others, but I don't know what they are.

What other network failures exist that people are concerned about, and what evidence and/or experience backs up the assertion that they are actually likely to happen in real life?

Once we have all the failure scenarios we care about inventoried, then (and only then) it will be time to figure out the best way to address them.

-Archie

Oren Eini (Ayende Rahien)

unread,

Jun 12, 2015, 3:49:08 PM6/12/15

to patrick chan, Kijana Woodard, raft...@googlegroups.com

I really don't like the "non candidate" election mode.

In my impl, we have the standard rules, with the following timeouts.

Election timeout - default to 2 seconds (in practice, 1.5 - 2.5 seconds randomized timeout)

Heartbeat timeout - default to 200 ms.

Vote allowed after heartbeat missed - default to 1 second.

Follower that gets a RequestVote will reject it if it got a heartbeat from the leader within 1 second.

That prevent distruptive elections.

Kijana Woodard

unread,

Jun 12, 2015, 3:49:43 PM6/12/15

to Archie Cobbs, raft...@googlegroups.com

"I'm not liking all the tweaking that's going on either. It doesn't take much to break safety."

Completely agree.

"pre-votes only solve half of the problem. The other half is the minimum election timeout idea. Without it, a candidate with current logs will force an election."

Why do you say that?

From the full dissertation, page 42*. [emphasis mine]

"Raft’s solution uses heartbeats to determine when a valid leader exists. In Raft, a leader is

considered active if it is able to maintain heartbeats to its followers (otherwise, another server will

start an election). Thus, servers should not be able to disrupt a leader whose cluster is receiving

heartbeats. We modify the RequestVote RPC to achieve this: if a server receives a RequestVote

request within the minimum election timeout of hearing from a current leader, it does not update its

term or grant its vote."

Within the minimum is key there. A follower with a leader rejects pre-vote & RequestVote RPCs. If the minimum election timeout has passed, the follower can now grant. The follower has [probably] not converted to candidate [randomized to election timeout x 2].

Restated on page 137.

"In the Pre-Vote algorithm, a candidate

only increments its term if it first learns from a majority of the cluster that they would be willing

to grant the candidate their votes (if the candidate’s log is sufficiently up-to-date, and the voters

have not received heartbeats from a valid leader for at least a baseline election timeout)."

...

"The Pre-Vote algorithm solves the issue of a partitioned server disrupting the cluster when it

rejoins. While a server is partitioned, it won’t be able to increment its term, since it can’t receive

permission from a majority of the cluster. Then, when it rejoins the cluster, it still won’t be able

to increment its term, since the other servers will have been receiving regular heartbeats from the

leader. Once the server receives a heartbeat from the leader itself, it will return to the follower state

(in the same term)."

* the Answer to the Ultimate Question of Life, the Universe and Everything. ;-)

--

Oren Eini (Ayende Rahien)

unread,

Jun 12, 2015, 3:49:52 PM6/12/15

to Archie Cobbs, raft...@googlegroups.com

You assume that a single node would be split from the network.

But what about two nodes being split in a 5 node cluster?

Hibernating Rhinos Ltd

Oren Eini l CEO l Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

--

Archie Cobbs

unread,

Jun 12, 2015, 3:57:53 PM6/12/15

to raft...@googlegroups.com, archie...@gmail.com

On Friday, June 12, 2015 at 2:49:52 PM UTC-5, Ayende Rahien wrote:

You assume that a single node would be split from the network.
But what about two nodes being split in a 5 node cluster?

Good point: it's probably fair to say that this is another network failure scenario that is likely to actually occur in real life.

So let's add that scenario to the list of ones we care about.

Actually, we can just generalize the previous "completely disconnected" condition to "disconnected from a majority".

So redefine the state of being "Isolated" as: not able to successfully ping a majority of cluster nodes (majority includes yourself).

Then that should cover the 2/5 case as well.

-Archie

Oren Eini (Ayende Rahien)

unread,

Jun 12, 2015, 3:59:34 PM6/12/15

to raft...@googlegroups.com, Archie Cobbs

But you just introduced another state, and for what?

What does this get you?

Hibernating Rhinos Ltd

Oren Eini l CEO l Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

Archie Cobbs

unread,

Jun 12, 2015, 4:10:35 PM6/12/15

to raft...@googlegroups.com, archie...@gmail.com

On Friday, June 12, 2015 at 2:59:34 PM UTC-5, Ayende Rahien wrote:

But you just introduced another state, and for what?
What does this get you?

What you get is the following: if a node becomes disconnected (where disconnected means unable to communicate with a majority of the cluster), and then some arbitrarily long amount of time passes, and then the node becomes reconnected, then that node will not disrupt the cluster by forcing a new election; instead the node will silently rejoin the cluster without disruption.

Moreover, an easy argument shows that with this change you still preserve all of Raft safety: from a global Raft state point of view, the client is behaves exactly as if you had powered it down and then started it back up, which we already know is safe.

-Archie

Archie Cobbs

unread,

Jun 12, 2015, 4:12:48 PM6/12/15

to raft...@googlegroups.com

In particular, this also solves a problem that I've actually seen myself: a node starts up before its network is fully functional. Then when the network finally turns on, boom, the node disrupts the cluster. That problem would go away, for the same reason.

-Archie

Oren Eini (Ayende Rahien)

unread,

Jun 12, 2015, 4:21:12 PM6/12/15

to raft...@googlegroups.com

Or, use pre votes, and the issue is gone

Hibernating Rhinos Ltd

Oren Eini l CEO l Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

--

Kijana Woodard

unread,

Jun 12, 2015, 4:22:21 PM6/12/15

to raft...@googlegroups.com

I'm assuming there's a reply coming to my most recent post with the quotes from the dissertation...or did that get lost in all the settings changes?

patrick chan

unread,

Jun 13, 2015, 12:10:12 PM6/13/15

to raft...@googlegroups.com

Hi,

I read and re-read your response. Sorry I don't get your point.

I claim you need to implement both pre-votes and minimum election timeout if you want to reduce disruptions. One isn't enough.

Then you ask "why do I say that?" and quote those parts from the dissertation that prompted me to say exactly that. So you seem to be both disagreeing and agreeing with me at the same time :) Can you clarify?

thanks

patrick chan

unread,

Jun 13, 2015, 12:15:16 PM6/13/15

to Archie Cobbs, raft...@googlegroups.com

I think I would try to define "disrupt" as causing an election that didn't have to happen. Then you can apply all your scenarios and see if it would cause a needless election.

cheers

--

Kijana Woodard

unread,

Jun 13, 2015, 12:41:20 PM6/13/15

to raft...@googlegroups.com

I confused the matter by quoting without attribution. IIRC, my first response was to you and the second was to Archie.

My quotes from the dissertation was to assert that pre vote and a minimum election timeout are part of, well, the dissertation.

What are we changing?

From: patrick chan
Sent: ‎6/‎13/‎2015 11:10 AM

To: raft...@googlegroups.com
Subject: Re: leader election questions

--

Diego Ongaro

unread,

Jun 17, 2015, 3:35:27 AM6/17/15

to raft...@googlegroups.com, Heidi Howard

I'm bringing some of the "is it ok for a leader to reject pre-vote rpc?" discussion into here, since I couldn't see how the two threads differed. I'm sorry my email got long, but I wanted to address several of the points in these two lengthy threads. If you're replying to a small piece of this that is relatively independent of the rest and likely to generate more discussion, consider starting a new thread. Just try not to split the same discussion topic into multiple concurrent threads again.

Timeouts

To avoid any more confusion, let me define minimum election timeout or baseline election timeout (I think I've use both terms interchangeably). If an election timeout is chosen from the random range A to B, the minimum election timeout is A.

In a typical Raft implementation (am I allowed to claim this? I have no data on what's typical), election timeouts are chosen from the uniform random distribution between T and 2*T, where T is a configuration setting. Every time the start-new-election timer is set (as follower or as candidate), it chooses a random value from T to 2*T. The minimum election timeout is then simply T. The heartbeat time is typically T/2.

Heidi Howard and her collaborators have explored splitting the election timeout into the follower timeout (follower-to-candidate) and candidate timeout (candidate-to-candidate) as Patrick is suggesting. They've published some simulation results on the impact this has. See the ARC and Raft Refloated papers listed on the Raft website: https://raftconsensus.github.io/#pubs .

We (John and I) chose to unify the two timeouts in the Raft paper/dissertation to make things a little simpler. Personally, I worry about the risks of using a smaller candidate timeout: when you're already in a split vote, things are looking dim; is that the right time to switch to more aggressive timeouts? Obviously that's coming from Diego intuition, not data about what's good in the real world.

Avoiding Disruptions

First off, if you're doing membership changes, servers shouldn't update their terms based on RequestVote requests for a minimum (baseline) election timeout after receiving a heartbeat. I consider that tweak required if you support membership changes. All the other approaches I know about don't quite handle all cases of disruptive servers during and after membership changes. (You can make a best effort to tell a server it's removed from the cluster, but that might fail, and you can't retry forever.)

Otherwise, if you're worried about disruptions of various sorts, Archie's right: we need a failure model. What types of failures do we care about? The goal is that any election-free failure behavior must not cause elections (under various timing assumptions), and behavior that falls outside of that model may cause elections.

In the core Raft algorithm, basically any server or long-ish link failure may cause an election (leader upon failure and rejoin, follower upon rejoin). Depending on how frequently you send/retransmit heartbeats, the loss of a single heartbeat might be considered election-free behavior, but if basically anything else goes wrong, you might (likely) see an election.

Adding the pre-vote phase extends the set of election-free behaviors. With pre-vote (including the minimum election timeout restriction), a minority of followers disconnecting and rejoining is election-free. They can't increment their terms while they're disconnected from the majority because of the pre-vote check. Once they do regain communication with a majority, they're not able to disrupt the leader (minimum election timeout on majority of servers that have been getting heartbeats prevents candidate from increasing its term). And they'll get a heartbeat within a timeout period of regaining connectivity with the leader. It turns out that's not the only behavior that's election-free in pre-vote, however; keep reading.

Archie suggests pinging a majority of nodes as a simpler alternative to the pre-vote; I'll call this majority-ping. I defined pre-vote in the dissertation as follows: return success "if the candidate's log is sufficiently up-to-date, and the voter [has] not received heartbeats from a valid leader for at least a baseline election timeout". In comparison, majority-ping just returns success (any time there is network connectivity).

So now we should ask: how does the set of election-free behaviors for pre-vote compare to the set of election-free behaviors for majority-ping? I believe majority-ping also handles a minority of followers disconnecting and rejoining, but are there election-free behaviors in pre-vote that are not election-free in majority-ping?

Consider a cluster where the leader can't communicate with one follower but that follower can communicate with everyone else (bidirectionally). In pre-vote, the others deny that follower the ability to increment its term, and no disruption occurs. In majority-ping, however, that follower would be allowed to increment its term and hold an election (now called 'candidate'). The other followers will ignore the candidate, since they're receiving heartbeats from the leader. However, once that network link regains connectivity, the leader will send a heartbeat to the candidate, and its heartbeat response will cause a disruption.

What, precisely, are the set of election-free behaviors in pre-vote and majority-ping? I think that's still an open question. When it came to writing my dissertation, I knew pre-vote would help in some important cases. I still don't know whether there are other important cases it does not help with. The harder part is that I still don't know what all the important cases are (what is the minimum election-free set that we'd be happy with?). Is majority-ping recommendable? Is it good enough? Is the additional if statement in pre-vote worth the cost to avoid that class of disruptions? I don't know.

I showed above how the timeout component of the pre-vote check may be valuable (it extends the election-free set of behaviors to include a temporary link failure between just the leader and a minority of followers). What about the up-to-date log component of the check? Any distinguishing behaviors there might be pretty contrived. But it's there in part for symmetry with the RequestVote processing: they're the same conditions as whether a server would grant its vote, hence the name pre-vote, so I guess it felt like a reasonable thing to propose. It also might help election performance, since no votes will be cast on a candidate that can't win.

In the case of LogCabin, I have no current plans to implement pre-vote. That means that any flaky server can cause elections. Does that bother you? In a three-server cluster, there's a 1 in 3 chance that the flaky server is the leader, and no Raft tweak will prevent leader election then. The client has to be able to tolerate outages during elections. If it can tolerate those with frequency F but not with frequency 3F or 5F, are you so sure F is ok? We'd like to have very robust systems, certainly, but we should view that from a whole-system perspective, including the clients. I know that adding pre-vote could reduce the number of elections in LogCabin, but having an unnecessary election every now and then can help ensure that clients are written to tolerate election-length outages. Of course, if I had data showing that pre-vote would help reduce overall complexity or provide a meaningfully better level of service, I'd make it a priority.

Patrick's critique that "....One reason is that there is no widely agreed-upon algorithm for Minimum Disruptive Raft...sketched possible approaches....but these differ from others...details have not been published." is a little harsh but also contains some truth. It's underspecified in my dissertation because I didn't have the answers, and I still don't. What are the election-free behaviors in Minimum Disruptive Raft, and who should implement it versus regular Raft? Where's the data? Where's the proof? It's going to take some more time until we get there, and I chose to graduate before then (how selfish, I know). It's a question I think we'll be better able to answer after a few years of experience running Raft-based systems in the real world.

"Should it be a different message or a flag in request vote?" Relatedly, should there be a fourth state between follower and candidate during the pre-vote? (Maybe call it "presumptive candidate" https://en.wikipedia.org/wiki/Candidate#Presumptive_candidate or probably just "pre-candidate"?) I didn't want to overspecify this in my dissertation without having implemented it (beyond a hacky simulation). I suspect you'd get better code reuse by sticking with RequestVote and distinguishing candidates and pre-candidates with a boolean flag, but maybe it'd be cleaner to have separate messages and distinct candidate and pre-candidate states. The implementation language may play a role here, as its type system may or may not lend itself to code reuse across different message types. If any of you have opinions after implementing pre-vote and feel like you understand the design considerations, I'd encourage you to write up the discussion and maybe propose some specifics, including naming conventions. Maybe we'd end up with a couple of reasonable options and some shared vocabulary.

Minimum election timeout, no-op flag, leader step down timer: those probably should be in the condensed summary, but they're also not strictly needed. If you implemented a system with static membership, where all read-only queries are committed as log entries, and clients time out when a leader doesn't respond in a timely manner, then you don't need any of that. Maybe we were sweeping complexity under the rug, but these things are all technically optional. If only we had A4 paper.

In summary: Maybe fewer of you should be implementing pre-vote. Those of you that do should have a solid understanding of why you're doing it. And sorry if my dissertation leaves you hanging there, but it doesn't have to be the final word on this subject. At least we have a common starting point (regular Raft), and we can explore the motivation for and the design trade-offs of a Minimally Disruptive Raft over time together.

-Diego

Archie Cobbs

unread,

Jun 17, 2015, 12:16:57 PM6/17/15

to raft...@googlegroups.com, heidiho...@gmail.com

On Wednesday, June 17, 2015 at 2:35:27 AM UTC-5, Diego Ongaro wrote:

I'm bringing some of the "is it ok for a leader to reject pre-vote rpc?" discussion into here, since I couldn't see how the two threads differed.

Hi Diego,

Thanks for taking the time to add your usual clear-headed thinking to the discussion.

I don't have anything to add, other than I thought it might be worthwhile to more precisely describe the "majority ping" algorithm that I ended up using. As you correctly point it, on the spectrum of (a) more simplicity and having fewer election-free behaviors vs. (b) more complexity and more election-free behaviors, this is on the (a) end. I simply haven't seen or done the research that would justify moving further toward the other direction (e.g., how likely asynchronous network failures are in real life, etc.).

This is a simple as I could get it: for "majority ping", define a new "pinger" Raft state. This state has an election timer just like follower and candidate do.

If a follower's election timer fires, the follower becomes a pinger:

pInger immediately sends one ping request to each other node in the cluster
pinger restarts the election timer (for simplicity, can use the same randomized interval)

If the election timer fires while a pinger:

If a response to the most recent ping request was received from a majority of nodes, transition to candidate

Note, we always guarantee that the leader has at least a minimum election timeout in which to contact us before we become a candidate

Otherwise, rinse & repeat: send out another round of pings and restart the election timer

If an AppendEntries or RequestVote is received, immediately revert to follower, restart election timer, and then handle normally
If a PingRequest is received in any state (follower, leader, candidate, pinger), reply with ping response.

Note it is not necessary to include current term in either ping request or response.

Hand-wavy proof of correctness:

The ping messages do not change any existing state on other nodes, so their behavior does not change and is therefore still safe
From the outside world's perspective, a follower behaving according to this modified algorithm behaves like a traditional follower that is temporarily shutdown for the duration of time that it is in the pinger state, so its modified behavior should also be safe

Oren Eini (Ayende Rahien)

unread,

Jun 17, 2015, 3:53:59 PM6/17/15

to raft...@googlegroups.com, heidiho...@gmail.com

Archie,

But a scenario like the one Diego described, where you have a node that can't talk to the leader, but can talk to the rest of the cluster...

A (leader) , B & C followers

Then we have:

A <-////-> B

A <-----> C

B <-----> C

B doesn't get a notice, does get a majority ping (itself and C), then force an election.

Hibernating Rhinos Ltd

Oren Eini l CEO l Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

Archie Cobbs

unread,

Jun 17, 2015, 3:59:45 PM6/17/15

to raft...@googlegroups.com, heidiho...@gmail.com

On Wednesday, June 17, 2015 at 2:53:59 PM UTC-5, Ayende Rahien wrote:

But a scenario like the one Diego described, where you have a node that can't talk to the leader, but can talk to the rest of the cluster...

...is not handled by "ping majority". That's an intentional trade-off between complexity vs. % coverage of possible failure scenarios.

Ben Darnell

unread,

Jun 17, 2015, 4:21:02 PM6/17/15

to raft...@googlegroups.com, Heidi Howard

On Wed, Jun 17, 2015 at 3:35 AM, Diego Ongaro <onga...@gmail.com> wrote:

Avoiding Disruptions

First off, if you're doing membership changes, servers shouldn't update their terms based on RequestVote requests for a minimum (baseline) election timeout after receiving a heartbeat. I consider that tweak required if you support membership changes. All the other approaches I know about don't quite handle all cases of disruptive servers during and after membership changes. (You can make a best effort to tell a server it's removed from the cluster, but that might fail, and you can't retry forever.)

I see how this avoids disruptions, but doesn't it cause other problems? Suppose node C is partitioned away and increments its term, so its term is now ahead of the majority A and B. If A and B refuse to increase their terms to match, then C cannot force an election, but its term remains "in the future" so it will not recognize messages from the current leader and remain frozen in its isolated state. How do you bring node C back into the fold? Do A and B somehow tell C to decrease its term?

Where (other than this email) is this "required tweak" discussed? I don't think I've seen it before.

-Ben

Kijana Woodard

unread,

Jun 17, 2015, 5:35:54 PM6/17/15

to raft...@googlegroups.com

The quoted idea is for the case of membership changes so that a recently removed server can't disrupt the cluster before being turned off.

This is where Diego went on to discuss "the set of election-free behaviors".

For temp partition, C won't update it's term if it doesn't get sufficient positive PreVote responses.

"Where (other than this email) is this "required tweak" discussed? I don't think I've seen it before."

In the full dissertation, but it wasn't referred to as "required" there.

--

Diego Ongaro

unread,

Jun 18, 2015, 2:01:08 PM6/18/15

to raft...@googlegroups.com

Relying on asymmetry: the term in the RequestVote request is not adopted, but the term in the AppendEntries response is. The leader (A or B) will send a heartbeat to C, C will reply to that heartbeat with its newer term, and the leader will step down and adopt that newer term.

Where (other than this email) is this "required tweak" discussed? I don't think I've seen it before.

In the last two paragraphs of the cluster membership changes section of the conference paper and Extended Version paper, as well as in the dissertation. We added this fairly late in the game, so if you've mostly read pre-release drafts of the paper, it may not have been present there.

-Diego

Ben Darnell

unread,

Jun 18, 2015, 2:12:30 PM6/18/15

to raft...@googlegroups.com

On Thu, Jun 18, 2015 at 2:00 PM, Diego Ongaro <onga...@gmail.com> wrote:

On Wed, Jun 17, 2015 at 1:20 PM, Ben Darnell <b...@bendarnell.com> wrote:
On Wed, Jun 17, 2015 at 3:35 AM, Diego Ongaro <onga...@gmail.com> wrote:
Avoiding Disruptions

First off, if you're doing membership changes, servers shouldn't update their terms based on RequestVote requests for a minimum (baseline) election timeout after receiving a heartbeat. I consider that tweak required if you support membership changes. All the other approaches I know about don't quite handle all cases of disruptive servers during and after membership changes. (You can make a best effort to tell a server it's removed from the cluster, but that might fail, and you can't retry forever.)

I see how this avoids disruptions, but doesn't it cause other problems? Suppose node C is partitioned away and increments its term, so its term is now ahead of the majority A and B. If A and B refuse to increase their terms to match, then C cannot force an election, but its term remains "in the future" so it will not recognize messages from the current leader and remain frozen in its isolated state. How do you bring node C back into the fold? Do A and B somehow tell C to decrease its term?

Relying on asymmetry: the term in the RequestVote request is not adopted, but the term in the AppendEntries response is. The leader (A or B) will send a heartbeat to C, C will reply to that heartbeat with its newer term, and the leader will step down and adopt that newer term.

Ah, I see. This is about ensuring that removed nodes can't cause endless disruptions, but a node that is partitioned away and comes back can still force an election.

This requires heartbeats to be real empty AppendEntries requests and not just a ping indicating that the server is alive. That's incompatible with our coalesced heartbeats so I'll have to think about this some more.

Where (other than this email) is this "required tweak" discussed? I don't think I've seen it before.

In the last two paragraphs of the cluster membership changes section of the conference paper and Extended Version paper, as well as in the dissertation. We added this fairly late in the game, so if you've mostly read pre-release drafts of the paper, it may not have been present there.