Bug in the raft paper and proposed solution (found because of problem with hashicorp implementation)

Jelte Fennema

unread,

Oct 25, 2017, 11:46:19 AM10/25/17

to raft-dev

I found a problem in the Hashicorp implementation of raft and after investigating where it came from I found out that it was caused by a recommendation from the raft paper itself.

The paper states the following at the end of section 6 that servers should disregard RequestVote RPCs when they believe a current leader exists:

The third issue is that removed servers (those not in Cnew) can disrupt the cluster. These servers will not receive heartbeats, so they will time out and start new elections. They will then send RequestVote RPCs with new term numbers, and this will cause the current leader to revert to follower state. A new leader will eventually be elected, but the removed servers will time out again and the process will repeat, resulting in poor availability.

To prevent this problem, servers disregard RequestVote RPCs when they believe a current leader exists. Specifically, if a server receives a RequestVote RPC within the minimum election timeout of hearing from a current leader, it does not update its term or grant its vote. This does not affect normal elections, where each server waits at least a minimum election timeout before starting an election. However, it helps avoid disruptions from removed servers: if a leader is able to get heartbeats to its cluster, then it will not be deposed by larger term numbers.

This indeed works to solve this problem. However, it has a very nasty side effect. Consider this scenario with nodes A, B and C

- A is leader and B and C are followers

- C because of some network problems misses enough of the heartbeats to start an election and starts sending RequestVote RPCs to A and B

- These RequestVote RPCs requests reach A and B, but they ignore them because A is leader

- Heartbeats and AppendEntries RPC's from A are still being sent to C, but C ignores them because they are from an older term

- This state continues until the connection between A and B is disrupted and they accept C's RequestVote RPCs.

Although this does not cause downtime or service disruption it does carry some problems:

- C will get behind indefinitely and probably has to restore from a snapshot when it rejoins the cluster.

- The performance of the raft cluster is degraded, because if B is slow the commit is slow as opposed to being limited by the fastest of B and C.

I think I found a simple solution to the issue mentioned in the paper which does not add these nasty side effects:

Nodes should not disregard all RequestVote RPCs when a leader is believed to exist. Instead they should disregard RequestVote RPCs from servers that are not part of their configuration.

This way a server that doesn't itself know that it is removed from the cluster cannot disrupt the cluster by sending RequestVote RPCs, but a member of the cluster that is trying to start an election is not ignored indefinitely.

I'm currently working on a pull request to the Hashicorp implementation to fix this issue and some others https://github.com/hashicorp/raft/pull/256. Any feedback on my proposed solution is very welcome, especially ones that see problems with my solution. I think it would be a good idea to update the paper, if this is deemed a good solution or a better one is proposed.

Ben Darnell

unread,

Oct 25, 2017, 11:56:44 AM10/25/17

to raft...@googlegroups.com

On Wed, Oct 25, 2017 at 11:46 AM Jelte Fennema <m...@jeltef.nl> wrote:

- A is leader and B and C are followers
- C because of some network problems misses enough of the heartbeats to start an election and starts sending RequestVote RPCs to A and B
- These RequestVote RPCs requests reach A and B, but they ignore them because A is leader
- Heartbeats and AppendEntries RPC's from A are still being sent to C, but C ignores them because they are from an older term

etcd/raft solves this a bit differently: When a node receives an Append message from an older term, it sends a response with its current term instead of just ignoring the message. This will trigger a new election with a higher term across the cluster, breaking out of this stuck state.

Of course, this is somewhat disruptive, so the pre-vote RPC (section 9.6) would also be recommended to avoid incrementing the term in the first place.

-Ben

- This state continues until the connection between A and B is disrupted and they accept C's RequestVote RPCs.

Although this does not cause downtime or service disruption it does carry some problems:
- C will get behind indefinitely and probably has to restore from a snapshot when it rejoins the cluster.
- The performance of the raft cluster is degraded, because if B is slow the commit is slow as opposed to being limited by the fastest of B and C.

I think I found a simple solution to the issue mentioned in the paper which does not add these nasty side effects:

Nodes should not disregard all RequestVote RPCs when a leader is believed to exist. Instead they should disregard RequestVote RPCs from servers that are not part of their configuration.
This way a server that doesn't itself know that it is removed from the cluster cannot disrupt the cluster by sending RequestVote RPCs, but a member of the cluster that is trying to start an election is not ignored indefinitely.

I'm currently working on a pull request to the Hashicorp implementation to fix this issue and some others https://github.com/hashicorp/raft/pull/256. Any feedback on my proposed solution is very welcome, especially ones that see problems with my solution. I think it would be a good idea to update the paper, if this is deemed a good solution or a better one is proposed.

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Henrik Ingo

unread,

Oct 25, 2017, 12:40:53 PM10/25/17

to raft...@googlegroups.com

Hi Jelte

There are actually 2 different reasons why servers could ignore
RequestVote RPCs. Perhaps you're only talking about one of them, but
for clarity I'll explain both separately. Especially because in real
world implementations you will typically see both of these addressed.

On Wed, Oct 25, 2017 at 6:46 PM, Jelte Fennema <m...@jeltef.nl> wrote:
> I think I found a simple solution to the issue mentioned in the paper which
> does not add these nasty side effects:
>
> Nodes should not disregard all RequestVote RPCs when a leader is believed to
> exist. Instead they should disregard RequestVote RPCs from servers that are
> not part of their configuration.

Correct. The Raft paper doesn't address the fact that in real world
implementations, you of course need some kind of mechanism to identify
nodes, and you would obviously ignore nodes not part of the cluster
configuration. All of this is implied, if you will.

> This way a server that doesn't itself know that it is removed from the
> cluster cannot disrupt the cluster by sending RequestVote RPCs, but a member
> of the cluster that is trying to start an election is not ignored
> indefinitely.
>

In practice, implementations also use an optimization where if A, B
and C are all part of the configuration, and A and B can talk to each
other but C has network issues, then a call for election from C will
be ignored. (And C also must not increase its term.)

I tried to write down how this works in a paper-like format once:
http://openlife.cc/blogs/2015/september/4-modifications-raft-consensus

henrik
--
henri...@avoinelama.fi
+358-40-5697354 skype: henrik.ingo irc: hingo
www.openlife.cc

My LinkedIn profile: http://fi.linkedin.com/pub/henrik-ingo/3/232/8a7

Oren Eini (Ayende Rahien)

unread,

Oct 25, 2017, 3:12:49 PM10/25/17

to raft...@googlegroups.com

Ignoring a vote with a higher term number is not a good idea, for this reason.

What is commonly being done is that you have a trial election before, which allow other nodes to disregard it safely without actually creating the new term number on the source.

Only when there has been enough votes to say that the election will succeed (but not _who_ will succeed) will the source node actually commit to the new term number.

Hibernating Rhinos Ltd

Oren Eini l CEO l Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

On Wed, Oct 25, 2017 at 6:46 PM, Jelte Fennema <m...@jeltef.nl> wrote:

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+unsubscribe@googlegroups.com.

pre...@hashicorp.com

unread,

Oct 31, 2017, 9:40:42 AM10/31/17

to raft-dev

Hi Jelte

Thanks for bringing this up. We've been talking about this internally. While the problem you identified is correct, the proposed implementation will trigger leader elections for every instability in a follower node ( such as CPU contention, temporary network blips etc). That is not ideal because applications that rely on this library (Consul/Nomad) will have availability issues during those frequent leader elections, every time that some follower node is unstable.

Some options we've been talking about are:

Implementing the pre-vote RPC or trial term idea (so that the source node does not increment its term until it knows that its vote request succeeds).

Having a way to indicate in the response when the vote request has been ignored because there is a leader already so that the source node can roll back its term (causing future append entries to start being accepted again).

Ben's suggestion in this thread with triggering an election in the follower if it gets an AppendEntries call with a term older than its current term.

Will need to think through the implications of all this some more, so any other feedback from others on this thread is appreciated.

Jelte Fennema

unread,

Oct 31, 2017, 10:52:36 AM10/31/17

to raft-dev

Thanks everyone for the insights and ideas! Especially the paper by Henrik has some very good additions to "plain" raft. I think it would be very useful to have it published on the Raft website: https://raft.github.io/. Possibly even in the quick links.

@preetapan I understand your concerns with my proposed solution and I agree. I think the pre-vote RPC would be a much better solution. After I read about it I just didn't want to implement it without first discussing it, since it would require a change the Transport interface to support the new RPC. The other solution you're proposing where you decrease the term sounds like a good way to violate some of the things Raft assumes to be true.

We're currently using the code from my PR in production, because of our chosen timeouts and network setup an election once in a while is not really a problem. However we would like to get some commonly accepted solution in the official lib. If you want, you can reach me at jelte at getstream.io, to maybe set up a skype call or something so we can discuss what needs to be done in more detail.

Jelte

Reply all

Reply to author

Forward