Pre-Vote cluster deadlock

107 views
Skip to first unread message

Mathieu Borderé

unread,
Jun 9, 2022, 1:01:57 PM6/9/22
to raft-dev
Hi all,

I'm maintaining a raft library and one of our users has reported the following issue in a cluster with Pre-Vote enabled.

4 cluster members, nodes 1 through 4

- node 1 is down
- node 2 starts and wins an election and forms quorum with 3 & 4, their term is now 1 larger than node 1's term
- node 2 goes down (and stays down), cluster loses quorum
- node 1 comes back up
- cluster quorum never recovers because in our implementation node 1 doesn't update its term in response to the Pre-Vote RequestVote RPC's of nodes 3 and 4, it will only update term when receiving Real RequestVote RPC's.

Is it a bug to not update term when receiving Pre-Vote RequestVote RPC's? My solution would be to update node 1's term to Pre-Vote-RequesVoteRPC.term - 1, because that will be the real term of the candidate that is sending the Pre-Vote RequestVote RPC. Is there another solution?

Thanks for the input!

Kind regards,
Mathieu

Mathieu Borderé

unread,
Jun 9, 2022, 1:20:40 PM6/9/22
to raft-dev
- cluster quorum never recovers because in our implementation node 1 doesn't update its term in response to the Pre-Vote RequestVote RPC's of nodes 3 and 4, it will only update term when receiving Real RequestVote RPC's
To clarify, node 3 and 4 ignore node 1's RequestVoteResponse RPC's due to its lower term.

Konstantin Osipov

unread,
Jun 9, 2022, 1:39:18 PM6/9/22
to raft...@googlegroups.com
* Mathieu Borderé <mathieu...@gmail.com> [22/06/09 20:29]:
> *- cluster quorum never recovers because in our implementation node 1
> doesn't update its term in response to the Pre-Vote RequestVote RPC's of
> nodes 3 and 4, it will only update term when receiving Real RequestVote
> RPC's*
> To clarify, node 3 and 4 ignore node 1's RequestVoteResponse RPC's due to
> its lower term.

Nodes 3 and 4 should become candidates eventually though? Both of
them should be able to win an election.

> > Hi all,
> >
> > I'm maintaining a raft library and one of our users has reported the
> > following issue in a cluster with Pre-Vote enabled.
> >
> > 4 cluster members, nodes 1 through 4
> >
> > - node 1 is down
> > - node 2 starts and wins an election and forms quorum with 3 & 4, their
> > term is now 1 larger than node 1's term
> > - node 2 goes down (and stays down), cluster loses quorum
> > - node 1 comes back up
> > - cluster quorum never recovers because in our implementation node 1
> > doesn't update its term in response to the Pre-Vote RequestVote RPC's of
> > nodes 3 and 4, it will only update term when receiving Real RequestVote
> > RPC's.
> >
> > Is it a bug to not update term when receiving Pre-Vote RequestVote RPC's?
> > My solution would be to update node 1's term to Pre-Vote-RequesVoteRPC.term
> > - 1, because that will be the real term of the candidate that is sending
> > the Pre-Vote RequestVote RPC. Is there another solution?
> >


--
Konstantin Osipov, Moscow, Russia

1912751295

unread,
Jun 9, 2022, 1:40:07 PM6/9/22
to kostja, raft-dev
别tm发了

Mathieu Borderé

unread,
Jun 9, 2022, 1:46:11 PM6/9/22
to raft-dev
Yes, Nodes 3 and 4 should be able to win an election, but in our case they didn't, because they ignored node 1's RequestVote response due to its lower term and node 1's vote is needed for majority.
The reason that node 1's term is outdated is because our implementation does not / did not update the term of a node as a result of receiving a Pre-Vote RequestVote RPC.

1912751295

unread,
Jun 9, 2022, 1:47:06 PM6/9/22
to mathieu.bordere, raft-dev
草泥马

Jinkun Geng

unread,
Jun 9, 2022, 1:51:54 PM6/9/22
to raft-dev
To avoid being bothered from raft-group email, you just need to send an email to raft-dev+u...@googlegroups.com. No need to broadcast the dirty words in the community.

如果不想被raft邮件组打扰,你只需要向raft-dev+u...@googlegroups.com 发送一封邮件即可退订,没必要在整个社区飙脏话。

On Thursday, June 9, 2022 at 10:47:06 AM UTC-7 19127...@qq.com wrote:
草泥马

Konstantin Osipov

unread,
Jun 9, 2022, 1:53:01 PM6/9/22
to raft-dev

Do you mean node 3 and 4 were not able to win a pre-vote, because a pre-vote request doesn't change 1's term, and the response from 1 contains 1's term?

Our implementation uses the pre-vote's request term in response: https://github.com/scylladb/scylla/blob/master/raft/fsm.cc#L794

// The term in the original message and current local term are the
// same in the case of regular votes, but different for pre-votes.
//
// When responding to {Pre,}Vote messages we include the term
// from the message, not the local term. To see why, consider the
// case where a single node was previously partitioned away and
// its local term is now out of date. If we include the local term
// (recall that for pre-votes we don't update the local term), the
// (pre-)campaigning node on the other end will proceed to ignore
// the message (it ignores all out of date messages).
send_to(from, vote_reply{request.current_term, true, request.is_prevote});

Mathieu Borderé

unread,
Jun 9, 2022, 1:57:30 PM6/9/22
to raft-dev
> Do you mean node 3 and 4 were not able to win a pre-vote, because a pre-vote request doesn't change 1's term, and the response from 1 contains 1's term?
Yes, that's what I mean. Thanks, I think your approach looks fine too.

Henrik Ingo

unread,
Jun 9, 2022, 2:00:50 PM6/9/22
to raft...@googlegroups.com
Hi Mathieu

Since a Pre-Vote algorithm isn't spelled out in the original Raft papers, may I start by asking which Pre-Vote algorithm is your implementation based on? Can you link to a written description of your PreVote RPC?

henrik

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/raft-dev/e11c4e2b-a882-4e62-8fbf-35e0046aed36n%40googlegroups.com.


--

Archie Cobbs

unread,
Jun 9, 2022, 2:02:04 PM6/9/22
to raft-dev
FWIW in my implementation "pre-vote" simply means: make sure you can successfully ping a majority of nodes (counting yourself) before starting an election. PIng requests and responses carry no state.

This way of doing things may fail in certain corner cases, but because ping's are stateless, when a majority is reachable the forward progress guarantees of Raft are still valid.

-Archie

Mathieu Borderé

unread,
Jun 9, 2022, 2:20:18 PM6/9/22
to raft-dev
| Since a Pre-Vote algorithm isn't spelled out in the original Raft papers, may I start by asking which Pre-Vote algorithm is your implementation based on? Can you link to a written description of your PreVote RPC?

Our Pre-Vote election proceeds like a normal election, a candidate has to obtain a majority of the votes before it can proceed to start a normal election.
The difference is that the term in the Pre-Vote RequestVote RPC is 1 higher than the actual term of the candidate and that a node does not update its term in response to receiving pre-vote RequestVote RPCs.

Henrik Ingo

unread,
Jun 9, 2022, 5:57:45 PM6/9/22
to raft...@googlegroups.com
Ok.

You're kind of correct, but the "carry no state" part is the problem. As you have described, you now end up in situations where

 - The leader was lost
 - Some nodes are on a term that is lower than majority
 - The only possible way for those nodes to learn about current term of the majority is the RequestVote RPC
 - But the nodes with the highest current term can never execute RequestVote because they can never pass the PreVote.

FWIW, I once wrote an addition to Raft where I added a well defined PreVote RPC step. It was reviewed on this list.

henrik

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.

Archie Cobbs

unread,
Jun 9, 2022, 6:35:40 PM6/9/22
to raft-dev
> You're kind of correct, but the "carry no state" part is the problem.

Bleh, sorry, I forgot how my own code works ... the Ping messages are NOT stateless - just like all other Raft messages, they carry the sender's current term.

Then these rules apply:
  • If a recipient gets a Ping with a higher term, it reverts to a follower in the new term per normal Raft rules (except! if we have heard from our leader within the minimum election timeout (dissertation, section 4.2.3)).
  • If a recipient gets a Ping with a lower term, it replies anyway. But the Ping reply also contains that remote node's higher term, so the original sender will see it and then revert to a follower in that higher term and start over.
So in the original scenario of this thread, node 1 should get a ping reply from node 3 or 4 with the higher term, and so then it would become a follower in that higher term. Then one of nodes 1, 3, or 4 would be able to successfully ping a majority of nodes, and then become a candidate.

-Archie

Vilho Raatikka

unread,
Jun 13, 2022, 2:44:57 PM6/13/22
to raft...@googlegroups.com
Hi, a minor thing to take into account with pre-voting addition is that the sender of a pre-vote request is not a candidate, which means that the term is not increased before sending pre-vote requests. As a result, pre-vote messages from subsequent pre-voting rounds may be identical, unlike voting messages, which are unique in terms of the voter, and the term.

Regards

Vilho

Reply all
Reply to author
Forward
0 new messages