An Improvement Idea for Leader Election in the Raft Algorithm

198 views
Skip to first unread message

孙涛涛

unread,
Mar 6, 2025, 12:07:23 PMMar 6
to raft-dev

For the Raft algorithm, the leader election process is as follows:

1. Node Roles

Nodes in the Raft algorithm have three roles:

(1) Leader: Responsible for handling client requests, managing log replication, and sending heartbeats.

(2) Follower: Passively accepts logs and heartbeats from the Leader and does not initiate requests.

(3) Candidate: During the election process, a Follower can transition to a Candidate and initiate an election.

 

2. Election Trigger Conditions

Leader elections are typically triggered under the following conditions:

(1) Leader failure: A Follower does not receive a heartbeat from the Leader within a certain period (election timeout).

(2) New node joining: A newly joined node may trigger an election.

(3) Partition recovery: An election may be triggered after network partition recovery.

 

3. Election Process

The election process consists of the following steps:

(1) Follower Transitions to Candidate

  1. When a Follower does not receive a heartbeat from the Leader within the election timeout period, it considers the Leader to have failed.
  2. The Follower transitions its role to Candidate and initiates a new election.

(2) Candidate Initiates Voting

  1. The Candidate first increments its term number (Term) by 1, indicating a new round of elections.
  2. The Candidate sends a RequestVote RPC to other nodes in the cluster to request votes.
  3. The Candidate votes for itself.

(3) Other Nodes Vote

  1. Upon receiving the RequestVote RPC, nodes check the following conditions:
  • Whether the Candidate's Term is greater than or equal to their own Term.
  • Whether they have already voted for another Candidate.
  • Whether the Candidate's log is at least as up-to-date as their own.
  1. If the conditions are met, the node votes for the Candidate and resets its election timeout.

(4) Election Result

  1. If the Candidate receives votes from a majority of nodes, it becomes the new Leader.
  2. The new Leader immediately sends heartbeats to other nodes to prevent them from initiating new elections.

(5) Election Failure

  1. If the Candidate does not receive enough votes within the election timeout period, the election fails.
  2. The Candidate waits for a random period before re-initiating the election (to avoid split votes caused by multiple nodes initiating elections simultaneously).

 

4. Election Timeout

(1) To prevent multiple nodes from initiating elections simultaneously, Raft uses randomized election timeout periods (typically 150ms-300ms).

(2) Randomized timeout periods reduce the probability of election conflicts.

 

5. Log Consistency Guarantee

(1) During the election process, Raft ensures that only nodes with sufficiently up-to-date logs can become Leaders by comparing the Term and Index of logs.

(2) This guarantees that the Leader's log contains all committed log entries, ensuring consistency.

 

6. Election Optimization

(1) Pre-Vote Mechanism: Before formally initiating an election, a Candidate can first send a Pre-Vote request to confirm whether it has a chance to win the election, avoiding unnecessary Term increments.

(2) Leader Transfer: In certain situations, the Leader can proactively transfer leadership to another node to avoid frequent elections.

 

 

Modified Election Process:

The election process is divided into the following steps:

(1) Follower Transitions to Candidate (unchanged)

  1. When a Follower does not receive a heartbeat from the Leader within the election timeout period, it considers the Leader to have failed.
  2. The Follower transitions its role to Candidate and initiates a new election.

(2) Candidate Initiates Voting (modified)

  1. The Candidate first increments its term number (Term) by 1 and generates a random int64 number, randPriority, indicating a new round of elections.
  2. The Candidate sends a RequestVote RPC to other nodes in the cluster (this RPC request carries the random number randPriority) to request votes.
  3. The Candidate votes for itself.

(3) Other Nodes Vote

  1. Upon receiving the RequestVote RPC, nodes check the following conditions:
  • Whether the Candidate's Term is greater than or equal to their own Term.
  • Whether they have already voted for another Candidate.
  • Whether the Candidate's log is at least as up-to-date as their own.
  1. If the conditions are met, the node votes for the Candidate and resets its election timeout.
  2. Upon receiving the RequestVote RPC, nodes record the maximum value maxNodePriority (first by log offset, then by the random number randPriority) among Candidate nodes that meet the following conditions:
  • The Candidate's Term is greater than or equal to their own Term.
  • The Candidate's log offset is at least as up-to-date as their own.
  1. If a node refuses to vote for the Candidate (conditions not met or already voted for another node), it will return a message informing the Candidate.

(4) Voting Result (Receiving Rejected Votes)

  1. If the Candidate receives rejected vote messages and the following conditions are met:
  • If the message informs the voting node that the log is outdated,
  • If the message informs the voting node that it has already voted for another node, then the Candidate node will record these nodes in a list (VoteOtherList) until the number of rejected nodes exceeds a majority (meaning the node cannot be elected according to the standard Raft algorithm, in which case it needs to self-rescue). If the Candidate's nodePriority is less than the recorded maxNodePriority or less than the maximum nodePriority in VoteOtherList, the node will initiate a specific node re-voting process.
  1. Specific Node Re-Voting Process:
  • Actively relinquish the Candidate role for this round.
  • Send a message to all nodes that voted for it, requesting them to re-vote (the message includes the Candidate node with the maximum nodePriority recorded, obtained from VoteOtherList and maxNodePriority).
  • Upon receiving this message, nodes compare the maxNodePriority in the message with the locally recorded maxNodePriority and send a vote to the node with the larger value.
  • The specific node re-voting process aims to concentrate votes on the node with the highest nodePriority, thereby completing the election process.

(5) Voting Result (Receiving an Accepted Vote)

  1. If the Candidate receives votes from a majority of nodes, it becomes the new Leader.
  2. The new Leader immediately sends heartbeats to other nodes to prevent them from initiating new elections.

(6) Election Failure

  1. If the Candidate does not receive enough votes within the election timeout period, the election fails.
  2. The Candidate waits for a random period before re-initiating the election.

 

 

Leader Election Examples

Example 1: Raft Election Succeeds in One Round

Assume a 5-node cluster (A, B, C, D, E):

  1. Initially, A is the Leader, and the other nodes are Followers.
  2. After A fails, B and C's election timeout expires, and they transition to Candidates.
  3. B and C initiate elections, sending RequestVote RPCs to other nodes.
  4. Assume B's log is more up-to-date than C's, and D and E vote for B.
  5. B receives 3 votes (including its own) and becomes the new Leader.

 

Example 2: Raft Election Fails in One Round

  1. Initially, A is the Leader, and the other nodes are Followers.
  2. After A fails, B, C, D, and E's election timeout expires, and they transition to Candidates.
  3. B, C, D, and E initiate elections, sending RequestVote RPCs to other nodes. B, C, D, and E receive random numbers 100, 90, 80, and 70, respectively.
  4. Assume B, C, D, and E each receive 1 vote. Since C, D, and E each receive three rejected vote messages (exceeding a majority), these three nodes relinquish their Candidate roles for this round and send messages to themselves (these three nodes vote for themselves), instructing them to transfer their votes to node B.
  5. Node B receives 4 votes, exceeding a majority, and completes the election process.

 

Example 3: Another Raft Scenario

  1. Initially, A is the Leader, and the other nodes are Followers.
  2. After A fails, B and C's election timeout expires, and they transition to Candidates.
  3. B and C initiate elections, sending RequestVote RPCs to other nodes, with random numbers 900 and 1000, respectively.
  4. Assume B and C have the same log offset and receive votes from D and E, respectively. Here, a heuristic approach can be considered: since node A cannot vote, both B and C believe that under the existing Raft algorithm logic, neither will become the Leader. Since node C has a larger offset, node C will not initiate specific node re-voting. Node B will relinquish its Candidate role for this round and instruct nodes B and D, which voted for it, to transfer their votes to node C.
  5. Node C receives 4 votes, exceeding a majority, and completes the election process.

 

Correctness of the Idea

At any moment, each node will only vote for one Candidate, so at any moment, there is at most one node with votes from a majority of nodes.

 

Efficiency

Under normal Raft operation, the new modifications have no impact. However, when the Raft algorithm cannot elect a Leader in one election round, the new modifications may facilitate the election of a Leader.


Thank you for your opinions, and welcome to talk about it.


Chris Johnson

unread,
Mar 10, 2025, 3:40:28 PMMar 10
to raft-dev
I agree with the premise that votes can be treated as tokens that can be passed around.

In Example 3, point 4 i disagree that A cannot vote.  It can come back and vote.  Maybe a bit pedantic, B and C could assume A will never vote for them and time out and it wouldn't really cause any problems if A had voted.

However, again in Example 3 what happens if B and C can't communicate and each consider themselves the highest priority?  As described neither will relinquish their votes.

What happens if a server gives up its candidacy and tells servers that voted for it to vote for a server they can't communicate with? Raft assumes an asynchronous network model so a server can't be sure whether it went through or not once sent and take the vote back without explicit consent from the receiver.

Given these problems i would think an election would still need to time out and try again with a higher term number, and if you need to account for that it puts into question whether the extra complexity is worth it given that elections should be rare events.

孙涛涛

unread,
Mar 10, 2025, 11:03:50 PMMar 10
to raft-dev

(1)   For the Example 3, point 4. This is a heuristic way, in other word, you could choose not to use this heuristic way, and just let the candidate gives up its candidacy when it knows that it could not become the leader (that not less than half of nodes vote for other candidates). But it is a low probability case that node A would come back after a few seconds. If in most cases it is true, then this heuristic way could speed up the election process, and in a few cases, it could slow down the election process, then it is feasible way. And we could do the heuristic way in the first few election rounds, then we could the election process would work when the normal raft election process works.
(2) In Example 3, we don’t need B and C could communicate with each other. Because we judge the highest priority according to only the information from B and C, and also D and E. You could refer to the following:

(4) Voting Result (Receiving Rejected Votes)

  1. If the Candidate receives rejected vote messages and the following conditions are met:
  • If the message informs the voting node that the log is outdated,
  • If the message informs the voting node that it has already voted for another node, then the Candidate node will record these nodes in a list (VoteOtherList) until the number of rejected nodes exceeds a majority (meaning the node cannot be elected according to the standard Raft algorithm, in which case it needs to self-rescue). If the Candidate's nodePriority is less than the recorded maxNodePriority or less than the maximum nodePriority in VoteOtherList, the node will initiate a specific node re-voting process.
  1. Specific Node Re-Voting Process:
  • Actively relinquish the Candidate role for this round.
  • Send a message to all nodes that voted for it, requesting them to re-vote (the message includes the Candidate node with the maximum nodePriority recorded, obtained from VoteOtherList and maxNodePriority).
  • Upon receiving this message, nodes compare the maxNodePriority in the message with the locally recorded maxNodePriority and send a vote to the node with the larger value.
  • The specific node re-voting process aims to concentrate votes on the node with the highest nodePriority, thereby completing the election process.

 

And in other way, if there are network partition, and the cluster splits into many disconnected parts, then there are no way to get a leader. And I don’t say that the election process is bound to be succeed in one round using this proposal.

(3) For what happens if a server gives up is candidacy and tells servers that voted for a server they can’t communicate with? First, it is a rare case, because if they can’t communicate with another candidate, then they don’t know the random priority, and then they would not think that candidate has high priority. If it is the case that we get the information for the candidate, and we could not connect to it when we want to do the re-voting process. Except that the candidate who gives up the candidacy, there are no other changes compared to the normal raft election. And the candidate who gives up the candidacy doesn’t need to know whether the node who was asked to re-voting receives the re-voting request or not, and whether the node revoting or not.
For example:

There are 7 nodes, and A is leader now, B, C, D, E, F, G are candidates. The node A stops now, and candidate B, C, D become candidates, and get the candidate from themselves.

And B get the vote from E, C get the vote from F, D get the vote from G.

B has the rand priority 300, C has the rand priority 200, D has the rand priority 100.

Case 1:
If B, C, D, E, F, G could communicate with each other, but when D and G want to do the re-voting, D and G could not communicate with B. then D and G would keep to do the re-voting to note B, but could not succeed. C and F would give their votes to B, and B get 4 votes, and becomes the leader.

Case 2:

If B, C, D, E, F, G could communicate with each other, but after D tells G to do the re-voting to the node B, then D disconnected with G. Now, B, C, E, F, G are connected, then B would get 5 nodes, and becomes the leader.

4I agree with you that an election still need to time out and may need to try again with a higher term number. The example I gives is just for one round, I don’t give the information about another round because it not related to the changes. You could refer to the following:


(5) Voting Result (Receiving an Accepted Vote)

  1. If the Candidate receives votes from a majority of nodes, it becomes the new Leader.
  2. The new Leader immediately sends heartbeats to other nodes to prevent them from initiating new elections.

(6) Election Failure

  1. If the Candidate does not receive enough votes within the election timeout period, the election fails.
  2. The Candidate waits for a random period before re-initiating the election.


For “whether the extra complexity is worth it given that elections should be rare events”, because there are some proposals that are used to speed up the election process for example the random wait time, and the following:

6. Election Optimization

(1) Pre-Vote Mechanism: Before formally initiating an election, a Candidate can first send a Pre-Vote request to confirm whether it has a chance to win the election, avoiding unnecessary Term increments.

From the above, we could know that speeding up the election process is very important. And when the cluster is in the leader election process, the cluster could not do the real work. Any speed up would make sense. Of course, we could do more investigation and experiments to check it. And there is not much extra complexity for the software development, CPU cost and network cost.

 

And thank you for your response. According to your response, I could find that you have a lot of thinking about it. Thank you very much.

Chris Johnson

unread,
Mar 11, 2025, 3:35:49 AMMar 11
to raft-dev
Yeah i'm really just playing devil's advocate, the argument for why not to do it is the increased complexity.  If you can show a significant improvement in election completion times in the majority of cases then the added complexity might be worth it to some people.

One thing to note, the pre-vote mechanism also prevents livelock, see: https://blog.cloudflare.com/a-byzantine-failure-in-the-real-world/

That said, after thinking about it a bit more i think you can tweak the idea so that it still reaches a decision even with partial network failure.

1) Pre-vote ensures that all candidates can directly communicate with the majority of the group.

2) If A >= B >= C in terms of vote priority, if C gives its vote to B, B can directly give it to A once it decides it can't become leader rather than sending it back to C, as it must be true that A >= C.  Initially i was worried that you could elect a server that couldn't commit anything if you did this, but because of 1) we know that all candidates do have a quorum.

3) There is still the possibility that B can't communicate with A, and if B gives the votes back to C as you propose you then have to worry about if C can communicate with A.  My first thought is you get around this by having all messages in the election phase be flood broadcast among the group.  As you said, it's not doing much else when an election is happening so you can handle the extra traffic.

4) You now don't need the initial timeout before deciding to give up the candidacy and pass on the votes, as soon as you learn of a candidate for the same term with a higher priority you can give them your votes and forward any votes you receive in the future.

5) You can avoid the case where two candidates pick the same random number through lexicographical ordering.

I think you now have something that decides on a leader without any additional waiting after the initial timeout of the old leader.  It may fail to elect a leader if a candidate crashes of course, and you have to have some mechanism to retry messages in case they are dropped such as TCP.

There's a lot of research on voting algorithms so i'm sure you can find some clever ideas in the literature.  The main requirements that are particular to Raft are the pre-vote to ensure all candidates are valid and will be able to make progress if they are elected, and the log being a factor in determining who to vote for.

孙涛涛

unread,
Mar 11, 2025, 11:28:47 PMMar 11
to raft-dev
Thank you, I have seen the pre-vote before. The main usage for pre-vote is that if a node check whether the leader node is connected, and whether the node has the biggest log offset, or the biggest term. The target is not same with my idea. I am sorry that I am busy these two days. I would give you response after about two days.

孙涛涛

unread,
Mar 15, 2025, 8:20:33 PMMar 15
to raft-dev
Thank you for your response. And I think you have a lot of ideas. And we need to get more information about the normal network conditions in the real environment for the product environments, and we need to do a lot of experiments to check the real profit. And for some personal reasons, I don't have time to do this for a few months, or even a few years. If you have time, you could do this, and may get some interesting results. And thank you for your response, again. And have fun.

在2025年3月11日星期二 UTC+8 15:35:49<cjljo...@gmail.com> 写道:

dr-dr xp

unread,
Apr 6, 2025, 12:16:16 PMApr 6
to raft-dev
Does it mean to replace the vote request handling

```
if req.term >= self.term 
```

with 

```
if (req.term, req.prio) >= (self.term, self.prio)
```

It seems like an analogy to paxos ballot number

孙涛涛

unread,
May 12, 2025, 1:00:12 AMMay 12
to raft-dev
Hi, we don't need to replace the vote request handling from
```
if req.term >= self.term
```
with
```
if (req.term, req.prio) >= (self.term, self.prio)
```
When we receive a vote request result, we need to check whether it is a vote to myself, or vote for other nodes, if most of the nodes vote for other nodes, and my prio is small, the node (myself) could give up its right to become the leader in this round, and tell the nodes who vote for it to revote for other nodes.
It is looks like the following:
if reqresult.vote_for == myself
   if myself.myself.couldBeLeader
     myself.recevied_notes++
     if myself.received_notes >= threshold
        self.becomeLeader
   else
        askVoteForOtherNodes(reqresult.from_node)
else
   other.received_notes++
   if other.received_notes >= threshold && myself.prio < max(others.prio)
      myself.couldBeLeader = false

Thanks
Reply all
Reply to author
Forward
0 new messages