When does a Leader realise that (all of) it's Followers are gone?

129 views
Skip to first unread message

mlj

unread,
Feb 19, 2025, 11:50:23 AMFeb 19
to raft-dev
Hi,

I've been looking at the Raft Algorithm for the last couple of days and I've got a question that's bugging me, and I've not been able to find the answer.

Consider a 3 node cluster: A (Leader), B and C (Followers). Reset between each scenario.

If A fails: B & C will run down their randomised election timers, one of them (let's say B) will nominate itself to Candidate, call an Election, vote for itself and C will vote for it as well. B is now the Leader, C a Follower and A is dead.

If B fails: A will continue to send RPCs, get no reply from B but will continue as Leader and C will continue as Follower. B is dead.

If A && B fail: C will run down it's election timer, nominate to Candidate, vote for itself, but not receive any other votes, therefore will never be Elected. The cluster has failed, Raft can't recover from this failure.

If B && C fail: A will continue to send them RPCs, get no reply but will continue a Leader and the cluster will remain "operational"??? Raft doesn't detect this as a failure mode?

Are my conclusions r.e. the last scenario correct?

I haven't found any mechanism described in the Raft documentation whereby an incumbent Leader reacts to the loss of some/the majority/all of it's Followers. The closest I could find is in Section 5.5 of the paper, which describes that RPCs are retried forever, but that doesn't indicate that any "Oh, the cluster is really, really FUBAR" mechanism exists.

I understand that the Leader could "know" that it doesn't have any Followers (because it's getting no RPC responses), but even if that's being tracked that isn't to say that anything will be done about it.

It seems like there is a sensitivity (for lack of a better word) concerning which nodes in the cluster have failed, and whether or not the cluster remains operational or not.

Is this correct, or have I missed something?

Many thanks,

mlj

A. Jesse Jiryu Davis

unread,
Feb 19, 2025, 12:03:53 PMFeb 19
to raft...@googlegroups.com
If a leader doesn’t hear from a majority of followers (no heartbeats, no responses to AppendEntries) for an election timeout, the leader steps down and becomes a follower. 

In the meantime it thinks it’s a leader but it fails to commit any writes. In Raft’s default algorithm the leader also can’t serve reads during this time, because it checks for a majority on each read. If leases are enabled the leader can serve reads until it steps down. 

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/raft-dev/6084d581-c83f-4871-8809-3fadc0d1de14n%40googlegroups.com.

Archie Cobbs

unread,
Feb 19, 2025, 12:10:11 PMFeb 19
to raft...@googlegroups.com
On Wed, Feb 19, 2025 at 10:50 AM mlj <mlitt...@gmail.com> wrote:
I've been looking at the Raft Algorithm for the last couple of days and I've got a question that's bugging me, and I've not been able to find the answer.

This is a side note, but here's a general piece of advice that helps me avoid getting confused: With Raft it helps to avoid thinking in terms of "the state of the cluster". The only thing that is real is what each individual node thinks. If you have a cluster of size 3, then you have three (somewhat) independent facts F1, F2, F3. There is no single "global state". What Raft guarantees is certain restrictions on what combinations of F1, F2, and F3 are possible.
 
If B && C fail: A will continue to send them RPCs, get no reply but will continue a Leader and the cluster will remain "operational"??? Raft doesn't detect this as a failure mode?

In this scenario, all that matters is what A thinks. In this scenario A thinks that B and C are being extremely quiet :) The key side effect of that is that A will not be able to commit any changes, because it will never get a majority (2/3) of acknowledgements. Otherwise, everything is "normal".
 
It seems like there is a sensitivity (for lack of a better word ) concerning which nodes in the cluster have failed, and whether or not the cluster remains operational or not.

Raft doesn't provide answers to that question - i.e., whether the cluster has "failed" or what that even means.

But practically speaking: in my applications I have a periodic health checker that wakes up every five seconds and attempts to commit a read-only transaction. If it succeeds within a certain time frame then I declare that the cluster is "currently healthy".

On Wed, Feb 19, 2025 at 11:03 AM 'A. Jesse Jiryu Davis' via raft-dev <raft...@googlegroups.com> wrote:
If a leader doesn’t hear from a majority of followers (no heartbeats, no responses to AppendEntries) for an election timeout, the leader steps down and becomes a follower.

I don't think that is correct. The leader will just keep broadcasting into the void.

The followers are the ones who trigger elections after a timeout, i.e., when they have not heard from the leader.
 
-Archie

--
Archie L. Cobbs

A. Jesse Jiryu Davis

unread,
Feb 19, 2025, 2:43:30 PMFeb 19
to raft...@googlegroups.com
Fair points, Archie. I think the Raft paper and thesis don't mention the "stepdown if you haven't heard from a majority for a while" rule. It's not essential, but nice to have, since it makes multiple-leader periods short and rare. LogCabin implements it here, and MongoDB has a more sophisticated rule "stepdown if you haven't transitively heard from a majority in a while".

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.

Jason Aten

unread,
Aug 16, 2025, 4:02:41 PMAug 16
to raft-dev
On Wednesday, February 19, 2025 at 7:43:30 PM UTC je... wrote:
Fair points, Archie. I think the Raft paper and thesis don't mention the "stepdown if you haven't heard from a majority for a while" rule. It's not essential, but nice to have...

Actually the dissertation (at least) does discuss this. See page 69 (as numbered; 
page 86 in the pdf numbering) in chapter 6 (section 6.2) of the online 
version. In other words, this file

 
where it says,

"Raft must also prevent stale leadership information from 
delaying client requests indefinitely. Leadership information 
can become stale all across the system, in leaders, 
followers, and clients:

• Leaders: A server might be in the leader state, 
but if it isn’t the current leader, it could be needlessly 
delaying client requests. For example, suppose a leader 
is partitioned from the rest of the cluster, but it can 
still communicate with a particular client. Without 
additional mechanism, it could delay a request from 
that client forever, being unable to replicate a log 
entry to any other servers. Meanwhile, there might 
be another leader of a newer term that is able to 
communicate with a majority of the cluster and 
would be able to commit the client’s request. Thus, 
a leader in Raft steps down if an election timeout 
elapses without a successful round of heartbeats 
to a majority of its cluster; this allows clients to retry 
their requests with another server."

I note however that this can cause alot of leader churn
when manually bootstrapping up a new cluster if 
you are not very fast about it. 

So I disable this for the first minute (or so -- it is configurable);
in the same way that during that first minute it is useful
to require communications with all nodes, not just a quorum, so as to detect
mis-configuration (e.g. bad IP/port conflict prevents a node from ever starting)
that might otherwise be hidden by Raft's fault-tolerance
properties.


A. Jesse Jiryu Davis

unread,
Aug 18, 2025, 10:31:28 AMAug 18
to raft-dev
Nice find, Jason, thanks!

MongoDB solves the leader churn problem during bootstrapping like this: you start up N nodes with no replica set configuration. They can accept connections and "replSetReconfig" commands but they're otherwise powerless. Choose one node and send it the "replSetReconfig" command, give it a single-node replica set config that contains the target node. The node elects itself the leader of the single-node replica set. Then you send more "replSetReconfig" commands to the leader, adding followers one at a time. The leader will reject the new config if it can't communicate with the new follower. (Of course, communication could fail later, but we at least check that the leader and follower can exchange messages just before the follower is added to the set.) This way the replica set usually contains only nodes that are running and accepting connections, so there's little risk of leader churn.

More info:

A. Jesse Jiryu Davis

unread,
Aug 18, 2025, 10:32:11 AMAug 18
to raft-dev
Jason, what Raft implementation are you developing?

Jason Aten

unread,
Aug 19, 2025, 9:17:54 PMAug 19
to raft-dev
Hi Jesse, thanks for the description/links on how Mongo bootstraps. I do something
similar for testing, but have everything config-file driven for production mode so
as to aim for reproducibility.

My Raft implementation isn't public, at least at the moment. I still have some features
to implement, and I want to do a bunch more chaos testing on the network simulator. 

- Jason

Jason Aten

unread,
Sep 2, 2025, 1:43:50 AMSep 2
to raft-dev
Hi Jesse (and other Mongo folks...)

I've been reading the MongoRaftReconfig algorithm paper,
"Design and Analysis of a Logless Dynamic Reconfiguration Protocol"
https://arxiv.org/abs/2102.11960

and I find it appealing--both that the storage is separate from the
regular Raft log, and there seems to be pretty strong evidence 
of correctness (human proof, some model checking, and I am
guessing several years of operational real world experience at
this point). 

I wonder if it can be adapted to do
Joint Consensus (JC) instead of Single-server-at-a-time (SSAT)
reconfigurations?  Just requiring two separate quroums from
each of the two member sets in the JC C_old,C_new; as in 
Raft's JC approach.  Does the mechanism break?

Thanks!
Jason

A. Jesse Jiryu Davis

unread,
Sep 2, 2025, 9:54:26 AMSep 2
to raft...@googlegroups.com
My colleague Will Schultz, an author on that paper, says his intuition is that joint consensus would be compatible with logless reconfig. We still just rely on waiting on particular quorum conditions in the logless protocol. We could likely alter these waiting conditions appropriately to support joint consensus.

At MongoDB we decided that single-server-at-a-time is simpler and easier to explain. We've never, to my knowledge, regretted the decision: we don't need joint consensus.

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.

Jason Aten

unread,
Sep 2, 2025, 10:56:41 AMSep 2
to raft-dev
Thanks guys!
Reply all
Reply to author
Forward
0 new messages