leader fails fsync after entries are committed by majority followers

49 views
Skip to first unread message

Nirvik Ghosh

unread,
Jun 21, 2024, 1:43:49 PM (14 days ago) Jun 21
to raft-dev
Regarding section 10.2.1 in the dissertation, it talks about leader sending acknowledgements back to the client if majority of the followers have applied the log along with fsync. 

however, what happens in a situation where majority of followers have applied the log and fsync'd it, but leader hasn't finished the fsync, in that case leader would  still consider the write record as committed and send ACK to the client. But now what if, the fsync fails on the leader and client sends a read request to that leader. In that case, leader would serve stale entry, right ? 

P.S the client expects consistent reads. 

Here is the flow (say we have a cluster of 3 nodes, L(leader), F1(follower 1), F2(follower 2))

1. Client sends W1 to L
2. L appends W1 to its log 
3. L sends W1 to F1 and F2 
4. L initiates fsync 
5. F1 and F2 apply W1 and finishes fsync
6. F1 and F2 send successful AppendResult back to the leader
7. L marks the entry to be committed and sends ACK to client 
8. L fails the fsync 
9. Client sends read request to the L in the same region were W1 was applied 


Oren Eini (Ayende Rahien)

unread,
Jun 21, 2024, 2:31:50 PM (14 days ago) Jun 21
to raft...@googlegroups.com
Failure of fsync usually requires that you'll restart the server to ensure that you are back at a consistent state. 
In other words, you'll lose leadership

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/raft-dev/90d55c57-fc1d-4c61-876f-61a08eaa2d78n%40googlegroups.com.


--

Nirvik Ghosh

unread,
Jun 21, 2024, 5:08:11 PM (14 days ago) Jun 21
to raft-dev
Thanks for the prompt response. 

What about the case where leader is still in the middle of fsync and client read request comes in ? Should the leader wait until fsync finishes successfully before replying to the read request or if it finishes with an error it can step down as a follower ?

Oren Eini (Ayende Rahien)

unread,
Jun 21, 2024, 5:16:41 PM (14 days ago) Jun 21
to raft...@googlegroups.com
Remember, Raft is a two stage process.
You have the commit to the log stage, and then you have the apply stage.

When the other nodes confirm that they got (and fsycned!) the value, the leader can start applying the command, only after _that_ is done can read requests actually respond with the new data.

In general, the way I implemented it (and how most others do), you'll accept the command, write it to disk + fsync (or O_DIRECT, etc) and then send to the other nodes.
That is complicated by batching, etc. 



Archie Cobbs

unread,
Jun 21, 2024, 5:24:13 PM (13 days ago) Jun 21
to raft...@googlegroups.com
I think there's may be subtle misunderstanding implied by the question here:

On Fri, Jun 21, 2024 at 12:43 PM Nirvik Ghosh <nirvi...@gmail.com> wrote:
Regarding section 10.2.1 in the dissertation, it talks about leader sending acknowledgements back to the client if majority of the followers have applied the log along with fsync.

It actually doesn't say "majority of followers" is required. What's required is a "majority of nodes" where "nodes" includes the leader as well.

Slightly confusing, the caption for Figure 10.2 says "Once the leader receives positive AppendEntries responses from half of its followers..."  but that's because there are 4 followers in the diagram example, so half of them (2) plus the leader (1) makes 3 and that constitutes majority of 3/5 total nodes.

Only when a (strict) majority of nodes have fsync'd the data and updated their commit index can it be considered committed - but that "majority" includes the leader.

-Archie

--
Archie L. Cobbs

Nirvik Ghosh

unread,
Jun 21, 2024, 6:00:48 PM (13 days ago) Jun 21
to raft-dev
This is very interesting that majority includes the leader. My understanding came from this particular line in the dissertation(towards the end of 10.2.1)
"The leader may even commit an entry before it has been written to its own disk, if a majority of followers have written it to their disks; this is still safe. LogCabin implements this optimization."

This statement to me implied that majority need not include the leader and the acknowledgement can be sent to the customer 

Archie Cobbs

unread,
Jun 21, 2024, 6:27:29 PM (13 days ago) Jun 21
to raft...@googlegroups.com
Sorry if what I said was ambiguous... The majority *may* include the leader, but doesn't have to, e.g. in a weird case where the followers can receive a message, write it to disk, and respond with a confirmation faster than the leader can write to its own local disk.

Remember a leader can stop being a leader at any time. Or more than one node can think it's the leader at the same time, or zero nodes. So technically speaking it's not precise to refer to "the leader" unless you are assuming a stable cluster. So you can imagine why there needs to be symmetry with respect to how nodes are counted for commit purposes.

You can think of Raft as a (a) a protocol with guarantees that if a (strict) majority of nodes have committed a log entry to disk, then that log entry can never be reverted (although that may an arbitrarily long time to discover/confirm), plus (b) leader and follower roles that nodes play to ensure this property. From the perspective of (a), the leader is just another node.

-Archie

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages