Scenario where leader moves current_lsn of the follower behind unanimously committed lsn

74 views
Skip to first unread message

Nirvik Ghosh

unread,
Mar 23, 2021, 4:44:01 PM3/23/21
to raft-dev
Question 1:
Is it normal/expected for the leader to receive a rejection message from a follower that would cause the leader to move the current_lsn of the follower behind the unanimously committed entry ?
(P.S: There is no new member involved in this scenario)

If answer is No, then please read the scenario section. 
If answer is Yes, then wouldn't it violate the durability guarantee ?

Scenario
Let's say the follower is at term X, and Lsn Y has been unanimously committed at term X-5, and term of that log was X-10.

During term X-10, the leader had sent an AppendEntriesRequest { lsn: Y  term: X-10 } to the follower. However, the message reaches the follower at term X, due to some slowness.
Now, at term X, the follower processes the AppendRequest { lsn: 10, term: X - 10 } from the leader at term X-10, who is also the leader at term X. It will reject the message with stale term error and set the current term of AppendResult to X, i.e, the follower will send out AppendResult { success: false, lsn: Y, term: X }.

Since, the leader is at term X and it receives a rejected AppendResult with term X, it will not ignore the AppendResult and end up moving the current lsn behind the lsn which was unanimously committed.


Thanks,
Nirvik

Jordan Halterman

unread,
Mar 23, 2021, 4:55:25 PM3/23/21
to raft...@googlegroups.com
A rejected AppendEntries RPC should not cause the leader to modify the commitIndex. The commitIndex is computed based on the set of matchIndex acknowledged by followers. The leader should not update the matchIndex for the follower when it receives a rejected AppendEntries response, so the commitIndex will remain unchanged. The commit index is only updated when a follower responds successfully to an AppendEntries RPC.

Jordan Halterman

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/raft-dev/df7e3040-890c-4da1-93b5-1a0feccb9f7cn%40googlegroups.com.

Nirvik Ghosh

unread,
Mar 23, 2021, 5:15:51 PM3/23/21
to raft-dev
Thanks for quick response. 

My follow up question would be, with the scenario explained above, is it possible that nextIndex[follower] becomes less than matchIndex[follower] ? If yes, is that normal ?

Nirvik

John Ousterhout

unread,
Mar 23, 2021, 5:31:01 PM3/23/21
to raft...@googlegroups.com
That should never be able to happen. The leader only sets its matchIndex for follower to a given value once it knows the follower's log matches its own log up to that point, so there's no reason for nextIndex ever to get less than that.

-John-

Nirvik Ghosh

unread,
Mar 23, 2021, 6:58:00 PM3/23/21
to raft-dev
Thanks for the response John.

What if network re-orders the message, it could still be possible, right ? 

Considering the same scenario explained above:

Term X - 10
1. leader sends AppendEntry { lsn: Y, term: X - 10 } to the follower. ------- (1*)
2. follower doesn't receive the AppendEntry yet 

Term X 
3. Leader at term X-10 becomes leader again in term X 
4. Leader tries to figure out the matchIndex of that follower
5. Leader sends AppendEntry { lsn: Y, term: X } to the follower --------- (2*)
6. Follower receives request (2*) and sends a successful response back, AppendResult { success: true, lsn: Y, term: X } ---- (3*)
7. Follower now receives request (1*) and sends a reject response back due to stale term, AppendResult { success: false, lsn: Y, term: X } --- (4*)

Now leader receives (3*) and (4*), also in that order. Processing (3*), would set the matchIndex[follower] to Y. Processing (4*) would set the nextIndex[follower] to Y-1. 
By now, the nextIndex[follower] < matchIndex[follower]

Do let me know, if there are any inaccuracies in the scenario explained 

John Ousterhout

unread,
Mar 23, 2021, 7:51:49 PM3/23/21
to raft...@googlegroups.com
The short answer is that the leader has to ignore response (4*) because it is stale. How that is done may depend on the implementation. The spec in the paper assumes RPCs, where each response is associated with a specific request, and the leader has the request in hand when it processes the response. If you implement this spec, then the leader will have no request associated with (4*), so it will presume that (4*) is left over from an earlier term (which it is) and discard it. Another possibility is for the follower to include the term of the request in each response; this would also allow the leader to detect that it is stale.

-John-

Nirvik Ghosh

unread,
Mar 23, 2021, 9:54:54 PM3/23/21
to raft-dev
That makes sense. Thanks 

Arun Sharma

unread,
Mar 25, 2021, 5:15:45 PM3/25/21
to raft-dev
On Tuesday, March 23, 2021 at 4:51:49 PM UTC-7 ous...@cs.stanford.edu wrote:
The short answer is that the leader has to ignore response (4*) because it is stale. How that is done may depend on the implementation. The spec in the paper assumes RPCs, where each response is associated with a specific request, and the leader has the request in hand when it processes the response. If you implement this spec, then the leader will have no request associated with (4*), so it will presume that (4*) is left over from an earlier term (which it is) and discard it. Another possibility is for the follower to include the term of the request in each response; this would also allow the leader to detect that it is stale.

Thank you for confirming this. A python async implementation of raft I'm working on had this problem and I coded up the same solution you suggest above. I looked around to see if other implementations were doing something similar, but couldn't readily confirm my hypothesis. Perhaps the implementation is buried in the guts of the RPC system.

Here is the solution I used with zeromq sockets:

* Each raft message gets a uuid
* RPC is async, not request-response
* When responses come in, they carry the uuid that the response corresponds to
* Leader keeps a cache of recently sent messages with an upper limit on the size and TTL

When messages come in and we get a match, we know what the response was about. If we don't find it, we discard it just like you suggest. But it's possible that we run into that case simply because there were a lot of messages and our limited size tracking data structure discarded some. I don't think that's a correctness issue. Just wanted to make sure.

I have other questions about the correctness of the changes I'm making to raft, but will ask those in a separate thread.

 -Arun

Jordan Halterman

unread,
Mar 25, 2021, 5:24:17 PM3/25/21
to raft...@googlegroups.com
Your solution is perfectly fine. This is essentially how synchronous RPCs are implemented on asynchronous transports. 

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.

Arun Sharma

unread,
Mar 25, 2021, 8:33:11 PM3/25/21
to raft-dev
Thanks for confirming that it's not crazy :)

It looks like many of the existing implementations use TCP (reliable, in-order packet delivery) for RPCs. However, the layers above still need to handle the case where the connection gets reset and packets need to be re-delivered.  I was looking for implementations that try to use IP/UDP broadcast to ease the network load on the leader in the presence of a large number of followers and rely on raft to deal with lost packets (since that's needed even for TCP).

Will probably need changes to the raft protocol itself to make it work when more aggressive network optimizations are considered. But before I propose anything - are there implementations that use broadcast that I can look at?
Reply all
Reply to author
Forward
0 new messages