Questions regarding AppendEntry RPC response.

195 views
Skip to first unread message

Jaseem Abid

unread,
Nov 21, 2015, 2:47:14 PM11/21/15
to raft-dev
Hey everyone, 

I'm trying to implement RAFT and I'm unable to figure out a few
things.

The paper doesn't mention a property called match_index in AppendEntry RPC
reply, but the formal TLA+ spec does. If the response to an RPC is of the form
{from, term, success}, there is no way for the leader to know when to increment
the corresponding nextIndex or matchIndex for the particular client.

Since a heartbeat will reply `success=true`, the leader cannot increment
`next_index` by one for all successful RPC calls. The leader needs to be able to
selectively increment corresponding peer's next_index only after appending
client commands, and an explicit `match_index` in RPC reply seems like the the
only way to do so.

Also, it would be really helpful if someone can explain why 3 variables
(next_index, match_index and commit_index) are needed, instead of only 2
(next_index and commit_index).

I'm unable to reach a conclusion. Any help appreciated.

jordan.h...@gmail.com

unread,
Nov 21, 2015, 8:48:18 PM11/21/15
to raft...@googlegroups.com
The TLA+ spec tends to use a few patterns that are different than what the paper because of the semantics of TLA+. Indeed, it does look like the AppendEntries response contains matchIndex. Normally, the leader holds matchIndex and nextIndex for each follower, and indeed it seems the end result of what's in the TLA+ spec and typical Raft implementations is the same. The changes are just sent in AppendEntries responses in the spec. 

Really, in the spec the follower is just sending prevLogIndex + Len(entries) back to the leader as matchIndex. For a successful AppendEntries RPC in a real implementation, that can be deduced without sending that information in responses since it's easier to hold state related to a specific request/response. So, that information just seems to be sent in responses in the spec to simplify the spec. I suppose you could do the same in an implementation, it would just be odd in a normal programming language.

So, that is, in a typical Raft implementation the leader knows when to increment nextIndex and matchIndex based on the context of an AppendEntries request and response. TLA+ may not handle context well, but most languages do. If the leader sends 10 entries to a follower with prevLogIndex = 5 and the follower responds success = true, indicating prevLogIndex/Term matched and the follower appended the entries, the leader can set matchIndex to 15 and nextIndex to 16 for that follower. If the follower responds success = false, the leader can decrement nextIndex to 5 and try again.

On nextIndex, matchIndex, and commitIndex:
You mention specifically why not nextIndex and commitIndex? The reason for separating matchIndex and nextIndex is because there can be big gaps between nextIndex and matchIndex and nextIndex and commitIndex. Note in the spec (and in the paper) when a server becomes the leader, matchIndex is initialized to 0 and nextIndex to Len(log) + 1. In practice, this allows the leader to start sending entries at the end of its log rather than at the beginning, assuming the follower will have most of the leader's entries. If you initialize nextIndex to 1 then the leader may end up sending its entire log to the follower. You may think that the follower could just send back its last log index, but that index's term may not match the leader's term or even be present in the leader's log. Consistency checks still need to be done to find the last index in the follower's log that matches the leader's log. So, it's really about converging from the leader's last log index to the follower's last log index. As AppendEntries RPCs fail and nextIndex is decremented, eventually an AppendEntries RPC will succeed (this process can be sped up by having the follower send back its last index and setting nextIndex to the follower's lastIndex + 1). That is where the follower's matchIndex is set. Before the matchIndex can be set, a successful AppendEntries RPC must be received by the leader for that index, indicating the term of that index matches the leader's term for that index. That way, we can calculate commitIndex based on the matchIndex for all followers. So, in that sense, matchIndex indicates not only that a follower has entries up to that index, but also that those entries are equal to the leader's entries up to that index, if that makes sense. nextIndex doesn't necessarily indicate anything about a follower's log, only that the leader will try that index next, so we can't assume anything with respect to commitIndex based on nextIndex.

Sorry if that explanation seemed all over the place :-)
--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Diego Ongaro

unread,
Nov 22, 2015, 5:24:14 AM11/22/15
to raft...@googlegroups.com
Excellent answer, Jordan. It saved me a bunch of typing.

The formal spec indeed uses message-passing whereas the paper uses more structured request-response RPCs. There are a few additional differences in the spec; these are explained in section 8.1 of my dissertation: https://github.com/ongardie/dissertation#readme

As to nextIndex vs matchIndex, the distinction is also useful if you pipeline the AppendEntries requests.

If you're into reading specs, you may want to look into Verdi's as well. BTW, I'm not aware of any write-up comparing the Verdi spec and the TLA+ spec; that'd be quite welcome.

Best,
Diego
Reply all
Reply to author
Forward
0 new messages