Hello guys,
I have been reading the Raft paper recently and I have some confusion about the snapshotting algorithm. Can you please check the following scenario:
Initially, there are 3 nodes: Node1, Node2, and Node3, as shown in STATE 1 below. Node1 is the leader and the current term is 1. Nodes take snapshots independently, which is ok due to the Raft algorithm. Lets say a node takes a new snapshot when there are more than 100 committed entries after the last snapshot. In STATE 1, Node1 has managed to commit 100 entries together with Node2. On the other hand, for some reason Node3 has missed the last AppendEntries RPC containing the entry with index=100.
==== STATE 1 ====
Node1 (LEADER):
Term: 1
Log: [ ]
CommitIndex: 100
Snapshot: <index=100, term=1>
Node3-MatchIndex: 100, Node3-NextIndex: 101
Node3-MatchIndex: 99, Node2-NextIndex: 100
Node2 (FOLLOWER):
Term: 1
CommitIndex: 100
Log: [ ]
Snapshot: <index=100, term=1>
Node3 (FOLLOWER):
Term: 1
CommitIndex: 99
Log: [ ..., Entry<index=98, term=1>, Entry<index=99, term=1> ]
Snapshot: -
=================
When the system is in this state, STATE 1, Node1 crashes. Node2 and Node3 eventually timeout, start elections, and Node2 wins the election with term=3, since it has a longer log. Now we are in STATE 2. Node2 initializes the match index and next index for Node3 with 0 and 101, respectively.
==== STATE 2 ====
Node1: CRASHED.
Node2 (LEADER):
Term: 3
CommitIndex: 100
Log: [ ]
Snapshot: <index=100, term=1>
Node3-MatchIndex: 0, Node2-NextIndex: 101
Node3 (FOLLOWER):
Term: 3
CommitIndex: 99
Log: [ Entry<index=98, term=1>, Entry<index=99, term=1> ]
Snapshot: -
=================
In this case, lets say Node2 appends a new entry: Entry<index=101, term=3>, as in STATE 3, then it sends an AppendEntries RPC with the following content: AppendEntries{prevLogIndex=100, prevLogTerm=1, entries=[ Entry<index=101, term=3> ]}. Node3 rejects this AppendEntries RPC since it does not have an entry at index=100. Then, Node2 decrements the next index to 100 for Node3, as in STATE 4, and notices that index=100 has been put into the last snapshot. Therefore, it will send an InstallSnapshot RPC to Node3 with the following content: InstallSnapshot{index=100, term=1}.
==== STATE 3 ====
Node2 (LEADER):
Term: 3
CommitIndex: 100
Log: [ Entry<index=101, term=3> ]
Snapshot: <index=100, term=1>
Node3-MatchIndex: 0, Node2-NextIndex: 101
Node3 (FOLLOWER):
Term: 3
CommitIndex: 99
Log: [ Entry<index=98, term=1>, Entry<index=99, term=1> ]
Snapshot: -
=================
==== STATE 4 ====
Node2 (LEADER):
Term: 3
CommitIndex: 100
Log: [ Entry<index=101, term=3> ]
Snapshot: <index=100, term=1>
Node3-MatchIndex: 0, Node2-NextIndex: 100
Node3 (FOLLOWER):
Term: 3
CommitIndex: 99
Log: [ Entry<index=98, term=1>, Entry<index=99, term=1> ]
Snapshot: NA (not taken yet)
=================
The InstallSnapshot RPC (Figure 13) part of the Raft paper starts the receiver implementation as follows:
1) Reply immediately if term < currentTerm
So, when Node3 receives InstallSnapshot{index=100, term=1} RPC, it just ignores the RPC since its current term=3 is bigger than the term=1 of the RPC.
In this case, Node2 cannot replicate any new entries and cannot install a snapshot to Node3. It just hangs. Is this a bug in combination of the leadership change and snapshotting logics, or is there something I am missing?
Thank you for your time and thank you in advance for your reply.
Regards,