Leadership change and Snapshotting

Ensar Basri Kahveci

unread,

Nov 27, 2017, 9:00:10 AM11/27/17

to raft-dev

Hello guys,

I have been reading the Raft paper recently and I have some confusion about the snapshotting algorithm. Can you please check the following scenario:

Initially, there are 3 nodes: Node1, Node2, and Node3, as shown in STATE 1 below. Node1 is the leader and the current term is 1. Nodes take snapshots independently, which is ok due to the Raft algorithm. Lets say a node takes a new snapshot when there are more than 100 committed entries after the last snapshot. In STATE 1, Node1 has managed to commit 100 entries together with Node2. On the other hand, for some reason Node3 has missed the last AppendEntries RPC containing the entry with index=100.

==== STATE 1 ====

Node1 (LEADER):

Term: 1

Log: [ ]

CommitIndex: 100

Snapshot: <index=100, term=1>

Node3-MatchIndex: 100, Node3-NextIndex: 101

Node3-MatchIndex: 99, Node2-NextIndex: 100

Node2 (FOLLOWER):

Term: 1

CommitIndex: 100

Log: [ ]

Snapshot: <index=100, term=1>

Node3 (FOLLOWER):

Term: 1

CommitIndex: 99

Log: [ ..., Entry<index=98, term=1>, Entry<index=99, term=1> ]

Snapshot: -

=================

When the system is in this state, STATE 1, Node1 crashes. Node2 and Node3 eventually timeout, start elections, and Node2 wins the election with term=3, since it has a longer log. Now we are in STATE 2. Node2 initializes the match index and next index for Node3 with 0 and 101, respectively.

==== STATE 2 ====

Node1: CRASHED.

Node2 (LEADER):

Term: 3

CommitIndex: 100

Log: [ ]

Snapshot: <index=100, term=1>

Node3-MatchIndex: 0, Node2-NextIndex: 101

Node3 (FOLLOWER):

Term: 3

CommitIndex: 99

Log: [ Entry<index=98, term=1>, Entry<index=99, term=1> ]

Snapshot: -

=================

In this case, lets say Node2 appends a new entry: Entry<index=101, term=3>, as in STATE 3, then it sends an AppendEntries RPC with the following content: AppendEntries{prevLogIndex=100, prevLogTerm=1, entries=[ Entry<index=101, term=3> ]}. Node3 rejects this AppendEntries RPC since it does not have an entry at index=100. Then, Node2 decrements the next index to 100 for Node3, as in STATE 4, and notices that index=100 has been put into the last snapshot. Therefore, it will send an InstallSnapshot RPC to Node3 with the following content: InstallSnapshot{index=100, term=1}.

==== STATE 3 ====

Node2 (LEADER):

Term: 3

CommitIndex: 100

Log: [ Entry<index=101, term=3> ]

Snapshot: <index=100, term=1>

Node3-MatchIndex: 0, Node2-NextIndex: 101

Node3 (FOLLOWER):

Term: 3

CommitIndex: 99

Log: [ Entry<index=98, term=1>, Entry<index=99, term=1> ]

Snapshot: -

=================

==== STATE 4 ====

Node2 (LEADER):

Term: 3

CommitIndex: 100

Log: [ Entry<index=101, term=3> ]

Snapshot: <index=100, term=1>

Node3-MatchIndex: 0, Node2-NextIndex: 100

Node3 (FOLLOWER):

Term: 3

CommitIndex: 99

Log: [ Entry<index=98, term=1>, Entry<index=99, term=1> ]

Snapshot: NA (not taken yet)

=================

The InstallSnapshot RPC (Figure 13) part of the Raft paper starts the receiver implementation as follows:

1) Reply immediately if term < currentTerm

So, when Node3 receives InstallSnapshot{index=100, term=1} RPC, it just ignores the RPC since its current term=3 is bigger than the term=1 of the RPC.

In this case, Node2 cannot replicate any new entries and cannot install a snapshot to Node3. It just hangs. Is this a bug in combination of the leadership change and snapshotting logics, or is there something I am missing?

Thank you for your time and thank you in advance for your reply.

Regards,

Ensar Basri Kahveci

unread,

Nov 27, 2017, 9:06:58 AM11/27/17

to raft...@googlegroups.com

Sorry for my last minute copy-paste errors :)

Please see the following corrections to my question:

For Node1 in STATE 1:

Node2-MatchIndex: 100, Node2-NextIndex: 101

Node3-MatchIndex: 99, Node3-NextIndex: 100

For Node2 in STATE 2:

Node3-MatchIndex: 0, Node3-NextIndex: 101

For Node2 in STATE 3:

Node3-MatchIndex: 0, Node3-NextIndex: 101

For Node2 in State 4:

Node3-MatchIndex: 0, Node3-NextIndex: 100

Regards,

--
You received this message because you are subscribed to a topic in the Google Groups "raft-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/raft-dev/fW0F92DstZs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to raft-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Ensar Basri Kahveci

Distributed Systems Engineer

Ensar Basri Kahveci

unread,

Nov 28, 2017, 3:26:45 AM11/28/17

to raft...@googlegroups.com

Ok. I got it.

I am missing the term parameter in the InstallSnapshot RPC and mistakenly doing the term check with lastIncludedTerm. When "Reply immediately if term < currentTerm" check is done with the leader's term, then the algorithm works fine.

Regards,

Reply all

Reply to author

Forward