Some Raft newcomer questions:
The Raft dissertation and talks mention that the commit index
advancement is kept simple for understandability, and
that if we knew all servers had a log entry, then the Figure 3.7
problem cannot occur.
I've tried to implement that suggestion below.
(If all servers have an entry then commit it, even if not term=current.)
This is probably inconsequential as
far as optimizations goes, but I'm asking to make sure I have
understand the commit index setting mechanism properly...
and also to see if it might fail under certain other circumstances.
I looked at this code to get the idea of how to implement advanceCommitIndex()
and then wrote the following in Go... my question / attempt at implementing
this (admittedly minor) optimization is inline below.
Questions: basically, have I got the right idea? Also does joint-consensus conflict with it?
// AdvanceCommitIndex:
//
// When the Raft leader server hears back from the
// peer servers and updates their matchIndex, it knows
// the longest common prefix of all the reporting peers' logs,
// and can mark that index as "committed" when
// a) quorum is reached; and
// b) all future leaders will have that entry.
//
// and b) means either
// 1) the entry is from this leader's term (standard Raft rule);
// 2) all peers have the entry (not just a quorum;
// not standard Raft; I'm proposing implementation below
// to see if I've understood the idea...)
//
func (s *RaftServer) AdvanceCommitIndex() {
if s.role != LEADER {
return
}
// Find the smallest match index that is replicated to a majority
var matchIndexes []int64
// add leader's own, since we are not in peers
matchIndexes = append(matchIndexes, s.lastLogIndex())
for _, peer := range s.peers {
matchIndexes = append(matchIndexes, peer.MatchIndex)
}
// sort in increasing order
sort.Slice(matchIndexes, func(i, j int) bool {
return matchIndexes[i] < matchIndexes[j]
})
// For an ascending sorted slice of MatchIndexes,
// quorumMin computes the index of the
// smallest match still in quorum, so the
// log index that is smallest among a bare
// minimum of quorum members.
// The rest of the members of the
// quorum have >= matchIndexes.
// -- idea stolen from the RaftConsensus.cc:2185 and line 467
// Configuration::SimpleConfiguration::quorumMin() method.
i := s.quorumMinSliceIndex()
if i >= len(matchIndexes) {
// not enough for quorum??? would be out of bounds...
return
}
newCommitIndex := matchIndexes[i]
// Is the "all-nodes-have-it" exception to
// the rule possible?
newCommitIndexHasFullClusterAgreement := false
if len(matchIndexes) == s.clusterSize() &&
matchIndexes[0] == newCommitIndex {
newCommitIndexHasFullClusterAgreement = true
}
if s.state.CommitIndex >= newCommitIndex {
return // RaftConsensus.cc:2186
}
if newCommitIndex > int64(len(s.wal.RaftLog)) {
panic("should be impossible...out of bounds hmmm")
return
}
if s.wal.RaftLog[newCommitIndex-1].Term != s.state.CurrentTerm {
// "At least one of these entries must also be from the current term to
// guarantee that no server without them can be elected."
// -- RaftConsensus.cc:2192
if newCommitIndexHasFullClusterAgreement {
// optimized version falls through to commitWhatWeCan()
countTimesOptimizationPathTaken() // accumulate statistics
} else {
return
}
// the safe version, the standard Raft rule; what logcabin does:
//return // RaftConsensus.cc:2194
}
s.state.CommitIndex = newCommitIndex // RaftConsensus.cc:2195
// THIS IS THE COMMIT ON THE LEADER.
s.commitWhatWeCan(true)
...
// The other question/reason I'm not doing this atm is because
// I haven't implemented the joint consensus
// transition stuff, and I don't know how this
// optimization would interact with that; I
// don't know what the proper definition of
// cluster size would be in those situations,
// should the C_new, C_old, C_old+new, or
// what be used? Better to play it safe for now,
// but maybe the raft-dev list can offer
// some insight?
//
// With the joint-consensus cluster membership
// change, is the optimization above still safe?
Question 3) On page 74 of the Raft dissertation, in figure 6.3, why is
the lease extension = start + (the election timeout divided by the clock drift bound)?
I would have thought you subtracted the drift bound, not divided by it? Is this just a typo?
Thanks in advance for your input.
Jason