Hey Pablo,
My replies are below. I wrote too much, but hopefully this helps.
On Sun, Jan 19, 2014 at 1:14 PM, Pablo Sebastián Medina
<
pablom...@gmail.com> wrote:
> Hi all,
>
> A few questions after reviewing the latest draft of the paper:
>
> 1. Shouldn't commitIndex be part of persistent state on all servers? How can
> a Leader elected after a cluster restart know the commitIndex if its a
> volatile state?
The only purpose of commitIndex is to bound how far state machines can
advance. It's ok to hold back state machines for brief periods of
time, especially when a new leader comes online, since state machines
are briefly stalled anyways with respect to new entries then.
The source of truth for which entries are committed (guaranteed to
persist forever) is the cluster's logs. Servers keep track of the
latest index they know is committed, but there may always be more
entries committed (in the guaranteed to persist forever sense) than
servers' commitIndex values reflect. For example, the moment an entry
is sufficiently replicated, it is committed, but it takes the leader
receiving a bunch of acknowledgements to realize this and update its
commitIndex. If you think of commitment that way, it's not a big
stretch to accept the idea that servers' commitIndex values may
sometimes reset to 0 (when they reboot) and nothing bad comes of it.
A newly elected leader will advance its commitIndex value once it
commits its first entry. (Was that a circular definition? I mean once
it replicates the first entry having its current term to a majority of
the cluster.) Then, other servers will advance their commitIndex
values on the leader's next AppendEntries request (or heartbeat) to
them. So even if a leader starts out with a commitIndex of 0, it and
the rest of the cluster will be able to advance their state machines
shortly after the leader commits a new entry.
The last complication is that In the absence of client requests, the
leader might not create any entries, so it wouldn't be able to commit
any new entries. If you want it to update its commitIndex quickly
anyway, for example for read-only operations, you can have new leaders
append a no-op entry to their logs. The paper explains this in the
client interaction section.
> 2. What is the purpose of lastApplied state ? The paper states the next
> about it: "...index of highest log entry applied to state
> machine". If a commited entry is applied to the state machine, why Raft
> handles commits and applies as different things?
This is something you probably figured out on your own, and somehow
the paper has given you the impression that there's more to it than
the obvious. There's not.
There's two ways for Raft to interface with your state machine: push
or pull. Raft can push committed entries to your state machine and
force it to process them the moment Raft realizes they're committed,
or your state machine can pull committed entries from Raft as it is
ready for them. Either way, you need to track which indexes the state
machine has already applied so that it can apply the next one in
order. In the push case, when commitIndex advances, this looks like:
while (lastApplied < newCommitIndex) { apply(lastApplied + 1);
lastApplied += 1; }
There are two important things to note here. First, lastApplied must
increase by 1 each time an entry is applied so that it reflects the
state of the state machine. It should never go backwards or skip
values or anything funny like that. Second, lastApplied should be as
durable as the state machine. If the state machine lives in memory,
then lastApplied should reset to 0 on a restart. If the state machine
lives on disk, then lastApplied should also be stored on disk.
> 3. What is the difference between nextIndex and matchIndex ? I don't get the
> purpose of matchIndex. It follows nextIndex increments when receiving
> succesfuls appendEntries results but not on failed appendEntries. It seems
> to be allowed to be bigger than nextIndex (due to not being decremented on
> failed appendEntries). Could you please clarify it.
nextIndex is a hint as to which entry to send to the follower next. It
might be too high (then AppendEntries requests get rejected) or too
low (then AppendEntries requests are redundant and have no effect),
but it'll self-correct.
matchIndex, on the other hand, has stricter guarantees, since it's
used to compute commitIndex. Raft maintains the invariant that
leader.log[1..matchIndex] == follower.log[1..matchIndex]. Once the
matchIndex values for a majority of the followers exceed the leader's
commitIndex, the leader can advance the commitIndex.
From that invariant, you can see that it's ok for matchIndex to be too
low (this just means that commitIndex won't advance as quickly as it
might otherwise). However, it's never ok for matchIndex to be too high
(this would allow a leader to mark entries committed that aren't
sufficiently replicated yet, leading to terrible safety problems). So
at the start of a leader's term, each follower's matchIndex is set to
0, and it's only advanced upon successful AppendEntries RPCs to that
follower.
This does mean that in normal operation, once the leader has found
where its log and the follower's log diverges, matchIndex will pretty
much always be equal to nextIndex - 1. However, this isn't always
true, so it would be a big mistake to "optimize" out one of these
variables.
Best,
Diego