On Commit index persistence, lastApplied and matchIndex

Pablo Sebastián Medina

unread,

Jan 19, 2014, 4:14:49 PM1/19/14

to raft...@googlegroups.com

Hi all,

A few questions after reviewing the latest draft of the paper:

1. Shouldn't commitIndex be part of persistent state on all servers? How can a Leader elected after a cluster restart know the commitIndex if its a volatile state?

2. What is the purpose of lastApplied state ? The paper states the next about it: "...index of highest log entry applied to state

machine". If a commited entry is applied to the state machine, why Raft handles commits and applies as different things?

3. What is the difference between nextIndex and matchIndex ? I don't get the purpose of matchIndex. It follows nextIndex increments when receiving succesfuls appendEntries results but not on failed appendEntries. It seems to be allowed to be bigger than nextIndex (due to not being decremented on failed appendEntries). Could you please clarify it.

Regards.

Pablo.

Diego Ongaro

unread,

Jan 19, 2014, 6:37:08 PM1/19/14

to Pablo Sebastián Medina, raft...@googlegroups.com

Hey Pablo,

My replies are below. I wrote too much, but hopefully this helps.

On Sun, Jan 19, 2014 at 1:14 PM, Pablo Sebastián Medina
<pablom...@gmail.com> wrote:
> Hi all,
>
> A few questions after reviewing the latest draft of the paper:
>
> 1. Shouldn't commitIndex be part of persistent state on all servers? How can
> a Leader elected after a cluster restart know the commitIndex if its a
> volatile state?

The only purpose of commitIndex is to bound how far state machines can
advance. It's ok to hold back state machines for brief periods of
time, especially when a new leader comes online, since state machines
are briefly stalled anyways with respect to new entries then.

The source of truth for which entries are committed (guaranteed to
persist forever) is the cluster's logs. Servers keep track of the
latest index they know is committed, but there may always be more
entries committed (in the guaranteed to persist forever sense) than
servers' commitIndex values reflect. For example, the moment an entry
is sufficiently replicated, it is committed, but it takes the leader
receiving a bunch of acknowledgements to realize this and update its
commitIndex. If you think of commitment that way, it's not a big
stretch to accept the idea that servers' commitIndex values may
sometimes reset to 0 (when they reboot) and nothing bad comes of it.

A newly elected leader will advance its commitIndex value once it
commits its first entry. (Was that a circular definition? I mean once
it replicates the first entry having its current term to a majority of
the cluster.) Then, other servers will advance their commitIndex
values on the leader's next AppendEntries request (or heartbeat) to
them. So even if a leader starts out with a commitIndex of 0, it and
the rest of the cluster will be able to advance their state machines
shortly after the leader commits a new entry.

The last complication is that In the absence of client requests, the
leader might not create any entries, so it wouldn't be able to commit
any new entries. If you want it to update its commitIndex quickly
anyway, for example for read-only operations, you can have new leaders
append a no-op entry to their logs. The paper explains this in the
client interaction section.

> 2. What is the purpose of lastApplied state ? The paper states the next
> about it: "...index of highest log entry applied to state
> machine". If a commited entry is applied to the state machine, why Raft
> handles commits and applies as different things?

This is something you probably figured out on your own, and somehow
the paper has given you the impression that there's more to it than
the obvious. There's not.

There's two ways for Raft to interface with your state machine: push
or pull. Raft can push committed entries to your state machine and
force it to process them the moment Raft realizes they're committed,
or your state machine can pull committed entries from Raft as it is
ready for them. Either way, you need to track which indexes the state
machine has already applied so that it can apply the next one in
order. In the push case, when commitIndex advances, this looks like:
while (lastApplied < newCommitIndex) { apply(lastApplied + 1);
lastApplied += 1; }

There are two important things to note here. First, lastApplied must
increase by 1 each time an entry is applied so that it reflects the
state of the state machine. It should never go backwards or skip
values or anything funny like that. Second, lastApplied should be as
durable as the state machine. If the state machine lives in memory,
then lastApplied should reset to 0 on a restart. If the state machine
lives on disk, then lastApplied should also be stored on disk.

> 3. What is the difference between nextIndex and matchIndex ? I don't get the
> purpose of matchIndex. It follows nextIndex increments when receiving
> succesfuls appendEntries results but not on failed appendEntries. It seems
> to be allowed to be bigger than nextIndex (due to not being decremented on
> failed appendEntries). Could you please clarify it.

nextIndex is a hint as to which entry to send to the follower next. It
might be too high (then AppendEntries requests get rejected) or too
low (then AppendEntries requests are redundant and have no effect),
but it'll self-correct.

matchIndex, on the other hand, has stricter guarantees, since it's
used to compute commitIndex. Raft maintains the invariant that
leader.log[1..matchIndex] == follower.log[1..matchIndex]. Once the
matchIndex values for a majority of the followers exceed the leader's
commitIndex, the leader can advance the commitIndex.

From that invariant, you can see that it's ok for matchIndex to be too
low (this just means that commitIndex won't advance as quickly as it
might otherwise). However, it's never ok for matchIndex to be too high
(this would allow a leader to mark entries committed that aren't
sufficiently replicated yet, leading to terrible safety problems). So
at the start of a leader's term, each follower's matchIndex is set to
0, and it's only advanced upon successful AppendEntries RPCs to that
follower.

This does mean that in normal operation, once the leader has found
where its log and the follower's log diverges, matchIndex will pretty
much always be equal to nextIndex - 1. However, this isn't always
true, so it would be a big mistake to "optimize" out one of these
variables.

Best,
Diego

Diego Ongaro

unread,

Jan 19, 2014, 8:10:41 PM1/19/14

to Pablo Sebastián Medina, raft...@googlegroups.com

Oops, forgot that pesky second condition for commitment. Should read:

Once the matchIndex values for a majority of the followers exceed the

leader's commitIndex AND the term of log[matchIndex] is the leader's
current term, the leader can advance the commitIndex.

Jason Ni

unread,

Mar 1, 2015, 7:30:31 AM3/1/15

to raft...@googlegroups.com, pablom...@gmail.com

Hi Diego,

When I was reading your paper, I had the same question #3. The reason for the nextIndex would be too high could be that the leader set the value to last log index + 1 when this leader come to power. What about to set it to last committed log index + 1? So it couldn't be too high?

Best Regards,

Jason

在 2014年1月20日星期一 UTC+8上午7:37:08，Diego Ongaro写道：

Diego Ongaro

unread,

Mar 2, 2015, 12:33:33 PM3/2/15

to Jason Ni, raft...@googlegroups.com, Pablo Sebastián Medina

Hi Jason,

Even if a new leaders initialize the nextIndex values to its
commitIndex + 1, that might still be too high for a minority of
followers. For example, if a follower has an empty log because it's
been hibernating since the beginning of time, that nextIndex value
will still be too high.

The other thing is if you're pipelining the AppendEntries RPCs, the
easiest way to do this is to send the AppendEntries request and
increment nextIndex optimistically before getting a response. But, if
that request never makes it to the follower, nextIndex will be set too
high.

So nextIndex might still too high, and in practice the logic isn't too
burdensome to implement.

Hope this helps,
Diego

> --
> You received this message because you are subscribed to the Google Groups
> "raft-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to raft-dev+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Luca Lovagnini

unread,

Mar 3, 2015, 9:48:30 AM3/3/15

to raft...@googlegroups.com, pablom...@gmail.com

I was going to open a new post about the same question, luckily I found this one! I'm talking about the fact that commitIndex should be a persistent parameter and not a volatile one. I think that it should be persistent for two reason:

If you play with Raft Visualization you'll notice that if a node change its commitIndex, it will never be reset (while nextIndex and matchIndex will be correctly reinitialized when the node switch to a candidate state), just like currentTerm and votedFor, which are (in fact) persistent fields. My only conclusion is that commitIndex is a persistent state too.
In addiction to the previous point, I'll try to explain to you a case where there is an inconsistent state (I don't know if that's the correct term) if commitIndex is not and even lastApplied are not persistent fields! Suppose that we are implementing the Raft algorithm for a Distributed File System (as in my case). Suppose that a client submit a request to the FS to create a new file, which correspond to a new entry. The entry (the new file request) is correctly replicated by the leader to the other followers and so it's commited. So, the entry is applied to the state machine and finally the new file is created. Now, suppose that a Follower goes down, and then up again. What now happens (if commitIndex and lastApplied are not persistent and so reset) is that the entry which was already committed before will be committed again with the first AppendEntries received by the leader (in fact it will update commitIndex). As conclusion the entry is commited twice and the file is created two time (which is obviously not allowed)!

So my question is: where am I wrong? In particular in my example in point 2.

Diego Ongaro

unread,

Mar 3, 2015, 8:22:34 PM3/3/15

to Luca Lovagnini, raft...@googlegroups.com, Pablo Sebastián Medina

Hey Luca,

1. That's just unintended behavior in RaftScope. I created
https://github.com/ongardie/raftscope/issues/12 for this (though I
haven't worked on RaftScope for a long time now). While RaftScope is a
fun toy, please don't treat it as a reliable reference implementation.

2. lastApplied must be as persistent as the state machine. In other
words, if you lose the state machine, you should set lastApplied to 0.
If you are keeping the state machine on disk, you should keep
lastApplied along with it on disk.

See also section 3.8 "Persisted state and server restarts" in my
dissertation: https://github.com/ongardie/dissertation#readme

Good luck with your filesystem,
Diego

Luca Lovagnini

unread,

Mar 4, 2015, 4:46:33 AM3/4/15

to raft...@googlegroups.com, lucal...@gmail.com, pablom...@gmail.com

Thanks Diego for your replies, they were really useful! And thanks for your job, it's one of the most incredible algorithm that I have ever seen (in my modest experience)!

Diego Ongaro

unread,

Mar 5, 2015, 12:08:26 PM3/5/15

to Luca Lovagnini, raft...@googlegroups.com, Pablo Sebastián Medina

Aww, thanks Luca, but Professor Ousterhout deserves half the credit :)

Georgios Bitzes

unread,

Sep 15, 2016, 7:59:58 AM9/15/16

to raft-dev

Hi,

> 2. lastApplied must be as persistent as the state machine. In other
> words, if you lose the state machine, you should set lastApplied to 0.
> If you are keeping the state machine on disk, you should keep
> lastApplied along with it on disk.

This makes complete sense, but is not what is implied on page 4 of https://raft.github.io/raft.pdf.

lastApplied is marked as being part of the volatile state, which to me would mean it gets re-initialized every time a server reboots, just like commitIndex. This is clearly not the case when storing the state machine on disk, like you mention.

Maybe a small update in wording is due? I found this page while searching for an explanation on how it could possibly work if lastApplied is not backed on stable storage, so others might get confused, too.