New key/value database based on Raft

438 views
Skip to first unread message

Archie Cobbs

unread,
Jun 10, 2015, 12:47:08 PM6/10/15
to raft...@googlegroups.com
Hello Rafters,

I've been working on a new distributed key/value database[1] based on Raft. This database is written in Java and provides a sorted key/value store with arbitrary byte[] keys and values, and linearizable ACID consistency. It is designed for lighter workloads where consistency and robustness are more important than performance and scalability.

This project is part of a larger Java persistence framework[2] which runs on top of key/value stores.

This post is to let folks know about it, but also to do a brain dump of the experience, choices made, lessons learned, etc. And if I did something really stupid hopefully someone will point that out too, it's not too late to fix it :)

Goals and Background

The goals for this project were:
  • Need a distributed, transactional, key/value database with linearizable ACID consistency.
  • Must be a pure Java implementation (and unfortunately, can't contain Java 8 syntax yet).
  • Need to understand and be able to maintain the code.
  • Must support dynamic (runtime) cluster changes.
  • Performance should be good but that is not the main goal; transaction load will be moderate.
    • Simplicity, reliability, and maintainability are more important than high performance.
    • However, must support concurrent transactions (e.g., do MVCC and/or some kind of locking)
A non-goal was having separate servers vs. clients. In this implementation, the "clients" are Java threads talking to the database through a Java API[3]. Another non-goal was read-only nodes.

The current state is of this code is functionally complete, lightly tested, not deployed anywhere yet.

Summary

Part of the motivation for Raft was to make implementation easier by being easier to understand. I definitely found this to be the case. After reading the paper and dissertation, I could understand how it all should work. So at least in my case, the promise and benefit of "understandability" was met.

In a nutshell, here's what I discovered:
  • Understandability does make a big difference in my ability to reason about the code and debug it quickly, in two ways:
    • Getting the core Raft algorithm implemented correctly
    • Adding required additional functionality without breaking things
  • As simple as Raft is, there are still a lot of details to face before you are done
Backing Store

Raft nodes need some form of local persistence. Currently I'm using the Java port of LevelDB[4] but other options are possible by implementing the required interface[5].

The Raft "state machine" in this case is the LevelDB key/value store (modulo a small amount of additional persistent meta-data).

Concurrent Transactions


Raft tells you how to create a distributed state machine, not a distributed database. So a good portion of the work in this project went into converting the state machine abstraction into a database abstraction with support for concurrent transactions (if you didn't need concurrent transactions, this would be easy).

If you want to preserve linearizable ACID semantics and have concurrent transactions, you have to deal with conflicts, which can occur when some transaction X changes the state machine while transaction Y is still open.

The choice here was a simple optimistic locking scheme. Followers base new transactions on their most recent log entry (termed the transaction's "base log entry"), which represents the initial view of the key/value store that the transaction sees (to create this view, any unapplied log entries are layered on top of a snapshot of the current state machine, aka. key/value store).

This works well with LevelDB, since LevelDB provides point in time "snapshot" views.

While transactions are open, all reads use this view, and all mutations are captured in memory (reads of mutated keys see the most recent mutation). In addition, all keys (but not values) read during the transaction are tracked in memory. No network traffic occurs during the transaction.

On commit, the follower sends both the mutations and the ranges of keys it read during the transaction to the leader (in a CommitRequest). The leader first verifies the transction's base log entry exists in its log; if not found, the transaction fails and must retry. This can happen if a new leader was elected while the transaction was open.

Otherwise, if any log entries exist in the leader's log after the base log entry, then those log entries' mutations are cross-checked against the transaction's read key ranges. If there is any overlap, the transaction fails due to a conflict (again, the transaction must retry). If they don't overlap, the leader may safely reorder the transaction after its most recent log entry.

The leader then appends a new log entry containing the transaction's mutations; this log entry becomes the transaction's "commit log entry". The leader then responds to the follower with the index & term of the commit log entry (in a CommitResponse). Finally, the follower waits until it receives and commits the transaction's commit log entry, at which time it can complete the transaction. If this does not occur within some timeout, the transaction fails and must retry. Note the race condition here that admits a small possibility of a transaction failing with a retry error, but actually having succeeded (see below).

Like any optimistic locking scheme, this design should perform well for concurrent transactions when contention is low. However, it would not perform well when there are simultaneous transactions with high contention (e.g., all reading and writing the same key) - you'd get many transactions having to retry.

If the transaction happens to be running on a leader, then the CommitRequest/CommitResponse traffic is eliminated but the conflict checking and commit logic stays the same.

Read-Only Transactions

For read-only transactions, things get simpler. First, we can easily support a couple of weaker consistency levels:
  • If uncommitted stale reads are allowed, the transaction always succeeds immediately with no checks or network traffic (just rollback the transaction!)
  • If stale reads are allowed, then no network traffic occurs - the follower just waits for the transaction's base log entry to be committed
Otherwise, the follower sends a CommitRequest to the leader just like in a read/write transaction, but the mutations are empty and so the leader doesn't append a new log entry.

Instead, we implement the "leader lease" optimization logic described in the dissertation. Every AppendRequest contains the leader's current timestamp, and this timestamp is reflected back to the leader in the corresponding AppendResponse. Every time the leader gets such a response, it updates its minimum "lease timeout", which is calculated as the time in the past at which the leader sent AppendRequest's to a majority of followers who have since responded, plus the minimum election timeout, minus a small adjustment for possible clock drift (this assumes all nodes have the same minimum election timeout configured). This "lease timeout" is the earliest time at which another leader could be elected.

In addition to the leader's current timestamp, every AppendRequest also contains the leader's current lease timeout.
 
Now when the leader handles a read-only transaction, if its current lease timeout is after the current time X, then the read cannot be stale (no other leader could exist who might have added a new, unknown log entry). If the current lease time is not after the current time X, all the follower needs to do to commit the transaction is wait for the leader's lease timeout to exceed X. In the latter case the follower waits for an AppendRequest with a leader timeout exceeding X (this value X is contained in the CommitResponse). Either way, the follower waits (if necessary) for the leader's lease timeout to exceed the leader's timestamp when it received the CommitRequest (time X).

Finally, the follower needs to know what log entry is the the transaction's "commit log entry": it's just the same as the base log entry. That the transaction's commit log entry can be its base log entry, and not the last log entry in the leader's log, was not obvious to me at first. The reason this works is that at the time the leader receives the CommitRequest, what the follower's transaction saw in the key value store is consistent with the leader's current log state, even if there are new log entries that the follower doesn't have, because otherwise there would have been a conflict (i.e., none of the log entries added after the transaction's base log entry contain mutations that would have been visible to the transaction). If these log entries are committed, all is good; if another leader gets elected and they are never committed, then whatever log entries eventually replace them will necessarily have been added sometime after the leader's lease timeout, which is always moving forward. Because the follower waits until the leader's lease timeout exceeds the timestamp when the CommitRequest was received, that latter point in time becomes a valid "linearizable" time point for the transaction's observation.

In a normally functioning network, the leader's lease timeout should always be in the future, because the heartbeat interval is a lot shorter that the election timeout, so this means read-only transactions can commit as soon as the transaction's base log entry is committed, with the only additional network traffic being the single CommitRequest/CommitResponse round trip between follower and leader. If there happen to be no other mutating transactions occurring simultaneously, the transaction's base log entry will likely already be committed, in which case the transaction can complete as soon as the CommitResponse is received by the follower.

If the leader's lease timeout is not in the future:
  • The leader sends probes to all followers in an attempt to increase the lease timeout quickly, and
  • The leader keeps track of the fact that the follower is waiting on that lease timeout, and notifies it immediately (via an updated AppendRequest) when its lease timeout advances beyond that point.
Random Notes

Miscellaneous random details I encountered...

Add log entry at start of term

During testing I encountered the following problem: suppose a follower tries to commit a transaction. The leader appends a new log entry at index X term N, and sends back the transaction's commit index = X & term = T to the follower. Now the the follower will wait for X/T to be committed. But then an election occurs, and another leader who has received X/T gets elected, and no additional transactions occur. The new leader cannot declare X/T committed because it is from a prior term. So X/T will sit there uncommitted indefinitely, which means the follower will timeout waiting to commit its transaction. Having leader append a trivial log entry at the beginning of each term fixes this corner case.

Output queue empty notification

Nodes receive notification from the network when the output queue for a recipient has gone empty. This allows them to send data as fast as possible, but no faster, avoiding output queue pile-ups and dropped messages. For example, an InstallSnapshot operation is broken into several (pipelined) messages. We want to send these as fast as possible, but if any of them gets dropped we have to start over. In other words, if you are pipelining messages you need flow control up to the application level.

Delaying log entry application

Normally nodes apply committed log entries as soon as possible. However it turns out there are two reasons for leaders to not do this in this implementation.

First, leaders try to keep log entries around, unapplied, for a minimum amount of time after they have been committed to help avoid retry failures on long-running transactions. Otherwise, because the base log entry is chosen when the transaction is opened, if the transaction runs for a long time before being committed, then in the mean time that base log entry and at least one other could get applied to the state machine, and then the leader will be unable to determine whether (a) the transaction's base log entry is actually in its log, or (b) there were any read/write conflicts, and be forced to reject the transaction with a retry. For this reason, leaders leave log entries sitting unapplied for a "minimum supported transaction duration" time, as long as they aren't taking up too much memory.

There is a similar problem with log compaction and InstallSnapshot: when an InstallSnapshot is in effect for a follower, that "snapshot" is based on a certain log index. We need to defer applying any log entries beyond that index to the state machine, otherwise after the snapshot is completed, the follower's next log entry would not be available to us to send, and so we'd then have to send a whole new snapshot. Also, we also try to avoid InstallSnaphot scenarios in the first place by waiting (again, up to some time and memory limit) until every follower has recevied a log entry before applying it. (Leaders also optimistically send InstallSnapshots based on their last log entry, not their last committed log entry, to help this problem.)

However, this change creates another quirk. In "stock" Raft, a leader's state machine always stays ahead of follower state machines. However, when the leader is deferring application of log as described above, that's no longer true. So the leader logic to decrement nextIndex for a follower might possibly result in a nextIndex less that the follower's last applied log index, in which case the follower would not know what to do or how to stop the decrementing! Instead, followers send their last applied log index to leaders in every AppendResponse and the leaders use this as a lower bound on that follower's nextIndex.

Client session tracking

For simplicity, I opted not to do any of the client session tracking stuff; instead, clients are expected to perform idempotent transactions and interpret a retry exception as meaning "your transaction was probably not committed, but there's a small chance that it actually was". Because the client will retry possibly committed transactions, they must be idempotent.

AppendRequest probes and pipelining

Leaders track the notion of whether a follower is "synced" or not. A follower is synced if the last AppendResponse indicated success. Unsynced followers only get probes (empty AppendRequests).

Instead of batching multiple log entries into a single AppendRequest, each AppendRequest only contains one log entry. However, as long as a follower stays "synced" to the leader, the leader will pipeline AppendRequests as necessary. The effect is the same as batching, with a few extra bytes flowing over the network.

Reflected log entry optimization

A leader doesn't need to send a log entry that corresponds to a transaction back to the follower who created the transaction, because the follower already has that data. So leaders just send the transaction ID instead in that case.

Initialization and cluster changes

Nodes are initially "unconfigured", meaning they have an empty log and no empty cluster configuration. In this state they are followers who do not start elections.

An unconfigured node in this state will accept a local configuration change that adds itself to the cluster (this is done via a special method of the transaction object). This is the only configuration change allowed when unconfigured. The follower then starts an election, which it immediately wins, and becomes leader of its new single-node cluster.

Therefore, the first log entry in any Raft log always contains a configuration change adding the first node in the cluster. So "unconfigured" and "empty log" mean the same thing.

The other way for an unconfigured node to get configured is to receive an AppendRequest from a leader. By the previous statements, it then has a non-empty log and a non-empty cluster configuration.

So to build a cluster, you start up a bunch of machines. The will all be unconfigured and so are stuck as passive followers, not starting any elections but listening out for any leaders. On one machine, you tell it to create a new cluster containing itself. It then becomes the leader of its own single-node cluster. Then you tell it to add each of the other nodes, one at a time, by adding them to the configuration. When a leader sees a new node added to its configuration, it starts sending AppendRequests to it. This converts the unconfigured followers into configured followers one by one.

Out-of-cluster behavior

When a node configured, a separate issue is whether the node is included in its own current configuration, i.e., whether the node is a member of its cluster. A node that is not a member of its cluster does not count its own vote to determine committed log entries or lease timeouts (if a leader), and does not start elections (if a follower). However, it will accept and respond to incoming AppendRequests and RequestVotes. A node that is not a member of its own cluster can never become a candidate because it doesn't start elections.

In addition, leaders follow these rules:
  • If a leader is removed from a cluster, it remains the leader until the configuration change that removed it is committed (not counting its own vote), and then steps down (reverts to follower).
  • If a follower is added to a cluster, the leader immediately starts sending that follower AppendRequests, even before committing the log entry that added it.
  • If a follower is removed from a cluster, the leader continues to send that follower AppendRequests until the follower acknowledges receipt of the log entry containing the configuration change.
  • Configuration changes that remove the last node in a cluster never allowed.
  • Only one configuration change may take place at a time.
Cluster ID's

A cluster ID is automatically chosen randomly and assigned when an unconfigured node is told to create a new cluster (by adding itself as the only member) and permanently recorded. The other option for an unconfigured node is to receive an AppendRequest from a leader, at which time the node joins that cluster and permanently records that cluster ID.

All messages contain this 32 bit cluster ID. Once a cluster ID is recorded, any messages containing other cluster ID's are rejected. This prevents "mixing" nodes from different clusters. I see this as cheap insurance to avoid a disastrous potential situation.

Elections and Voting

I've implemented the technique in section 4.2.3 of the dissertation whereby followers ignore a RequestVote that is received within a minimum election timeout of an AppendRequest. Also, followers that are not part of their own cluster do not start elections. Together these should help keep disruptions to a minimum (I hope).

I did not implement pre-voting. More generally I did not try to add optimizations to speed up elections. On a properly functioning network, elections should be rare. Put another way, it seemed to me that if you are having so many elections that they are taking up an appreciable amount of time, then you in fact have some other, larger problem that needs your attention.

Followers only send replies to RequestVote if the vote is granted; there is no negative reply. Negative replies seemed useless to me (unless I'm missing something) because they have no effect on the state of the recipient.

References

[1] http://archiecobbs.github.io/jsimpledb/publish/reports/javadoc/index.html?org/jsimpledb/kv/raft/RaftKVDatabase.html
[2] https://github.com/archiecobbs/jsimpledb/
[3] http://archiecobbs.github.io/jsimpledb/publish/reports/javadoc/index.html?org/jsimpledb/kv/KVDatabase.html
[4] https://github.com/archiecobbs/leveldb
[5] http://archiecobbs.github.io/jsimpledb/publish/reports/javadoc/index.html?org/jsimpledb/kv/mvcc/AtomicKVStore.html


jordan.h...@gmail.com

unread,
Jun 10, 2015, 3:29:32 PM6/10/15
to Archie Cobbs, raft...@googlegroups.com
Sounds really interesting and well thought out! Gotta say I love the documentation of the algorithm, but I like to write too much :-)

A couple of comments inline...
Sounds interesting. I think I'll definitely dig in to the code to understand this.


In a normally functioning network, the leader's lease timeout should always be in the future, because the heartbeat interval is a lot shorter that the election timeout, so this means read-only transactions can commit as soon as the transaction's base log entry is committed, with the only additional network traffic being the single CommitRequest/CommitResponse round trip between follower and leader. If there happen to be no other mutating transactions occurring simultaneously, the transaction's base log entry will likely already be committed, in which case the transaction can complete as soon as the CommitResponse is received by the follower.

If the leader's lease timeout is not in the future:
  • The leader sends probes to all followers in an attempt to increase the lease timeout quickly, and
  • The leader keeps track of the fact that the follower is waiting on that lease timeout, and notifies it immediately (via an updated AppendRequest) when its lease timeout advances beyond that point.
Random Notes

Miscellaneous random details I encountered...

Add log entry at start of term

During testing I encountered the following problem: suppose a follower tries to commit a transaction. The leader appends a new log entry at index X term N, and sends back the transaction's commit index = X & term = T to the follower. Now the the follower will wait for X/T to be committed. But then an election occurs, and another leader who has received X/T gets elected, and no additional transactions occur. The new leader cannot declare X/T committed because it is from a prior term. So X/T will sit there uncommitted indefinitely, which means the follower will timeout waiting to commit its transaction. Having leader append a trivial log entry at the beginning of each term fixes this corner case.
Committing a no-op entry at the beginning of a leader's term to force commitment of entries from previous terms is a standard component of the Raft algorithm. See 5.4.2 in the Raft paper, which oddly doesn't seem to actually mention explicitly committing a no-op entry, but section 8 does, and I think the dissertation does as well. But even so, good on you for noticing that issue too!


Output queue empty notification

Nodes receive notification from the network when the output queue for a recipient has gone empty. This allows them to send data as fast as possible, but no faster, avoiding output queue pile-ups and dropped messages. For example, an InstallSnapshot operation is broken into several (pipelined) messages. We want to send these as fast as possible, but if any of them gets dropped we have to start over. In other words, if you are pipelining messages you need flow control up to the application level.

Delaying log entry application

Normally nodes apply committed log entries as soon as possible. However it turns out there are two reasons for leaders to not do this in this implementation.

First, leaders try to keep log entries around, unapplied, for a minimum amount of time after they have been committed to help avoid retry failures on long-running transactions. Otherwise, because the base log entry is chosen when the transaction is opened, if the transaction runs for a long time before being committed, then in the mean time that base log entry and at least one other could get applied to the state machine, and then the leader will be unable to determine whether (a) the transaction's base log entry is actually in its log, or (b) there were any read/write conflicts, and be forced to reject the transaction with a retry. For this reason, leaders leave log entries sitting unapplied for a "minimum supported transaction duration" time, as long as they aren't taking up too much memory.

There is a similar problem with log compaction and InstallSnapshot: when an InstallSnapshot is in effect for a follower, that "snapshot" is based on a certain log index. We need to defer applying any log entries beyond that index to the state machine, otherwise after the snapshot is completed, the follower's next log entry would not be available to us to send, and so we'd then have to send a whole new snapshot.
This is an interesting issue that I've had with snapshots too. For snapshots on leaders, it seems best to ensure there's some buffer between the snapshot and the end of the log in order to prevent immediately sending a snapshot after taking it. But alternatively, leaders can transfer their leadership (perhaps using the method from Diego's dissertation) before taking a snapshot, thus ensuring you won't be compacting entries that need to be sent to some follower. Blocking application of commands on the leader means blocking reads and writes in most implementations, which could be unacceptable if a not insignificant amount of time is required to take a snapshot.
Also, we also try to avoid InstallSnaphot scenarios in the first place by waiting (again, up to some time and memory limit) until every follower has recevied a log entry before applying it. (Leaders also optimistically send InstallSnapshots based on their last log entry, not their last committed log entry, to help this problem.)

However, this change creates another quirk. In "stock" Raft, a leader's state machine always stays ahead of follower state machines. However, when the leader is deferring application of log as described above, that's no longer true. So the leader logic to decrement nextIndex for a follower might possibly result in a nextIndex less that the follower's last applied log index, in which case the follower would not know what to do or how to stop the decrementing! Instead, followers send their last applied log index to leaders in every AppendResponse and the leaders use this as a lower bound on that follower's nextIndex.
I'm not following on this. The leader's nextIndex for a follower should reference the next entry to send to that follower, not the next entry to be applied on that follower. The follower's lastApplied index may differ from its commitIndex, and nextIndex should simply not decrease past the follower's last log index, though it's not a tragedy if it does. The entries applied to a follower's state machine correlate with (but may or may not be equal to) commitIndex rather than their last log index, so I'm not seeing how nextIndex relates to the follower's state machine, but maybe there's something specific to this implementation that I'm missing.
Awesome stuff! Wish I had more time to digest all this information, but I'll definitely dig through it!
--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Archie Cobbs

unread,
Jun 10, 2015, 4:30:41 PM6/10/15
to raft...@googlegroups.com, archie...@gmail.com
Hi Jordan,


On Wednesday, June 10, 2015 at 2:29:32 PM UTC-5, Jordan Halterman (kuujo) wrote:
Add log entry at start of term

During testing I encountered the following problem: suppose a follower tries to commit a transaction. The leader appends a new log entry at index X term N, and sends back the transaction's commit index = X & term = T to the follower. Now the the follower will wait for X/T to be committed. But then an election occurs, and another leader who has received X/T gets elected, and no additional transactions occur. The new leader cannot declare X/T committed because it is from a prior term. So X/T will sit there uncommitted indefinitely, which means the follower will timeout waiting to commit its transaction. Having leader append a trivial log entry at the beginning of each term fixes this corner case.
Committing a no-op entry at the beginning of a leader's term to force commitment of entries from previous terms is a standard component of the Raft algorithm. See 5.4.2 in the Raft paper, which oddly doesn't seem to actually mention explicitly committing a no-op entry, but section 8 does, and I think the dissertation does as well. But even so, good on you for noticing that issue too!

It seems like a good thing to do for several reasons, at least according to other threads on the forum. For example, it allows you to compact the log sooner.

There is a similar problem with log compaction and InstallSnapshot: when an InstallSnapshot is in effect for a follower, that "snapshot" is based on a certain log index. We need to defer applying any log entries beyond that index to the state machine, otherwise after the snapshot is completed, the follower's next log entry would not be available to us to send, and so we'd then have to send a whole new snapshot.
This is an interesting issue that I've had with snapshots too. For snapshots on leaders, it seems best to ensure there's some buffer between the snapshot and the end of the log in order to prevent immediately sending a snapshot after taking it. But alternatively, leaders can transfer their leadership (perhaps using the method from Diego's dissertation) before taking a snapshot, thus ensuring you won't be compacting entries that need to be sent to some follower. Blocking application of commands on the leader means blocking reads and writes in most implementations, which could be unacceptable if a not insignificant amount of time is required to take a snapshot.

I'm not understanding the transfer of leadership idea and how that would help. Plus it seems like it would make things more complicated, because now you have the possibility of snapshot messages sent from non-leaders.

My implementation is very simple... snapshots only go from leaders to followers, and if anything weird happens it has to be restarted.  So for example the snapshot will fail if an election occurs during the transfer.

However, an in-progress snapshot install does not prevent anyone from doing work, other than the follower receiving the snapshot. It only prevents the leader from applying committed log entries (the leader is still free to append new log entries). Multiple snapshot installs and transactions can be happening at the same time, etc.

Also, we also try to avoid InstallSnaphot scenarios in the first place by waiting (again, up to some time and memory limit) until every follower has recevied a log entry before applying it. (Leaders also optimistically send InstallSnapshots based on their last log entry, not their last committed log entry, to help this problem.)

However, this change creates another quirk. In "stock" Raft, a leader's state machine always stays ahead of follower state machines. However, when the leader is deferring application of log as described above, that's no longer true. So the leader logic to decrement nextIndex for a follower might possibly result in a nextIndex less that the follower's last applied log index, in which case the follower would not know what to do or how to stop the decrementing! Instead, followers send their last applied log index to leaders in every AppendResponse and the leaders use this as a lower bound on that follower's nextIndex.
I'm not following on this. The leader's nextIndex for a follower should reference the next entry to send to that follower, not the next entry to be applied on that follower. The follower's lastApplied index may differ from its commitIndex, and nextIndex should simply not decrease past the follower's last log index, though it's not a tragedy if it does. The entries applied to a follower's state machine correlate with (but may or may not be equal to) commitIndex rather than their last log index, so I'm not seeing how nextIndex relates to the follower's state machine, but maybe there's something specific to this implementation that I'm missing.

Here's the problem: suppose every node has a fully compacted log with last applied index = 10. Now the leader appends index = 11 and index = 12 and sends these to all followers. All followers reply affirmative, so both entries is committed, and the leader updates commitIndex -> 12 and sends this new commitIndex to all followers. So all followers apply the committed log entries #11 and #12 to their state machines, so their logs are again fully compacted with last applied index = 12. But the leader has not yet applied log entry #11 or #12 yet. The leader has nextIndex = 13 for all followers.

Now suppose follower X drops off the network for a short time, just long enough for the next two AppendRequests to that follower to fail. The leader decrements nextIndex twice for that follower, back down to 11. Now the follower comes back online. The leader then sends him an AppendRequest with log entry #11.

What does the follower do now? The follower has no way to determine whether the term associated with log entry #11 is the same as the one he already compacted into his log, because that information is lost. So the follower must reply in the negative. So then the leader decrements his nextIndex again, back down to 10. This will repeat until the leader reaches his own log compaction point, after which a snapshot install will be forced.

I suppose there is a simple logical argument to prove that the follower's term for log entry #11 must match, using the leader completeness principle: anything in the follower's compacted log must be committed, and therefore must match what's in the current leader's log. So I suppose the follower can just reply in the positive instead of the negative even though it can't explicitly verify the log entry term, and that would eliminate the need for this extra bit of communication.

In any case, the Raft paper doesn't seem to explicitly specify what followers should do in the "I received an AppendEntries log entry but that index is already compacted away in my log" case.

-Archie

jordan.h...@gmail.com

unread,
Jun 10, 2015, 5:33:43 PM6/10/15
to Archie Cobbs, raft...@googlegroups.com




On Jun 10, 2015, at 1:30 PM, Archie Cobbs <archie...@gmail.com> wrote:

Hi Jordan,

On Wednesday, June 10, 2015 at 2:29:32 PM UTC-5, Jordan Halterman (kuujo) wrote:
Add log entry at start of term

During testing I encountered the following problem: suppose a follower tries to commit a transaction. The leader appends a new log entry at index X term N, and sends back the transaction's commit index = X & term = T to the follower. Now the the follower will wait for X/T to be committed. But then an election occurs, and another leader who has received X/T gets elected, and no additional transactions occur. The new leader cannot declare X/T committed because it is from a prior term. So X/T will sit there uncommitted indefinitely, which means the follower will timeout waiting to commit its transaction. Having leader append a trivial log entry at the beginning of each term fixes this corner case.
Committing a no-op entry at the beginning of a leader's term to force commitment of entries from previous terms is a standard component of the Raft algorithm. See 5.4.2 in the Raft paper, which oddly doesn't seem to actually mention explicitly committing a no-op entry, but section 8 does, and I think the dissertation does as well. But even so, good on you for noticing that issue too!

It seems like a good thing to do for several reasons, at least according to other threads on the forum. For example, it allows you to compact the log sooner.

There is a similar problem with log compaction and InstallSnapshot: when an InstallSnapshot is in effect for a follower, that "snapshot" is based on a certain log index. We need to defer applying any log entries beyond that index to the state machine, otherwise after the snapshot is completed, the follower's next log entry would not be available to us to send, and so we'd then have to send a whole new snapshot.
This is an interesting issue that I've had with snapshots too. For snapshots on leaders, it seems best to ensure there's some buffer between the snapshot and the end of the log in order to prevent immediately sending a snapshot after taking it. But alternatively, leaders can transfer their leadership (perhaps using the method from Diego's dissertation) before taking a snapshot, thus ensuring you won't be compacting entries that need to be sent to some follower. Blocking application of commands on the leader means blocking reads and writes in most implementations, which could be unacceptable if a not insignificant amount of time is required to take a snapshot.

I'm not understanding the transfer of leadership idea and how that would help. Plus it seems like it would make things more complicated, because now you have the possibility of snapshot messages sent from non-leaders.
There's no possibility of snapshots being sent from non-leaders. That doesn't make sense. Only leaders should have the ability to send a snapshot, because that's dictated by whether a follower indicates that it doesn't have entries back to the snapshot index via an AppendEntries response.


My implementation is very simple... snapshots only go from leaders to followers, and if anything weird happens it has to be restarted.  So for example the snapshot will fail if an election occurs during the transfer.

However, an in-progress snapshot install does not prevent anyone from doing work, other than the follower receiving the snapshot. It only prevents the leader from applying committed log entries (the leader is still free to append new log entries).
But this is an essential responsibility of the leader. If it can't apply an entry during a snapshot, it can't respond to a write (if the write needs output from the state machine).

Multiple snapshot installs and transactions can be happening at the same time, etc.
So, consider the case of a state machine with an atomic CAS that returns a Boolean value indicating whether the set was successful. In order for this to be committed, it must go through the leader and be applied to a state machine in log order. If the leader is taking a snapshot during that process and has paused applying operations to its state machine during that process, it won't be able to apply the CAS and respond with the result. If the snapshot takes a long time, writes will essentially be blocked during that period.

I suppose the leader could legitimately get the result of that operation from a follower if follower's are applying commits before the leader.

But what I'm saying is, rather than pausing application of operations to the leader's state machine during a snapshot, the leader can transfer its leadership to a follower before taking its snapshot. In that case, because the leader becomes a follower prior to snapshotting its state machine and removing entries from its log, it's not possible that a follower can suddenly need an entry in the snapshotted node's log because it's no longer the leader and is therefore no longer responsible for replicating entries to followers.

What will have happened is another node which is not taking a snapshot of its state machine will become the leader (assuming some randomization of snapshots). Perhaps that node took a snapshot recently, but long enough ago that it's unlikely follower's will be missing entries that aren't in the new leader's log. So, follower's will never receive entries from a leader that just completed a snapshot because leader's transfer their leadership prior to taking snapshots.

Also, we also try to avoid InstallSnaphot scenarios in the first place by waiting (again, up to some time and memory limit) until every follower has recevied a log entry before applying it. (Leaders also optimistically send InstallSnapshots based on their last log entry, not their last committed log entry, to help this problem.)

However, this change creates another quirk. In "stock" Raft, a leader's state machine always stays ahead of follower state machines. However, when the leader is deferring application of log as described above, that's no longer true. So the leader logic to decrement nextIndex for a follower might possibly result in a nextIndex less that the follower's last applied log index, in which case the follower would not know what to do or how to stop the decrementing! Instead, followers send their last applied log index to leaders in every AppendResponse and the leaders use this as a lower bound on that follower's nextIndex.
I'm not following on this. The leader's nextIndex for a follower should reference the next entry to send to that follower, not the next entry to be applied on that follower. The follower's lastApplied index may differ from its commitIndex, and nextIndex should simply not decrease past the follower's last log index, though it's not a tragedy if it does. The entries applied to a follower's state machine correlate with (but may or may not be equal to) commitIndex rather than their last log index, so I'm not seeing how nextIndex relates to the follower's state machine, but maybe there's something specific to this implementation that I'm missing.

Here's the problem: suppose every node has a fully compacted log with last applied index = 10. Now the leader appends index = 11 and index = 12 and sends these to all followers. All followers reply affirmative, so both entries is committed, and the leader updates commitIndex -> 12 and sends this new commitIndex to all followers. So all followers apply the committed log entries #11 and #12 to their state machines, so their logs are again fully compacted with last applied index = 12. But the leader has not yet applied log entry #11 or #12 yet. The leader has nextIndex = 13 for all followers.

Now suppose follower X drops off the network for a short time, just long enough for the next two AppendRequests to that follower to fail. The leader decrements nextIndex twice for that follower, back down to 11. Now the follower comes back online. The leader then sends him an AppendRequest with log entry #11.

What does the follower do now? The follower has no way to determine whether the term associated with log entry #11 is the same as the one he already compacted into his log, because that information is lost. So the follower must reply in the negative. So then the leader decrements his nextIndex again, back down to 10. This will repeat until the leader reaches his own log compaction point, after which a snapshot install will be forced.

I suppose there is a simple logical argument to prove that the follower's term for log entry #11 must match, using the leader completeness principle: anything in the follower's compacted log must be committed, and therefore must match what's in the current leader's log. So I suppose the follower can just reply in the positive instead of the negative even though it can't explicitly verify the log entry term, and that would eliminate the need for this extra bit of communication.

In any case, the Raft paper doesn't seem to explicitly specify what followers should do in the "I received an AppendEntries log entry but that index is already compacted away in my log" case.

-Archie

Jordan Halterman

unread,
Jun 10, 2015, 5:39:20 PM6/10/15
to Archie Cobbs, raft...@googlegroups.com
Actually, what I'm referring to with regard to leadership transfers prior to leader snapshots is in Diego's dissertation (I thought it was) in 5.1.2, his phrasing may be a better explanation:

"It may also be possible to schedule snapshots in a way that client requests never wait on a server that is snapshotting. In this approach, servers would coordinate so that only up to a minority of the servers in the cluster would snapshot at any one time (when possible). Because Raft only requires a majority of servers to commit log entries, the minority of snapshotting servers would normally have no adverse effect on clients. When a leader wished to snapshot, it would step down first, allowing another server to manage the cluster in the meantime. If this approach was sufficiently reliable, it could also eliminate the need to snapshot concurrently; servers could just be unavailable while they took their snapshots (though they would count against the cluster’s ability to mask failures). This is an exciting opportunity for future work because of its potential to both improve overall system performance and reduce mechanism."

Archie Cobbs

unread,
Jun 10, 2015, 6:00:47 PM6/10/15
to raft...@googlegroups.com, archie...@gmail.com

On Wednesday, June 10, 2015 at 4:33:43 PM UTC-5, Jordan Halterman (kuujo) wrote:
There is a similar problem with log compaction and InstallSnapshot: when an InstallSnapshot is in effect for a follower, that "snapshot" is based on a certain log index. We need to defer applying any log entries beyond that index to the state machine, otherwise after the snapshot is completed, the follower's next log entry would not be available to us to send, and so we'd then have to send a whole new snapshot.
This is an interesting issue that I've had with snapshots too. For snapshots on leaders, it seems best to ensure there's some buffer between the snapshot and the end of the log in order to prevent immediately sending a snapshot after taking it. But alternatively, leaders can transfer their leadership (perhaps using the method from Diego's dissertation) before taking a snapshot, thus ensuring you won't be compacting entries that need to be sent to some follower. Blocking application of commands on the leader means blocking reads and writes in most implementations, which could be unacceptable if a not insignificant amount of time is required to take a snapshot.

I'm not understanding the transfer of leadership idea and how that would help. Plus it seems like it would make things more complicated, because now you have the possibility of snapshot messages sent from non-leaders.
There's no possibility of snapshots being sent from non-leaders. That doesn't make sense. Only leaders should have the ability to send a snapshot, because that's dictated by whether a follower indicates that it doesn't have entries back to the snapshot index via an AppendEntries response.

My implementation is very simple... snapshots only go from leaders to followers, and if anything weird happens it has to be restarted.  So for example the snapshot will fail if an election occurs during the transfer.

However, an in-progress snapshot install does not prevent anyone from doing work, other than the follower receiving the snapshot. It only prevents the leader from applying committed log entries (the leader is still free to append new log entries).
But this is an essential responsibility of the leader. If it can't apply an entry during a snapshot, it can't respond to a write (if the write needs output from the state machine).

We are having terminology confusion. I thought by "snapshot" you were referring to the InstallSnapshot operation, but you are referring to "snapshotting" as described in Section 5.1 of the dissertation.

My implementation doesn't ever do that, which is why I didn't understand what you were talking about.

You are talking about this type of log compaction (quoting dissertation):
  • Snapshotting for memory-based state machines is conceptually the simplest approach. In snapshotting, the entire current system state is written to a snapshot on stable storage, then the entire log up to that point is discarded
Instead my implementation uses this model (quoting again):
  • With disk-based state machines, a recent copy of the system state is maintained on disk as part of normal operation. Thus, the Raft log can be discarded as soon as the state machine reflects writes to disk, and snapshotting is used only when sending consistent disk images to other servers (Section 5.2).

And since it's using LevelDB, which supports "point in time" snapshots, for the local persistent state (including the state machine itself) it can be sending an InstallSnapshot and appending or applying log entries at the same time. So there's never any need to step down, etc.

I'm using a "disk based state machine" because I wanted to avoid the limitation that the entire database must fit into memory.


jordan.h...@gmail.com

unread,
Jun 10, 2015, 7:41:05 PM6/10/15
to Archie Cobbs, raft...@googlegroups.com
Gotcha, makes sense.

Oren Eini (Ayende Rahien)

unread,
Jun 11, 2015, 2:44:59 AM6/11/15
to Archie Cobbs, raft...@googlegroups.com
Why are you allowing transactions on any node?
That just complicate everything. Because transactions has to go through the leader anyway, it is _much_ simpler to state that write transactions has to be executed on the leader.
That way, it can commit directly, and fail if there is a network split directly.

Also, note that for consistency, you have 3 modes:
- Read from follower (potentially stale)
- Read from leader (better, but leader might have been demoted without knowing it yet)
- Read from quorum (force a log commit to ensure that we are actually the leader)

A leader most certainly can commit entries from previous terms.


Also, leaders shouldn't discard log entries after they have been committed, that just lead you to not being able to send entries, or delay the actual state machine application.
You assume that applying a log entry also means deleting it, that isn't the case. And in your case, it can lead to effective unavailability of the cluster (it process commands, but it doesn't apply them, which is really bad).

I don't follow your sync vs. unsync distinction, or why you send just a single entry. What happen if I execute two transactions concurrently? I want to send them out as fast as possible.

The reflected log entry is a bad idea, it is a minor optimization for a bigger perf issue (see above), but it also means that you now need to handle the case of a follower crashing after sending it to the leader, and getting the full value back. It is more complex.

Random cluster ids are probably a bad idea. You might get identical ones (easier than you would expect to run into that in prod). You probably want to use a guid or something like that.

Negative replies are really important for debugging, troubleshooting and ops.


Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 


Oren Eini (Ayende Rahien)

unread,
Jun 11, 2015, 2:54:43 AM6/11/15
to Archie Cobbs, raft...@googlegroups.com
However, an in-progress snapshot install does not prevent anyone from doing work, other than the follower receiving the snapshot. It only prevents the leader from applying committed log entries (the leader is still free to append new log entries). Multiple snapshot installs and transactions can be happening at the same time, etc.


Not applying commands means that the system is down, as far as the user is concerned.

Here's the problem: suppose every node has a fully compacted log with last applied index = 10. Now the leader appends index = 11 and index = 12 and sends these to all followers. All followers reply affirmative, so both entries is committed, and the leader updates commitIndex -> 12 and sends this new commitIndex to all followers. So all followers apply the committed log entries #11 and #12 to their state machines, so their logs are again fully compacted with last applied index = 12. But the leader has not yet applied log entry #11 or #12 yet. The leader has nextIndex = 13 for all followers.

Why wouldn't it apply them? The leader usually apply them when it get majority votes from the cluster.
That is a really strange design decision.
 
What does the follower do now? The follower has no way to determine whether the term associated with log entry #11 is the same as the one he already compacted into his log, because that information is lost.

The follower wouldn't ask entry #11. It would say "already have it, thanks".
It would need to only compare its latest term/entry with the leader top term/entry. Raft ensure that if they match, all previous ones match as well.

Oren Eini (Ayende Rahien)

unread,
Jun 11, 2015, 2:56:10 AM6/11/15
to Jordan Halterman, Archie Cobbs, raft...@googlegroups.com
To be honest, it is probably easier to handle snapshotting concurrently.

Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 


PatC

unread,
Jun 11, 2015, 3:19:29 AM6/11/15
to raft...@googlegroups.com

> Followers only send replies to RequestVote if the vote is granted; there is no negative reply. Negative replies seemed useless to me (unless I'm missing something) because they have no effect on the state of the recipient.

I thought one reason to send a negative reply is to update the sender's currentTerm, which could be lower.

> I've implemented the technique in section 4.2.3 of the dissertation whereby followers ignore a RequestVote that is received within a minimum election timeout of anAppendRequest.

I've been wondering if the "minimum election timeout" is the same timeout that causes a follower to become a candidate. If so, then the rule would simply be that followers always ignore RequestVotes.

Oren Eini (Ayende Rahien)

unread,
Jun 11, 2015, 3:26:40 AM6/11/15
to Archie Cobbs, raft...@googlegroups.com

Instead my implementation uses this model (quoting again):
  • With disk-based state machines, a recent copy of the system state is maintained on disk as part of normal operation. Thus, the Raft log can be discarded as soon as the state machine reflects writes to disk, and snapshotting is used only when sending consistent disk images to other servers (Section 5.2).

And since it's using LevelDB, which supports "point in time" snapshots, for the local persistent state (including the state machine itself) it can be sending an InstallSnapshot and appending or applying log entries at the same time. So there's never any need to step down, etc.

I'm using a "disk based state machine" because I wanted to avoid the limitation that the entire database must fit into memory.


This is a pretty bad idea. If you rely on the state machine snapshots, compacting the log should be done on a routine manner, but leave enough of a tail that sending a snapshot isn't required usually.
We typically compact the log and leave the last 10% - 25% intact.
 

Archie Cobbs

unread,
Jun 11, 2015, 10:37:12 AM6/11/15
to raft...@googlegroups.com, archie...@gmail.com
On Thursday, June 11, 2015 at 1:44:59 AM UTC-5, Ayende Rahien wrote:
Why are you allowing transactions on any node?
That just complicate everything. Because transactions has to go through the leader anyway, it is _much_ simpler to state that write transactions has to be executed on the leader.
That way, it can commit directly, and fail if there is a network split directly.

Well, here's the thinking behind that design decision...

First of all, whether clients talk directly to leaders, or clients talk to the local node who then talks to the leader on the client's behalf, you're going to have the same amount of network traffic. So there's no real difference from that point of view.

In my design there is no requirement to support non-cluster node clients, so allowing clients to be separated from cluster nodes is not an advantage for that reason in my scenario.

As far as complication, I think there are trade-offs. My design separates the transaction layer from the Raft layer: what clients see is simply a transactional key/value store. They know nothing about Raft, who is the leader, what state the local node is in, etc. To me, this is actually simpler than a design which complicates the transaction layer with Raft-specific details. It may help to note that in this project Raft is only one of many key/value store database implementations, so it is already required to implement a generic transaction interface.

In my design the transaction state and the Raft state are two different state machines. A client could open a transaction, do some work, and commit that transaction, while one or more Raft elections occurred simultaneously under the covers. The client would see this only as a delay or possibly a retry exception. A transaction that was opened when the local node was a follower could be committed when the local node was a leader, etc.

If you're interested, here are the transaction state machine states.
 
Also, note that for consistency, you have 3 modes:
- Read from follower (potentially stale)
- Read from leader (better, but leader might have been demoted without knowing it yet)
- Read from quorum (force a log commit to ensure that we are actually the leader)

Not sure what you mean. There is the option to configure a transaction for stale reads, but that's just an option. Normal transactions have fully linearizable ACID consistency, no matter what node you execute them on.

Also, leaders shouldn't discard log entries after they have been committed, that just lead you to not being able to send entries, or delay the actual state machine application.

I agree and implemented it this way... see description of leaders delaying the application of log entries.
 
You assume that applying a log entry also means deleting it, that isn't the case. And in your case, it can lead to effective unavailability of the cluster (it process commands, but it doesn't apply them, which is really bad).

Well, you can either apply them early and keep them around, or keep them around and apply them later.. the effect is the same: they are still available for longer than they would otherwise be, so you can send them to followers.

In any case the existence of unapplied log entries do not make the cluster unavailable... not sure where you're getting that from. Leaders can function just fine with one or more unapplied log entries.
 
I don't follow your sync vs. unsync distinction, or why you send just a single entry. What happen if I execute two transactions concurrently? I want to send them out as fast as possible.

Sync just means that the follower's latest AppendRequest response was positive. This will of course be the normal steady state.

The leader will pipeline AppendRequests to synced followers if needed. Sending multiple AppendRequests in a row is functionally equivalent to sending one AppendRequest containing multiple log entries, plus a few extra bytes of overhead (e.g., you will get an AppendResponse for each request, but these are very small). So they will still go out as fast as possible.

To give a specific example, if the leader sends log entry #10 to a follower, but the follower's response has not been received yet, and then a new log entry #11 is added, the leader will go ahead and send out #11 to that follower even though it hasn't yet received the response to entry #10.
 
The reflected log entry is a bad idea, it is a minor optimization for a bigger perf issue (see above), but it also means that you now need to handle the case of a follower crashing after sending it to the leader, and getting the full value back. It is more complex.

Well, the implementation is actually very simple. When it receives a transaction X from follower Y, it sets a flag "Y sent X". When later sending that log entry back to Y, if the flag is set it:
  1. Sends only the transaction ID instead of the whole log entry; and
  2. Clears the flag
So this only happens the first time. If the message fails, the flag will be cleared the second time so the normal thing happens and everybody recovers.
 
Random cluster ids are probably a bad idea. You might get identical ones (easier than you would expect to run into that in prod). You probably want to use a guid or something like that.

32 bits of randomness seemed like enough for me... that's a 1 in 4.3 billion chance of a collision (I'm using SecureRandom FWIW).
 
Negative replies are really important for debugging, troubleshooting and ops.

Perhaps. I've been relying on inspection of the follower logs to understand what the follower is doing and this has seemed sufficient so far.

Thanks for your comments.

-Archie

Archie Cobbs

unread,
Jun 11, 2015, 10:41:40 AM6/11/15
to raft...@googlegroups.com, archie...@gmail.com
On Thursday, June 11, 2015 at 1:54:43 AM UTC-5, Ayende Rahien wrote:

However, an in-progress snapshot install does not prevent anyone from doing work, other than the follower receiving the snapshot. It only prevents the leader from applying committed log entries (the leader is still free to append new log entries). Multiple snapshot installs and transactions can be happening at the same time, etc.

Not applying commands means that the system is down, as far as the user is concerned.

There may be some terminology confusion. By "applying committed log entries" I'm referring to compacting the log, i.e., applying the mutations to the state machine. Even if the leader is not currently doing this, the leader may still append new log entries to the log. So normal progress continues, new transactions may be committed, etc.
 
Here's the problem: suppose every node has a fully compacted log with last applied index = 10. Now the leader appends index = 11 and index = 12 and sends these to all followers. All followers reply affirmative, so both entries is committed, and the leader updates commitIndex -> 12 and sends this new commitIndex to all followers. So all followers apply the committed log entries #11 and #12 to their state machines, so their logs are again fully compacted with last applied index = 12. But the leader has not yet applied log entry #11 or #12 yet. The leader has nextIndex = 13 for all followers.

Why wouldn't it apply them? The leader usually apply them when it get majority votes from the cluster.
That is a really strange design decision.

The leader delays applying committed log entries to the state machine for the reasons stated earlier (e.g., and in-progress SnapshotInstall).
 
 
What does the follower do now? The follower has no way to determine whether the term associated with log entry #11 is the same as the one he already compacted into his log, because that information is lost.

The follower wouldn't ask entry #11. It would say "already have it, thanks".
It would need to only compare its latest term/entry with the leader top term/entry. Raft ensure that if they match, all previous ones match as well.

Yes you are right... I didn't realize that at the time but do now. So this bit of complexity can go away.

-Archie
 

Archie Cobbs

unread,
Jun 11, 2015, 10:48:06 AM6/11/15
to raft...@googlegroups.com
On Thursday, June 11, 2015 at 2:19:29 AM UTC-5, PatC wrote:
> Followers only send replies to RequestVote if the vote is granted; there is no negative reply. Negative replies seemed useless to me (unless I'm missing something) because they have no effect on the state of the recipient.

I thought one reason to send a negative reply is to update the sender's currentTerm, which could be lower.

Yes, by not sending negative replies you lose that possibility. But this problem will correct itself soon anyway when the deposed leader hears directly from a new leader. So this amounts to a small optimization for an odd corner case, and so was not worth the trouble (in my opinion).
 
> I've implemented the technique in section 4.2.3 of the dissertation whereby followers ignore a RequestVote that is received within a minimum election timeout of anAppendRequest.

I've been wondering if the "minimum election timeout" is the same timeout that causes a follower to become a candidate. If so, then the rule would simply be that followers always ignore RequestVotes.

Yes it's the same timeout. Under this rule followers will ignore RequestVote if they have heard from a leader in less than the minimum election timeout. If they have not heard from a leader in less than the minimum election timeout, then they will respond.

This includes the case that the RequestVote occurs in the time interval between the minimum election timeout and the follower's actual (randomly chosen) election timeout. But this is how Raft elections always work: the first follower to timeout starts the election, and at that time, by definition all other follower's chosen election timeouts have not yet expired. There is a small race window but it's handled by failing the election and retrying.

-Archie


Archie Cobbs

unread,
Jun 11, 2015, 10:54:14 AM6/11/15
to raft...@googlegroups.com, archie...@gmail.com

Not sure what you're specifically referring to by "bad idea".

We all agree that avoiding sending unnecessary snapshots is a good idea. But this seems to me like a tuning issue, not a design issue. Oof course the design needs to support being tuned.

So you would look at how far behind a node is likely to ever be, how long it takes to send a snapshot (roughly, size of database divided by network speed), network reliability, etc. and then set your parameters accordingly. I have not really done much of this tuning yet but you are correct that it will need to be done at some point..

-Archie

Oren Eini (Ayende Rahien)

unread,
Jun 11, 2015, 1:28:10 PM6/11/15
to Archie Cobbs, raft...@googlegroups.com
Imagine 5 node cluster.
S1 is the leader.
S4 is down for a while.

We are now processing commands, and we can apply them to S1,S2,S3,S5.

But S4 is not responding, and we can't send anything to it. 

If we remove commited entries from the log, it means that when S4 is back up, we'll need to send a snapshot.

In my case, snapshots can be _very_ large (GBs or higher), we want to avoid that if at all possible.

Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 


Archie Cobbs

unread,
Jun 11, 2015, 2:38:29 PM6/11/15
to raft...@googlegroups.com, archie...@gmail.com
On Thursday, June 11, 2015 at 12:28:10 PM UTC-5, Ayende Rahien wrote:
Imagine 5 node cluster.
S1 is the leader.
S4 is down for a while.

We are now processing commands, and we can apply them to S1,S2,S3,S5.

But S4 is not responding, and we can't send anything to it. 

If we remove commited entries from the log, it means that when S4 is back up, we'll need to send a snapshot.

In my case, snapshots can be _very_ large (GBs or higher), we want to avoid that if at all possible.

OK. Like I said, this is a tuning issue.

If your goal is to try really really hard to avoid snapshots, then (in my system) the leader would be configured to delay application of committed log entries for a long long time. The existence of unapplied log entries does not cause any harm, other than having a that-much-longer list that certain internal operations must iterate through.

How is that ultimately so different from what you are advocating doing?

-Archie

Oren Eini (Ayende Rahien)

unread,
Jun 11, 2015, 5:32:03 PM6/11/15
to Archie Cobbs, raft...@googlegroups.com
No, it isn't a tuning issue. It is a pretty fundamental issue.
You are delaying committing log entries because you equate committing a log entry with deleting it.

And not applying the log entry means that the operation didn't happen, the client is going to be upset about it.

Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 


Archie Cobbs

unread,
Jun 11, 2015, 5:48:59 PM6/11/15
to raft...@googlegroups.com, archie...@gmail.com
On Thursday, June 11, 2015 at 4:32:03 PM UTC-5, Ayende Rahien wrote:
No, it isn't a tuning issue. It is a pretty fundamental issue.
You are delaying committing log entries because you equate committing a log entry with deleting it.

And not applying the log entry means that the operation didn't happen, the client is going to be upset about it.

You're misunderstanding how it works... I think part of the problem is that the terminology is confusing.

It's true that I'm deleting log entries once they are applied to the state machine (aka, "compacted")... and I assume "applied to the state machine" is what you meant by "committing a log entry".

But your assertion "not applying the log entry means that the operation didn't happen" is not true. My log entries are recorded to disk persistently as well as kept in memory.

So if a node has 5 outstanding log entries that have not yet been applied to its state machine, and then it crashes and restarts, those 5 log entries will still be there.

Perhaps I should have made this more clear in the description. I guess I thought it was obvious, because the whole correctness of Raft presumes that a node's log is part of its persistent state.

Also to clarify another point which may be confusing: transactions always observe a view of the node's current state machine with any unapplied log entries layered on top. So in effect what the transaction sees is the entire, most up-to-date log entry, even if that log entry has not been applied yet.

Then when a follower sends the CommitRequest to the leader, the leader verifies that any additional log entries that the follower doesn't know about yet do not create conflicts (this is the optimistic locking part).

So transactions can be running and log entries can be appended in an inter-mixed fashion, even if there are outstanding, unapplied log entries on the follower and/or leader.

Hope this clears it up.

-Archie

Oren Eini (Ayende Rahien)

unread,
Jun 11, 2015, 5:58:17 PM6/11/15
to Archie Cobbs, raft...@googlegroups.com
inline

Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 


On Fri, Jun 12, 2015 at 12:48 AM, Archie Cobbs <archie...@gmail.com> wrote:
On Thursday, June 11, 2015 at 4:32:03 PM UTC-5, Ayende Rahien wrote:
No, it isn't a tuning issue. It is a pretty fundamental issue.
You are delaying committing log entries because you equate committing a log entry with deleting it.

And not applying the log entry means that the operation didn't happen, the client is going to be upset about it.

You're misunderstanding how it works... I think part of the problem is that the terminology is confusing.

Okay, so let define things.
Commit - the leader get replies from N/2+1 nodes that they accepted a command
Applied - the command is processed by the state machine
Deleted - removed from the persistent log
Compaction - deleting old entries from the log and generating a stable snapshot of the state machine.
 

It's true that I'm deleting log entries once they are applied to the state machine (aka, "compacted")...

No, the fact that you applied it to the state machine doesn't mean that you compacted anything. You assume that applying the command require deleting it, that is false.
 
and I assume "applied to the state machine" is what you meant by "committing a log entry".
 

But your assertion "not applying the log entry means that the operation didn't happen" is not true. My log entries are recorded to disk persistently as well as kept in memory.


According to your description, only until they are applied to the state machine.
 
So if a node has 5 outstanding log entries that have not yet been applied to its state machine, and then it crashes and restarts, those 5 log entries will still be there.


So, but _it wasn't applied_. 

Consider the case of a simple key/value store. 
I sent the following commands:

Set X = 1
Set Y = 2

You have now stored those commands in the log (persistent or in memory, doesn't matter). But until you applied them to the state machine, you can't read those value back.
That means that as far as the client is concerned, they didn't happen.

This basically mean that you have created a system were updates are gotten, and the reply you get back (oh, yeah, we'll apply it some time in the future).

What happen when you have a command like:

Set X = 2 IF Y = 1 

You send the command to the leader, it is committed, but you don't have any reply. And that is something that the client is likely to want.

 
Perhaps I should have made this more clear in the description. I guess I thought it was obvious, because the whole correctness of Raft presumes that a node's log is part of its persistent state.

You are focusing on the log, but you missed the part where if you don't apply the state machine, you are spending a lot of effort not doing much.

Also to clarify another point which may be confusing: transactions always observe a view of the node's current state machine with any unapplied log entries layered on top. So in effect what the transaction sees is the entire, most up-to-date log entry, even if that log entry has not been applied yet.

That is wrong. Because you don't know if a log entry will be rolled back. And if you are only doing that for committed entries, then that "view" is when you actually apply the state machine command.


Then when a follower sends the CommitRequest to the leader, the leader verifies that any additional log entries that the follower doesn't know about yet do not create conflicts (this is the optimistic locking part).

Again, this is extra complex. If the leader did all the work, on top of the fully applied state machine, you wouldn't need it.
 

So transactions can be running and log entries can be appended in an inter-mixed fashion, even if there are outstanding, unapplied log entries on the follower and/or leader.

Hope this clears it up.

-Archie

Archie Cobbs

unread,
Jun 11, 2015, 6:23:40 PM6/11/15
to raft...@googlegroups.com, archie...@gmail.com
On Thursday, June 11, 2015 at 4:58:17 PM UTC-5, Ayende Rahien wrote:

Okay, so let define things.
Commit - the leader get replies from N/2+1 nodes that they accepted a command
Applied - the command is processed by the state machine
Deleted - removed from the persistent log
Compaction - deleting old entries from the log and generating a stable snapshot of the state machine.

Let's stop here because I'm already confused :)

My persistent state consists of:
  • Raft meta-data
    • Current-Term
    • Voted-For
    • Last-Applied-Term
    • Last-Applied-Index
    • Last-Applied-Config
  • A key/value store
  • A list of unapplied log entries, each consisting of:
    • Log-Entry-Index (will always be equal to Last-Applied-Index plus this entry's offset in list)
    • Log-Entry-Term
    • A set of key/value mutations (puts & deletes)
There are only two log-related operations ever performed on this persistent state:
  1. Append a new log entry to the list of unapplied log entries
  2. Apply the first unapplied log entry in the list to the key/value store (aka "state machine"), which means doing the following:
    1. Apply the log entry's mutations to the key/value store (do the puts & deletes)
    2. Set Last-Applied-Index equal to Log-Entry-Index (i.e., increment Last-Applied-Index by one)
    3. Set Last-Applied-Term equal to Log-Entry-Term
    4. Remove the log entry we just applied from the front of the list of unapplied log entries
So you can call those things whatever you want.

I call operation #1 "appending a log entry" and operation #2 "applying a log entry".

There is no separate "compaction" operation. The "compaction" is implicit in step #2.
 
So if a node has 5 outstanding log entries that have not yet been applied to its state machine, and then it crashes and restarts, those 5 log entries will still be there.

So, but _it wasn't applied_. 

Consider the case of a simple key/value store. 
I sent the following commands:

Set X = 1
Set Y = 2

You have now stored those commands in the log (persistent or in memory, doesn't matter). But until you applied them to the state machine, you can't read those value back.

The statement "you can't read those values back" is not correct. You can read them back... they may not be committed yet, so a transaction that reads them cannot commit yet, but they are readable.

That means that as far as the client is concerned, they didn't happen.

This basically mean that you have created a system were updates are gotten, and the reply you get back (oh, yeah, we'll apply it some time in the future).

What happen when you have a command like:

Set X = 2 IF Y = 1 

You send the command to the leader, it is committed, but you don't have any reply. And that is something that the client is likely to want.

That's not how it works.

The client does its own reading of the key/value state. When commit is invoked, it sends to the leader is its writes, plus the keys that it read - the latter only being used to detect conflicts. Then only after the leader responds with the index & term of the associated log entry, and the follower sees that log entry committed (in the Raft sense), does the transaction succeed and the commit() method return successfully. If something goes wrong, commit() throws a RetryTransactionException instead.
 
Also to clarify another point which may be confusing: transactions always observe a view of the node's current state machine with any unapplied log entries layered on top. So in effect what the transaction sees is the entire, most up-to-date log entry, even if that log entry has not been applied yet.

That is wrong. Because you don't know if a log entry will be rolled back. And if you are only doing that for committed entries, then that "view" is when you actually apply the state machine command.

The transaction does not successfully commit until the log entry that it read from is successfully committed, and the transaction will block until that happens.
 
Then when a follower sends the CommitRequest to the leader, the leader verifies that any additional log entries that the follower doesn't know about yet do not create conflicts (this is the optimistic locking part).

Again, this is extra complex. If the leader did all the work, on top of the fully applied state machine, you wouldn't need it.

But then you'd also loose the ability to have concurrent transactions, which is unacceptable.

In short, this is an optimistic locking scheme that uses the "versioning" inherent in the Raft log for MVCC to build a transactional database supporting concurrent transactions.

-Archie

Oren Eini (Ayende Rahien)

unread,
Jun 11, 2015, 6:57:27 PM6/11/15
to Archie Cobbs, raft...@googlegroups.com
inline

Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 


On Fri, Jun 12, 2015 at 1:23 AM, Archie Cobbs <archie...@gmail.com> wrote:
On Thursday, June 11, 2015 at 4:58:17 PM UTC-5, Ayende Rahien wrote:


  1. Append a new log entry to the list of unapplied log entries
  2. Apply the first unapplied log entry in the list to the key/value store (aka "state machine"), which means doing the following:
    1. Apply the log entry's mutations to the key/value store (do the puts & deletes)
    2. Set Last-Applied-Index equal to Log-Entry-Index (i.e., increment Last-Applied-Index by one)
    3. Set Last-Applied-Term equal to Log-Entry-Term
    4. Remove the log entry we just applied from the front of the list of unapplied log entries
So you can call those things whatever you want.


My issue is with 2.4, you don't need to do that at this time. You can do that when you have enough of a buffer that you won't force us to send a snapshot over the network.
 
I call operation #1 "appending a log entry" and operation #2 "applying a log entry".

There is no separate "compaction" operation. The "compaction" is implicit in step #2.
You send the command to the leader, it is committed, but you don't have any reply. And that is something that the client is likely to want.

That's not how it works.

The client does its own reading of the key/value state. When commit is invoked, it sends to the leader is its writes, plus the keys that it read - the latter only being used to detect conflicts. Then only after the leader responds with the index & term of the associated log entry, and the follower sees that log entry committed (in the Raft sense), does the transaction succeed and the commit() method return successfully. If something goes wrong, commit() throws a RetryTransactionException instead.

That assumes that you have very simple operations, and that they can be handled on their own.

What about something like:

INC x

You don't know what the value of x is, you need to actually execute it to get it.
And you _want_ something like that to allow concurrency. 
That is what I meant by not having a result until this is actually applied to the state machine.

 
But then you'd also loose the ability to have concurrent transactions, which is unacceptable.

Why? You can most certainly submit multiple commands to raft that will be processed in a single batch

Archie Cobbs

unread,
Jun 11, 2015, 9:38:17 PM6/11/15
to raft...@googlegroups.com, archie...@gmail.com
On Thursday, June 11, 2015 at 5:57:27 PM UTC-5, Ayende Rahien wrote:
On Fri, Jun 12, 2015 at 1:23 AM, Archie Cobbs <archie...@gmail.com> wrote:
  1. Append a new log entry to the list of unapplied log entries
  2. Apply the first unapplied log entry in the list to the key/value store (aka "state machine"), which means doing the following:
    1. Apply the log entry's mutations to the key/value store (do the puts & deletes)
    2. Set Last-Applied-Index equal to Log-Entry-Index (i.e., increment Last-Applied-Index by one)
    3. Set Last-Applied-Term equal to Log-Entry-Term
    4. Remove the log entry we just applied from the front of the list of unapplied log entries
So you can call those things whatever you want.


My issue is with 2.4, you don't need to do that at this time. You can do that when you have enough of a buffer that you won't force us to send a snapshot over the network.

I agree with your concern, but accomplish the same thing another way, which is to delay the entire #2 "apply" operation.
 
I call operation #1 "appending a log entry" and operation #2 "applying a log entry".The client does its own reading of the key/value state. When commit is invoked, it sends to the leader is its writes, plus the keys that it read - the latter only being used to detect conflicts. Then only after the leader responds with the index & term of the associated log entry, and the follower sees that log entry committed (in the Raft sense), does the transaction succeed and the commit() method return successfully. If something goes wrong, commit() throws a RetryTransactionException instead.

That assumes that you have very simple operations, and that they can be handled on their own.

What about something like:

INC x

You don't know what the value of x is, you need to actually execute it to get it.
And you _want_ something like that to allow concurrency. 
That is what I meant by not having a result until this is actually applied to the state machine.

All of the transaction's reads and writes (and INC's or whatever happen on the client node, within the transaction's private little world, entirely locally and not involving any network communication. It's only at commit() time that we start communicating with the leader. And at that time, what we communicate is the recorded mutations (puts and removes) and keys read (gets). Whether those gets, puts, and removes were implementing INC's or whatever other fancy logic is irrelevant at that point.

But then you'd also loose the ability to have concurrent transactions, which is unacceptable.

Why? You can most certainly submit multiple commands to raft that will be processed in a single batch

A bunch of commands is not the same thing as a linearizable ACID transaction over the entire key/value store, which is an explicit goal of this implementation. A linearizable transaction allows you to (conceptually) perform a bunch of reads and writes, affecting a bunch of different keys and key ranges, all on a virtual point-in-time snapshot of the entire key/value store, and then have it appear as if all of those writes were applied atomically at that same point in time, and moreover, all such transactions appear to have occurred in some total linear ordering.

If you are just doing the reads and writes of individual key/value pairs one at a time, then you don't have a linearizable ACID transaction over the entire key/value store - unless you disallow concurrent transactions.

To take a simple counter-example, suppose only one user can be "king", but two different, concurrent transactions execute "if nobody is king, then set x as king" but with different x's. Then you possibly end up with two kings. On the other hand, linearizable semantics for the transactions would have made that outcome impossible.

-Archie

 

Oren Eini (Ayende Rahien)

unread,
Jun 12, 2015, 2:28:28 AM6/12/15
to Archie Cobbs, raft...@googlegroups.com

What about something like:

INC x

You don't know what the value of x is, you need to actually execute it to get it.
And you _want_ something like that to allow concurrency. 
That is what I meant by not having a result until this is actually applied to the state machine.

All of the transaction's reads and writes (and INC's or whatever happen on the client node, within the transaction's private little world, entirely locally and not involving any network communication. It's only at commit() time that we start communicating with the leader. And at that time, what we communicate is the recorded mutations (puts and removes) and keys read (gets). Whether those gets, puts, and removes were implementing INC's or whatever other fancy logic is irrelevant at that point.

Assume that you are using INC to generate unique ids. That is a classic case for a distributed consensus. 
Using your model, it is impossible for the client to get a valid reply for INC. Because two followers with the same state will run it locally and get different results because they aren't aware of the other running.
Unless you fail the transaction for one because they both touched the same key, which is bad for something like INC, which is explicitly designed to be parallel. 
 

A bunch of commands is not the same thing as a linearizable ACID transaction over the entire key/value store, which is an explicit goal of this implementation. A linearizable transaction allows you to (conceptually) perform a bunch of reads and writes, affecting a bunch of different keys and key ranges, all on a virtual point-in-time snapshot of the entire key/value store, and then have it appear as if all of those writes were applied atomically at that same point in time, and moreover, all such transactions appear to have occurred in some total linear ordering.

Yes, of course. But you already have single threaded behavior. The log ensure that by making sure that all commands go into it in this way.So you don't actually get anything from this.

To take a simple counter-example, suppose only one user can be "king", but two different, concurrent transactions execute "if nobody is king, then set x as king" but with different x's. Then you possibly end up with two kings. On the other hand, linearizable semantics for the transactions would have made that outcome impossible.


Because the log commits are happening in sequence, _both_ commands will be appended to the log. 
Then they will be applied to the state machine. The state machine would set X as the kind on the first cmd, then return an error on the second. 

Archie Cobbs

unread,
Jun 12, 2015, 10:26:24 AM6/12/15
to raft...@googlegroups.com, archie...@gmail.com
On Friday, June 12, 2015 at 1:28:28 AM UTC-5, Ayende Rahien wrote:
All of the transaction's reads and writes (and INC's or whatever happen on the client node, within the transaction's private little world, entirely locally and not involving any network communication. It's only at commit() time that we start communicating with the leader. And at that time, what we communicate is the recorded mutations (puts and removes) and keys read (gets). Whether those gets, puts, and removes were implementing INC's or whatever other fancy logic is irrelevant at that point.

Assume that you are using INC to generate unique ids. That is a classic case for a distributed consensus. 
Using your model, it is impossible for the client to get a valid reply for INC. Because two followers with the same state will run it locally and get different results because they aren't aware of the other running.
Unless you fail the transaction for one because they both touched the same key, which is bad for something like INC, which is explicitly designed to be parallel. 

Exactly - one of the transactions will fail.

As mentioned, this is an optimistic locking scheme, and it will not perform well when there is a high degree of contention for the same key. That's part of the design.

<Digression>
The API does support a lockless counter type that accomodates the use case of incrementing a counter without conflict to handle this specific use case.

In general however, applications that want to be scalable should not read and write the same database key/row at a high rate by concurrent threads. This is just inherently unscalable behavior. However, since counters are a common requirement, a special accommodation for them is included.
</Digression>

 Yes, of course. But you already have single threaded behavior. The log ensure that by making sure that all commands go into it in this way.So you don't actually get anything from this.
To take a simple counter-example, suppose only one user can be "king", but two different, concurrent transactions execute "if nobody is king, then set x as king" but with different x's. Then you possibly end up with two kings. On the other hand, linearizable semantics for the transactions would have made that outcome impossible.

Because the log commits are happening in sequence, _both_ commands will be appended to the log. 
Then they will be applied to the state machine.

No, only one transaction will succeed, and the other will fail and have to retry. When it retries it will see that somebody is already marked as "king".

-Archie

 

Oren Eini (Ayende Rahien)

unread,
Jun 12, 2015, 3:57:05 PM6/12/15
to Archie Cobbs, raft...@googlegroups.com
Okay, we have a very different method of operating here.

You are logging the _results_ of a transaction to raft.
I'm logging the commands.

In other words, in the concurrent "I'm king if no one else is".

You:
 - Check if someone else is king
 - Make me kind
 - Send the command to raft with optimistic concurrency
 - I assume that if it fails due to optimistic concurrency, it isn't on the log?

Me:
 - Write the command to the log and commit it
 - Let the _state machine itself_ process the command, which may result in an error
 - Even failed commands are in the log.



Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 


--

Archie Cobbs

unread,
Jun 12, 2015, 4:04:15 PM6/12/15
to raft...@googlegroups.com, archie...@gmail.com
On Friday, June 12, 2015 at 2:57:05 PM UTC-5, Ayende Rahien wrote:
Okay, we have a very different method of operating here.

You are logging the _results_ of a transaction to raft.
I'm logging the commands.

In other words, in the concurrent "I'm king if no one else is".

You:
 - Check if someone else is king
 - Make me kind
 - Send the command to raft with optimistic concurrency
 - I assume that if it fails due to optimistic concurrency, it isn't on the log?

Correct. Transaction is rejected, and nobody but the client even knows it ever happened. Client must then retry and hopefully will have better luck next time :)
 
Me:
 - Write the command to the log and commit it
 - Let the _state machine itself_ process the command, which may result in an error
 - Even failed commands are in the log.

Got it... definitely two different ways of doing things.

In your case, presumably a client waits for the state machine to process the command and return a success/fail result before it can complete.

In either way of doing things, the client must wait for confirmation. That confirmation occurs at different points in the two ways of doing things but the end result (from the client's point of view) is the same.

-Archie

Oren Eini (Ayende Rahien)

unread,
Jun 12, 2015, 4:09:23 PM6/12/15
to raft...@googlegroups.com, Archie Cobbs
Yes, that is correct.
The idea with having everything handled for the state machine is that it means that we have what I believe is a simpler mechanism for concurrency and error handling.
The raft impl is opaque, it doesn't care about the actual commands, it doesn't need to know about the state machine (we currently have 3 different state machines running using the same Raft impl).



Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 


Archie Cobbs

unread,
Jun 12, 2015, 4:23:34 PM6/12/15
to raft...@googlegroups.com, archie...@gmail.com
On Friday, June 12, 2015 at 3:09:23 PM UTC-5, Ayende Rahien wrote:
Yes, that is correct.
The idea with having everything handled for the state machine is that it means that we have what I believe is a simpler mechanism for concurrency and error handling.
The raft impl is opaque, it doesn't care about the actual commands, it doesn't need to know about the state machine (we currently have 3 different state machines running using the same Raft impl).

Your approach is definitely simpler, and mine is more complex, but for a good reason: I'm allowing arbitrary transactions to do whatever they want and have a consistent view over the entire key/value store. These are not just atomic "do this" transactions, these are more like "open transaction, do something, do something else, go eat some pizza, do some more stuff, ok now commit this transaction" transactions.

To do what you're doing, my clients would have to serialize their Java code and send it to the leader to execute. This is not a crazy idea but it has all sorts of limitations, e.g., a transaction would not be able to read from a local file, you have to have compatible versions of your application's Java class files on every node (and you can only work with Java clients), the leader would have to enforce time limits for long-running code, etc.

-Archie

Oren Eini (Ayende Rahien)

unread,
Jun 12, 2015, 4:25:45 PM6/12/15
to raft...@googlegroups.com, Archie Cobbs
Why do it this way?
You run your transaction, and that aggregates a set of commands to be applied to the state machine. Then you execute this, and get a "success/ fail" operation.


Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 


Archie Cobbs

unread,
Jun 12, 2015, 4:31:08 PM6/12/15
to raft...@googlegroups.com, archie...@gmail.com
On Friday, June 12, 2015 at 3:25:45 PM UTC-5, Ayende Rahien wrote:
Why do it this way?
You run your transaction, and that aggregates a set of commands to be applied to the state machine. Then you execute this, and get a "success/ fail" operation.

That's effectively what I'm doing. The node doing the transaction converts the arbitrary actions of the code doing the transaction into a simple list of key/value puts and deletes, and then sends those to the leader.

Assuming they pass the conflict test, we are basically done - there's nothing left to validate. Puts and deletes can't "fail". So there's no hurry in applying them to the state machine.

-Archie

Oren Eini (Ayende Rahien)

unread,
Jun 12, 2015, 4:34:58 PM6/12/15
to raft...@googlegroups.com, Archie Cobbs
You now need to maintain the state machine, as well as a view over the state machine for unapplied lgos.

Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 


--

Archie Cobbs

unread,
Jun 12, 2015, 4:38:35 PM6/12/15
to raft...@googlegroups.com, archie...@gmail.com
Indeed. This is inherent in doing MVCC.

-Archie

Oren Eini (Ayende Rahien)

unread,
Jun 12, 2015, 4:40:12 PM6/12/15
to raft...@googlegroups.com, Archie Cobbs
Not really, no. We do MVCC without needing this.
Reply all
Reply to author
Forward
0 new messages