In a normally functioning network, the leader's lease timeout should always be in the future, because the heartbeat interval is a lot shorter that the election timeout, so this means read-only transactions can commit as soon as the transaction's base log entry is committed, with the only additional network traffic being the single CommitRequest/CommitResponse round trip between follower and leader. If there happen to be no other mutating transactions occurring simultaneously, the transaction's base log entry will likely already be committed, in which case the transaction can complete as soon as the CommitResponse is received by the follower.
If the leader's lease timeout is not in the future:Random Notes
- The leader sends probes to all followers in an attempt to increase the lease timeout quickly, and
- The leader keeps track of the fact that the follower is waiting on that lease timeout, and notifies it immediately (via an updated AppendRequest) when its lease timeout advances beyond that point.
Miscellaneous random details I encountered...
Add log entry at start of term
During testing I encountered the following problem: suppose a follower tries to commit a transaction. The leader appends a new log entry at index X term N, and sends back the transaction's commit index = X & term = T to the follower. Now the the follower will wait for X/T to be committed. But then an election occurs, and another leader who has received X/T gets elected, and no additional transactions occur. The new leader cannot declare X/T committed because it is from a prior term. So X/T will sit there uncommitted indefinitely, which means the follower will timeout waiting to commit its transaction. Having leader append a trivial log entry at the beginning of each term fixes this corner case.
Output queue empty notification
Nodes receive notification from the network when the output queue for a recipient has gone empty. This allows them to send data as fast as possible, but no faster, avoiding output queue pile-ups and dropped messages. For example, an InstallSnapshot operation is broken into several (pipelined) messages. We want to send these as fast as possible, but if any of them gets dropped we have to start over. In other words, if you are pipelining messages you need flow control up to the application level.
Delaying log entry application
Normally nodes apply committed log entries as soon as possible. However it turns out there are two reasons for leaders to not do this in this implementation.
First, leaders try to keep log entries around, unapplied, for a minimum amount of time after they have been committed to help avoid retry failures on long-running transactions. Otherwise, because the base log entry is chosen when the transaction is opened, if the transaction runs for a long time before being committed, then in the mean time that base log entry and at least one other could get applied to the state machine, and then the leader will be unable to determine whether (a) the transaction's base log entry is actually in its log, or (b) there were any read/write conflicts, and be forced to reject the transaction with a retry. For this reason, leaders leave log entries sitting unapplied for a "minimum supported transaction duration" time, as long as they aren't taking up too much memory.
There is a similar problem with log compaction and InstallSnapshot: when an InstallSnapshot is in effect for a follower, that "snapshot" is based on a certain log index. We need to defer applying any log entries beyond that index to the state machine, otherwise after the snapshot is completed, the follower's next log entry would not be available to us to send, and so we'd then have to send a whole new snapshot.
Also, we also try to avoid InstallSnaphot scenarios in the first place by waiting (again, up to some time and memory limit) until every follower has recevied a log entry before applying it. (Leaders also optimistically send InstallSnapshots based on their last log entry, not their last committed log entry, to help this problem.)
However, this change creates another quirk. In "stock" Raft, a leader's state machine always stays ahead of follower state machines. However, when the leader is deferring application of log as described above, that's no longer true. So the leader logic to decrement nextIndex for a follower might possibly result in a nextIndex less that the follower's last applied log index, in which case the follower would not know what to do or how to stop the decrementing! Instead, followers send their last applied log index to leaders in every AppendResponse and the leaders use this as a lower bound on that follower's nextIndex.
References
[1] http://archiecobbs.github.io/jsimpledb/publish/reports/javadoc/index.html?org/jsimpledb/kv/raft/RaftKVDatabase.html
[2] https://github.com/archiecobbs/jsimpledb/
[3] http://archiecobbs.github.io/jsimpledb/publish/reports/javadoc/index.html?org/jsimpledb/kv/KVDatabase.html
[4] https://github.com/archiecobbs/leveldb
[5] http://archiecobbs.github.io/jsimpledb/publish/reports/javadoc/index.html?org/jsimpledb/kv/mvcc/AtomicKVStore.html
--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Committing a no-op entry at the beginning of a leader's term to force commitment of entries from previous terms is a standard component of the Raft algorithm. See 5.4.2 in the Raft paper, which oddly doesn't seem to actually mention explicitly committing a no-op entry, but section 8 does, and I think the dissertation does as well. But even so, good on you for noticing that issue too!Add log entry at start of term
During testing I encountered the following problem: suppose a follower tries to commit a transaction. The leader appends a new log entry at index X term N, and sends back the transaction's commit index = X & term = T to the follower. Now the the follower will wait for X/T to be committed. But then an election occurs, and another leader who has received X/T gets elected, and no additional transactions occur. The new leader cannot declare X/T committed because it is from a prior term. So X/T will sit there uncommitted indefinitely, which means the follower will timeout waiting to commit its transaction. Having leader append a trivial log entry at the beginning of each term fixes this corner case.
There is a similar problem with log compaction and InstallSnapshot: when an InstallSnapshot is in effect for a follower, that "snapshot" is based on a certain log index. We need to defer applying any log entries beyond that index to the state machine, otherwise after the snapshot is completed, the follower's next log entry would not be available to us to send, and so we'd then have to send a whole new snapshot.This is an interesting issue that I've had with snapshots too. For snapshots on leaders, it seems best to ensure there's some buffer between the snapshot and the end of the log in order to prevent immediately sending a snapshot after taking it. But alternatively, leaders can transfer their leadership (perhaps using the method from Diego's dissertation) before taking a snapshot, thus ensuring you won't be compacting entries that need to be sent to some follower. Blocking application of commands on the leader means blocking reads and writes in most implementations, which could be unacceptable if a not insignificant amount of time is required to take a snapshot.
I'm not following on this. The leader's nextIndex for a follower should reference the next entry to send to that follower, not the next entry to be applied on that follower. The follower's lastApplied index may differ from its commitIndex, and nextIndex should simply not decrease past the follower's last log index, though it's not a tragedy if it does. The entries applied to a follower's state machine correlate with (but may or may not be equal to) commitIndex rather than their last log index, so I'm not seeing how nextIndex relates to the follower's state machine, but maybe there's something specific to this implementation that I'm missing.Also, we also try to avoid InstallSnaphot scenarios in the first place by waiting (again, up to some time and memory limit) until every follower has recevied a log entry before applying it. (Leaders also optimistically send InstallSnapshots based on their last log entry, not their last committed log entry, to help this problem.)
However, this change creates another quirk. In "stock" Raft, a leader's state machine always stays ahead of follower state machines. However, when the leader is deferring application of log as described above, that's no longer true. So the leader logic to decrement nextIndex for a follower might possibly result in a nextIndex less that the follower's last applied log index, in which case the follower would not know what to do or how to stop the decrementing! Instead, followers send their last applied log index to leaders in every AppendResponse and the leaders use this as a lower bound on that follower's nextIndex.
Hi Jordan,
On Wednesday, June 10, 2015 at 2:29:32 PM UTC-5, Jordan Halterman (kuujo) wrote:Committing a no-op entry at the beginning of a leader's term to force commitment of entries from previous terms is a standard component of the Raft algorithm. See 5.4.2 in the Raft paper, which oddly doesn't seem to actually mention explicitly committing a no-op entry, but section 8 does, and I think the dissertation does as well. But even so, good on you for noticing that issue too!Add log entry at start of term
During testing I encountered the following problem: suppose a follower tries to commit a transaction. The leader appends a new log entry at index X term N, and sends back the transaction's commit index = X & term = T to the follower. Now the the follower will wait for X/T to be committed. But then an election occurs, and another leader who has received X/T gets elected, and no additional transactions occur. The new leader cannot declare X/T committed because it is from a prior term. So X/T will sit there uncommitted indefinitely, which means the follower will timeout waiting to commit its transaction. Having leader append a trivial log entry at the beginning of each term fixes this corner case.
It seems like a good thing to do for several reasons, at least according to other threads on the forum. For example, it allows you to compact the log sooner.There is a similar problem with log compaction and InstallSnapshot: when an InstallSnapshot is in effect for a follower, that "snapshot" is based on a certain log index. We need to defer applying any log entries beyond that index to the state machine, otherwise after the snapshot is completed, the follower's next log entry would not be available to us to send, and so we'd then have to send a whole new snapshot.This is an interesting issue that I've had with snapshots too. For snapshots on leaders, it seems best to ensure there's some buffer between the snapshot and the end of the log in order to prevent immediately sending a snapshot after taking it. But alternatively, leaders can transfer their leadership (perhaps using the method from Diego's dissertation) before taking a snapshot, thus ensuring you won't be compacting entries that need to be sent to some follower. Blocking application of commands on the leader means blocking reads and writes in most implementations, which could be unacceptable if a not insignificant amount of time is required to take a snapshot.
I'm not understanding the transfer of leadership idea and how that would help. Plus it seems like it would make things more complicated, because now you have the possibility of snapshot messages sent from non-leaders.
My implementation is very simple... snapshots only go from leaders to followers, and if anything weird happens it has to be restarted. So for example the snapshot will fail if an election occurs during the transfer.
However, an in-progress snapshot install does not prevent anyone from doing work, other than the follower receiving the snapshot. It only prevents the leader from applying committed log entries (the leader is still free to append new log entries).
Multiple snapshot installs and transactions can be happening at the same time, etc.
I'm not following on this. The leader's nextIndex for a follower should reference the next entry to send to that follower, not the next entry to be applied on that follower. The follower's lastApplied index may differ from its commitIndex, and nextIndex should simply not decrease past the follower's last log index, though it's not a tragedy if it does. The entries applied to a follower's state machine correlate with (but may or may not be equal to) commitIndex rather than their last log index, so I'm not seeing how nextIndex relates to the follower's state machine, but maybe there's something specific to this implementation that I'm missing.Also, we also try to avoid InstallSnaphot scenarios in the first place by waiting (again, up to some time and memory limit) until every follower has recevied a log entry before applying it. (Leaders also optimistically send InstallSnapshots based on their last log entry, not their last committed log entry, to help this problem.)
However, this change creates another quirk. In "stock" Raft, a leader's state machine always stays ahead of follower state machines. However, when the leader is deferring application of log as described above, that's no longer true. So the leader logic to decrement nextIndex for a follower might possibly result in a nextIndex less that the follower's last applied log index, in which case the follower would not know what to do or how to stop the decrementing! Instead, followers send their last applied log index to leaders in every AppendResponse and the leaders use this as a lower bound on that follower's nextIndex.
Here's the problem: suppose every node has a fully compacted log with last applied index = 10. Now the leader appends index = 11 and index = 12 and sends these to all followers. All followers reply affirmative, so both entries is committed, and the leader updates commitIndex -> 12 and sends this new commitIndex to all followers. So all followers apply the committed log entries #11 and #12 to their state machines, so their logs are again fully compacted with last applied index = 12. But the leader has not yet applied log entry #11 or #12 yet. The leader has nextIndex = 13 for all followers.
Now suppose follower X drops off the network for a short time, just long enough for the next two AppendRequests to that follower to fail. The leader decrements nextIndex twice for that follower, back down to 11. Now the follower comes back online. The leader then sends him an AppendRequest with log entry #11.
What does the follower do now? The follower has no way to determine whether the term associated with log entry #11 is the same as the one he already compacted into his log, because that information is lost. So the follower must reply in the negative. So then the leader decrements his nextIndex again, back down to 10. This will repeat until the leader reaches his own log compaction point, after which a snapshot install will be forced.
I suppose there is a simple logical argument to prove that the follower's term for log entry #11 must match, using the leader completeness principle: anything in the follower's compacted log must be committed, and therefore must match what's in the current leader's log. So I suppose the follower can just reply in the positive instead of the negative even though it can't explicitly verify the log entry term, and that would eliminate the need for this extra bit of communication.
In any case, the Raft paper doesn't seem to explicitly specify what followers should do in the "I received an AppendEntries log entry but that index is already compacted away in my log" case.
-Archie
There's no possibility of snapshots being sent from non-leaders. That doesn't make sense. Only leaders should have the ability to send a snapshot, because that's dictated by whether a follower indicates that it doesn't have entries back to the snapshot index via an AppendEntries response.There is a similar problem with log compaction and InstallSnapshot: when an InstallSnapshot is in effect for a follower, that "snapshot" is based on a certain log index. We need to defer applying any log entries beyond that index to the state machine, otherwise after the snapshot is completed, the follower's next log entry would not be available to us to send, and so we'd then have to send a whole new snapshot.This is an interesting issue that I've had with snapshots too. For snapshots on leaders, it seems best to ensure there's some buffer between the snapshot and the end of the log in order to prevent immediately sending a snapshot after taking it. But alternatively, leaders can transfer their leadership (perhaps using the method from Diego's dissertation) before taking a snapshot, thus ensuring you won't be compacting entries that need to be sent to some follower. Blocking application of commands on the leader means blocking reads and writes in most implementations, which could be unacceptable if a not insignificant amount of time is required to take a snapshot.
I'm not understanding the transfer of leadership idea and how that would help. Plus it seems like it would make things more complicated, because now you have the possibility of snapshot messages sent from non-leaders.But this is an essential responsibility of the leader. If it can't apply an entry during a snapshot, it can't respond to a write (if the write needs output from the state machine).
My implementation is very simple... snapshots only go from leaders to followers, and if anything weird happens it has to be restarted. So for example the snapshot will fail if an election occurs during the transfer.
However, an in-progress snapshot install does not prevent anyone from doing work, other than the follower receiving the snapshot. It only prevents the leader from applying committed log entries (the leader is still free to append new log entries).
With disk-based state machines, a recent copy of the system state is maintained on disk as part of normal operation. Thus, the Raft log can be discarded as soon as the state machine reflects writes to disk, and snapshotting is used only when sending consistent disk images to other servers (Section 5.2).
Hibernating Rhinos Ltd
Oren Eini l CEO l Mobile: + 972-52-548-6969
Office: +972-4-622-7811 l Fax: +972-153-4-622-7811
However, an in-progress snapshot install does not prevent anyone from doing work, other than the follower receiving the snapshot. It only prevents the leader from applying committed log entries (the leader is still free to append new log entries). Multiple snapshot installs and transactions can be happening at the same time, etc.
Here's the problem: suppose every node has a fully compacted log with last applied index = 10. Now the leader appends index = 11 and index = 12 and sends these to all followers. All followers reply affirmative, so both entries is committed, and the leader updates commitIndex -> 12 and sends this new commitIndex to all followers. So all followers apply the committed log entries #11 and #12 to their state machines, so their logs are again fully compacted with last applied index = 12. But the leader has not yet applied log entry #11 or #12 yet. The leader has nextIndex = 13 for all followers.
What does the follower do now? The follower has no way to determine whether the term associated with log entry #11 is the same as the one he already compacted into his log, because that information is lost.
Hibernating Rhinos Ltd
Oren Eini l CEO l Mobile: + 972-52-548-6969
Office: +972-4-622-7811 l Fax: +972-153-4-622-7811
Instead my implementation uses this model (quoting again):
And since it's using LevelDB, which supports "point in time" snapshots, for the local persistent state (including the state machine itself) it can be sending an InstallSnapshot and appending or applying log entries at the same time. So there's never any need to step down, etc.
With disk-based state machines, a recent copy of the system state is maintained on disk as part of normal operation. Thus, the Raft log can be discarded as soon as the state machine reflects writes to disk, and snapshotting is used only when sending consistent disk images to other servers (Section 5.2).
I'm using a "disk based state machine" because I wanted to avoid the limitation that the entire database must fit into memory.
Why are you allowing transactions on any node?That just complicate everything. Because transactions has to go through the leader anyway, it is _much_ simpler to state that write transactions has to be executed on the leader.That way, it can commit directly, and fail if there is a network split directly.
Also, note that for consistency, you have 3 modes:- Read from follower (potentially stale)- Read from leader (better, but leader might have been demoted without knowing it yet)- Read from quorum (force a log commit to ensure that we are actually the leader)
Also, leaders shouldn't discard log entries after they have been committed, that just lead you to not being able to send entries, or delay the actual state machine application.
You assume that applying a log entry also means deleting it, that isn't the case. And in your case, it can lead to effective unavailability of the cluster (it process commands, but it doesn't apply them, which is really bad).
I don't follow your sync vs. unsync distinction, or why you send just a single entry. What happen if I execute two transactions concurrently? I want to send them out as fast as possible.
The reflected log entry is a bad idea, it is a minor optimization for a bigger perf issue (see above), but it also means that you now need to handle the case of a follower crashing after sending it to the leader, and getting the full value back. It is more complex.
Random cluster ids are probably a bad idea. You might get identical ones (easier than you would expect to run into that in prod). You probably want to use a guid or something like that.
Negative replies are really important for debugging, troubleshooting and ops.
However, an in-progress snapshot install does not prevent anyone from doing work, other than the follower receiving the snapshot. It only prevents the leader from applying committed log entries (the leader is still free to append new log entries). Multiple snapshot installs and transactions can be happening at the same time, etc.Not applying commands means that the system is down, as far as the user is concerned.
Here's the problem: suppose every node has a fully compacted log with last applied index = 10. Now the leader appends index = 11 and index = 12 and sends these to all followers. All followers reply affirmative, so both entries is committed, and the leader updates commitIndex -> 12 and sends this new commitIndex to all followers. So all followers apply the committed log entries #11 and #12 to their state machines, so their logs are again fully compacted with last applied index = 12. But the leader has not yet applied log entry #11 or #12 yet. The leader has nextIndex = 13 for all followers.Why wouldn't it apply them? The leader usually apply them when it get majority votes from the cluster.That is a really strange design decision.
What does the follower do now? The follower has no way to determine whether the term associated with log entry #11 is the same as the one he already compacted into his log, because that information is lost.The follower wouldn't ask entry #11. It would say "already have it, thanks".It would need to only compare its latest term/entry with the leader top term/entry. Raft ensure that if they match, all previous ones match as well.
> Followers only send replies to RequestVote if the vote is granted; there is no negative reply. Negative replies seemed useless to me (unless I'm missing something) because they have no effect on the state of the recipient.I thought one reason to send a negative reply is to update the sender's currentTerm, which could be lower.
> I've implemented the technique in section 4.2.3 of the dissertation whereby followers ignore a RequestVote that is received within a minimum election timeout of anAppendRequest.I've been wondering if the "minimum election timeout" is the same timeout that causes a follower to become a candidate. If so, then the rule would simply be that followers always ignore RequestVotes.
Hibernating Rhinos Ltd
Oren Eini l CEO l Mobile: + 972-52-548-6969
Office: +972-4-622-7811 l Fax: +972-153-4-622-7811
Imagine 5 node cluster.S1 is the leader.S4 is down for a while.We are now processing commands, and we can apply them to S1,S2,S3,S5.But S4 is not responding, and we can't send anything to it.If we remove commited entries from the log, it means that when S4 is back up, we'll need to send a snapshot.In my case, snapshots can be _very_ large (GBs or higher), we want to avoid that if at all possible.
Hibernating Rhinos Ltd
Oren Eini l CEO l Mobile: + 972-52-548-6969
Office: +972-4-622-7811 l Fax: +972-153-4-622-7811
No, it isn't a tuning issue. It is a pretty fundamental issue.You are delaying committing log entries because you equate committing a log entry with deleting it.And not applying the log entry means that the operation didn't happen, the client is going to be upset about it.
Hibernating Rhinos Ltd
Oren Eini l CEO l Mobile: + 972-52-548-6969
Office: +972-4-622-7811 l Fax: +972-153-4-622-7811
On Thursday, June 11, 2015 at 4:32:03 PM UTC-5, Ayende Rahien wrote:No, it isn't a tuning issue. It is a pretty fundamental issue.You are delaying committing log entries because you equate committing a log entry with deleting it.And not applying the log entry means that the operation didn't happen, the client is going to be upset about it.
You're misunderstanding how it works... I think part of the problem is that the terminology is confusing.
It's true that I'm deleting log entries once they are applied to the state machine (aka, "compacted")...
and I assume "applied to the state machine" is what you meant by "committing a log entry".
But your assertion "not applying the log entry means that the operation didn't happen" is not true. My log entries are recorded to disk persistently as well as kept in memory.
So if a node has 5 outstanding log entries that have not yet been applied to its state machine, and then it crashes and restarts, those 5 log entries will still be there.
Perhaps I should have made this more clear in the description. I guess I thought it was obvious, because the whole correctness of Raft presumes that a node's log is part of its persistent state.
Also to clarify another point which may be confusing: transactions always observe a view of the node's current state machine with any unapplied log entries layered on top. So in effect what the transaction sees is the entire, most up-to-date log entry, even if that log entry has not been applied yet.
Then when a follower sends the CommitRequest to the leader, the leader verifies that any additional log entries that the follower doesn't know about yet do not create conflicts (this is the optimistic locking part).
So transactions can be running and log entries can be appended in an inter-mixed fashion, even if there are outstanding, unapplied log entries on the follower and/or leader.
Hope this clears it up.
-Archie
Okay, so let define things.Commit - the leader get replies from N/2+1 nodes that they accepted a commandApplied - the command is processed by the state machineDeleted - removed from the persistent logCompaction - deleting old entries from the log and generating a stable snapshot of the state machine.
So if a node has 5 outstanding log entries that have not yet been applied to its state machine, and then it crashes and restarts, those 5 log entries will still be there.So, but _it wasn't applied_.Consider the case of a simple key/value store.I sent the following commands:Set X = 1Set Y = 2You have now stored those commands in the log (persistent or in memory, doesn't matter). But until you applied them to the state machine, you can't read those value back.
That means that as far as the client is concerned, they didn't happen.This basically mean that you have created a system were updates are gotten, and the reply you get back (oh, yeah, we'll apply it some time in the future).What happen when you have a command like:Set X = 2 IF Y = 1You send the command to the leader, it is committed, but you don't have any reply. And that is something that the client is likely to want.
Also to clarify another point which may be confusing: transactions always observe a view of the node's current state machine with any unapplied log entries layered on top. So in effect what the transaction sees is the entire, most up-to-date log entry, even if that log entry has not been applied yet.That is wrong. Because you don't know if a log entry will be rolled back. And if you are only doing that for committed entries, then that "view" is when you actually apply the state machine command.
Then when a follower sends the CommitRequest to the leader, the leader verifies that any additional log entries that the follower doesn't know about yet do not create conflicts (this is the optimistic locking part).Again, this is extra complex. If the leader did all the work, on top of the fully applied state machine, you wouldn't need it.
Hibernating Rhinos Ltd
Oren Eini l CEO l Mobile: + 972-52-548-6969
Office: +972-4-622-7811 l Fax: +972-153-4-622-7811
On Thursday, June 11, 2015 at 4:58:17 PM UTC-5, Ayende Rahien wrote:So you can call those things whatever you want.
- Append a new log entry to the list of unapplied log entries
- Apply the first unapplied log entry in the list to the key/value store (aka "state machine"), which means doing the following:
- Apply the log entry's mutations to the key/value store (do the puts & deletes)
- Set Last-Applied-Index equal to Log-Entry-Index (i.e., increment Last-Applied-Index by one)
- Set Last-Applied-Term equal to Log-Entry-Term
- Remove the log entry we just applied from the front of the list of unapplied log entries
I call operation #1 "appending a log entry" and operation #2 "applying a log entry".
There is no separate "compaction" operation. The "compaction" is implicit in step #2.
You send the command to the leader, it is committed, but you don't have any reply. And that is something that the client is likely to want.
That's not how it works.
The client does its own reading of the key/value state. When commit is invoked, it sends to the leader is its writes, plus the keys that it read - the latter only being used to detect conflicts. Then only after the leader responds with the index & term of the associated log entry, and the follower sees that log entry committed (in the Raft sense), does the transaction succeed and the commit() method return successfully. If something goes wrong, commit() throws a RetryTransactionException instead.
But then you'd also loose the ability to have concurrent transactions, which is unacceptable.
On Fri, Jun 12, 2015 at 1:23 AM, Archie Cobbs <archie...@gmail.com> wrote:So you can call those things whatever you want.
- Append a new log entry to the list of unapplied log entries
- Apply the first unapplied log entry in the list to the key/value store (aka "state machine"), which means doing the following:
- Apply the log entry's mutations to the key/value store (do the puts & deletes)
- Set Last-Applied-Index equal to Log-Entry-Index (i.e., increment Last-Applied-Index by one)
- Set Last-Applied-Term equal to Log-Entry-Term
- Remove the log entry we just applied from the front of the list of unapplied log entries
My issue is with 2.4, you don't need to do that at this time. You can do that when you have enough of a buffer that you won't force us to send a snapshot over the network.
I call operation #1 "appending a log entry" and operation #2 "applying a log entry".The client does its own reading of the key/value state. When commit is invoked, it sends to the leader is its writes, plus the keys that it read - the latter only being used to detect conflicts. Then only after the leader responds with the index & term of the associated log entry, and the follower sees that log entry committed (in the Raft sense), does the transaction succeed and the commit() method return successfully. If something goes wrong, commit() throws a RetryTransactionException instead.
That assumes that you have very simple operations, and that they can be handled on their own.What about something like:INC xYou don't know what the value of x is, you need to actually execute it to get it.And you _want_ something like that to allow concurrency.That is what I meant by not having a result until this is actually applied to the state machine.
But then you'd also loose the ability to have concurrent transactions, which is unacceptable.Why? You can most certainly submit multiple commands to raft that will be processed in a single batch
What about something like:INC xYou don't know what the value of x is, you need to actually execute it to get it.And you _want_ something like that to allow concurrency.That is what I meant by not having a result until this is actually applied to the state machine.
All of the transaction's reads and writes (and INC's or whatever happen on the client node, within the transaction's private little world, entirely locally and not involving any network communication. It's only at commit() time that we start communicating with the leader. And at that time, what we communicate is the recorded mutations (puts and removes) and keys read (gets). Whether those gets, puts, and removes were implementing INC's or whatever other fancy logic is irrelevant at that point.
A bunch of commands is not the same thing as a linearizable ACID transaction over the entire key/value store, which is an explicit goal of this implementation. A linearizable transaction allows you to (conceptually) perform a bunch of reads and writes, affecting a bunch of different keys and key ranges, all on a virtual point-in-time snapshot of the entire key/value store, and then have it appear as if all of those writes were applied atomically at that same point in time, and moreover, all such transactions appear to have occurred in some total linear ordering.
To take a simple counter-example, suppose only one user can be "king", but two different, concurrent transactions execute "if nobody is king, then set x as king" but with different x's. Then you possibly end up with two kings. On the other hand, linearizable semantics for the transactions would have made that outcome impossible.
All of the transaction's reads and writes (and INC's or whatever happen on the client node, within the transaction's private little world, entirely locally and not involving any network communication. It's only at commit() time that we start communicating with the leader. And at that time, what we communicate is the recorded mutations (puts and removes) and keys read (gets). Whether those gets, puts, and removes were implementing INC's or whatever other fancy logic is irrelevant at that point.Assume that you are using INC to generate unique ids. That is a classic case for a distributed consensus.Using your model, it is impossible for the client to get a valid reply for INC. Because two followers with the same state will run it locally and get different results because they aren't aware of the other running.Unless you fail the transaction for one because they both touched the same key, which is bad for something like INC, which is explicitly designed to be parallel.
Yes, of course. But you already have single threaded behavior. The log ensure that by making sure that all commands go into it in this way.So you don't actually get anything from this.To take a simple counter-example, suppose only one user can be "king", but two different, concurrent transactions execute "if nobody is king, then set x as king" but with different x's. Then you possibly end up with two kings. On the other hand, linearizable semantics for the transactions would have made that outcome impossible.Because the log commits are happening in sequence, _both_ commands will be appended to the log.Then they will be applied to the state machine.
Hibernating Rhinos Ltd
Oren Eini l CEO l Mobile: + 972-52-548-6969
Office: +972-4-622-7811 l Fax: +972-153-4-622-7811
--
Okay, we have a very different method of operating here.You are logging the _results_ of a transaction to raft.I'm logging the commands.In other words, in the concurrent "I'm king if no one else is".You:- Check if someone else is king- Make me kind- Send the command to raft with optimistic concurrency- I assume that if it fails due to optimistic concurrency, it isn't on the log?
Me:- Write the command to the log and commit it- Let the _state machine itself_ process the command, which may result in an error- Even failed commands are in the log.
Hibernating Rhinos Ltd
Oren Eini l CEO l Mobile: + 972-52-548-6969
Office: +972-4-622-7811 l Fax: +972-153-4-622-7811
Yes, that is correct.The idea with having everything handled for the state machine is that it means that we have what I believe is a simpler mechanism for concurrency and error handling.The raft impl is opaque, it doesn't care about the actual commands, it doesn't need to know about the state machine (we currently have 3 different state machines running using the same Raft impl).
Hibernating Rhinos Ltd
Oren Eini l CEO l Mobile: + 972-52-548-6969
Office: +972-4-622-7811 l Fax: +972-153-4-622-7811
Why do it this way?You run your transaction, and that aggregates a set of commands to be applied to the state machine. Then you execute this, and get a "success/ fail" operation.
Hibernating Rhinos Ltd
Oren Eini l CEO l Mobile: + 972-52-548-6969
Office: +972-4-622-7811 l Fax: +972-153-4-622-7811
--