Simple optimization - is it safe?

Archie Cobbs

unread,

Oct 6, 2016, 1:43:44 PM10/6/16

to raft-dev

I'm considering a simple optimization and I think it would be safe but wanted to get some other opinions...

Here's the scenario...

Let M be the size of a majority.

Suppose the following:

The leader L is about to send an AppendEntries containing log entry with index N to follower F.
L has already received confirmation from M - 1 nodes that log entry N has been successfully received and added to their logs (i.e., itself plus M - 2 other followers).
L's current commitIndex is N - 1.

Then in the AppendEntries that L sends to F, it sets leaderCommit to N, instead of N - 1 as it would normally do.

Why is this safe? Because by the time F is ready to update its commitIndex, we know log entry N has now been replicated on a majority M of servers, so it is committed. The leader doesn't know this yet (he'll find out when he gets F's positive reply), but the follower, being the one who actually puts N over the commit threshold, can take advantage of that new fact.

How does this improve things? Because it eliminates a round-trip from the time it takes observers on F to see that a transaction they originated is committed. Depending on how you're using Raft, this may or may not be important (in my case it is).

This optimization would be most applicable in a two or three node cluster, because condition #2 is always true. In larger clusters, condition #2 is less likely to be true in general because L should be sending new log entries to all followers in parallel.

Condition #3 is often true if load on L is light but becomes less likely to be true when the load on L is heavy (assuming L is pipelining log entries as it should be).

Thoughts?

-Archie

Archie Cobbs

unread,

Oct 6, 2016, 4:43:32 PM10/6/16

to raft-dev

Following up on my own post...

On Thursday, October 6, 2016 at 12:43:44 PM UTC-5, Archie Cobbs wrote:

Suppose the following:
The leader L is about to send an AppendEntries containing log entry with index N to follower F.
L has already received confirmation from M - 1 nodes that log entry N has been successfully received and added to their logs (i.e., itself plus M - 2 other followers).
L's current commitIndex is N - 1.
Then in the AppendEntries that L sends to F, it sets leaderCommit to N, instead of N - 1 as it would normally do.

I think to be safe and consistent with the current Raft logic we also need this additional condition:

4. The term associated with log entry N is equal to L's current term

Of course this will always be true, except at the beginning of L's term.

-Archie

Archie Cobbs

unread,

Oct 6, 2016, 6:39:07 PM10/6/16

to raft-dev

Sorry to keep talking to myself.

There is another condition required:

5. The leader's current matchIndex for F is less than N

This should be sort-of obvious - F's successful application of the log entry with index N doesn't "put N over the commit threshold" if F is already being counted in the tally.

For a newly created log entry, condition #5 will always be the case of course.

Also, condition #3 can be relaxed to say:

3. L's current commitIndex is less than N.

This follows from leader completeness.

To summarize, the new (renumbered) conditions are:

For M - 1 nodes not including F, L's matchIndex for each node is >= N.

The term associated with log entry N is equal to L's current term

Then the leader can set leaderCommit = MAX(commitIndex, N).

-Archie

Oren Eini (Ayende Rahien)

unread,

Oct 7, 2016, 1:01:08 AM10/7/16

to raft...@googlegroups.com

It is quite confusing with all those letters, I suggest you'll use meaningful variable names.

I keep thinking about this issue, and I think that I agree, it will be a perf optimization for 3 nodes scenario.

For the 5 nodes, I don't think it will be very useful.

Hibernating Rhinos Ltd

Oren Eini l CEO l Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Юрий Соколов

unread,

Oct 7, 2016, 4:53:02 PM10/7/16

to raft-dev

I doubdt it ever could be optimization, cause optimized version sends AppendEntries to all followers (and writes to disk) simutaneously.
So this optimisation will do meaningful work only when sending to lagging replicas.
If you don't have non-lagging majority, then you alrwady have serious troubles.

jordan.h...@gmail.com

unread,

Oct 7, 2016, 5:42:39 PM10/7/16

to raft...@googlegroups.com

I think the point is it *is* an optimization in a 2-3 node cluster, because the leader can write to disk and send all AppendEntries requests with the new commitIndex since it knows itself plus any follower is a majority.

> --
> You received this message because you are subscribed to the Google Groups "raft-dev" group.

> To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.

Oren Eini (Ayende Rahien)

unread,

Oct 7, 2016, 5:49:25 PM10/7/16

to raft...@googlegroups.com

It _has_ to first write to the disk, and only then send it out.

Hibernating Rhinos Ltd

Oren Eini l CEO l Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

> To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+unsubscribe@googlegroups.com.

> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+unsubscribe@googlegroups.com.

Юрий Соколов

unread,

Oct 7, 2016, 6:49:58 PM10/7/16

to raft-dev

Cite from Raft dissertation:

10.2.1

Writing to the leader’s disk in parallel

One useful performance optimization can remove a disk write from Raft’s critical path. In a naı̈ve

implementation, the leader writes the new log entry to disk before replicating the entry to its fol-

lowers. Then, the followers write the entry to their disks. This results in two sequential disk writes

on the path to process a request, contributing significant latency for deployments where disk writes

are a dominant factor.

Fortunately, the leader can write to its disk in parallel with replicating to the followers and them

writing to their disks ...

суббота, 8 октября 2016 г., 0:49:25 UTC+3 пользователь Ayende Rahien написал:

It _has_ to first write to the disk, and only then send it out.

Hibernating Rhinos Ltd
Oren Eini l CEO l Mobile: + 972-52-548-6969
Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

On Sat, Oct 8, 2016 at 12:42 AM, jordan.h...@gmail.com <jordan.h...@gmail.com> wrote:

I think the point is it *is* an optimization in a 2-3 node cluster, because the leader can write to disk and send all AppendEntries requests with the new commitIndex since it knows itself plus any follower is a majority.

> On Oct 7, 2016, at 1:53 PM, Юрий Соколов <funny....@gmail.com> wrote:
>
> I doubdt it ever could be optimization, cause optimized version sends AppendEntries to all followers (and writes to disk) simutaneously.
> So this optimisation will do meaningful work only when sending to lagging replicas.
> If you don't have non-lagging majority, then you alrwady have serious troubles.
>
> --
> You received this message because you are subscribed to the Google Groups "raft-dev" group.

> To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.

> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.

Юрий Соколов

unread,

Oct 7, 2016, 6:51:53 PM10/7/16

to raft-dev

The leader may even commit an entry before it has been written to its own disk, if a majority of

followers have written it to their disks; this is still safe. LogCabin implements this optimization.

(this is also cite from 10.2.1)

суббота, 8 октября 2016 г., 1:49:58 UTC+3 пользователь Юрий Соколов написал:

Oren Eini (Ayende Rahien)

unread,

Oct 8, 2016, 1:59:44 AM10/8/16

to raft...@googlegroups.com

Yes, but if the optimization discussed here is implemented, you _have_ to sync to disk (and probably pay fsync).

Consider the case:

* N1 is leader

* N2 & N3 are followers

N1 get an entry, send it to N2 & N3 with note that it is committed (since if they get it, it will be on the majority).

N3 never get the message.

N1 crashes and recover, but it didn't write the entry to disk.

So N1 & N3 pick N3 as the leader.

However, now we have a committed entry that isn't in the leader.

To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+unsubscribe@googlegroups.com.

jordan.h...@gmail.com

unread,

Oct 8, 2016, 6:15:42 AM10/8/16

to raft...@googlegroups.com

Yeah... I was thinking about that too. I suppose it's usefulness depends on whether followers for some reason are better off learning about committed entries faster. In this case, a follower learns an entry is committed before the leader. But being able to flush to disk while replication is already in progress is a nice optimization and probably better for most implementations.

Юрий Соколов

unread,

Oct 8, 2016, 6:31:05 AM10/8/16

to raft-dev

I've said:

"I doubdt it ever could be optimization, cause optimized version sends AppendEntries to all followers (and writes to disk) simutaneously."

So, I mean, there is no reason to implement topic "optimization", cause it is better to implement optimization from dissertation "10.2.1
Writing to the leader’s disk in parallel".

In normal mode (majority is not lagging)
10.2.1 will give leader latency of "one network round + write to disk", while discussed optimization is still "two writes to disk + one network round". On follower (even with cluster of 3 and majority 2) 10.2.1 will give latency "3/2 network round + one write to disk", while discussed optimization "two writes to disk + 1/2 network round".

BTW, fsync is neccessary in any mode, IMHO. If you think consensus protocol is viable without fsync... Lets not discus it here, I don't really want to hear that.

Oren Eini (Ayende Rahien)

unread,

Oct 8, 2016, 7:32:40 AM10/8/16

to raft...@googlegroups.com

You can do persistent consensus without fsync, but this require some sort of append only storage (since you need to lose up to a certain point (and requires that you understand that in a log of 1,2,3,4, you might have 1,2,X, 4, after restart).

But the key point is that on startup, you might have previously sent confirmations that you have "forgotten" about, so you are forced to come back up as a new node, effectively.

In short, that isn't worth it

Hibernating Rhinos Ltd

Oren Eini l CEO l Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

Archie Cobbs

unread,

Oct 8, 2016, 2:52:10 PM10/8/16

to raft-dev

On Saturday, October 8, 2016 at 5:31:05 AM UTC-5, Юрий Соколов wrote:

BTW, fsync is neccessary in any mode, IMHO. If you think consensus protocol is viable without fsync... Lets not discus it here, I don't really want to hear that.

I agree, for the purpose of this discussion let's assume that everybody must fsync() (i.e., truly persist) a log entry before it is considered applied. Any other option is a separate discussion. But in fact it doesn't really matter happens - see below.

On Saturday, October 8, 2016 at 5:31:05 AM UTC-5, Юрий Соколов wrote:

In normal mode (majority is not lagging)
10.2.1 will give leader latency of "one network round + write to disk", while discussed optimization is still "two writes to disk + one network round". On follower (even with cluster of 3 and majority 2) 10.2.1 will give latency "3/2 network round + one write to disk", while discussed optimization "two writes to disk + 1/2 network round".

So it depends on the relative time of disk vs. network.

Let's say applying a log entry takes time Ta. This is the time it takes for a node to write a new log entry and fsync() the disk (or whatever you're doing :).

Let's say the one-way communication time between nodes is Tc. So the round trip time RTT = 2 * Tc.

Now suppose the leader of a 2 or 3 node cluster is at the point where it is going to append a new log entry. Let's see what happens if you do or don't do the optimization.

Note what we are measuring here is how long it takes before the follower gets confirmation that the new log entry has been committed. That's what this optimization is trying to optimize.

Of course, whether this is important to your application is a separate discussion. But for the purpose of argument let's assume that this is an important thing to optimize.

Without the optimization suggested by this thread, this happens:

Leader applies log entry locally - takes time Ta
Leader sends log entry to followers - takes time Tc
Followers apply log entry locally - takes time Ta
Followers reply positively to leader - takes time Tc
Leader updates commitIndex
Leader sends new leaderIndex to followers - takes time Tc

Steps 2 - 5 must execute in sequence, but the leader can do step #1 in parallel with 2 - 4 as suggested by section 10.2.1.

In that case the total time for a follower to get updated with the new leaderIndex is Ta + 3*Tc. This is doing steps 1 and 3 in parallel (Ta) plus steps 2, 4, 5, 6 in serial (3*Tc).

With the optimization suggested by this thread, this happens:

Leader applies log entry locally - takes time Ta
Leader sends log entry - with new leaderCommit - to follower - takes time Tc
Follower applies log entry locally - takes time Ta
Followers reply positively to leader - takes time Tc
Leader updates commitIndex
Leader sends new leaderIndex to followers - takes time Tc

We CANNOT do steps #1 and #2 in parallel as pointed out by Oren.

However, the total time for the follower to get updated with the new leaderIndex is 2*Ta + Tc, because this happens at the end step 3.

So the comparison is 2*Ta + Tc (without optimization) vs. Ta + 3*Tc (with optimization).

So it depends on the actual values for Ta and Tc whether the optimization is worth it.

Doing the math, the optimization is worth it if Ta < 2*Tc.

That is, the optimization is worth it if it takes less than one RTT to persist a new log entry.

-Archie

Oren Eini (Ayende Rahien)

unread,

Oct 8, 2016, 3:07:06 PM10/8/16

to raft...@googlegroups.com

In this case, this is relevant:

https://gist.github.com/jboner/2841832

So round trip time (Tc) is about 500 us

I run some tests about HD speed for durable writes, you can find it here:

https://ayende.com/blog/174785/fast-transaction-log-windows

For non fsync buffered writes, the cost for a single write is 30 us.

For durable writes, the cost is around 300 us o Linux, around 120 us on Windows.

Note that those are SSD drives, on HDD, you will see about 700 us (and that is without any contention).

Hibernating Rhinos Ltd

Oren Eini l CEO l Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

Archie Cobbs

unread,

Oct 8, 2016, 7:07:31 PM10/8/16

to raft-dev

On Saturday, October 8, 2016 at 2:07:06 PM UTC-5, Ayende Rahien wrote:

For durable writes, the cost is around 300 us o Linux, around 120 us on Windows.

Note that those are SSD drives, on HDD, you will see about 700 us (and that is without any contention).

On my MacBook Pro (with SSD) I'm seeing similar durable persist times averaging around 450 usec, but with occasional much higher values.

So if RTT is 500 usec then the optimization gives a slight improvement.

So another consideration, especially when the two options are roughly comparable, is variance.

I would guess that unless your network is overcrowded SSD write latency probably has higher variance than network communication. At least network congestion seems like an easier problem to solve than SSD I/O bandwith congestion.

If Ta is more highly variable than Tc, the option that has less Ta and more Tc (applying the optimization) would be preferable.

Still, seems like not a huge difference either way with current technology.

-Archie

Юрий Соколов

unread,

Oct 9, 2016, 2:52:09 AM10/9/16

to raft-dev

> So the comparison is 2*Ta + Tc (without optimization 10.2.1, but with discussed) vs. Ta + 3*Tc (with optimization 10.2.1).

I've said exactly that. I just measure by network roundtrip, so it sounded "2*Ta + 1/2*Tr" and "Ta + 3/2*Tr".

And you forget about leader latency: with 10.2.1 its latency is "Ta + 2*Tc", and with discussed optimization it is "2*Ta + 2*Tc".

Archie Cobbs

unread,

Oct 9, 2016, 12:37:14 PM10/9/16

to raft-dev

Yep - good point.

Thanks.

Reply all

Reply to author

Forward