log persistence craftiness for improving performance of linearizable read

Hitoshi Mitake

unread,

Aug 15, 2016, 3:49:52 AM8/15/16

to raft...@googlegroups.com, Xiang Li, anthony...@coreos.com

Hi Raft folks,

I'm considering to improve performance of linearizable read in

Raft. Even requests to a replicated state machine do not have side

effect (e.g. read request of KVS), Raft needs to write them in

its persistent log. So it limits the read performance because of the

synchronous write.

There are some known workaround for avoiding the

overhead. Serializable read can be served without log persistence (and

consensus of the cluster) and batched linearizable read (described in

section 6.4 of the thesis) can improve throughput of linearizable

read. However, serializable read sacrifices consistency and batched

linearizable read sacrifices latency.

The simplest approach for avoiding the log persistence without

sacrificing consistency and latency would be just skipping writing an

entry. However, this approach introduces holes in the logs so must be

harmful for keeping safety properties of Raft.

My plan is delaying writing log entries without side effects until a

log entry with side effect is appended. With this scheme, the hole in

the log can be avoided if an entry that has side effect is succeeding

to the entries without side effect. If such an entry isn't appear in

the log, the entries without side effect can be lost safely because

replaying them does nothing to the replicated state machine.

Of course this approach is a little bit dirty because Raft needs to

interpret the log entries and check that they have side effect or not

for implementing it. Normally, contents in log entries should be

uninterpreted byte sequences from the perspective of Raft. However, my

simple experiment shows that the approach can improve linearizable

read more than 20 times in a case of etcd (the benchmark result can be

found in the below etcd issue).

The plan is quite simple so someone else would consider or try it

already in the past. So I'll be really happy if I can hear opinions or

experiences from others.

I'm implementing the idea in etcd and rough draft can be found here:

https://github.com/mitake/etcd/tree/skip-persist-in-wal

and ongoing discussion can be found here:

https://github.com/coreos/etcd/issues/5912

Thanks,

Hitoshi

Archie Cobbs

unread,

Aug 15, 2016, 9:30:02 AM8/15/16

to raft-dev, xian...@coreos.com, anthony...@coreos.com

On Monday, August 15, 2016 at 2:49:52 AM UTC-5, Hitoshi Mitake wrote:

Even requests to a replicated state machine do not have side
effect (e.g. read request of KVS), Raft needs to write them in
its persistent log.

I'm already confused with this second sentence and must be missing something.

Why do reads need to be persisted into the log in order to be linearizable?

Note the requirement is "linearizable", not "linearized". You don't have to actually linearize the operations...

-Archie

David B Murray

unread,

Aug 15, 2016, 10:23:07 AM8/15/16

to raft...@googlegroups.com, xian...@coreos.com, anthony...@coreos.com

The point of persisting a no-op in the log is to find the "current" (for some definition of current) end of the log. If you then wait until your local state machine has processed up through that point before answering the read, you prevent stale reads.

If you've got some other useful thing to append to the log at around the same time, you can certainly piggyback on that to discover the end of the log. You do still need to wait before executing the read though.

-d

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Archie Cobbs

unread,

Aug 15, 2016, 10:37:02 AM8/15/16

to raft-dev, xian...@coreos.com, anthony...@coreos.com

Thanks, now I get it. I agree any reasonable implementation ought to optimize follower reads, especially when in the common case the follower is going to be already up-to-date.

E.g., simply asking the leader what its current end of the log is, and waiting for that entry to appear.

I guess I thought this was a solved problem, but I may still be missing something.

-Archie

To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.

David B Murray

unread,

Aug 15, 2016, 4:03:55 PM8/15/16

to raft...@googlegroups.com, Xiang Li, anthony...@coreos.com

> E.g., simply asking the leader what its current end of the log is, ...

The thing to watch out for here is accidentally asking a "leader" who is not a leader any more, but hasn't realize it yet. Depending on how much you trust your clocks you might be able to get away with a leader assuming no one will try to vote it out for some amount of time after it gets elected, but the only way to be *sure* (that I have seen; would love to be wrong) is to successfully put something in the log after getting the read request.

In my experience, if you want to scale reads up you need to give up on linearizability.

-d

To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+unsubscribe@googlegroups.com.

Hitoshi Mitake

unread,

Aug 17, 2016, 3:12:41 AM8/17/16

to raft...@googlegroups.com, Xiang Li, anthony...@coreos.com

Hi Archie, David, thanks for your reply.

On Tue, Aug 16, 2016 at 5:03 AM, David B Murray <david....@gmail.com> wrote:

> E.g., simply asking the leader what its current end of the log is, ...

The thing to watch out for here is accidentally asking a "leader" who is not a leader any more, but hasn't realize it yet. Depending on how much you trust your clocks you might be able to get away with a leader assuming no one will try to vote it out for some amount of time after it gets elected, but the only way to be *sure* (that I have seen; would love to be wrong) is to successfully put something in the log after getting the read request.

Yes, for preventing a stale read from an isolated node (which can think itself as a leader before next election starts), read requests must be appended to the logs of the quorum.

In my experience, if you want to scale reads up you need to give up on linearizability.

For improving scalability, of course linearizability must be sacrificed. My idea is focusing on skipping synchronous disk write in read request processing. And it does not consider about scalability (even a number of nodes increases, read performance cannot be improved).

Thanks,

Hitoshi

Diego Ongaro

unread,

Aug 22, 2016, 4:52:09 PM8/22/16

to raft...@googlegroups.com

I'm still confused by the original email. Writing something into the log gives you an obvious linearization (the log index), and that's simple but slow. My dissertation explains how to get the exact same guarantees without touching the log in section 6.4.0, requiring one round of communication from the leader to half the followers to confirm its leadership. (David, I hope you do truly love being wrong.) Then 6.4.1 talks about eliminating the round of communication by relying on clocks.

Hitoshi, is your proposal solving a different problem than 6.4.0? Or is it better by some metric than the solution in 6.4.0? It may also help if you walk us through a detailed example of a read request with your proposed hole approach.

-Diego

David B Murray

unread,

Aug 22, 2016, 6:23:20 PM8/22/16

to raft...@googlegroups.com

Oh hey, I *do* love being wrong (and I clearly need to get around to reading the rest of your dissertation one of these days :)).

-d

Hitoshi Mitake

unread,

Aug 24, 2016, 2:43:58 AM8/24/16

to raft...@googlegroups.com

Hi Diego,

On Tue, Aug 23, 2016 at 5:51 AM, Diego Ongaro <onga...@gmail.com> wrote:

I'm still confused by the original email. Writing something into the log gives you an obvious linearization (the log index), and that's simple but slow. My dissertation explains how to get the exact same guarantees without touching the log in section 6.4.0, requiring one round of communication from the leader to half the followers to confirm its leadership. (David, I hope you do truly love being wrong.) Then 6.4.1 talks about eliminating the round of communication by relying on clocks.

Hitoshi, is your proposal solving a different problem than 6.4.0? Or is it better by some metric than the solution in 6.4.0? It may also help if you walk us through a detailed example of a read request with your proposed hole approach.

I read section 6.4.0 of your thesis again, and noticed that my idea would be almost equal to the idea described in the section.

In the steps described in the section, step 3 has the below line:

"It issues a new round of heartbeats and waits for their acknowledgments from a majority of the cluster."

When I first read the description, I didn't think that the "new round of heartbeats" is triggered by read request processing of the leader. So I thought it is an optimization for throughput with batching multiple read requests (the batching idea is described in the sentence of "To improve efficiency further...").

However, if the round is triggered by the read request processing (is this right?), the optimization allows processing read request in a timely manner without synchronous disk write. And it is almost same to my idea.

Sorry for confusion and thanks for correcting my understanding!

Thanks,

Hitoshi

Diego Ongaro

unread,

Aug 25, 2016, 1:18:37 PM8/25/16

to raft...@googlegroups.com

On Tue, Aug 23, 2016 at 11:43 PM, Hitoshi Mitake <mitake....@gmail.com> wrote:

Hi Diego,

On Tue, Aug 23, 2016 at 5:51 AM, Diego Ongaro <onga...@gmail.com> wrote:
I'm still confused by the original email. Writing something into the log gives you an obvious linearization (the log index), and that's simple but slow. My dissertation explains how to get the exact same guarantees without touching the log in section 6.4.0, requiring one round of communication from the leader to half the followers to confirm its leadership. (David, I hope you do truly love being wrong.) Then 6.4.1 talks about eliminating the round of communication by relying on clocks.

Hitoshi, is your proposal solving a different problem than 6.4.0? Or is it better by some metric than the solution in 6.4.0? It may also help if you walk us through a detailed example of a read request with your proposed hole approach.

I read section 6.4.0 of your thesis again, and noticed that my idea would be almost equal to the idea described in the section.

In the steps described in the section, step 3 has the below line:
"It issues a new round of heartbeats and waits for their acknowledgments from a majority of the cluster."

When I first read the description, I didn't think that the "new round of heartbeats" is triggered by read request processing of the leader. So I thought it is an optimization for throughput with batching multiple read requests (the batching idea is described in the sentence of "To improve efficiency further...").

However, if the round is triggered by the read request processing (is this right?), the optimization allows processing read request in a timely manner without synchronous disk write. And it is almost same to my idea.

Yep, I'll leave that decision up to you. As a simple point in the design space, LogCabin sends out the heartbeats right away if there aren't any outstanding, otherwise queues the next round of heartbeats when the first gets back. The next round should satisfy any reads that have arrived in the meantime.

Sorry for confusion and thanks for correcting my understanding!

No problem, just glad we're on the same page now.

Hitoshi Mitake

unread,

Aug 26, 2016, 2:17:48 AM8/26/16

to raft...@googlegroups.com

On Fri, Aug 26, 2016 at 2:18 AM, Diego Ongaro <onga...@gmail.com> wrote:

On Tue, Aug 23, 2016 at 11:43 PM, Hitoshi Mitake <mitake....@gmail.com> wrote:
Hi Diego,

On Tue, Aug 23, 2016 at 5:51 AM, Diego Ongaro <onga...@gmail.com> wrote:
I'm still confused by the original email. Writing something into the log gives you an obvious linearization (the log index), and that's simple but slow. My dissertation explains how to get the exact same guarantees without touching the log in section 6.4.0, requiring one round of communication from the leader to half the followers to confirm its leadership. (David, I hope you do truly love being wrong.) Then 6.4.1 talks about eliminating the round of communication by relying on clocks.

Hitoshi, is your proposal solving a different problem than 6.4.0? Or is it better by some metric than the solution in 6.4.0? It may also help if you walk us through a detailed example of a read request with your proposed hole approach.

I read section 6.4.0 of your thesis again, and noticed that my idea would be almost equal to the idea described in the section.

In the steps described in the section, step 3 has the below line:
"It issues a new round of heartbeats and waits for their acknowledgments from a majority of the cluster."

When I first read the description, I didn't think that the "new round of heartbeats" is triggered by read request processing of the leader. So I thought it is an optimization for throughput with batching multiple read requests (the batching idea is described in the sentence of "To improve efficiency further...").

However, if the round is triggered by the read request processing (is this right?), the optimization allows processing read request in a timely manner without synchronous disk write. And it is almost same to my idea.

Yep, I'll leave that decision up to you. As a simple point in the design space, LogCabin sends out the heartbeats right away if there aren't any outstanding, otherwise queues the next round of heartbeats when the first gets back. The next round should satisfy any reads that have arrived in the meantime.

I see. I'll play with LogCabin for understanding the behavior of the mechanism.

Thanks for your explanation,

Hitoshi

Reply all

Reply to author

Forward