Batching, latency and throughput

873 views
Skip to first unread message

Hugues Evrard

unread,
Jun 7, 2016, 4:35:44 AM6/7/16
to raft...@googlegroups.com
Hi all,

Batching, i.e., gathering multiple entries in one AER (Append Entries
Request), is a effective way to cope with high throughput. However, if
it is implemented as "leader waits a given amount of time to gather
entries before sending AER", it impacts latency.

For instance, I made some experiments with CoreOS etcd 18 months ago,
they use to make the leader wait 50ms before sending an AER so that it
can do batching (I _guess_ 50ms also corresponds to the heartbeat
delay). Under a series of write requests from a unique client, which
waits for the answer before sending the next request, the approach led
to mediocre performances: approx. <nb requests> x 50ms (so 100 requests
took 5 seconds). So batching degrades latency of serial requests from a
unique client, but leads to good throughput under many concurrent
requests from several distinct clients, which must be the main use-case
for etcd.

My question is: does any Raft implementation try to keep low latency for
serial requests, and use batching only when necessary to handle many
concurrent requests? Like batching over a variable-delay depending on
the workload?

(This question came to mind when I read the "Process / Flow Control"
mail thread started by Philip Haynes, but I thought it would be better
to isolate it in a new thread.)

Best,
Hugues

--
Hugues Evrard, Eng. PhD
Team CONVECS, Inria & LIG
http://hevrard.org

Oren Eini (Ayende Rahien)

unread,
Jun 7, 2016, 4:40:06 AM6/7/16
to raft...@googlegroups.com
Technically speaking, you can have  the client indicate whatever it is willing to wait, or to send whatever it has right now.


Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 


--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Philip Haynes

unread,
Jun 7, 2016, 8:22:19 PM6/7/16
to raft-dev, Hugues...@inria.fr
Hi Hugues,

Our experience of low latency techniques is that throughput versus latency is not actually an engineering tradeoff. 
Aeron enables trivial saturation of a 1G network with small messages. The new flow control algorithm 
allows ~5 AER messages queued up in flight at all times so followers are maxed out servicing requests. 
Current non-optimial performance numbers show throughput of ~30K AER's per second with mean latency of 300us. 

Thus service processing times would have to improve by at least one to two orders of magnitude,
before message batching could make sense. Even then, message  batching could complicate recovery.

RE: Concurrency. Multiple requests from distinct web clients are aggregated using a MPSC queue.  
Main processing is done in a tight spin busy loop within a single thread with dedicated CPU cores so
you don't incur costs due to locking and can control your cache behaviour better.

Due to current state of practice,  a low latency focus does requires a lot of engineering effort. Additionally,
the operational characteristics of the system are quite different to a more conventional multi-threaded web design.
This implies a _lot_ of education of the operations team so they understand the behaviour of the system (e.g. why are 
CPU cores running maxed out and how to use DTrace to see actual utilisation).

So for something like etcd that is to be used by a lot of people, my guess is that you want an application
that looks like most others and that low latency craziness could quite complicate acceptance of the CoreOS.

I hope this answers your question.

Kind Regards,
Philip



Oren Eini (Ayende Rahien)

unread,
Jun 8, 2016, 6:19:57 AM6/8/16
to raft...@googlegroups.com, Hugues...@inria.fr
I still think that you ignore the fact that you have a cost _per_ message.
In the case of RocksDB, flushing to the log is the expensive part.

Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 


Philip Haynes

unread,
Jun 9, 2016, 6:38:17 AM6/9/16
to raft-dev, Hugues...@inria.fr
Hi,

Per message service time / cost obviously must include a fixed component associated with the RAFT protocol itself. 
Our measurements of this, which include calls to RocksDB, are used as part of the Flow Control system we are finalising testing and review.
Provisionally, compared to the variances in performance with on the wire processing, service time variance are
much smaller (~30-50us for service times against 300-500,000us network time variance - before the network stack fails).
Whilst it is conceivable the need may arise introduce support for multiple, entries per request, we are far away from that.

Having gone through the journey to implement a RAFT flow control system, I now think it is an essential part of a
high performance RAFT implementation.  if people are interested, I will do write up the algorithm we ended up with
along with supporting test data and dead ends.

Cheers,
Philip

Oren Eini (Ayende Rahien)

unread,
Jun 9, 2016, 6:39:24 AM6/9/16
to raft...@googlegroups.com, Hugues...@inria.fr
I would love to read that, yes.

Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 


--

Henrik Ingo

unread,
Jun 13, 2016, 5:08:53 AM6/13/16
to raft...@googlegroups.com, Hugues...@inria.fr
On Wed, Jun 8, 2016 at 1:19 PM, Oren Eini (Ayende Rahien) <aye...@ayende.com> wrote:
I still think that you ignore the fact that you have a cost _per_ message.
In the case of RocksDB, flushing to the log is the expensive part.


Tangential, but at this point I always have an urge to point out, that although the Raft paper does mandate flushing messages to durable storage, this is not really required.

More precisely, it depends on what type of failure you want to guarantee availability against. Generally a distributed database system (such as described by Raft) assures persistence in failure scenarios where at least M-out-of-N nodes are still healthy. Typically M is a majority, so > ½N. Within this requirement there is no requirement to "ever" flush anything to durable storage, because at least M nodes are always expected to work perfectly fine. The system of course has to be deployed in a way such that node failures are independent, for example in AWS you would have to use different regions.

Raft paper mandates that committed data must be flushed to durable storage. This provides the additional assurance that if all nodes where to fail simultaneously, for example because they were all in the same AWS region, you will not lose any committed data. Otoh this additional safety comes with a rather high performance cost. Apart from this additional durability guarantee, there's no benefit in flushing the log all the time and personally I would consider a database that doesn't do it still a "Raft implementation". (Not that my opinion of course has any authority on that label...)

henrik

--
henri...@avoinelama.fi
+358-40-5697354        skype: henrik.ingo            irc: hingo
www.openlife.cc

My LinkedIn profile: http://fi.linkedin.com/pub/henrik-ingo/3/232/8a7

Юрий Соколов

unread,
Jun 14, 2016, 2:44:44 PM6/14/16
to raft-dev, Hugues...@inria.fr, henri...@avoinelama.fi
Henrik, I think you are not quite right.

Raft is consensus protocol, but it does not handle byzantine failure.

Imagine, you have three servers A,B and C.
B were a leader, and it had connectivity with A.
B wrote record R1-1 to itself and A, so B assumes R1-1 is commited. Perhaps B even informs A that R1-1 is commited.
But A didn't flush this knowledge to disk, and it "reboots".
Then A gain connectivity with C. C didn't know about R1-1, and A forgot about R1-1,
so C can become leader for term 2, and commit record R2-1 (together with A).
And now you have split brain.

So, not flushing to durable storage can quickly lead to byzantine failure.

Another approach: A should not rejoin cluster after "reboot". But it could quickly lead to
cluster degradation, if several members "reboots" one after other in a short time.

So, while theoretically there is no need in durable storage, practically it is
hard to build durable system without durable storage.

понедельник, 13 июня 2016 г., 12:08:53 UTC+3 пользователь Henrik Ingo написал:

Henrik Ingo

unread,
Jun 15, 2016, 7:00:54 AM6/15/16
to raft...@googlegroups.com, Hugues...@inria.fr
On Tue, Jun 14, 2016 at 9:44 PM, Юрий Соколов <funny....@gmail.com> wrote:
Henrik, I think you are not quite right.

Raft is consensus protocol, but it does not handle byzantine failure.


Just to clarify, I haven't been talking about byzantine failure, nor does it seem that you are.
(https://en.wikipedia.org/wiki/Byzantine_fault_tolerance) Nor do I think trying to survive byzantine failures is a useful goal to set.

 
Imagine, you have three servers A,B and C.
B were a leader, and it had connectivity with A.
B wrote record R1-1 to itself and A, so B assumes R1-1 is commited. Perhaps B even informs A that R1-1 is commited.
But A didn't flush this knowledge to disk, and it "reboots".
Then A gain connectivity with C. C didn't know about R1-1, and A forgot about R1-1,
so C can become leader for term 2, and commit record R2-1 (together with A).

Yes, I'm well aware of this discussion, and you have simply affirmed my point.

- If A "reboots", then you would expect node A to shutdown cleanly and flush its state to disk before completely shutting down.

- What you intended to say is that A crashes (and then reboots). In other words you're describing a situation where a majority of nodes are unavailable. This is precisely what I said as well: If the requirement, and this is commonly the requirement, is to survive failure of a minority of nodes, then surviving the kind of failure you describe, is outside of requirements.

- The old school thinking in your argument is to think that data is somehow magically safe because it was written to a disk. If you replace "A reboots" with "A has a disk failure after which disk is replaced with a new disk" in your sequence above, then you still cannot survive the failure. Hence, the argument for relying only on the distributed consensus part for durability, is that disk failures needn't be treated as a special kind of failure. Disks are just like any other components that can fail.
 
And now you have split brain.


Not if you implemented Raft correctly. Because B is not connected to a majority of the cluster, it must reject incoming requests. Hence it doesn't matter what state B is in.

 
So, while theoretically there is no need in durable storage, practically it is
hard to build durable system without durable storage.


Unfortunately, the "practical" world does not agree with this statement. In reality, pretty much all mainstream distributed databases have implementations that do not flush to disk on commit, either as the only option or as default option:

Only option: MySQL NDB Cluster

Default option: MySQL Galera Cluster, Riak, Cassandra, Couchbase

MongoDB by default is neither, but recent versions provide configurability to choose either option.

I'm unaware of what Redis and Amazon DynamoDB do. And by this I believe I have covered all relevant DB products that have any kind of mainstream adoption. In reality then, aiming to survive minority failures is more than sufficient for achieving good HA, and the performance cost of flushing to disk is too big that people would actually be using it.


Юрий Соколов

unread,
Jun 15, 2016, 8:18:03 AM6/15/16
to raft-dev, Hugues...@inria.fr, henri...@avoinelama.fi


среда, 15 июня 2016 г., 14:00:54 UTC+3 пользователь Henrik Ingo написал:
On Tue, Jun 14, 2016 at 9:44 PM, Юрий Соколов <funny....@gmail.com> wrote:
Henrik, I think you are not quite right.

Raft is consensus protocol, but it does not handle byzantine failure.


Just to clarify, I haven't been talking about byzantine failure, nor does it seem that you are.
(https://en.wikipedia.org/wiki/Byzantine_fault_tolerance) Nor do I think trying to survive byzantine failures is a useful goal to set.

 
Imagine, you have three servers A,B and C.
B were a leader, and it had connectivity with A.
B wrote record R1-1 to itself and A, so B assumes R1-1 is commited. Perhaps B even informs A that R1-1 is commited.
But A didn't flush this knowledge to disk, and it "reboots".
Then A gain connectivity with C. C didn't know about R1-1, and A forgot about R1-1,
so C can become leader for term 2, and commit record R2-1 (together with A).

Yes, I'm well aware of this discussion, and you have simply affirmed my point.

- If A "reboots", then you would expect node A to shutdown cleanly and flush its state to disk before completely shutting down.

- What you intended to say is that A crashes (and then reboots). In other words you're describing a situation where a majority of nodes are unavailable. This is precisely what I said as well: If the requirement, and this is commonly the requirement, is to survive failure of a minority of nodes, then surviving the kind of failure you describe, is outside of requirements.

You are right. 

- The old school thinking in your argument is to think that data is somehow magically safe because it was written to a disk. If you replace "A reboots" with "A has a disk failure after which disk is replaced with a new disk" in your sequence above, then you still cannot survive the failure. Hence, the argument for relying only on the distributed consensus part for durability, is that disk failures needn't be treated as a special kind of failure. Disks are just like any other components that can fail.

You are wrong. First: disk failure is such failure that system administrator should handle. And system administrator should not allow A to connect to cluster with wrong log.
But server may crash and quickly start without administrator inference (for example, if power supply fails for a part of a second), and if confirmed log records were synced to disk, it may safely connect to cluster.
 
 
And now you have split brain.


Not if you implemented Raft correctly. Because B is not connected to a majority of the cluster, it must reject incoming requests. Hence it doesn't matter what state B is in.

And you wrong again: B were connected to majority of the cluster - itself and A. It is majority! So Raft protocol allows B to assume it had committed R1-1.
Perhaps you mean C ? But C also connected to majority - itself and A, so it may commit R2-1.

This case is byzantine failure which were caused by not writting to durable storage before responding to master.
Raft doesn't handle byzantine failure.

This case could be fixed by not allowing A to connect to cluster after "failing", but it may lead to quick cluster degradation.

 
So, while theoretically there is no need in durable storage, practically it is
hard to build durable system without durable storage.


Unfortunately, the "practical" world does not agree with this statement. In reality, pretty much all mainstream distributed databases have implementations that do not flush to disk on commit, either as the only option or as default option:


I've said "durable system". Not "practically used system".
Many users use "not durable" systems for performance.
 
Only option: MySQL NDB Cluster

Default option: MySQL Galera Cluster, Riak, Cassandra, Couchbase

MongoDB by default is neither, but recent versions provide configurability to choose either option.

I'm unaware of what Redis and Amazon DynamoDB do. And by this I believe I have covered all relevant DB products that have any kind of mainstream adoption. In reality then, aiming to survive minority failures is more than sufficient for achieving good HA, and the performance cost of flushing to disk is too big that people would actually be using it.


So, you just confirmed my point: many "practical" systems are not "durable" by default, but they
allows to switch them in "durable" state.

Lets just distinguish "durable" and "some users thinks it is durable enough", ok?

Henrik Ingo

unread,
Jun 15, 2016, 9:51:23 AM6/15/16
to raft...@googlegroups.com, Hugues...@inria.fr
On Wed, Jun 15, 2016 at 3:18 PM, Юрий Соколов <funny....@gmail.com> wrote:
- The old school thinking in your argument is to think that data is somehow magically safe because it was written to a disk. If you replace "A reboots" with "A has a disk failure after which disk is replaced with a new disk" in your sequence above, then you still cannot survive the failure. Hence, the argument for relying only on the distributed consensus part for durability, is that disk failures needn't be treated as a special kind of failure. Disks are just like any other components that can fail.


You are wrong. First: disk failure is such failure that system administrator should handle. And system administrator should not allow A to connect to cluster with wrong log.
But server may crash and quickly start without administrator inference (for example, if power supply fails for a part of a second), and if confirmed log records were synced to disk, it may safely connect to cluster.
 

As I'm explaining below, allowing A to connect with C does not cause split brain.

It does of course cause R1-1 to be lost, I'm not arguing against that. I'm just saying that even when flushing the log to disk all the time, the world does still have failures where R1-1 can be lost. Hence, all that we can offer users is to be clear about what level of HA and durability a system offers. Neither your nor my opinion provides a 100% guarantee.
 
And now you have split brain.


Not if you implemented Raft correctly. Because B is not connected to a majority of the cluster, it must reject incoming requests. Hence it doesn't matter what state B is in.

And you wrong again: B were connected to majority of the cluster - itself and A. It is majority! So Raft protocol allows B to assume it had committed R1-1.
Perhaps you mean C ? But C also connected to majority - itself and A, so it may commit R2-1.


Let me try to be clearer: B is no longer a primary, as it is not connected to a majority (neither A nor C, in fact).

Your worrying is justified to the extent that there is a short lived anomaly in this situation: For a short period B will remain a primary, until it reaches its election timeout, and during this period clients can still connect to it and read R1-1. And after the election timeout is reached, R1-1 is no longer readable from anywhere.

However, this loss of R1-1 is due to temporarily losing a majority of the cluster. Again, it is outside the scope of the given requirements.


This case could be fixed by not allowing A to connect to cluster after "failing", but it may lead to quick cluster degradation.


Not necessary.
 
I've said "durable system". Not "practically used system".
Many users use "not durable" systems for performance.
 
...
 
So, you just confirmed my point: many "practical" systems are not "durable" by default, but they
allows to switch them in "durable" state.

Lets just distinguish "durable" and "some users thinks it is durable enough", ok?

My point is twofold:

 - There is no argument that flushing the log is "more durable". However, even that is just an incremental improvement, it is naive (and IMO legacy) thinking to say it is perfectly durable, or even some kind of optimal state that an implementation must achieve.

 - From a practical point of view, a system only implementing the more durable option, or even offering it as the default, is likely to see very little adoption. Offering both alternatives is of course ok. A minority of users will go for the more durable setting for sure.

henrik

Юрий Соколов

unread,
Jun 15, 2016, 3:18:37 PM6/15/16
to raft-dev, Hugues...@inria.fr, henri...@avoinelama.fi


среда, 15 июня 2016 г., 16:51:23 UTC+3 пользователь Henrik Ingo написал:
On Wed, Jun 15, 2016 at 3:18 PM, Юрий Соколов <funny....@gmail.com> wrote:
- The old school thinking in your argument is to think that data is somehow magically safe because it was written to a disk. If you replace "A reboots" with "A has a disk failure after which disk is replaced with a new disk" in your sequence above, then you still cannot survive the failure. Hence, the argument for relying only on the distributed consensus part for durability, is that disk failures needn't be treated as a special kind of failure. Disks are just like any other components that can fail.


You are wrong. First: disk failure is such failure that system administrator should handle. And system administrator should not allow A to connect to cluster with wrong log.
But server may crash and quickly start without administrator inference (for example, if power supply fails for a part of a second), and if confirmed log records were synced to disk, it may safely connect to cluster.
 

As I'm explaining below, allowing A to connect with C does not cause split brain.

It does of course cause R1-1 to be lost, I'm not arguing against that. I'm just saying that even when flushing the log to disk all the time, the world does still have failures where R1-1 can be lost. Hence, all that we can offer users is to be clear about what level of HA and durability a system offers. Neither your nor my opinion provides a 100% guarantee.

B is sure that R1-1 is committed, and B *reports to client that R1-1 is committed*!!!!
C is sure that R1-1 never existed, and C *reports to client that (effect of) R1-1 never existed*!!!!
B may not replay R1-1 back, cause B is sure R1-1 is committed.
Raft has invariant, that committed records are never overwritten, and this invariant is violated.

It hard for me to predict exactly what will happen in this case, but I doubt it will be a good thing.
 
 
And now you have split brain.


Not if you implemented Raft correctly. Because B is not connected to a majority of the cluster, it must reject incoming requests. Hence it doesn't matter what state B is in.

And you wrong again: B were connected to majority of the cluster - itself and A. It is majority! So Raft protocol allows B to assume it had committed R1-1.
Perhaps you mean C ? But C also connected to majority - itself and A, so it may commit R2-1.


Let me try to be clearer: B is no longer a primary, as it is not connected to a majority (neither A nor C, in fact).

Your worrying is justified to the extent that there is a short lived anomaly in this situation: For a short period B will remain a primary, until it reaches its election timeout, and during this period clients can still connect to it and read R1-1. And after the election timeout is reached, R1-1 is no longer readable from anywhere.

However, this loss of R1-1 is due to temporarily losing a majority of the cluster. Again, it is outside the scope of the given requirements.

You confuse availability requirements and consistency requirements.
Presence of a majority is availability requirement.
But Raft does not require presence of majority for consistency. It requires that all log records are written to durable storage before responding to leader.
 

This case could be fixed by not allowing A to connect to cluster after "failing", but it may lead to quick cluster degradation.


Not necessary.

Not necessary leads to cluster degradation?
Here is cite from Raft book: "For example,
it can be dangerous for the cluster to automatically remove failed servers, as it could then be left
with too few replicas to satisfy the intended durability and fault-tolerance requirements."
 
 
I've said "durable system". Not "practically used system".
Many users use "not durable" systems for performance.
 
...
 
So, you just confirmed my point: many "practical" systems are not "durable" by default, but they
allows to switch them in "durable" state.

Lets just distinguish "durable" and "some users thinks it is durable enough", ok?

My point is twofold:

 - There is no argument that flushing the log is "more durable". However, even that is just an incremental improvement, it is naive (and IMO legacy) thinking to say it is perfectly durable, or even some kind of optimal state that an implementation must achieve.

I'm never said "perfect durable".
There is just different levels of durability.
And I'm saying: not flushing to disk is very-very-very unreliable.

For example: we had a server that "crushed" 10 times during 2 hours.
It destroys 10GB of data during that cause in some piece of code wrong file descriptor were flushed :-(

File cache is huge! If server "crushes", hundreds of megabytes could be lost easily. And even gigabytes.
 
 - From a practical point of view, a system only implementing the more durable option, or even offering it as the default, is likely to see very little adoption. Offering both alternatives is of course ok. A minority of users will go for the more durable setting for sure.

PostgreSQL. It offers durable store by default. And rare user switches it to less durable option.
Oracle, Microsoft SQL Server, IBM DB2.

Sorry, but you are not right. "More durable" is very valuable option, and many systems offers it by default.
Cause it is much more easier to not have "anomalies", then recover from them.
It much easier to have consistent data, then solve puzzles of inconsistency.
And consistency almost impossible without durability.

If you don't need consistency, then you don't need consensus protocol !!!!!!!!!!
Just use asynchronous replication, and it will cover all your needs.

Henrik Ingo

unread,
Jun 15, 2016, 5:34:03 PM6/15/16
to raft...@googlegroups.com, Hugues...@inria.fr


On Wed, Jun 15, 2016 at 10:18 PM, Юрий Соколов <funny....@gmail.com> wrote:

Ok, now you're just shouting. I think we're done here...


PostgreSQL. It offers durable store by default. And rare user switches it to less durable option.
Oracle, Microsoft SQL Server, IBM DB2.


Like I said, your school of thought usually comes from people using the traditional legacy databases, which are not distributed database systems but rely on the disk to provide durability. The above list perfectly illustrates what I mean.

Юрий Соколов

unread,
Jun 16, 2016, 3:40:40 AM6/16/16
to raft...@googlegroups.com, Hugues...@inria.fr

Wish you never realize you are wrong.
Good day to you, Henrik.

16 июня 2016 г. 0:34 пользователь "Henrik Ingo" <henri...@avoinelama.fi> написал:
--
You received this message because you are subscribed to a topic in the Google Groups "raft-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/raft-dev/L3pPb1kH9hU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to raft-dev+u...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages