Questions on scalability in Raft

879 views
Skip to first unread message

George Samman

unread,
Apr 13, 2016, 5:51:15 AM4/13/16
to raft-dev
Hi all I was wondering if someone could answer the following questions on scalability for me. Thanks in advance.

What is current time measurement?
           - For transaction to be validated?
           - For consensus to achieved?
Total Volume (# of trades) per second?
What is current measure of scalability?
Is there a limitations on number of fields?
Is speed of system impacted of system is made more  scalable?
Does synchronization have any impact on scalability?

Henrik Ingo

unread,
Apr 13, 2016, 9:29:02 AM4/13/16
to raft...@googlegroups.com
Hi George

I think your questions are more relevant for specific Raft
implementations than the algorithm itself. Ongaro's thesis includes
some performance tests for the reference implementation they did.

henrik
> --
> You received this message because you are subscribed to the Google Groups
> "raft-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to raft-dev+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



--
henri...@avoinelama.fi
+358-40-5697354 skype: henrik.ingo irc: hingo
www.openlife.cc

My LinkedIn profile: http://fi.linkedin.com/pub/henrik-ingo/3/232/8a7

George Samman

unread,
Apr 13, 2016, 9:31:00 AM4/13/16
to raft-dev, henri...@avoinelama.fi
Any idea how I can find out some answers to those questions?  Thanks

Jordan Halterman

unread,
Apr 13, 2016, 4:08:31 PM4/13/16
to raft...@googlegroups.com
Those measurements are in Diego's dissertation.

You'll probably have to talk to specific authors for that information. But in general, you'll find that most Raft implementations are limited in terms of scalability at some level simply because Raft is a leader-based algorithm. All operations ultimately have to go through the leader. But there are ways to improve performance of write and read operations in various ways, and that's why your question is implementation dependent.

Some Raft implementations support sharding which allows the cluster to scale far beyond what a typical Raft implementation could do. Similarly, some implementations may sacrifice linearizability for reads to scale them to multiple nodes (allowing reads from leaders and followers). But even in those cases, there's some cost to maintaining sequential consistency to ensure a write to the read on a follower does not occur prior to an earlier write to the leader.

Performance is going to be largely dependent on what's being written to the cluster, the hardware on which the cluster is running, the number of clients submitting operations to the cluster, the ratio of reads to writes, the guarantees provided by the cluster for reads and writes, etc, so you should really just look at Diego's dissertation for an idea, and that will be as good as any information you'll get from any of the other authors of Raft implementations.


See section 10.3 for the performance charts.

I have some benchmarks from Copycat that show similar performance, but Diego's dissertation is more thorough and I think it should suffice for what you're looking for. If not, what is missing?

Philip Haynes

unread,
Apr 14, 2016, 6:17:57 AM4/14/16
to raft-dev
Hi George,

As other responders have indicated, it is a big "it depends" on HW and your usage patterns.
But I will answer what I have found so far to try and give you a feel for what you can expect.

Our RAFT implementation is built using low latency Java techniques (specifically no memory allocations in a tight central loop), Aeron multi-cast messaging and RocksDB.
We are finalised development and currently deploying into server environment running Smart OS Solaris where we are doing performance tests. 
Each server has dual CPU E5-2660 Xeon's, 256G RAM and SSD on  a 1 & 10G networks.

One of our RAFT use cases is guaranteed messaging support. For relatively large messages (~5K), profiling is showing actual performance, matches 
expected performance models against HW capability - particularly how fast one can write 5K log records to SSD.  On my development mac, for example,
where I read 10K event records from disk, and then store in a three node replica these records, I am getting ~1K/ committed records per sec where 40% of time 
is spent writing to disk (ball park Log Cabin Results).  Apart from disk I/O, the next big performance issue is queuing behaviour due to a long tail. As we rollout into production
setup I am expecting a big leap in performance. The protocol in itself isn't adding measurable overhead.


What is current time measurement?
>           - For transaction to be validated?
~120 microseconds at  > 100M per day - but this has nothing particularly to do with RAFT.

 >          - For consensus to achieved?
In a 3 node cluster ~ 80 micro seconds + 2 x record write time + latency long tail overhead.

Total Volume (# of trades) per second?
On out 3 node cluster with 5K messages,  I am expecting worst case performance 4K TPS but hoping for better.
As we employ a multi-cast design that allows us to scale horizontally. 

 What is current measure of scalability?
Not sure what you mean here.

Is there a limitations on number of fields?
We have > 500 fields - apart from IO consequences, so no. 
In our validation we decision just on actual required fields to significantly improve performance. 

Is speed of system impacted of system is made more  scalable?
Not sure what you mean here.

>Does synchronization have any impact on scalability?
Yes - you have to go through a strong leader and you get queuing behaviour.

Hope this helps. If you are interested, when I get final production hardware numbers I can 
share this with you as well.

Kind Regards,
Philip


Oren Eini (Ayende Rahien)

unread,
Apr 14, 2016, 6:23:03 AM4/14/16
to raft...@googlegroups.com
How are you writing to disk?
1K committed records x 5K is pretty poor write rate on SSD. 


Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 


George Samman

unread,
Apr 14, 2016, 8:51:35 AM4/14/16
to raft-dev
Hi Jordan:

Thanks much. Is there any chance I can email you as I may have a few more questions around implementation, unless of course you don't mind me putting them right in here.  Thanks again.

George Samman

unread,
Apr 14, 2016, 8:52:32 AM4/14/16
to raft-dev
Sorry this was meant Philip Haynes. Thanks.

TaiZe Wang

unread,
Apr 14, 2016, 11:25:02 PM4/14/16
to raft-dev
Hi Philip ,

We just have implemented RAFT based on leveldb . It reaches 9K write qps on five node cluster.

Diego Ongaro

unread,
Apr 17, 2016, 4:51:27 PM4/17/16
to raft...@googlegroups.com
What is current measure of scalability? 
 
 But in general, you'll find that most Raft implementations are limited in terms of scalability at some level simply because Raft is a leader-based algorithm. All operations ultimately have to go through the leader

It's a bit more fundamental than this:
Paxos/Raft/VR/ZK won't lose your data if you have a majority of servers up
=> every command has to be replicated on at least a majority of servers to be committed (the overlap of any two majorities is non-empty)
=> every server has to participate in the commitment of a majority of the commands
=> with a single consensus group, no matter what the algorithm does, it can never achieve twice the throughput of an individual non-replicated server.

To be clear, Raft isn't optimal (the leader can be a bottleneck), but there's a pretty low bar for what you can hope to achieve without partitioning/sharding.

-Diego

Philip Haynes

unread,
Apr 17, 2016, 5:12:03 PM4/17/16
to raft-dev
I should have been clearer - on my development mac. Each commit equates to
a) 1 Read of a test record.
b) 3 x 5 K write (one for each "node").

I am doing a key-value write of the index using RocksDB. 
Next round of testing will be likely faster as:
a) Machines will be clean
b) only writing will be done on each node - without all the contention I have at the moment.
Reply all
Reply to author
Forward
0 new messages