High performance raft library/system

moranti...@gmail.com

unread,

May 27, 2018, 1:52:23 PM5/27/18

to raft-dev

Hi all,

For my PhD, I'm running some benchmarks on distributed systems/tools/libraries. For Raft implementations, I've tried etcd, consul and logcabin. Is there any other "high performance" one I must give a go?

I got two cluster setups with 1 gigabit and 10 gigabit ethernet, each cluster has 3 nodes (each has 16 core CPU's, 30 gb ram) and another node for clients. Raft implementations couldn't saturate 1 gigabit setup.

So, actually I'm looking for a "high performance one", please let me know if there is any. I don't want to miss it.

P.S I ran all products with default configs. All commands were put/write from multiple clients. So, any configuration tip for performance is appreciated

Thanks.

Mora Ntikili

Oren Eini (Ayende Rahien)

unread,

May 27, 2018, 3:22:45 PM5/27/18

to raft...@googlegroups.com

Raft will usually do request / response, so you're limited by the latency in the network, not the bandwidth.

Technically, you can do it in a batch mode, but that gets funny after a while, because our of order responses make for hard to follow code.

Hibernating Rhinos Ltd

Oren Eini l CEO l Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Vilho Raatikka

unread,

May 27, 2018, 3:37:19 PM5/27/18

to raft...@googlegroups.com

Hi, are you planning to use simple request / response -kind of load or something more complicated, such as multi-statement transactions? I agree with the opinion that network shouldn't limit performance. With simplistic load the technical report by Howard et. al. concentrated on Leader election and tuning timeouts. What I haven't seen investigated yet is the effect of parallelism with non-trivial load, for example.

Having three nodes cluster sounds also way too little. I understand the difficulty of getting dozens of them just for playing but if it isn't mandatory to run nodes in separate hosts, nodes should be able to share hosts. Of course, then you should concentrate on other things than network latencies.

Cheers, Vilho

On Sun, May 27, 2018 at 8:52 PM, <moranti...@gmail.com> wrote:

--

Timothy Moran

unread,

May 27, 2018, 4:32:37 PM5/27/18

to raft...@googlegroups.com

Hi,

Thanks for quick responses.

@Oren

E.g. I ran etcd with "linearizable" option (-l). So I assume it sends requests in "batch mode" but couldn't get more than 20k requests(1 kb each) / per second

@Vilho

I expect implementations send messages in batch mode. Multiple requests in flight is acceptable for me. I dont care.

I think raft wouldn't be effective when there are dozens of nodes as we have to replicate data to each of them (For redundancy it is ok but for performance it might not be acceptable ).

Sharing same hosts also seems ineffective, if host crashes multiple nodes will be down.

@Oren, @Vilho

Sorry, I forgot the mention, nodes in cluster setup have SSD disks (write ~450 MB/s)

So, if an implementation uses batch mode and if you have a fast disk,

theoretically it can saturate gigabit ethernet.

(1 gigabit ethernet = ~125 MB/s, e.g clients send 60 MB/s total, leader replicates to two followers, leader out traffic would be : ~120 MB/s + some acks?, still faster than SSD speed)

I don't know if it is possible with Raft algorithm :) So, just asking if there is any other implementation performs that good.

Thank you.

Oren Eini (Ayende Rahien)

unread,

May 27, 2018, 5:17:55 PM5/27/18

to raft...@googlegroups.com

I'm not sure how you _can_ do batches in this regard.

My implementation of Raft does:

* Accept a command in the leader

* Write it to disk durably (fsync)

* A thread for each follow wake up and send it to the other side

* Wait for confirmation about this from the other side

* Follower listen to the network

* Get new commands, write them to disk durably (fsync)

* Confirm receipt

* Apply commit index (execute state machine)

* Leader get response, record the number of confirmations for this durably (fsync)

* Return to client

In particular, during the AppendEntries call, there are not concurrent AppendEntries going on to the same follower.

Even if we ignore the cost of fsyncs, there is _no_ traffic from the leader to follower during at least a full roundtrip.

That alone should mean you can get full network utilization.

Note that in my case, there are also explicit limits in place. If the leader accepts a lot of commands in the meantime, I'm not going to send all of them in one shot. Because networks sucks and we need to get confirmation from follower within 300ms.

That also put a limit on the amount of traffic we can afford to send to follower in one shot.

At any given point,

Hibernating Rhinos Ltd

Oren Eini l CEO l Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

Timothy Moran

unread,

May 27, 2018, 7:12:11 PM5/27/18

to raft...@googlegroups.com

I understand your implementation and requirements. You could send another AppendEntries while waiting confirmation from previous one, so it would lead to higher throughput maybe.

I think Raft lets you implement this way(Does it?), multiple RPC's, multiple AppendEntries in flight possible as long as you send them sequentially:

Client -> RPC1 -> Leader

Leader-> AppendEntries1 -> Follower1, Follower2

Client -> RPC2 -> Leader

Follower1-> AppendEntries1Ack -> Leader

Leader->AppendEntries2 -> Follower1, Follower2

Leader-> RPC1Ack -> Client

Follower2->AppendEntries1Ack, AppendEntries2Ack -> Leader

Leader->RPC2Ack->Client

Follower1->AppendEntries2Ack->Leader

jordan.h...@gmail.com

unread,

May 27, 2018, 9:49:33 PM5/27/18

to raft...@googlegroups.com

Sure. Search “pipelining” in the Raft dissertation. We do it in Atomix. Atomix’s Raft implementation works more or less the same way, but multiple AppendEntries RPCs can be in transit at a time. In order to avoid small batch sizes (e.g. two 1-entry AppendEntries RPCs after two entries are appended) we keep track of round trip times and send the next batch at some interval based on average RTT or something like that.

To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.

Oren Eini (Ayende Rahien)

unread,

May 28, 2018, 1:22:30 AM5/28/18

to raft...@googlegroups.com

The moment you go with that route, you run into some complex scenarios.

For example, if you are using a single TCP connection, you have head of line blocking anyway.

There is also an issue of timing. You got an AppendEntriesResponse, but for what AppendEntries? Is it too late to count, etc.

If using multiple TCP streams / UDP, you don't get sequencing.

Note that this isn't to say that you can't do it this way, only that there are some traps that you have to deal with.

I considered having a separate channel for heartbeats from the actual AppendEntries, but that is risky. In particular, it is possible to construct a scenario where you can get heartbeats but can't process AppendEntries, which should result in elections.

Simple splitting the send / recv for each follow to be independent will also work, and means that you can start sending the next AppendEntries before the other side got the previous one. But be aware that it doesn't actually matter.

Eventually, you are going to hit the TCP buffer size and stall until the other side read from the network and confirm the received packets. So after all of this, you still end up with being limited by how much you can read and process each AppendEntries.

jordan.h...@gmail.com

unread,

May 28, 2018, 6:27:30 AM5/28/18

to raft...@googlegroups.com

Sure, there are certainly limits to the approach, but in practice we’ve seen it has indeed made a difference. But it can degrade performance quickly, so the pipelining currently has a hard coded limit of two concurrent AppendEntries requests to each follower. We have a lot going on in the network beyond Raft and can’t afford to saturate it anyways.

Atomix uses a messaging abstraction that uses a single TCP socket for AppendEntries RPCs (this would be very complicated without the ordering guarantee) and essentially uses correlation IDs for request/reply, and the future for the relevant request is completed once a response for given request is received. That allows the code to correlate a response with the context within which it was sent, which is effectively held in memory until the response is received.

A state machine is used to track when AppendEntries requests are unsuccessful, and pipelining is disabled when conflicts occur on a follower (the leader is trying to find the last matching entry in the follower’s log), otherwise there are certain ways you can get into a state where optimistically sending entries to a follower that always rejects them actually prevents the leader from decreasing the nextIndex. This solution is simpler than ignoring responses that occur after an earlier request is rejected, and there’s no performance benefit to pipelining when the follower is rejecting entries anyways.

Anyways, good points! It would be interesting to do a practical experiment with different configurations. I did it at some point, but it was so long ago I don’t remember the details. All I do know is that we arrived at a very minimal implementation, but that may also be a product of the use case.

To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.

Timothy Moran

unread,

May 28, 2018, 7:46:19 AM5/28/18

to raft...@googlegroups.com

Great points indeed. Thank you all for detailed explanations.

I'll add atomix to benchmark suite. I know better to ask on github but one quick question, If I modify and run RaftPerformaceTest.java, will it employ multiple clusters(sharding?) . So, I ll get results for multiple clusters not a single one?

jordan.h...@gmail.com

unread,

May 29, 2018, 2:38:41 PM5/29/18

to raft...@googlegroups.com

RaftPerformanceTest is just a test on a single Raft cluster so it should be comparable to other single-instance systems. That class is really just used to debug performance issues in the IDE now and then. But all our regular benchmarking is done with sharding so that’s probably the best place to start.

To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.

Юрий Соколов

unread,

May 31, 2018, 1:47:10 AM5/31/18

to raft-dev

A friend of mine did storage with pipelined Raft implementation without much limits (well, there were about 1024 in-flight AppendEntries, but it were dictated by other technical issues).
Storage had ability to rollback unanknowleged records, so when majority of followers failed, it rollbacks to last (known to be) committed record, and starts election.
Unfortunately, I doubt it is opensourced.

Arthur Wang

unread,

Aug 22, 2018, 4:40:52 AM8/22/18

to raft...@googlegroups.com

I'm excited to see such a performance related discussing . And I'm currently working on my own implementation of Raft which is exactly aimed at this (mostly to enhance the throughput) .

The basic idea of my version is to take the advantage of paralleling as much as possible.Here are some key points of my design(some of these are already been done):

1. implemented by c++:

1> inherent performance merits.

2> strong features brought by c++11/14/17/20.

2.using grpc as the server side network framewok:

1> multiple working threads model which I can directly take of . Thus I don't how to do all the tricks if I use some other alternatives like libevent .

2> out of the box serialization solution : protobuf .

1) good compress ratio to decrease the traffice size.

2) friendly interface to the programmers.

3> sort of asynchronous communitions(based on its `CompletionQueue`) between leader and followers.

3. Pipelined:

1> each pair of the leader and the follower has its own connection pool . The connections inside the pool is surly represented by a `Stub` object in grpc.

1) A seperate thread is responsible for the healthy of each connection by periodically send heartbeats.

2) I'm not sure whether each TCP connection could be mulplexed or not under the grpc, but I believe it .

2> AppendEntries RPC calls are issued in parallel to all the followers. And leader just wait the responses until got the majority confirmed ,after which ,response a successful msg to client.

1) For the rest of the responses of AppendEntries , and even further, the CommitEntries RPC calls will be handled by other threads spawned by the user code.

2) Communications between the threads are engined by a multiple producer multiple consumer lock free ring buffer queue.

3> Follower has the ability to cope with the out of order problems properly.

1) It compares the first entry of all the entries each AppendEntries attached, if the previous log id of the first entry is beyond the 'id of the last replicated log' , then the thread pending , waiting (absolutely just for a certain amount of time)untill the previous

to come.

2) There is a pending queue(double linked list) here to hold the entries arrived ahead of time . Also , the queue is lockfree , again.

3) When pending threads allowed to go ahead ,also , at that point , its logs are already appended by other thread , it finished its job. Then notification is done by the condition variable.

4> Disk operatoins are serialized.

This is the ultimate problem that pipeline can't get around. When multiple threads appending logs to the same file, they have to be well coordinated .

1) There is a last_id infomation maintained in memory indicates the last appended log's id .

2) Like 3>1), if earliest id of the logs that thread wanting to write is beyond the last_id , thread hang and wait. A condition variable helping do the notifications again.

The fundamental idea for my pipeline design is : Where it could take multiple threads to accelerate the whole process , never use a single thread .

Although , the subsidiary problems that multiple threading brought are big and hard, they are also not unsolvable.

4. Persistent storage on each node is in a layered manner inspired by levelDB.

1) good blind write throughtput.

2) SSTables , in memory log , bloom filter , blabala.....

All the above are about the log replication of raft , for the other two parts :

I. election

II.membership change

I don't think there is any place to paly the optimizing game.

After all the above is done , it is more like a distributed k-v storage system rather than just raft protocol.

This project is tremendously large , like a monster . I've been working on it for several months (absolutely in part time) , and I think it need another several months to see the bate release.

Hope this could be a novel implementation of raft .

Arthur Wang

unread,

Aug 22, 2018, 5:26:45 AM8/22/18

to raft...@googlegroups.com

The time I spent on this project is mostly on two parts:

1> lock free(or maybe wait-free , I don't care too much about that.) data structure's design.

They are heavily rely on the CAS operations. Now I can convince myself to believe its correctness.

2> coordinations among the threads.

And some other thoughts about how to get around the synchronous writing file problem:

1> different thread write different files , thus improve overall write throughput , inside each file , log ID are monotonically increasing , but can be non adjacent.

2> In scenarios where need read the files, maintain handlers for each file , sequencially reading , to get right order of the whole log ,like a merge sort .

I've never practice this file splitting idea , but it should be viable.

Oren Eini (Ayende Rahien)

unread,

Aug 22, 2018, 7:10:14 AM8/22/18

to raft...@googlegroups.com

Pay attention to the fact that the throughput of Raft is limited by design by:

* Round trip time for a majority of the network

* Persisting the data to disk (if using persistance)

Nothing else will be as costly as these.

To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

You received this message because you are subscribed to the Google Groups "raft-dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+unsubscribe@googlegroups.com.

Arthur Wang

unread,

Aug 22, 2018, 7:32:29 AM8/22/18

to raft...@googlegroups.com

- Round trip time for a majority of the network

I don't know why this could be a factor for throughput limitation, it is much more related to one AppendEntries RPC call's latency.

If you can issue the AppendEntries RPC calls for one entry before its previous entries got majority confirmed , thus , it is actually no limit for throughput by raft's design at all.

One thing of this way to consider is: when the previous entries cannot get majority confirmed (like cluster down), there will be a large amount of unnecessary traffic generated in the network.

To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.

Oren Eini (Ayende Rahien)

unread,

Aug 22, 2018, 7:41:35 AM8/22/18

to raft...@googlegroups.com

Sorry, I tend to care more about latency than throughput and that is what I care for.

Be aware that you want to have some limit on the number of in flight requests. You have to execute and report back results to the client after the commit, and that can cause you to eventually timeout if you let too much into the queue

To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+unsubscribe@googlegroups.com.

Arthur Wang

unread,

Aug 22, 2018, 7:46:06 AM8/22/18

to raft...@googlegroups.com

Yes. There should have a limit on that number . I almost forget .

To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.

Reply all

Reply to author

Forward