Questions about consensus for raft

539 views
Skip to first unread message

George Samman

unread,
Apr 2, 2016, 6:44:57 PM4/2/16
to raft-dev
Hi all Diego Ongaro suggested I join this group to get some questions answered on consensus with Raft.  I'm not a software engineer so forgive my ignorance on the questions. I work with blockchains and am putting together a whitepaper on consensus mechanisms.  Any help here would be greatly appreciated.

Here are my questions:

1) How much time does a node need to reach a decision?

2)  How do you measure if a node behaves with error (maybe non-malicious errors but due to data corruption or implementation errors)?

3) How does this scale? How many transactions can you do per second? 

4) What is the synchronization process?  (Is it the leader who synchronizes and sends out a replicated log?)  

5) In this mechanism the leader seems to have all the power, is this true and is that good for consensus as a whole? What if he's a malicious actor?

6) How is access restricted for malicious actors? Can they be kicked off the network?

7) Is there signature verification in place?

8) What is the cost to implement this? How much time does it take to implement?

9) How many nodes can you have on the network? Is it unlimited?

10) How many nodes need to validate a transaction? (%)

11) Is the system hard coded, or built with code flexibility?

12) - Fault tolerance levels?  (How many nodes need to be compromised before everything is shut down?)

13) - What is disaster recovery process?

Ruslan Rusu

unread,
Apr 2, 2016, 9:52:17 PM4/2/16
to raft-dev

Hey,

maybe you can read the whitepaper and come back there are any unanswered questions left 

Cheers

Diego Ongaro

unread,
Apr 2, 2016, 10:29:32 PM4/2/16
to raft...@googlegroups.com
I think George has already looked at the Raft paper, and we can see indications of that in the way he's asked some of the questions, especially 4 and 5. George, you can probably find the answers to 4 (yes), 10, and 12 in the paper pretty easily. George also asked some more difficult and interesting questions (including 2, 8, 13), where I'm hoping we can help shed some light or point him to other resources and past discussions. I'll reply with my answers when I have some more time, if no one's beat me to it.

-Diego

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

George Samman

unread,
Apr 3, 2016, 6:17:59 AM4/3/16
to raft-dev
Hi all

Thanks for getting back to me!! Much appreciated.

I did read the white paper.  Some of the questions I think I know the answers too and are in there but I am just wanting to be sure as like I said I'm not very technical. I think I basically answered 4 with my follow up question but I was wanting to be sure. As far as 5 my impression from the whitepaper was that the leader gets elected and stays the leader until he goes offline or does something "wrong", but while he is the leader it seems all the other nodes are giving him control to synchronize and replicate the log. True or false?

Also I believe in the whitepaper you used an example of 5 nodes with 2 going down  and system was ok so its 40%? That seems quite high to me am I wrong on that number? 

Thanks again all.

Alex Bligh

unread,
Apr 3, 2016, 7:01:52 AM4/3/16
to raft...@googlegroups.com, Alex Bligh
George,

On 2 Apr 2016, at 23:44, George Samman <joking...@gmail.com> wrote:

> Hi all Diego Ongaro suggested I join this group to get some questions answered on consensus with Raft. I'm not a software engineer so forgive my ignorance on the questions. I work with blockchains and am putting together a whitepaper on consensus mechanisms. Any help here would be greatly appreciated.

Let me have a stab. In general many of your questions are implementation dependent. There are many aspects of Raft left to the implementor (e.g. the absolute values of timeouts as opposed to their relative values) and some areas which are ambiguous in the sense that there is more than one way to implement it in a compliant manner or where the spec itself gives more than one way (a classic example being the section on client behaviour which IIRC is not in the thesis at all but is in section 6 of the dissertation with several options).

For this reason several of my answers may initially seem unhelpful.

> Here are my questions:
>
> 1) How much time does a node need to reach a decision?

You'll need to be clearer on what decision you mean. In normal operation the node will apply items to its FSM and read from its FSM. You could (with appropriate timeouts) have each of those take 1 second. This would however involve setting an election timeout an order of magnitude higher.

> 2) How do you measure if a node behaves with error (maybe non-malicious errors but due to data corruption or implementation errors)?

The raft specs do not specify measurement. Most implementations measure some aspects (e.g. with monotonic counters), and often send them off to a stats server. grep for 'stats' in hashicorp/raft for instance.

If you mean 'how does the algorithm tell whether a node has errored' it simply looks for either the absence of a valid reply or heartbeat within a timeout, or the presence of an invalid reply or heartbeat (where 'valid' and 'invalid' are determined by the spec).

> 3) How does this scale? How many transactions can you do per second?

Raft communication is unicast. Under normal operation the leader makes RPC calls to the followers, which means communication is O(n). As the cluster gets larger there is more chance of an individual node failing in a given time interval; however, it takes more individual nodes to fail to break a quorum. When individual nodes fail and come back online (assuming no re-election) their logs need to be 'caught up'. The failure of the network fabric is also going to have some relationship with the size of nodes. There are some studies on raft scaling in both papers - particularly in as it affects election timeouts.

In general, the number of nodes is not going to be the key determinant of transactions per second. I suspect the practical limit in most scenarios is 'a few hundred' though doubtless the algorithm could be tuned. Many use cases with more 'nodes' have a small subset participating in raft (and sharing the FSM) and a larger subset that are nodes but not raft nodes (e.g. consul).

The number of transactions per second is going to depend more on network latency and time to apply to the FSM than number of nodes. Remember that whilst each follower processes every log entry, they can do so in parallel.

> 4) What is the synchronization process? (Is it the leader who synchronizes and sends out a replicated log?)

Yes. This is clear from both raft papers - I suggest you read them in more detail.

> 5) In this mechanism the leader seems to have all the power, is this true and is that good for consensus as a whole? What if he's a malicious actor?

Yes, until an election occurs, where (in essence) he has no more 'power' than anyone else.

It's not bad for consensus as the whole point of an FSM is that if any actor applies the same data to an FSM in the same state, they get the same result.

Raft itself provides few if any protections against malicious actors as far as I know.

> 6) How is access restricted for malicious actors? Can they be kicked off the network?

Raft does not specify a transport layer, and authn/authz is a matter for the transport layer. hashicorp/raft from memory uses TLS as can etcd/raft. However, this is outside the spec of raft. When raft gets the messages, it is assumed they have been authenticated / authorized/

> 7) Is there signature verification in place?

As per 6. Note that TCP itself will prevent most corruption and ensure retransmission, and most users use TCP (with or without encryption on top). Note that Raft does not have to run over TCP (I submitted a udp version for instance, and running over say ZeroMQ would be quite feasible).

> 8) What is the cost to implement this? How much time does it take to implement?

Transport dependent.

> 9) How many nodes can you have on the network? Is it unlimited?

The protocol sets no limit. I'm not sure you'd want more than a few hundred. I suspect raft scalability would be a fruitful area for further academic study.

> 10) How many nodes need to validate a transaction? (%)

Logs need to be propagated to an absolute majority of the network (including the leader itself), i.e. strictly more than a half, i.e. >50%. 50% itself is insufficient.

> 11) Is the system hard coded, or built with code flexibility?

Implementation dependent. Many implementations are inherently flexible (see etcd/raft / hashicorp/raft). I'm sure it would be possible to design something gross that is hard coded.

> 12) - Fault tolerance levels? (How many nodes need to be compromised before everything is shut down?

Compromised is the wrong word.

If half or more of the nodes are down, the cluster will no longer be able to progress new log writes. Reads to the FSM may work if stale reads are permitted.

If you really meant 'compromised' in the sense of 'hacked', I suspect a single node being compromised would be sufficient to DoS the system without further protection.

> 13) - What is disaster recovery process?

Read the raft papers searching for 'stable store' or similar. The whole point is that the raft log items, once accepted, are guaranteed to be on stable storage. Sp in the event of a total outage, the cluster should just come back alive. Similarly you only need one copy of the FSM to restore the whole lot.

The details of the process will depend both on the implementation and on operational procedures by the end user (e.g. how backups are taken). Note the fact that raft is designed to operate between data centers makes this easier. One minor challenge is that the state of the peers is itself in the FSM, so if you lose a lot of peers, adding back peers to form a quorum needs thought. From memory this is covered in chapter 4.

--
Alex Bligh




George Samman

unread,
Apr 3, 2016, 7:04:54 AM4/3/16
to raft-dev
Thank you Alex!  


On Saturday, April 2, 2016 at 6:44:57 PM UTC-4, George Samman wrote:

Diego Ongaro

unread,
Apr 17, 2016, 4:35:48 PM4/17/16
to raft...@googlegroups.com
Thanks, Alex. I think you covered almost all of it; I'm just chiming in here (rather late) to answer this one:

8) What is the cost to implement this? How much time does it take to implement?

Are you asking how much human time does it take to create an implementation? It's a bit hard to measure. A basic implementation of the core protocol seems to be about 400 lines and takes me about a day. By the time you layer on:
 - durability
 - membership changes
 - compaction
 - versioning and rolling upgrades
 - correctness testing
 - performance optimizations
 - metrics and benchmarks,
it adds up. For one engineer, it's somewhere between months or years.

LogCabin is one data point that offers a loose upper bound of 3 man-years from zero to production (summer 2012 when consensus work started, to summer 2015). I say loose upper bound because:
 - I didn't have a book on Raft when I started LogCabin and didn't understand basic things like the replicated state machine approach. I hope it's easier for people now.
 - Work on it was bursty and not full-time for long periods. It was common for months to go by with no commits. Eyeballing the commits graph, you could easily discount at least 1 man-year for whitespace.

Do others on the list have better data points? I'm curious about the number of man-years from zero to production (full-featured, etc), especially if it's been a focused full-time effort.

Cheers,
Diego

--

Li Xiang

unread,
Apr 17, 2016, 4:55:06 PM4/17/16
to raft...@googlegroups.com
On Sun, Apr 17, 2016 at 1:35 PM, Diego Ongaro <onga...@gmail.com> wrote:
Thanks, Alex. I think you covered almost all of it; I'm just chiming in here (rather late) to answer this one:

8) What is the cost to implement this? How much time does it take to implement?

Are you asking how much human time does it take to create an implementation? It's a bit hard to measure. A basic implementation of the core protocol seems to be about 400 lines and takes me about a day. By the time you layer on:
 - durability
 - membership changes
 - compaction
 - versioning and rolling upgrades
 - correctness testing
 - performance optimizations
 - metrics and benchmarks,
it adds up. For one engineer, it's somewhere between months or years.

LogCabin is one data point that offers a loose upper bound of 3 man-years from zero to production (summer 2012 when consensus work started, to summer 2015). I say loose upper bound because:
 - I didn't have a book on Raft when I started LogCabin and didn't understand basic things like the replicated state machine approach. I hope it's easier for people now.
 - Work on it was bursty and not full-time for long periods. It was common for months to go by with no commits. Eyeballing the commits graph, you could easily discount at least 1 man-year for whitespace.

Do others on the list have better data points? I'm curious about the number of man-years from zero to production (full-featured, etc), especially if it's been a focused full-time effort.


etcd/raft took about a little less than one man-year to be relatively stable (full-featured, tested, reasonable performance, etc.). But we are still slowly improving it over time.

jordan.h...@gmail.com

unread,
Apr 17, 2016, 7:49:15 PM4/17/16
to raft...@googlegroups.com
Copycat has also been in development for a few years and is now stable, but a lot of that time was spent also working on related projects like a serialization framework and Atomix or developing and experimenting with algorithms that aren't fully described in the Raft literature like incremental compaction and session events. 

If I had to guess the actual time it took to develop Copycat itself to stability was closer to a year and a half with sessions, events, membership changes, incremental compaction, snapshots, etc all included. But that feels pretty anecdotal, and the extensions - session events and incremental compaction - certainly added on a lot of time.
Reply all
Reply to author
Forward
0 new messages