On 2 Apr 2016, at 23:44, George Samman <joking...@gmail.com
> Hi all Diego Ongaro suggested I join this group to get some questions answered on consensus with Raft. I'm not a software engineer so forgive my ignorance on the questions. I work with blockchains and am putting together a whitepaper on consensus mechanisms. Any help here would be greatly appreciated.
Let me have a stab. In general many of your questions are implementation dependent. There are many aspects of Raft left to the implementor (e.g. the absolute values of timeouts as opposed to their relative values) and some areas which are ambiguous in the sense that there is more than one way to implement it in a compliant manner or where the spec itself gives more than one way (a classic example being the section on client behaviour which IIRC is not in the thesis at all but is in section 6 of the dissertation with several options).
For this reason several of my answers may initially seem unhelpful.
> Here are my questions:
> 1) How much time does a node need to reach a decision?
You'll need to be clearer on what decision you mean. In normal operation the node will apply items to its FSM and read from its FSM. You could (with appropriate timeouts) have each of those take 1 second. This would however involve setting an election timeout an order of magnitude higher.
> 2) How do you measure if a node behaves with error (maybe non-malicious errors but due to data corruption or implementation errors)?
The raft specs do not specify measurement. Most implementations measure some aspects (e.g. with monotonic counters), and often send them off to a stats server. grep for 'stats' in hashicorp/raft for instance.
If you mean 'how does the algorithm tell whether a node has errored' it simply looks for either the absence of a valid reply or heartbeat within a timeout, or the presence of an invalid reply or heartbeat (where 'valid' and 'invalid' are determined by the spec).
> 3) How does this scale? How many transactions can you do per second?
Raft communication is unicast. Under normal operation the leader makes RPC calls to the followers, which means communication is O(n). As the cluster gets larger there is more chance of an individual node failing in a given time interval; however, it takes more individual nodes to fail to break a quorum. When individual nodes fail and come back online (assuming no re-election) their logs need to be 'caught up'. The failure of the network fabric is also going to have some relationship with the size of nodes. There are some studies on raft scaling in both papers - particularly in as it affects election timeouts.
In general, the number of nodes is not going to be the key determinant of transactions per second. I suspect the practical limit in most scenarios is 'a few hundred' though doubtless the algorithm could be tuned. Many use cases with more 'nodes' have a small subset participating in raft (and sharing the FSM) and a larger subset that are nodes but not raft nodes (e.g. consul).
The number of transactions per second is going to depend more on network latency and time to apply to the FSM than number of nodes. Remember that whilst each follower processes every log entry, they can do so in parallel.
> 4) What is the synchronization process? (Is it the leader who synchronizes and sends out a replicated log?)
Yes. This is clear from both raft papers - I suggest you read them in more detail.
> 5) In this mechanism the leader seems to have all the power, is this true and is that good for consensus as a whole? What if he's a malicious actor?
Yes, until an election occurs, where (in essence) he has no more 'power' than anyone else.
It's not bad for consensus as the whole point of an FSM is that if any actor applies the same data to an FSM in the same state, they get the same result.
Raft itself provides few if any protections against malicious actors as far as I know.
> 6) How is access restricted for malicious actors? Can they be kicked off the network?
Raft does not specify a transport layer, and authn/authz is a matter for the transport layer. hashicorp/raft from memory uses TLS as can etcd/raft. However, this is outside the spec of raft. When raft gets the messages, it is assumed they have been authenticated / authorized/
> 7) Is there signature verification in place?
As per 6. Note that TCP itself will prevent most corruption and ensure retransmission, and most users use TCP (with or without encryption on top). Note that Raft does not have to run over TCP (I submitted a udp version for instance, and running over say ZeroMQ would be quite feasible).
> 8) What is the cost to implement this? How much time does it take to implement?
> 9) How many nodes can you have on the network? Is it unlimited?
The protocol sets no limit. I'm not sure you'd want more than a few hundred. I suspect raft scalability would be a fruitful area for further academic study.
> 10) How many nodes need to validate a transaction? (%)
Logs need to be propagated to an absolute majority of the network (including the leader itself), i.e. strictly more than a half, i.e. >50%. 50% itself is insufficient.
> 11) Is the system hard coded, or built with code flexibility?
Implementation dependent. Many implementations are inherently flexible (see etcd/raft / hashicorp/raft). I'm sure it would be possible to design something gross that is hard coded.
> 12) - Fault tolerance levels? (How many nodes need to be compromised before everything is shut down?
Compromised is the wrong word.
If half or more of the nodes are down, the cluster will no longer be able to progress new log writes. Reads to the FSM may work if stale reads are permitted.
If you really meant 'compromised' in the sense of 'hacked', I suspect a single node being compromised would be sufficient to DoS the system without further protection.
> 13) - What is disaster recovery process?
Read the raft papers searching for 'stable store' or similar. The whole point is that the raft log items, once accepted, are guaranteed to be on stable storage. Sp in the event of a total outage, the cluster should just come back alive. Similarly you only need one copy of the FSM to restore the whole lot.
The details of the process will depend both on the implementation and on operational procedures by the end user (e.g. how backups are taken). Note the fact that raft is designed to operate between data centers makes this easier. One minor challenge is that the state of the peers is itself in the FSM, so if you lose a lot of peers, adding back peers to form a quorum needs thought. From memory this is covered in chapter 4.