What if the majority fail?

101 Aufrufe
Direkt zur ersten ungelesenen Nachricht

Daniel Shen

ungelesen,
20.10.2017, 03:22:3320.10.17
an raft-dev
Sorry if this is a beginner question.

I am designing a group computing system in which there are N (N=n+e, e < n) peers and peers conform to a observed failure rate. Assumption there could be at most n-1  peers fail at the same time. Once the system has less than n peers, the system will recover the group size to n+e by recruiting new  processes from a peer-to-peer network. Peers in a group needs to deliver (or process) incoming requests in the same order. Also assume the adversary is only node crash, network latency, and temporary message lost, no network partition.


Basically, it is an atomic broadcast problem which can be further converted to a consensus problem. To maintain system correctness, is it possible to use Raft in this problem. How should I set and the quorum number in this case? Specifically,


1) Is it OK to modify the quorum number from (N/2+1) to n, since in my case, there could be n-1 instead of N/2 peers fail simultaneously?

2) If the failure number reaches or exceeds the quorum, is it still OK to use Raft by temporarily pausing the consensus from processing new requests (i.e., let the system temporarily become unavailable) and waiting for the quorum recovered, and then resume the process?

3) During the recovery process, is it possible to use Raft for membership change agreement?

Archie Cobbs

ungelesen,
20.10.2017, 10:37:4820.10.17
an raft-dev
On Friday, October 20, 2017 at 2:22:33 AM UTC-5, Daniel Shen wrote:
I am designing a group computing system in which there are N (N=n+e, e < n) peers and peers conform to a observed failure rate. Assumption there could be at most n-1  peers fail at the same time. Once the system has less than n peers, the system will recover the group size to n+e by recruiting new  processes from a peer-to-peer network. Peers in a group needs to deliver (or process) incoming requests in the same order. Also assume the adversary is only node crash, network latency, and temporary message lost, no network partition.


Basically, it is an atomic broadcast problem which can be further converted to a consensus problem. To maintain system correctness, is it possible to use Raft in this problem. How should I set and the quorum number in this case? Specifically,


1) Is it OK to modify the quorum number from (N/2+1) to n, since in my case, there could be n-1 instead of N/2 peers fail simultaneously?


If N = n + e and e < n then n > N/2 .. so n is already a majority... ?

However I'm uncertain about what you mean by "the system has less than n peers".

If you mean "how many peers are actually available at this moment", that is in some sense an unknowable question. You could however have nodes attempt to empirically guess at this (using ping timeouts, etc.) and then if they deem a node insufficiently available, initiate a Raft configuration change to remove that node from the cluster...  but of course then the value of N changes and you have a new, smaller cluster.
 
In other words, you can always ask "What is the value of N?" but you can't ask "How many nodes are currently failing?"

But you CAN ask "Are a majority of nodes currently functioning?" To ask that question, simply attempt to commit something. If the commit succeeds, the answer is definitely YES. If the commit fails, the answer is probably NO (it could be YES e.g. if the local node is the only one with a problem (e.g., cable unplugged)).

2) If the failure number reaches or exceeds the quorum, is it still OK to use Raft by temporarily pausing the consensus from processing new requests (i.e., let the system temporarily become unavailable) and waiting for the quorum recovered, and then resume the process?


Yes, this happens automatically. As long as a quorum is not present, it will be impossible to commit anything new.
 

3) During the recovery process, is it possible to use Raft for membership change agreement?


Using the "one at a time" method, any node can unilaterally attempt to add or remove a single node to/from the cluster at any time. If the attempt is successfully committed, then Raft guarantees that the membership change has been safely applied.

-Archie

 

Daniel Shen

ungelesen,
23.10.2017, 02:46:4523.10.17
an raft-dev
Thanks for your reply, Archie. Your answer really helps.

So, in Raft, membership monitoring is done through heartbeat exchange between a leader and followers?

Archie Cobbs

ungelesen,
23.10.2017, 10:03:3023.10.17
an raft-dev
Hi Daniel,


On Monday, October 23, 2017 at 1:46:45 AM UTC-5, Daniel Shen wrote:
Thanks for your reply, Archie. Your answer really helps.

So, in Raft, membership monitoring is done through heartbeat exchange between a leader and followers?

Actually the purpose of Raft heartbeats is the maintenance of the (guaranteed unique) leader in the current term, and the triggering of a new election when that exchange breaks down.

Not sure what you mean by "membership monitoring". With Raft and schemes like it, you have to define these terms carefully before you can say anything definitive about them.

If by "membership monitoring" you mean "at any given time, what is the set of nodes that some node X believes is part of the Raft cluster?" then this is specified in the paper (namely: look at the most recent configuration change in the log, whether or not committed).

If by "membership monitoring" you mean "how many peers are actually available at this moment", then see previous response :)

-Archie

Daniel Shen

ungelesen,
23.10.2017, 23:43:0723.10.17
an raft-dev
Hi Archie,

Sorry for the ambiguity made in terms LOL. The "membership monitoring" I mean is the first case you mentioned. Thanks for your answer.
Allen antworten
Antwort an Autor
Weiterleiten
0 neue Nachrichten