I am designing a group computing system in which there are N
(N=n+e, e < n) peers and peers conform to a observed failure rate. Assumption there could be at most n-1 peers fail at the same time. Once the system
has less than n peers, the system will recover the group size to n+e by recruiting new processes from a peer-to-peer network. Peers in a group needs to deliver (or process) incoming requests in the same order. Also assume the adversary is only node crash, network latency, and temporary message lost, no network partition.
Basically, it is an atomic broadcast problem which can be further converted to a consensus problem. To maintain system correctness, is it possible to use Raft in this problem. How should I set and the quorum number in this case? Specifically,
1) Is it OK to modify the quorum number from (N/2+1) to n, since in my case, there could be n-1 instead of N/2 peers fail simultaneously?
2) If the failure number reaches or exceeds the quorum, is it still OK to use Raft by temporarily pausing the consensus
from processing new requests (i.e., let the system temporarily become unavailable) and waiting for the quorum recovered, and then resume the process?
3) During the recovery process, is it possible to use Raft for membership change agreement?
I am designing a group computing system in which there are N (N=n+e, e < n) peers and peers conform to a observed failure rate. Assumption there could be at most n-1 peers fail at the same time. Once the system has less than n peers, the system will recover the group size to n+e by recruiting new processes from a peer-to-peer network. Peers in a group needs to deliver (or process) incoming requests in the same order. Also assume the adversary is only node crash, network latency, and temporary message lost, no network partition.
Basically, it is an atomic broadcast problem which can be further converted to a consensus problem. To maintain system correctness, is it possible to use Raft in this problem. How should I set and the quorum number in this case? Specifically,
1) Is it OK to modify the quorum number from (N/2+1) to n, since in my case, there could be n-1 instead of N/2 peers fail simultaneously?
2) If the failure number reaches or exceeds the quorum, is it still OK to use Raft by temporarily pausing the consensus from processing new requests (i.e., let the system temporarily become unavailable) and waiting for the quorum recovered, and then resume the process?
3) During the recovery process, is it possible to use Raft for membership change agreement?
Thanks for your reply, Archie. Your answer really helps.
So, in Raft, membership monitoring is done through heartbeat exchange between a leader and followers?