Hi,
Good question.
In this particular case, the system hangs, it does not proceed to next step.
Yes, this is not good, but more tolerable than promising client something and later on figure out it should have been revoked. (Give client a response while the system is not consences, a.k.a. correctness)
The raft algorithm explicitly assumes if you expect the system to be live, you should make sure you have majority of nodes up,
In practice, if you think yourself being a admin of a cluster of N machines. All machine failure occurs one by another in time axis, and you get noticed when each machine fails, by the time the failure node reach N/2 machines, you probably already recovered earliest failed node, this keep the total running, functional node count above N/2.
The risk of all node failed at almost the same time is extremely low, you should be OK with that.
In another word, if you care about this risk, you should invest more on hardware, reduce the failure rate of single node, and make it bearable,.
To give you a brief intro to some knowledge background you may want to study:
Under the theory of FLP,
You cannot build a distributed system in real life with all these 3 properties,
termination, agreement, and validity
Paxos/Raft system designers decided to sacrifice termination and retain other 2 properties.
If you are interested learn more on this topic, I encourage you to read these 2 papers.
FLP
Paxos paper
I quoted form it:
"Rather than sacrifice correctness, which some variants (including the one I described) of 3PC do, Paxos sacrifices liveness, i.e. guaranteed termination"
Please let me know if this makes sense to you :P
Thanks.
Cheng