hi raft-dev,
I've recently been having fun trying to learn raft by implementing it. it's fun!
cutting to the chase: it seems to me that you could simplify the practical implementation of raft because it isnt actually necessary to have any state 'updated on stable storage before responding to RPCs' (in the words of the paper's figure 2).
The reason this is important to me is that the 'hard state' (specifically currentTerm and votedFor) that 'must be persisted' is actually quite tricky to persist correctly; it isnt specified how the hard state is persisted in the paper, but my intuition is that it would not work well to have it in a separate 'metadata' file alongside the log, because the filesystem may not flush/update the log and metadata files in-order. so the obvious thing to do is to append it the log file itself.
however putting it in the log means that the log file will be different on different peers, because they may have different voting histories/timings - and thus the log file becomes a mixture of 'real' log entries and 'occasional record of voting' entries, which are different across peers. This is slightly undesirable from a complexity/backup/debugging point of view - it would be nice if in a particular implementation of raft, the files holding logs on all machines could be byte identical, up to the agreed upon commitindex. In this email I propose a way to achieve this, by getting rid of the hard state entirely, but that seems like I may be missing something that makes this approach unsafe. hence, this email. apologies if this has been covered elsewhere, or is a mindnumblingly stupid proposal.
my understanding is that the reason for this state being 'hard' is entirely to prevent double-voting for a candidate within a term. the 'currentterm' persisted field could be made 'soft' and initialized to zero - as it will rapidly be set to the leaders term. so the only really important state to persist is the 'votedFor'. the bad scenario as I understand it: a follower could come up, be asked for a vote, submit a vote, go down, come back up, get asked again for a vote within the same term - and, assuming it incorrectly forgets its cast vote for this term, re-cast it, causing the candidate to double-count the single peer's vote and incorrectly assume a majority. bad! so far, so 'by the book'.
observation 1: for the problem to occur, the requestVote message has to be delivered to the crashnig follower twice. raft as described makes no requirement for RPCs to be ordered/idempotent, so I guess this is possible. however in an implementation of raft where the RPCs and their responses are ordered (eg over TCP connections), it can be arranged that the 'request for a vote' is never sent more than once to any given follower on any given TCP connection; so even if the connection 'flaps', we will never send RequestVote twice, and thus never get a double vote in response. so that would remove the possibility to double-vote, and in turn remove the need to store votedfor persistently across runs of a peer.
observation 2: even if we stick with an unordered RPC approach (or one with retries that happen below the raft layer), the candidate peer could store an explicit list of votes collected, rather than just a count, and screen double votes that way - if it sees two votes from the same peer (that presumably crashed or flapped or had RPCs retried), it simply does not count them.
with either or both of these approaches, the need to persist votedFor (and any other state than the log) goes away. this seems simpler?
I am most likely missing something since I am relatively new to raft and have only been learning-by-coding for a week or so. I'd appreciate any input or insight from the list. If I'm right, that would be a lovely simplification for my implementation!
many thanks for your time and for the algorithm!
alex
'