alternative way to avoid double voting within a term thus avoiding need for 'hard state'

Alex Evans

unread,

Oct 18, 2016, 5:37:16 PM10/18/16

to raft-dev

hi raft-dev,

I've recently been having fun trying to learn raft by implementing it. it's fun!

cutting to the chase: it seems to me that you could simplify the practical implementation of raft because it isnt actually necessary to have any state 'updated on stable storage before responding to RPCs' (in the words of the paper's figure 2).

The reason this is important to me is that the 'hard state' (specifically currentTerm and votedFor) that 'must be persisted' is actually quite tricky to persist correctly; it isnt specified how the hard state is persisted in the paper, but my intuition is that it would not work well to have it in a separate 'metadata' file alongside the log, because the filesystem may not flush/update the log and metadata files in-order. so the obvious thing to do is to append it the log file itself.

however putting it in the log means that the log file will be different on different peers, because they may have different voting histories/timings - and thus the log file becomes a mixture of 'real' log entries and 'occasional record of voting' entries, which are different across peers. This is slightly undesirable from a complexity/backup/debugging point of view - it would be nice if in a particular implementation of raft, the files holding logs on all machines could be byte identical, up to the agreed upon commitindex. In this email I propose a way to achieve this, by getting rid of the hard state entirely, but that seems like I may be missing something that makes this approach unsafe. hence, this email. apologies if this has been covered elsewhere, or is a mindnumblingly stupid proposal.

my understanding is that the reason for this state being 'hard' is entirely to prevent double-voting for a candidate within a term. the 'currentterm' persisted field could be made 'soft' and initialized to zero - as it will rapidly be set to the leaders term. so the only really important state to persist is the 'votedFor'. the bad scenario as I understand it: a follower could come up, be asked for a vote, submit a vote, go down, come back up, get asked again for a vote within the same term - and, assuming it incorrectly forgets its cast vote for this term, re-cast it, causing the candidate to double-count the single peer's vote and incorrectly assume a majority. bad! so far, so 'by the book'.

observation 1: for the problem to occur, the requestVote message has to be delivered to the crashnig follower twice. raft as described makes no requirement for RPCs to be ordered/idempotent, so I guess this is possible. however in an implementation of raft where the RPCs and their responses are ordered (eg over TCP connections), it can be arranged that the 'request for a vote' is never sent more than once to any given follower on any given TCP connection; so even if the connection 'flaps', we will never send RequestVote twice, and thus never get a double vote in response. so that would remove the possibility to double-vote, and in turn remove the need to store votedfor persistently across runs of a peer.

observation 2: even if we stick with an unordered RPC approach (or one with retries that happen below the raft layer), the candidate peer could store an explicit list of votes collected, rather than just a count, and screen double votes that way - if it sees two votes from the same peer (that presumably crashed or flapped or had RPCs retried), it simply does not count them.

with either or both of these approaches, the need to persist votedFor (and any other state than the log) goes away. this seems simpler?

I am most likely missing something since I am relatively new to raft and have only been learning-by-coding for a week or so. I'd appreciate any input or insight from the list. If I'm right, that would be a lovely simplification for my implementation!

many thanks for your time and for the algorithm!

alex

'

Archie Cobbs

unread,

Oct 18, 2016, 5:55:08 PM10/18/16

to raft-dev

Hi Alex,

On Tuesday, October 18, 2016 at 4:37:16 PM UTC-5, Alex Evans wrote:

my understanding is that the reason for this state being 'hard' is entirely to prevent double-voting for a candidate within a term. the 'currentterm' persisted field could be made 'soft' and initialized to zero - as it will rapidly be set to the leaders term. so the only really important state to persist is the 'votedFor'. the bad scenario as I understand it: a follower could come up, be asked for a vote, submit a vote, go down, come back up, get asked again for a vote within the same term - and, assuming it incorrectly forgets its cast vote for this term, re-cast it, causing the candidate to double-count the single peer's vote and incorrectly assume a majority. bad! so far, so 'by the book'.

There's also the possibility that the follower votes twice, but not for the same candidate, resulting in multiple leaders for the same term. This is the real concern... and without the persistent state there's no easy way to detect/prevent this situation.

-Archie

Alex / Bluespoon

unread,

Oct 18, 2016, 6:08:22 PM10/18/16

to raft...@googlegroups.com

aha yes, I knew I must be missing something! and there it is. thanks for the fast reply.

I've been browsing the list and copycat seems nice! where do you choose to persist the votedFor state in your implementation?

thanks again.

--
You received this message because you are subscribed to a topic in the Google Groups "raft-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/raft-dev/yzT3JTZweH0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to raft-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Alex Evans

unread,

Oct 18, 2016, 6:23:55 PM10/18/16

to raft-dev

There's also the possibility that the follower votes twice, but not for the same candidate, resulting in multiple leaders for the same term. This is the real concern... and without the persistent state there's no easy way to detect/prevent this situation.

thinking a little more, and at the risk of flogging a dead horse, might it be sufficient to initialize votedFor and currentTerm to 'dont know' on boot (rather than from stable storage); then on the first request vote RPC observed after boot, a peer always votes negatively, and initializes votedFor and currentTerm using the RPC params :) in other words, it always lies once on boot, and casts a vote for 'nobody'.

maybe that is sufficient to remove the possibility of double voting, at the cost of delaying newly booted peers from contributing a positive vote for one extra election timeout. definitely a tradeoff, but the less persistent state, the better - I hate disks!

Archie Cobbs

unread,

Oct 18, 2016, 6:25:37 PM10/18/16

to raft-dev, al...@bluespoon.com

Hi Alex,

On Tuesday, October 18, 2016 at 5:08:22 PM UTC-5, Alex / Bluespoon wrote:

aha yes, I knew I must be missing something! and there it is. thanks for the fast reply.
I've been browsing the list and copycat seems nice! where do you choose to persist the votedFor state in your implementation?

Jordan is the copycat guy.

My implementation is different from most - it's a transactional key/value database (link).

-Archie

Alex / Bluespoon

unread,

Oct 18, 2016, 6:29:53 PM10/18/16

to raft...@googlegroups.com

Jordan is the copycat guy.
My implementation is different from most - it's a transactional key/value database (link).

ah, apologies. it's late! I will definitely check that out, looks interesting! I will hunt around to find how/where you safely store votedFor :) cheers.

Archie Cobbs

unread,

Oct 18, 2016, 8:43:36 PM10/18/16

to raft-dev

I think that would be safe. But you'd be stuck in cases where half or more of the nodes crashed at the same time, and on the very first startup of your cluster, requiring manual intervention.

Regarding your original issue, I don't see separate files for the log and the meta-data as a big problem. There's no requirement in Raft that I know of that these two things must be updated atomically together. You can simply update and sync them independently when needed. The application of a single log entry must be atomic of course (there's no such thing as half a log entry). Also don't forget about log compaction (aka state machine application), which does need to be atomic with respect log entries applied and the associated "last applied log index" and "last applied log term" state machine meta-data.

-Archie

Doug Woos

unread,

Oct 18, 2016, 8:55:04 PM10/18/16

to Alex Evans, raft-dev

Alex Evans <blue...@gmail.com> writes:
> thinking a little more, and at the risk of flogging a dead horse,
> might it be sufficient to initialize votedFor and currentTerm to 'dont
> know' on boot (rather than from stable storage); then on the first
> request vote RPC observed after boot, a peer always votes negatively,
> and initializes votedFor and currentTerm using the RPC params :) in
> other words, it always lies once on boot, and casts a vote for
> 'nobody'.

I'm fairly certain this would not be safe, since after it comes back the
node could receive its first RequestVote RPC from an arbitrarily old
term--say, from a node that was partitioned from the rest of the system
and therefore missed several elections. It would then initialize its
currentTerm and votedFor incorrectly, which could lead to double-voting.

Doug

Alex Evans

unread,

Oct 19, 2016, 4:22:19 AM10/19/16

to raft...@googlegroups.com

thanks for all your replies pointing out the holes in my logic. i think my refined position is: raft requires that no peer votes twice in any term. it suggests implementing this by persisting votedfor before replying to a vote. i now believe that simply refusing to vote positively for any candidate for one max election timeout after boot is also sufficient to prevent accidental double vote on crash recovery, and is simpler to implement.

doug: good point re arbitrarily old terms. i think my 'lie once' is thus insufficient. luckily the duration of all terms is bounded - by the max election timeout. so by sitting out votes for that short period on boot, we know that any candidate we may have forgotten we voted for will either have timed out their election or become leader via a legitimate majority.

archie: i don't think it gets stuck in any case since it reverts to normal behaviour after a fixed time (or in the version you were replying to, after seeing one request vote rpc.

re atomicity:

quoute: "Regarding your original issue, I don't see separate files for the log and the meta-data as a big problem. There's no requirement in Raft that I know of that these two things must be updated atomically together"

you're right raft doesn't require the files to be synced. thanks! thinking more clearly in the morning, what it does require is that you never forget a cast vote even in the case of crash, so you must do something like fsync on your metadata before replying to a vote, or use some other completely stable storage. my claim is that this is actually still quite hard to come by, as an implementor; and i hope that simply waiting out votes for a short period on boot is easier& sufficient...

.

but as you all have so helpfully pointed out, i am usually wrong when it comes to this stuff. thanks for the discussion!

alex

Archie Cobbs

unread,

Oct 19, 2016, 9:50:36 AM10/19/16

to raft-dev, al...@bluespoon.com

Hi Alex,

On Wednesday, October 19, 2016 at 3:22:19 AM UTC-5, Alex Evans wrote:

doug: good point re arbitrarily old terms. i think my 'lie once' is thus insufficient. luckily the duration of all terms is bounded - by the max election timeout. so by sitting out votes for that short period on boot, we know that any candidate we may have forgotten we voted for will either have timed out their election or become leader via a legitimate majority.

archie: i don't think it gets stuck in any case since it reverts to normal behaviour after a fixed time (or in the version you were replying to, after seeing one request vote rpc.

Sounds reasonable intuitively, but nailing down the details and proving it all correct in the face of arbitrary message drops, duplications, and delays is another thing altogether (worthy of a PhD dissertation in fact :)

re atomicity:
quoute: "Regarding your original issue, I don't see separate files for the log and the meta-data as a big problem. There's no requirement in Raft that I know of that these two things must be updated atomically together"

you're right raft doesn't require the files to be synced. thanks! thinking more clearly in the morning, what it does require is that you never forget a cast vote even in the case of crash, so you must do something like fsync on your metadata before replying to a vote, or use some other completely stable storage. my claim is that this is actually still quite hard to come by, as an implementor; and i hope that simply waiting out votes for a short period on boot is easier& sufficient...

Doing an fsync() is not hard, you just have to know the right function/command to invoke in whatever language you're programming with.

Not only is it not hard, it's mandatory - if you can't guarantee something's been durably persisted, then you've broken an assumption on which the correctness of Raft is based. In particular, Raft assumes certain state transitions occur on a node prior to that node sending out certain messages. If you are going to add the new possibility that these state transitions can be reverted, then you've set yourself up for a vastly more difficult correctness proof.

-Archie

Alex Evans

unread,

Oct 19, 2016, 11:41:09 AM10/19/16

to raft-dev, al...@bluespoon.com

> Doing an fsync() is not hard, you just have to know the right function/command to invoke in whatever language you're programming with.

> Not only is it not hard, it's mandatory - if you can't guarantee something's been durably persisted, then you've broken an assumption on which the correctness of Raft is based.

indeed! amen to those wise words. I am very familiar with fsync and the like, FWIW, having written a few-million user production scale database in the past. but there's always more to learn!, and I am certainly not yet familiar with raft. :)

> In particular, Raft assumes certain state transitions occur on a node prior to that node sending out certain messages. If you are going to add the new possibility that these state transitions can be reverted, then you've set yourself up for a vastly more difficult correctness proof.

also indeed, although this whole thread started because I wanted to be more specific about 'certain state transitions' and 'certain messages'. upon actually attempting to implement raft, which after all only has 2RPCs as one of its major selling points, it seemed to me that the only state transition that relied on externally stable other *other than the log*, was the supression of double votes (and my (initially) faulty understanding of that) via stable persistence of votedfor. given that it seemed such an edge case, it felt wrong that persistent storage was necessary for this...

That is to say - 'certain state transitions... on certain messages' seems only to be 'deciding how to reply to request vote rpc based on currentterm and votedfor'. (the (many) other durability requirements are all on the log, which I am fine with investing lots of effort in, since the log is the central data structure).

I think at this point you've helped me hugely and I'm probably just taking up your time whipping a dead horse - but unless I figure out (or have pointed out to me) a counter example (entirely possible!), adding a 'dont vote for any candidate for a short time on boot' does seem to remove the only need for stable storage of votedfor/currentterm, which to me is a big 'ops' win (since the only durable state of a running raft cluster becomes the log itself, with no need for 'hard state' anywhere, and the log will be byte identical on every peer, from offset 0 to commitindex).

(*) of course, by sidestepping the edge cases of stable storage, it means that the new edge case is that my delay might to provide the necessary safety if the clock of any machine during the vulnerable period mis-measures the election timeout relative to the other peers holding the elections. conversely, the original stable-storage requirement is essentially 'stable storage never fails' without specifying what stable storage actually is. I'd claim that on most linux filesystems, for example, even with fsync, a defensive implementation of vanilla raft would still need to handle the case that the 'metadata' 'hardstate' read on boot might be missing or corrupted - and the raft paper/proof makes no comment about what to do in this case, just as I make no comment about what to do in the case of gigantic-fleeting-clock-rate-differences-between-peers-during-node-boot. both are rare - fsync and disks because they're pretty good even under crash, delay-on-boot not just because clocks are good enough to tick with rates a factor of 2 of each other over a few hundred milliseconds, but also because a rebooting node almost certainly takes time to do it - easily a few hundred milliseconds when loading a large executable - which means in many practical scenarios, the election in which I might conceivably have voted before I crashed is extremely likely to be long gone. so in some sense, this is all moot / at the point of diminishing returns....

Archie Cobbs

unread,

Oct 19, 2016, 12:06:06 PM10/19/16

to raft-dev, al...@bluespoon.com

On Wednesday, October 19, 2016 at 10:41:09 AM UTC-5, Alex Evans wrote:

> In particular, Raft assumes certain state transitions occur on a node prior to that node sending out certain messages. If you are going to add the new possibility that these state transitions can be reverted, then you've set yourself up for a vastly more difficult correctness proof.

also indeed, although this whole thread started because I wanted to be more specific about 'certain state transitions' and 'certain messages'. upon actually attempting to implement raft, which after all only has 2RPCs as one of its major selling points, it seemed to me that the only state transition that relied on externally stable other *other than the log*, was the supression of double votes (and my (initially) faulty understanding of that) via stable persistence of votedfor. given that it seemed such an edge case, it felt wrong that persistent storage was necessary for this...

You may very well be right that there is an optimization to be had here.

However I have been burned by my own "optimizations" that ended up breaking something inadvertently. There's a lot of subtlety to it all, and constructing a rigorous proof of correctness - or even a disproving counter-example - is not always easy, at least for me. So my advice is to... be careful :)

FWIW here's a good example of a simple optimization I did that broke something.

Many implementations have leaders send a "probe" AppendEntries message that doesn't contain any log entries to probe followers for their log state. Once the leader learns where the follower's log is (based on a positive reply), then it starts sending actual log entries to the follower.

The problem is that the AppendEntries also contains a leaderCommit which is a copy of the leader's commitIndex and which the follower is supposed to use to update its own commitIndex. However, the follower must not do this for probe messages, because the follower's log might have a conflicting log entry that won't get detected and removed (because the AppendEntries has no 'correct' log entry to compare it to). If you fail to take this into account, you get inconsistent logs on different nodes - oops!

-Archie

Alex Evans

unread,

Oct 19, 2016, 12:54:07 PM10/19/16

to raft-dev, al...@bluespoon.com

ah, that tip is gold! thanks archie. I had also implemented the probe approach after reading (skimming, really) the etcd implementation. I'll look out for that once I get it running in anger :)

even for probes, I always send the previous term & log index though, as per the paper's RPC spec, so even in the case of a probe with 0 entries, there's always enough in the RPC parameters to deterimine if the logs match ? so there is always a log index & term to compare, in other words. if it matches, my empty append entries probe is currently carrying on as if it were a normal one, in other words, it ends up truncating the log at the agreed upon index, without appending anything new. commitidx is then updated, but bounded to the length of the receiver's log, and so it shouldn't move forward.

But in general I absolutely agree that this sort of stuff is sooooo subtle. I'll definitely heed your warnings :)

I think the existence of good raft implementations and the clear paper are fantastic and rare (props to Diego et al) however it seems that the gap is slowly widening between practice and the paper on the front page, even between the thesis and the paper.

I've noticed a few appearing several times on this list and as I felt around raft; not relevant directly to this thread, and you'll know these already, but since we're here, and you gave a good example 'tip', these ones felt crucial but well hidden, in case anyone in the future stumbles on this thread:

- you need to ignore requestvotes if you've received a valid appendentries from a good leader within one minimum election timeout, to help with disruptive servers. this is mentioned somewhere in the thesis (client interactions?), and possibly in the paper, and definitely in the '4 modifications' paper linked on this list, but it's not front and center in the main algorithm description.

- a newly elected leader should immediately try to commit a no-op log entry in order to permit commitindex to advance (it can only do so once current term item is committed). again, not implemented in eg raft scope, and it is mentioned in various diffused places (including thesis and posts from diego several times) but not front and center in the core algo description AFAIK

- cluster membership changes seem to be an active area of research; I'd love it if the published raft paper made this clearer, and also that diego consideres the newer, one-at-a-time approach (with recent fix) a better option than the joint-consensus which gets much more space in the paper, in the etcd implementation, etc

all of which are minor niggles on an atypically well documented algorithm :) I wonder if a seasoned practitioner could one day write a follow up paper on 'implementing a minimal raft in practice' - 'plank' maybe? - that could then be linked from the raft homepage, encapsulating in one place all these caveats to the original.

in the meantime, I'm enjoynig playing 'hunt the experience gems' across the web, this list, and so on :). thanks!

John Ousterhout

unread,

Oct 19, 2016, 3:14:17 PM10/19/16

to raft...@googlegroups.com, al...@bluespoon.com

I have only partially been following this discussion, so I may have missed something relevant, but I wanted to mention that one of our key design criteria for Raft was to avoid depending on time for safety. That is, no matter how confused clocks might get, the system will never perform incorrectly. Achieving availability inevitably depends on time (this has been proven by the FLP impossibility paper), and Raft depends on timeouts to detect server crashes and for other availability issues. But I believe it does not currently depend on time in any way for safety.

If I understand this proposal correctly, it would make Raft dependent on time for safety. You can decide whether or not you are comfortable with that...

-John-

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+unsubscribe@googlegroups.com.

Alex Evans

unread,

Oct 19, 2016, 3:38:48 PM10/19/16

to raft-dev, al...@bluespoon.com

you are absolutely right, this proposal violates that design goal. I have an alternative in mind that does not , but it's not yet simpler than the original proposal of just using stable storage. if I ever get anywhere useful, I'll post here.

thanks for your time.

jordan.h...@gmail.com

unread,

Oct 19, 2016, 5:14:09 PM10/19/16

to raft...@googlegroups.com

The term/votedFor state in Copycat is stored in a separate file

--

You received this message because you are subscribed to the Google Groups "raft-dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.

Diego Ongaro

unread,

Oct 20, 2016, 12:14:09 AM10/20/16

to raft...@googlegroups.com

This thread grows in words at an alarming rate. I agree fully with John above, and I want to add a couple of more thoughts.

When it comes to writing to stable storage, I don't think the currentTerm+vote is a large burden, as Archie argued above. What worries me slightly more is removing conflicting uncommitted entries off the end of the log, which probably relies on ftruncate doing its job in combination with fsync. On the other hand, I don't know that this is a real problem, either.

Supposing you were worried about ftruncate, it might be wiser to append a marker to the log instead of removing any bytes. Then your logs would no longer be bite-wise identical across replicas, but with that objective off the table, you'd be free to put the currentTerm+vote there too. Maybe it's a good design point.

BTW, a description of LogCabin's storage implementation is https://logcabin.github.io/talk/#/segmentedstorage (press down, not right, to advance), and the code is https://github.com/logcabin/logcabin/blob/master/Storage/SegmentedLog.h and .cc.

To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

You received this message because you are subscribed to the Google Groups "raft-dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+unsubscribe@googlegroups.com.

Oren Eini (Ayende Rahien)

unread,

Oct 20, 2016, 2:52:20 AM10/20/16

to raft...@googlegroups.com, al...@bluespoon.com

inline

On Wed, Oct 19, 2016 at 6:41 PM, Alex Evans <blue...@gmail.com> wrote:

> Doing an fsync() is not hard, you just have to know the right function/command to invoke in whatever language you're programming with.
> Not only is it not hard, it's mandatory - if you can't guarantee something's been durably persisted, then you've broken an assumption on which the correctness of Raft is based.

indeed! amen to those wise words. I am very familiar with fsync and the like, FWIW, having written a few-million user production scale database in the past. but there's always more to learn!, and I am certainly not yet familiar with raft. :)

That isn't actually inaccurate.

Or, to be rather more exact, it is gaining safety by using a 5KG hammer.

Fsync will ensure the data is on disk (assuming the disk isn't lying), but it also has a REALLY expensive cost.

In any real world system, the cost of fsync dominates pretty much anything else.

In general, by the way, I wouldn't recommend just writing your own storage system for a one off implementation like Raft, you can use LevelDB, LMDB, Sqlite or something like that that is focused on solving this problem and leaving you out of the gnarly details of stable storage.

Oren Eini (Ayende Rahien)

unread,

Oct 20, 2016, 2:54:30 AM10/20/16

to raft...@googlegroups.com, al...@bluespoon.com

(*) of course, by sidestepping the edge cases of stable storage, it means that the new edge case is that my delay might to provide the necessary safety if the clock of any machine during the vulnerable period mis-measures the election timeout relative to the other peers holding the elections.

That is actually a really good point, fsync might take more than the election timeout, depending on system load and other various params.

conversely, the original stable-storage requirement is essentially 'stable storage never fails' without specifying what stable storage actually is. I'd claim that on most linux filesystems, for example, even with fsync, a defensive implementation of vanilla raft would still need to handle the case that the 'metadata' 'hardstate' read on boot might be missing or corrupted - and the raft paper/proof makes no comment about what to do in this case,

That is something that you have to deal with anyway.

Assume that you wrote something to the disk, then called fsync, then crashed midway through. There is no guarantee that the data has been updated atomically on the disk. That is why I recommend using an existing storage solution, since they handle that.

Alex Evans

unread,

Oct 20, 2016, 6:09:37 AM10/20/16

to raft-dev, al...@bluespoon.com

jordan: thanks for the datapoint.

I am convinced by everyone's arguments that the simplest thing to do is to write the metadata to disk. thanks all!

diego: sorry for the excess of words :) I followed a similar segmented architecture to your logcabin one, thanks for the link. it seems that even if ftruncate fails and a log has a 'stale tail', it can reliably be detected (as Im sure logcabin does) because (even ignoring checksum failures), the terms in the log will decrease at the point that needs truncating after recovery. eg old log contain terms 112233, peer tried to truncate at index 2 and append 4 to get 114, but crash! even if truncate failed and the file contains eg 114233 (and its more likely it will contain 114<checksum failure>) you can detect the point to truncate because the term goes from 4 to 2 in consecutive log entries. and, as you say, ftruncate is pretty reliable in the first place. raft is a beautiful thing!

ayende:

> In general, by the way, I wouldn't recommend just writing your own storage system for a one off implementation like Raft, you can use LevelDB, LMDB, Sqlite or something like that that is focused on solving this problem and leaving you out of the gnarly details of stable storage.

ah, but where would the fun be in that?! good advice in general, but I prefer to build things up myself from lower level building blocks. agree it's not rational, but I find the (low level) journey at least as motivating as the destination :) right, let's get on with inventing some wheels...

Diego Ongaro

unread,

Oct 21, 2016, 9:49:59 PM10/21/16

to raft...@googlegroups.com, al...@bluespoon.com

On Thu, Oct 20, 2016 at 3:09 AM, Alex Evans <blue...@gmail.com> wrote:

jordan: thanks for the datapoint.
I am convinced by everyone's arguments that the simplest thing to do is to write the metadata to disk. thanks all!

Yay!

diego: sorry for the excess of words :) I followed a similar segmented architecture to your logcabin one, thanks for the link. it seems that even if ftruncate fails and a log has a 'stale tail', it can reliably be detected (as Im sure logcabin does) because (even ignoring checksum failures), the terms in the log will decrease at the point that needs truncating after recovery. eg old log contain terms 112233, peer tried to truncate at index 2 and append 4 to get 114, but crash! even if truncate failed and the file contains eg 114233 (and its more likely it will contain 114<checksum failure>) you can detect the point to truncate because the term goes from 4 to 2 in consecutive log entries. and, as you say, ftruncate is pretty reliable in the first place. raft is a beautiful thing!

I don't think your claim about term numbers being out of order always holds. For a counterexample, see Figure 3.7 in my dissertation, where S1 in (c) has [1][2][4], then in (d1) has [1][3]. If that ftruncate somehow didn't happen, it might end up with [1][3][4], which looks monotonic. But yeah, maybe we can just rely on ftruncate to do its job :)

Reply all

Reply to author

Forward