fsync, crashes, and guarantees

Ramnatthan Alagappan

unread,

Jul 12, 2018, 12:54:06 PM7/12/18

to raft-dev

Hi,

I have a few questions regarding the need for synchronous fsync-s before acknowledging updates. I will first explain my understanding and then ask the questions.

My understanding: When clients perform updates to the state machine, at least a majority of replicas synchronously persist the updates to their on-disk logs by issuing fsync calls (possibly with batching to improve throughput). So in the event of crashes (even correlated ones), data will be durable and available once a majority of replicas have recovered from the failure. Further, according to chapter 6 in Diego's dissertation, the leader issues a round of heartbeats to confirm leadership before serving reads, ensuring that the reads are not stale. Also, duplicate requests are filtered to ensure exactly-once semantics. The combination of the above three mechanisms (synchronous persistence, heartbeats before reads, and duplicate-request filtering), enables Raft to provide linearizability.

While synchronous persistence provides strong durability guarantees, it does incur a cost (tens of ms on HDDs and around 100 us on flash). I am wondering what if an implementation does not issue fsync calls in the critical log-update path but performs them in the background. In such a system, a data loss will not be noticeable if a majority of nodes do not crash together. However, if a majority of nodes crash together and recover, the system may lose a few recent updates. I have two specific questions about such a system.

1. The system may serve stale data after failure recovery (because a few writes have been acknowledged, but they are now lost). I believe this behavior violates linearizability. Is this correct? I think the answer is yes because, for linearizability, operations need to take effect sometime between invocation and response; but here, in some sense, a few completed writes have lost their effects.

2. Is this behavior sequentially consistent? I am thinking the answer is yes because of the following (incorrect?) reasoning. The requirement for sequential consistency is that each client needs to see a monotonically progressing state machine (chapter 6, section 6.4). When a correlated failure occurs, I believe all client sessions need to be re-established, meaning that they are new clients. Now each client would observe a prefix of the state machine state (and monotonically progressing from there). Is this correct? Can such a system be considered to provide sequential consistency?

Thanks

Ram

Andy Schwerin

unread,

Jul 12, 2018, 8:29:28 PM7/12/18

to raft...@googlegroups.com

We've experimented with this a little in MongoDB. There are some challenging problems.

In cases where a majority of nodes fail in this mode, linearizability is violated because the system has suffered more than the tolerated number of failures.

Also, to your second question, treating re-established client sessions as new sessions to side-step the sequential consistency problem won't satisfy actual users. Their application didn't really crash, so they're probably not expecting the connection re-establishment to mean that they have effectively become a new client. Better to be up front with them about the kinds of anomalies they can experience in these scenarios.

-Andy

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Archie Cobbs

unread,

Jul 12, 2018, 9:57:18 PM7/12/18

to raft-dev

On Thursday, July 12, 2018 at 11:54:06 AM UTC-5, Ramnatthan Alagappan wrote:

1. The system may serve stale data after failure recovery (because a few writes have been acknowledged, but they are now lost). I believe this behavior violates linearizability. Is this correct? I think the answer is yes because, for linearizability, operations need to take effect sometime between invocation and response; but here, in some sense, a few completed writes have lost their effects.

I think it's worse than just a potential "rollback".

AFAICT the proof of Raft's correctness relies on the assumption that when a node says it has persistently appended X, Y, and Z to the log, then the only possible future scenarios after a crash are (a) the node can actually recover X, Y, and Z, or (b) the node is completely lost. There is no option for (c) the node can recover X and Y, but not Z.

So all bets are off. In particular I'm pretty sure it's possible (with the right sequence of failures) for your system to have positively committed log entries R, S, T, but then after the crash the log instead appears to contain committed log entries R, V, W. In other words, you don't just get a rollback, you get a completely different result.

-Archie

Oren Eini (Ayende Rahien)

unread,

Jul 13, 2018, 1:39:55 AM7/13/18

to raft...@googlegroups.com

Can you explain "linearizability is violated because the system has suffered more than the tolerated number of failures" ?

As I follow it, if you have more than allowed failures, the system _stops_, it doesn't violate any consistency rules.

Hibernating Rhinos Ltd

Oren Eini l CEO l Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

You received this message because you are subscribed to the Google Groups "raft-dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+unsubscribe@googlegroups.com.

Oren Eini (Ayende Rahien)

unread,

Jul 13, 2018, 1:42:00 AM7/13/18

to raft...@googlegroups.com

We do durable writes for log appends, but another scenario that you have to handle in this case is the issue of restoring from backups.

In a failure scenario, you may get two node failures that you have to restore from backup.

In this case, you have a node that got updates, and committed them, and yet a majority of the nodes don't have it and actually elected a leader that doesn't include this node.

This is outside the scope of Raft, but something that should be handled as an application using it.

Hibernating Rhinos Ltd

Oren Eini l CEO l Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

--

Andy Schwerin

unread,

Jul 13, 2018, 8:03:03 AM7/13/18

to raft...@googlegroups.com

I was thinking about recovery. What if those failures are transient? If nodes don't flush log writes to stable storage before acknowledging, then when those crashed nodes restart they may elect a leader who is missing writes because they had acknowledged and then lost.

If you don't flush writes, you need to treat all node failures as permanent, I believe. You have to treat a node restart like a membership change rather than just the end of a network partition.

To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.

Юрий Соколов

unread,

Jul 13, 2018, 3:25:06 PM7/13/18

to raft-dev

Andy Schwerin is absolutely right:
Even if single follower lost previously acknowledged record, it could participate in election of other leader, who has never known about lost records.

Therefore, whole guarantee "committed once is committed forever" is lost in case of crash of single follower.

So, such follower had to be considered as lost forever. Cluster should be reconfigured to remove that follower from cluster, and to add it again as a new one.
Before follower will be added as new one, all previously acknowledged records will be inevitable replicated to remained majority.

пятница, 13 июля 2018 г., 15:03:03 UTC+3 пользователь Andy Schwerin написал:

Ensar Basri Kahveci

unread,

Jul 31, 2019, 4:40:41 AM7/31/19

to raft-dev

> Even if single follower lost previously acknowledged record, it could participate in election of other leader, who has never known about lost records.

This is a very important realization. Thanks for writing it down!

We have been discussing exactly the same topic and totally missed this point. We falsely thought that as long as a single node from the majority is alive, others can crash and restore with losing some acknowledged writes but they will always get those lost writes from the healthy node that has all log entries. But we totally missed the case that even a single node restart with missing acknowledged entries can create another "invalid" majority. Just for future readers, I would like to write down the simplest case:

[A, B, C] are group members and A is the leader. A commits an entry with B and C has no idea about this entry yet. If B crashes and comes back without this entry, B and C can elect a new leader and boom!

"Safe" rafting everyone...

Reply all

Reply to author

Forward