Is log persistence still necessary if the Raft assumption is always true?

Jinkun Geng

unread,

Oct 8, 2021, 12:39:05 AM10/8/21

to raft-dev

In Raft paper, it assumes

"They are fully functional (available) as long as any majority of the servers are operational and can communicate with each other and with clients. "

In Diego's thesis,

Chap 11.7.3, he discussed avoiding persistent storage writes.

''

The trade-off is that data loss is possible in catastrophic events. For example, if a majority of the cluster were to restart simultaneously, the cluster would have potentially lost entries and would not be able to form a new view. Raft could be extended in similar ways to support disk-less operation, but we think the risk of availability or data loss usually outweighs the benefits.

''

Diego tries to support "Persisting logs to disk" is more beneficial than the "diskless solution" used by other protocols, with the example "a majority will be restarted simulatenously".

I agree that persistent logs can help avoid the example mentioned here. But Raft assumes "a majority is operational" . If this assumption is true, then this example will be out of scope, right? In that way, for performance benefit, diskless solution should be more favored, right?

I understand we can find different practical scenarios to support different solutions. But Raft has made such an assumption "a majority is operational", it means it only guarantee safety under its assumption and is not obliged to make extra promise when the assumption is violated. Within this scope, is log persistence still necessary? In other words, if nobody blames the protocol when it goes wrong in the case of "a majority is restarted simulatenously", is there any other benefit to persist the log to disks?

If I understand correctly, the disk write can constrain Raft performance because it stays in the critical path ("Raft’s RPCs typically require the recipient to persist information to stable storage, so the broadcast time may range from 0.5ms to 20ms, depending on storage technology."). Now my thinking is, if everybody agrees to suffer from data loss when a majority fails and nobody blames Raft under such cases, Raft can move disk-write out of its critical path and use a background thread to write to disk.

Archie Cobbs

unread,

Oct 8, 2021, 2:05:39 AM10/8/21

to raft-dev

On Thursday, October 7, 2021 at 7:39:05 PM UTC-5 gjk...@stanford.edu wrote:

But Raft assumes "a majority is operational"

That's not strictly correct - Raft makes no assumptions about how many nodes are operational.

But it's also true that Raft only guarantees availability if a (strict) majority is operational.

If Raft assumed that a majority was always operational, then I would have to throw away my Raft cluster as soon as there were ever not a majority, because otherwise who knows what state it had got into?? There would be no guarantee of anything.

But Raft has made such an assumption "a majority is operational", it means it only guarantee safety under its assumption and is not obliged to make extra promise when the assumption is violated.

See above.

I think you are also mixing two different guarantees...

Raft guarantees consistency even if nodes are tardy in persisting their logs. But it doesn't guarantee durability in that case. This is the trade-off with doing that.

If you need ACI_ (but not ACID), then sure - take your time persisting the logs. But if you want the full ACID then nodes must actually persist their logs before telling the rest of the world that they have done so, or else you've created the possibility that committed transactions may "disappear" after a restart.

Note also the difference between the whole cluster suddenly restarting (e.g., power failure) vs. the whole cluster being destroyed (e.g., meteor strike).

Of course, in the meteor strike scenario it doesn't matter what's on your disk because the disk is gone anyway.

But in the restart scenario, if you're not guaranteeing durability then it's possible some client successfully committed a transaction just before the crash ("Yes! We have received your important payment!") but after restart it could turn out whoops, the transaction never happened ("You're important payment is overdue!").

-Archie

Jinkun Geng

unread,

Oct 8, 2021, 2:17:19 AM10/8/21

to raft-dev

Hi, Archie.

I am a bit confused with "Raft guarantees consistency even if nodes are tardy in persisting their logs. But it doesn't guarantee durability in that case."

Could you clarify more about "guarantee consistency but not durablity"?

As a client, if I submit the request, the raft replicas take very long time to write to disk. Before replying to me, suddently there is power failure, and when they are all restarted, they foget my request. In this case, I think it does not violate durability or consistency, because the leader does not reply committed to me.

So can you give a more concrete example "guarantee consistency but not durability"?

Archie Cobbs

unread,

Oct 8, 2021, 2:28:49 AM10/8/21

to raft-dev

On Thursday, October 7, 2021 at 9:17:19 PM UTC-5 gjk...@stanford.edu wrote:

I am a bit confused with "Raft guarantees consistency even if nodes are tardy in persisting their logs. But it doesn't guarantee durability in that case."

Could you clarify more about "guarantee consistency but not durablity"?

As a client, if I submit the request, the raft replicas take very long time to write to disk. Before replying to me, suddently there is power failure, and when they are all restarted, they foget my request. In this case, I think it does not violate durability or consistency, because the leader does not reply committed to me.

So can you give a more concrete example "guarantee consistency but not durability"?

By "consistency" I mean that clients will see a consistent evolution of the log. For example, no client will ever append X and then Y, but then go back later to inspect the log and see Y appended before X. Raft's linearizability guarantee is still there... as long as the power doesn't fail.

In other words, imagine a Raft cluster that had no disks but just stored everything in memory. Then Raft's guarantee of linearizability would still hold. In fact everything that Raft promises would still hold... until the power failure.

I guess you could say that after the power failure Raft starts being inconsistent because previously committed data is no longer there. But usually I think people would refer to that as a durability problem, not a consistency problem. In other words, you're still getting a consistent view, it just happens to be from a long time ago :)

-Archie

Jinkun Geng

unread,

Oct 8, 2021, 2:40:04 AM10/8/21

to raft-dev

As for the definition of consistency, you are saying "a consistent evolution of the log"

But when I checked another paper in John's group, https://www.usenix.org/system/files/nsdi19-park.pdf [See Appendix A]

Consistency: if a client completes an operation, its result returned to an application remains consistent after server crash recoveries.

If we follow this definition, then consistency means "the consistency of the result before/after crash".

In this way, I feel durability becomes a necessary condition of consistency. For example, if "x=0" initially, then a request "Write x=1" is committed and viewed by the world. Then it is lost during failure (no durability), then I will see an inconsistent result x=0.

If we follow this definition, then if durability is not guaranteed, then consistency can neither be guaranteed.

What do you think?

Jordan Halterman

unread,

Oct 8, 2021, 2:56:32 AM10/8/21

to raft...@googlegroups.com

As Diego mentions, Raft was simply not designed for ephemeral logs. If you don’t persist logs to disk, you have to be okay with your cluster losing _all_ the logs even when only a majority (but not all) of the nodes go down. So, in a three node cluster, even losing only two nodes will mean you lose either consistency or data, and neither is typically acceptable for a strongly consistent database.

The reason is because in Raft, consistency is guaranteed by a leader election protocol that compares candidate logs to guarantee the leader will have all prior committed entires, and the protocol depends on quorums to do that. To commit a change, the leader must verify the entry is stored on a majority of nodes. And to ensure a leader is elected with all prior committed entries, a candidate must receive votes from a majority of nodes. This intersection of the leader election and log replication protocols is how Raft ensures consistency. But if logs are not persisted, losing a majority of replicas means there are no log entries stored on a majority of nodes, and even a node with no entries in its log can receive votes from a majority of peers and be elected leader. To avoid this sort of catastrophic data loss following even a less-than-total loss of nodes, you really have to rethink the protocol IMO.

Even if you exclude candidates with empty logs, it’s impossible to use quorums to guarantee consistency after losing a quorum. Perhaps you could use periodic snapshots to limit the data loss. But you won’t be able to preserve the sorts of guarantees Raft was designed for without changing the protocol.

If you can tolerate that sort of data loss when a majority of Raft nodes goes down, I would suggest Raft is not what you should be using at all. You’d be better off with a protocol that doesn’t depend on quorums for consistency — e.g. a primary-backup protocol with n=f+1 replicas — using consensus only fair coordinating the protocol.

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/raft-dev/914adc56-2e4f-4ce1-8347-b6fe4161cad7n%40googlegroups.com.

Jinkun Geng

unread,

Oct 8, 2021, 3:46:18 AM10/8/21

to raft-dev

It seems to be an open question, which is based on what kind of definition of "safety" we are using.

Here is a relevant discussion.

https://groups.google.com/u/2/g/raft-dev/c/u5llMm3Jy80/m/Tm4kta9vCAAJ

I think Henrik's comment gives a good summary.

"In both #2 and #3 the transaction is safe against any failure not
affecting the majority of nodes. In addition #2 is also safe against
some other failures, in particular catastrophic shutdown of the entire
cluster, as long as a majority of nodes can restart with the disk
intact. This additional safety comes with a relatively high
performance cost. I would expect at least a 50% performance overhead,
but in practice it often tends to be more. "

Whether or not pesisting to disk is decided by whether we expect to protect against the "catastrophic shutdown of the entire
cluster".

I checked a few academic papers, and it seems there is also a divergence.

But typical consensus protocols like to use the notation "2f+1" with f as the max tolerated failure.

I am just wondering, if catastrophic shutdown must be considered, then we must log every request to stable storage, in that case,

is it still necessary to maintain such a notation "f out of 2f+1"? Since we already have stable storage (at a high performance cost), even all replicas fail, we can still recover. Then, what is the significance of highlighting 2f+1 in papers?

Jordan Halterman

unread,

Oct 8, 2021, 10:36:28 AM10/8/21

to raft...@googlegroups.com

Sure, a Raft cluster that persists its logs to disk can tolerate a loss of all the nodes without losing data. But Raft is a consensus protocol, and 2f+1 nodes represents what is required to achieve consensus, i.e. to reach agreement on a proposal(s). While logs will persist through the failure of f+1 nodes, the cluster won’t be able to make progress until a quorum can be formed again. In other words, the cluster cannot service linearizable reads nor commit new changes until f+1 replicas are available. That’s the significance of 2f+1 here: it’s an availability formula.

To view this discussion on the web visit https://groups.google.com/d/msgid/raft-dev/d25d0c18-4d44-4bf6-abe8-e60cf93ff01bn%40googlegroups.com.

Archie Cobbs

unread,

Oct 8, 2021, 10:51:24 AM10/8/21

to raft-dev

On Thursday, October 7, 2021 at 9:40:04 PM UTC-5 gjk...@stanford.edu wrote:

As for the definition of consistency, you are saying "a consistent evolution of the log"
But when I checked another paper in John's group, https://www.usenix.org/system/files/nsdi19-park.pdf [See Appendix A]

Consistency: if a client completes an operation, its result returned to an application remains consistent after server crash recoveries.

If we follow this definition, then consistency means "the consistency of the result before/after crash".
In this way, I feel durability becomes a necessary condition of consistency. For example, if "x=0" initially, then a request "Write x=1" is committed and viewed by the world. Then it is lost during failure (no durability), then I will see an inconsistent result x=0.
If we follow this definition, then if durability is not guaranteed, then consistency can neither be guaranteed.

Yes I agree. I'm just throwing around these traditional terms casually and should probably leave it to the experts to give precise meanings.

Part of the problem is terms like these are often defined with implicit reference to assumed implementation details. For example Wikipedia says:

Durability guarantees that once a transaction has been committed, it will remain committed even in the case of a system failure (e.g., power outage or crash). This usually means that completed transactions (or their effects) are recorded in non-volatile memory.

But what is a "system failure"? That's an implementation-dependent concept (so is "non-volatile memory" of course). From the client's perspective, who cares what kind of disaster is happening in the data center...? All the client cares about is what guarantees it can assume based on its interaction with the cluster.

By the way the same "assumed implementation" problem exists with SQL isolation levels, which were defined with implicit reference to "locking".

Ideally all such properties should be defined only in terms of what a client can see from the outside. If we did that, then surely a system that failed to appear "durable" from a client's perspective would also fail to appear "consistent" from a client's perspective.

-Archie

Freddy Rios

unread,

Jan 11, 2024, 3:48:22 PM1/11/24

to raft-dev

The way I see it, if an empty log node manages to get a majority, some of the raft properties would get lost. Specifically the leader could accept new log entries with the current term and a log index that was already used in entries that are in the committed logs of other nodes. Raft assumes this can't happen, so surely it would end badly.

I have been thinking about this because I have an use case where an operational distributed system needs the type of guarantees raft provides. However, if this distributed system goes down, we need to re-evaluate/re-create the state without using the previous state (can't make any assumptions about what was being done previously and the related and need to figure out a new shared view of the world based purely on external data). A full power failure is easy as we just start fresh, but it seems a majority nodes failure would require us to have some external mechanism that triggers the full reset of the whole cluster.

Reply all

Reply to author

Forward