Raft in-memory without storage

296 views
Skip to first unread message

z g

unread,
Mar 1, 2024, 8:56:39 PMMar 1
to raft-dev
1) Is there an implementation of Raft that works in-memory? i.e., a raft cluster with no disk storage (or access) that commits log entries to memory only? 
2) Would that be a valid implementation of Raft? 

i.e., are there edge cases that make durable storage for log entries an absolute requirement for a valid and complete Raft implementation?

Jinkun Geng

unread,
Mar 1, 2024, 9:26:47 PMMar 1
to raft...@googlegroups.com
Hi, zgeorge,

I think I know the answers to your two questions.

  1.  Disk (stable storage) is a necessary requirement to Raft. That is to say, if the Raft implementation is standard, it must include disk write. But,  in many open-sourced Raft, especially those from commercial companies, like Nuraft https://github.com/eBay/NuRaft/issues/267, their internal implementation has proprietary stable storage, but that part is not open-sourced, so the open-sourced version works in-memory. But we should note that is not the proper way to use Raft (even for some top-tier papers, when they do evaluation, they disable the disk write to output high and pretty numbers).

  2. To answer your question: Why Stable storage is necessary to Raft?  See the error trace I have written in  (the earlier version of) this paper.   https://arxiv.org/pdf/2206.03285v1.pdf   [Section K in Appendix] 
  3. Actually, (maybe including some of my personal biases), I always think the diskless consensus protocol should be the future trend. Since Prof. Liskov's first attempt  https://pmg.csail.mit.edu/papers/vr-revisited.pdf, trying to make VR become diskless (although later found buggy), many researchers have spent huge effort trying to create diskless consensus protocols, and I feel a landmark paper is https://syslab.cs.washington.edu/papers/recovering-disc17.pdf. Of course, the concept in this paper may be a little abstract when you read it for the first time, in my technical report [Section A in appendix]  https://arxiv.org/pdf/2206.03285.pdf , I have provided the concrete pseudo-code to show how to use the  diskless technique  (i.e., crash vector). 

Hope this is helpful to you.
Jinkun


From: raft...@googlegroups.com <raft...@googlegroups.com> on behalf of z g <zgeo...@gmail.com>
Sent: Friday, March 1, 2024 12:34 PM
To: raft-dev <raft...@googlegroups.com>
Subject: [raft-dev] Raft in-memory without storage
 
1) Is there an implementation of Raft that works in-memory? i.e., a raft cluster with no disk storage (or access) that commits log entries to memory only? 
2) Would that be a valid implementation of Raft? 

i.e., are there edge cases that make durable storage for log entries an absolute requirement for a valid and complete Raft implementation?

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/raft-dev/33ca69f4-e908-456f-a2f2-cb4fe4173a7bn%40googlegroups.com.

Free Ekanayaka

unread,
Mar 2, 2024, 6:07:36 AMMar 2
to z g, raft-dev
If you don't persist entries to disk and a majority of your nodes goes
down, then you might permanently lose data.

If you can be absolutely certain that there's no way that a majority of
nodes goes down, then saving log entries only in-memory and not on disk
is safe.

However, I believe there are very few scenarios (if any) where the
latter is true.

Free

Jinkun Geng

unread,
Mar 2, 2024, 5:34:02 PMMar 2
to raft-dev
What free says is definitely correct.  
Actually, if you break the assumption "a majority of replicas are alive", then almost all consensus protocols, such as Viewstamped Replication, Paxos-like protocols will be broken. 

But I want to point out that, even when the assumption "a majority of replicas are alive" is true, Raft can still break the correctness (whereas other diskless protocols are free from such errors ). The root cause of Raft's problem comes from its election mechanism, which is quite different from Viewstamped Replication: The correctness of Raft's election relies deeply on stable storage. Again, See Section K in the appendix of  https://arxiv.org/pdf/2206.03285v1.pdf 


Jinkun 

Jinkun Geng

unread,
Mar 2, 2024, 5:39:06 PMMar 2
to raft-dev

Actually, a little divergent opinion from Free's argument "If you can be absolutely certain that there's no way that a majority of
nodes goes down... I believe there are very few scenarios (if any) where the latter is true."

I feel  every user of consensus protocol should believe " there's no way that a majority of
nodes goes down." because that is the bedrock assumption that the consensus protocol should work, and the replication factor f actually is the tradeoff between the cost and your confidence in the assumption "a majority will be always be alive".  If you are worried that  you cannot guarantee "a majority out of 3 replicas (i.e., 2) can always be alive", then you should pay 5 replicas increase your confidence.  if you still donot beleive "a majority out of 5 replicas (i.e., 3) can always be alive", then you continue to pay 7 replicas to build your fault-tolerant systems. That is the trade-off: the more unconfident you are with the assumption, the more replicas you pay to relieve your worry during the construction of your fault-tolerant systems. 

Doug Woos

unread,
Mar 2, 2024, 6:21:15 PMMar 2
to raft...@googlegroups.com
Jinkun,

Your error trace depends on a node restarting (and therefore losing
its state) and then rejoining the cluster and voting for a new leader.
Since it has lost its state, though, it should really be treated as a
new node (which could be implemented by generating a UUID when a node
starts up). So I'm not sure I agree that this is a fundamental issue
with Raft--you'd just need to do a reconfiguration in order for a
failed node to rejoin the cluster.

Doug
> To view this discussion on the web visit https://groups.google.com/d/msgid/raft-dev/33c4a77d-3d0a-43fd-a076-4162be0351ean%40googlegroups.com.

Jinkun Geng

unread,
Mar 2, 2024, 6:50:14 PMMar 2
to raft-dev
Thanks Doug.
 
Some useful information to supplement:
The more general term to describe this problem is called "stray message", and the name is given in TDSN'21 by the JPaxos team (https://ieeexplore.ieee.org/abstract/document/8758801), but was first explicitly studied by the NOPaxos team (https://syslab.cs.washington.edu/papers/diskless-tr16.pdf)

Now let's continue the discussion:
 it should really be treated as a new node (which could be implemented by generating a UUID when a node starts up).

How should the other nodes distinguish whether the one specific node is a new node or not?
>> You are saying that, okay every node can generate a UUID when it starts up. Then how should the node notify the other nodes of its UUID? 
Then, after a node receives another node's UUID, how does the receiver node know that the guy who sends the UUID is still alive? (for example, Node A sends a UUID, fails and reboots, and then sends another UUID, and then fails and reboots and send the third UUID, and all these UUIDs are reordered in the network, and the receiver receives three UUIDs, how does the receiver know which UUID to keep as the record?)

Thinking following this direction,  you will find the problem become cascadingly complex. 

More importantly, Raft does not talk about how to design such a diskless mechanism. Also, you mention reconfiguration, but you might have noticed in the Raft's paper and thesis, its reconfiguration still relies on stable storage (its reconfiguration is to treat the reconfiguration command as a normal command to commit and execute)


If we really want to solve these questions, we need to think much harder and thoroughly, and that is why many other works appear, such as Crash Vector, ViewSS/EpochSS, MatchMaker (trying to do fast reconfiguration) https://arxiv.org/abs/2007.09468
Again, please remember, none of these works are trivial, and none of these works are included in Raft design. 

[Below may include some of my personal biases]
No doubt, Raft is a brilliant work, and it has even become the industry standard. But what makes me feel unhappy and unfair  is that, since Raft comes out, it seems Raft has overshadowed all the other works in consensus, either in the past or in the current community. Doug's reply reminds me of similar discussion I had with some friends about one year ago. 

Whenever I want to do some protocol design different from Raft, many people will say "Hey, Raft is already there, how could you beat it?"  Then whenever I say "Here is Raft's performance problem, it suffers from disk overheads and the protocol will become incorrect when you remove disks and completely keep everything in memory". Then people will continue to say "That is not a big deal, you can easily fix it, like using UUID". Then I continue to ask the above questions, "what if these edge cases occur?" Then my friends either become impatient or continues to say "it is not a big deal". Actually, if you can really solve the "bypassing disk" issue, you need to pay much effort to keep correctness, that is why the other works (e.g., crash vector) come out. However,  nobody gives enough attention and credit to these works. Instead, many people even tend to attribute the credits of these works to the Raft protocol, although Raft does not consider these issues at all. 

Doug Woos

unread,
Mar 2, 2024, 7:28:12 PMMar 2
to raft...@googlegroups.com
Jinkun,

I certainly don't want to dismiss any of the work on
diskless/non-crash-fail consensus! I guess maybe we're disagreeing
about what counts as "Raft?" The simplest solution to the UUID issues
you raise is for every node to include its UUID on every message it
sends; I wouldn't categorise this as a fundamental change to the
protocol (and if I tried to publish a paper about "diskless Raft"
where this was the only change, I think reviewers would likely reject
it!). This change is equivalent to making nodes crash-fail--a node's
identity, for all protocol purposes, is its UUID. My narrow claim is
that Raft does not in any meaningful way depend on persistent storage
for correctness (as opposed to performance). The original question was
whether one can have a valid implementation of Raft that does not rely
on persistent storage, so I think the answer is "yes!"

Doug
> To view this discussion on the web visit https://groups.google.com/d/msgid/raft-dev/8fa69f96-d1a0-4a46-bc75-f222133497a3n%40googlegroups.com.

Jinkun Geng

unread,
Mar 2, 2024, 7:33:40 PMMar 2
to raft-dev
Now let's say we want to design diskless Raft, can we first try to address the following questions? 

(1) how should the node notify the other nodes of its UUID? 
(2) after a node receives another node's UUID, how does the receiver node know that the guy who sends the UUID is still alive? (for example, Node A sends a UUID, fails and reboots, and then sends another UUID, and then fails and reboots and send the third UUID, and all these UUIDs are reordered in the network, and the receiver receives three UUIDs, how does the receiver (say Node B) know which UUID to keep as the record?)

I would be very happy to listen to your thoughts on these questions, which may be very inspiring to launch some follow-up research works.  

Jinkun Geng

unread,
Mar 2, 2024, 8:26:50 PMMar 2
to raft-dev
Re: Dough's comments:

This change is equivalent to making nodes crash-fail--a node's
identity, for all protocol purposes, is its UUID.

>> I think the change should be  "making rebooted nodes remember it has failed before "  and "making alive nodes able to distinguish which incoming messages are sent by failed nodes (potentially no longer valid) and which messages are sent by alive nodes (which should be processed)".

UUID can always change whenever a node reboots. 

>> The original question was whether one can have a valid implementation of Raft that does not rely on persistent storage, so I think the answer is "yes!"

If the answer is "yes", how should we easily make the change happen? i.e.,  "making rebooted nodes remember it has failed before "  and "making alive nodes able to distinguish which incoming messages are sent by failed nodes (potentially no longer valid) and which messages are sent by alive nodes (which should be processed)".

Doug Woos

unread,
Mar 2, 2024, 8:32:06 PMMar 2
to raft...@googlegroups.com
Jinkun,

1) We're assuming that a node joining the cluster with a new UUID
triggers a reconfiguration. (This will require some special
intervention at the beginning of time, which is pretty common in
consensus-based systems--e.g., IIRC Zookeeper requires some special
startup config). So, on startup a node generates a UUID and sends
messages to the rest of the nodes in the cluster to say "hello, I am
node <UUID>, please let me join the cluster"; the leader attempts to
reconfigure the cluster on receipt of such a message.
2) I'm not sure I see the issue here. In such a scenario, the cluster
would initiate some reconfiguration attempts, some of which would
fail. If the node ever managed to start up with a UUID and stay up
through such a request, the cluster would eventually be reconfigured
to include that node and its UUID.

Doug
> To view this discussion on the web visit https://groups.google.com/d/msgid/raft-dev/892308a1-dcfa-4e3e-9f24-60cd536cdfd2n%40googlegroups.com.

Doug Woos

unread,
Mar 2, 2024, 8:41:06 PMMar 2
to raft...@googlegroups.com
I'm willing to concede your point here, though, Jinkun--I do think
it's fair to say that implementing a version of Raft w/o persistent
storage requires some care in implementation in order to avoid either
liveness or safety issues vis-a-vis reconfiguration or the
participation of old nodes. I don't think these implementation issues
rise to the level of fundamental protocol changes (and any practical
implementation of a consensus protocol will require dealing with some
correctness-affecting "implementation details" that aren't typically
covered in detail in a high-level protocol description) but it's
certainly true that the algorithm described by the Raft paper assumes
persistent storage and so if one is looking for a diskless consensus
algorithm it probably makes sense to take look at some of the more
recent work!

Doug

Free Ekanayaka

unread,
Mar 3, 2024, 5:46:37 AMMar 3
to Jinkun Geng, raft-dev
Hello Jinkun,

I agree with everything Doug said, including his final remark that if
one is looking for a protocol explicitely designed for diskless
operation, then Raft might not be the best choice.

However, I also want to highlight what Doug already said:

> The original question was whether one can have a valid implementation
> of Raft that does not rely on persistent storage, so I think the
> answer is "yes!"

And that's easy to do, without any real modification of the protocol, as
Doug himself noted.

Raft (like any other protocol) already requires a unique and stable
identity for each node. That identity must be included in each message
being sent, so the node receiving the message knows what node has sent
it.

The identity must be "stable" in the sense that if a node reboots, then
it must maintain the same identity it had before.

Each implementation is free to choose how to implement such node
identity, it can be the machine IP, the hostname, a UUID, or
anything. The only requirement is that the identity is unique and stable
across reboots.

The criticial point to keep in mind when implementing a diskless Raft is
this:

Nodes never actually reboot. After they crash, they always restart as
brand new nodes with a brand new identity. So they are effectively a new
node.

For example, the identity could be implemented with a UUID which is
generated when the node process starts, and is different every time.

When a node starts again after a crash, it is not the same node anymore,
it is always a brand new node with a new identity which is not yet part
of the configuration. The node that has crashed effectively does not
exist anymore, from the point of view of the protocol. That means that
this newly started or restarted node can't vote and can't become a
leader: since it's not part of the configuration, other nodes will not
vote for it and will not count its vote. Likewise, this new node does
not count for committment quorum, since other nodes won't consider it
when deciding if a log entry is committed or not.

So when a node starts again after a crash, a configuration change will
be needed in order to include it in the configuration. That is a normal
configuration change, already described in the Raft protocol, nothing
special.

This is the way Raft can be made diskless, maintaining safety and
correctness.

Hope that helps,

Free
> To view this discussion on the web visit https://groups.google.com/d/msgid/raft-dev/44dad35d-7916-4e54-875c-c434c3ced8a4n%40googlegroups.com.

Free Ekanayaka

unread,
Mar 3, 2024, 6:03:33 AMMar 3
to Jinkun Geng, raft-dev
For clarity, I will add this:

when choosing to implement node identities using UUIDs, that means that
every message will contain that UUID, and that the configuration also
contains the UUID.

No additional identity is needed (IP or hostname or whatever), there's
nothing else identifying the node.

So for all intents and purposes: the UUID is the node. A new UUID means
a new node, and there's nothing special to do to "handle changing
UUIDs", they are just new nodes.

z g

unread,
Mar 11, 2024, 9:10:13 AMMar 11
to raft-dev
Is it fair to say that some of the suggested approaches (e.g., UUID) MAY address a solution based on a diskless implementation of Raft. But, that isn't part of the Raft paper or from POV of correctness? 
But, if this approach (using fresh UUIDs for nodes that recover or new nodes) were implemented - What role does quorum play in such an implementation? I am thinking of a recovery-from-failure scenario, whereby, nodes that failed (no disk) start back up and send out new UUIDs, but (say) fail after sending it out, causing another one to start up (ad nauseum, for argument) and send out its new UUID., etc. 
How is quorum defined in a scenario where there are multiple instances that are attempting to start up and partially failing, but hanging around and sending out their UUIDs? 
It feels like any candidate receiving votes in favor cannot rely on an upper bound of the number of active nodes in the cluster? 

Free Ekanayaka

unread,
Mar 11, 2024, 10:24:06 AMMar 11
to z g, raft-dev
z g <zgeo...@gmail.com> writes:

> Is it fair to say that some of the suggested approaches (e.g., UUID) MAY
> address a solution based on a diskless implementation of Raft. But, that
> isn't part of the Raft paper or from POV of correctness?

The proposed approach IS pure Raft, there's no modification of the Raft
paper. Therefore it is correct.

> But, if this approach (using fresh UUIDs for nodes that recover or new
> nodes) were implemented - What role does quorum play in such an
> implementation? I am thinking of a recovery-from-failure scenario, whereby,
> nodes that failed (no disk) start back up and send out new UUIDs, but (say)
> fail after sending it out, causing another one to start up (ad nauseum, for
> argument) and send out its new UUID., etc.
> How is quorum defined in a scenario where there are multiple instances that
> are attempting to start up and partially failing, but hanging around and
> sending out their UUIDs?

This is a detail that I omitted for brevity. When a node fails, you
should first delete it from the configuration before restarting it with
a new UUID (i.e. its old UUID will not be part of the configiguration
anymore).

This could be done for example by storing the old UUID to disk, and
that'd be the only bit of information that'd need to be persisted.

There might be other equivalent ways to achieve the same goal, the
important part is to delete failed nodes from the configuration.

Free

Henrik Ingo

unread,
Mar 12, 2024, 3:04:04 AMMar 12
to raft...@googlegroups.com
Hi George

Fsyncing to disk is not necessary, but a few modifications to Raft are required to keep it safe in all scenarios. This configuration is in my experience quite common in practice but at least used to be woefully understudied in the literature.


I got help from Ganesan when writing it and really it's them who pointed out the cases to watch out for.

Henrik 

--

z g

unread,
Apr 11, 2024, 11:21:21 AMApr 11
to raft-dev
While my original query (on this thread) was about an pure in-memory Raft implementation (Thanks to you all for your responses).
A simpler version of that query  is  - can a Raft implementation use ephemeral storage (say, /tmp) and operate correctly? 

Please consider this scenario:
  1. 3 node cluster - Node0, Node1, Node2
  2. Node 0 is the current leader. Cluster works as expected. 
    1. logs and snapshots are being written on each node to /tmp (NOT durable)
  3. Node 0 goes down. 
  4. Node1 or Node2 takes over leadership.
    1. Cluster continues working (as expected)
  5. Node 0 start back up - joins the cluster. 
    1. It's /tmp has been cleared. 
    2. all the logs/snapshots have been cleared. 
Would that matter to the correctness of the Raft algorithm? 
Wouldn't Node0 be able to join even though all of its snapshots and logs were cleared? 
i.e., as long as number of failed nodes (having /tmp for storage) stays below 'f' nodes (with 2f+1 being the cluster size) - wouldn't the algorithm work correctly? 

Thanks again. 


Archie Cobbs

unread,
Apr 11, 2024, 3:51:33 PMApr 11
to raft-dev
Not sure if this is helpful, but here are some observations...

1. Don't ask about "Doing Raft with ephemeral storage" - there is no such thing and that is an undefined phrase. Raft nodes, by definition, have persistent storage. If a Raft node loses its memory then it's no longer a member of any cluster, instead it is (at best) a new, unaffiliated node. It would need to be added back using the normal configuration change protocol if you wanted it to rejoin that original cluster and still call your cluster a Raft cluster.

2. If your question is this: "OK but suppose I have nodes doing the Raft protocol but using ephemeral storage, do there exist some failure scenarios where linearizable consistency is maintained?" then the answer is yes.

3. If your question is this: "OK but suppose I have nodes doing the Raft protocol but using ephemeral storage, do there exist some failure scenarios where linearizable consistency is not maintained?" then the answer is also yes.

Example of #2 - your example.

Example of #3 - see other answers.

-Archie

z g

unread,
Apr 12, 2024, 2:26:37 PMApr 12
to raft-dev
Thanks. That helps. 

(https://kubernetes.io/docs/concepts/storage/ephemeral-volumes/)
How about this rephrasing ... 
If we define ephemeral storage as storage that does NOT persist across restarts - is that kind of storage "persistent enough" for a Raft Node (that by definition requires persistent storage)? 
From your observation #1 & #2, one could conclude that the cluster will continue to operate correctly with the remaining nodes (provided there are enough for a quorum) AND the recovering node can join back into the cluster so long as it is treated as a new Raft node. 
Based on your observation #3 - could one conclude that the failure scenarios where linearizable consistency is NOT maintained arise the same conditions as with durable storage? I guess, what I am looking for is - are there specific failure scenarios that occur because of using ephemeral storage, all else being equal. 

Jinkun Geng

unread,
Apr 12, 2024, 3:48:28 PMApr 12
to raft-dev
Hi Zg.
```
Based on your observation #3 - could one conclude that the failure scenarios where linearizable consistency is NOT maintained arise the same conditions as with durable storage? I guess, what I am looking for is - are there specific failure scenarios that occur because of using ephemeral storage, all else being equal. 
'''
See this trace below.
raft-error-trace.jpg

At a high level: why does Raft go wrong when the storage is not persistent? 
Because with only ephemeral storage, Raft replicas will forget some messages that they have sent (i.e., stray messages). With a "bad memory", the leader selection will suffer from problems, and further the problematic leader will lead to incorrectness.

I think I have mentioned this earlier in this thread, but was distracted by some other guys. 

Archie's answer is sharp and correct. If he had joined the discussion earlier, we should have saved much time in unnecessary and distracting discussion.
"Don't ask about "Doing Raft with ephemeral storage" - there is no such thing and that is an undefined phrase.

(Whenever we want to say that Raft is not correct without persistent storage, there will always be somebody jumping out, saying "That is not a big deal, you can do XYZ to easily fix that". But eventually, nobody really gives a concrete, complete, and proved solution telling me how to run Raft without stable storage.  Actually,  Diego and John have clearly mentioned in the paper and thesis, that stable storage should be equipped for Raft. I don't understand why there are so many Raft adherents that like to overclaim Raft's application scope)


 

jianyu niu

unread,
Sep 4, 2024, 2:07:32 AMSep 4
to raft-dev
For a crashed node, we can assign it a new UUID and use reconfiguration to remove the one with old UUID and add the new one. Another way is to recover its state from other nodes. These two methods seems to achieve the similar goal. I wonder if there are some intrinsic differences? Or If we have simple reconfiguration method, we do not need the complicated recovery mechanism anymore. 
Reply all
Reply to author
Forward
0 new messages