Parallel universe due to amnesia when replacing dead nodes

223 views
Skip to first unread message

Mike Percy

unread,
Jun 30, 2015, 3:08:58 PM6/30/15
to raft...@googlegroups.com
I've come up with a somewhat nasty scenario involving dynamic config change and am wondering if others are handling this case. I don't think this particular situation is handled by the protocol because it violates the persistent storage requirement (although sometimes disks fail, so we have to deal with the possibility of amnesia).

Say I have a 3-node Raft cluster that I am rotating across 6 servers as they fail using membership change to do so. If nodes come back online with amnesia (say they had a catastrophic disk failure), it becomes easy to create a split-brain / parallel-universe scenario where there are two clusters with divergent histories. Let's assume we bring old nodes back with the same "ID" (e.g. hostname or UUID) and they don't remember anything in their logs or their votes. This is the scenario I am considering, where "L" means leader, "F" means follower, "down" means crashed, and blank means we don't care / not involved yet.

Node:     A       B       C       D       E       F
---------------------------------------------------
term 1:   F       F       L
          down    F       L
          down    F       L       F
          down    down    L       F       F
          down    down    down    F       F
term 2:   down    down    down    L       F
          down    down    down    L       F       F

At this point, say we reformat nodes B & C, and bring them back online with an empty configuration so they can act as hot standbys as needed. Say we find that we can just restart A (maybe it just had a kernel panic) so A does not have amnesia when it comes online. In this scenario, A would believe that the first configuration is still valid, ask B & C for votes, B and C should vote to elect A, then A should catch B & C up to its own log, and now we have a split-brain scenario with 2 parallel clusters with an {A,B,C} cluster and a {D,E,F} cluster.

Assuming this is possible, In such a case, I can think of the following approaches to preventing this issue:
  1. Never reuse instance IDs if you bring a dead box back, i.e. use some additional identifier for cluster members or use some kind of sequence number in addition to the hostname. Reject someone's vote / AppendEntries request if they call you by the wrong ID. Also, never actively delete logs or election histories from a box when you take a node out of a cluster if you will ever put it back into that cluster without changing its ID.
  2. Have a monitor or "master" node that keeps track of all of the instances and kills nodes that come back online with old configurations. This is a reactive approach and it's possible for the above config to be running for some amount of time before detecting it and deleting it. It's also a parallel system that you have to keep running
  3. Never bring servers online with an empty configuration (no logs, no term information). Always pull a snapshot from a machine in the latest configuration when spinning up a new box. I'm not convinced this isn't still racy and susceptible to this problem without quiescing though.
Is anyone else dealing with this case? It's easier to imagine running into this if you have a component taking care of automatic provisioning and replacement of failed nodes. I'm not really satisfied with the ideas I've come up with so far for dealing with the issue, although I'm leaning towards implementing some version of #1.

Mike

Diego Ongaro

unread,
Jun 30, 2015, 3:51:01 PM6/30/15
to raft...@googlegroups.com
Though I suspect the details will differ from system to system, I've also been thinking about this along the lines of #1.

Another idea, just to throw it out there, is to use randomized instance IDs, where servers would store their IDs along with their logs (fate-sharing), and if a server didn't have an ID, it would generates a random one (from a space large enough that collisions are unlikely).

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mike Percy

unread,
Jun 30, 2015, 5:26:42 PM6/30/15
to raft...@googlegroups.com
I agree that automatically-generated randomized IDs per server should be sufficient to effectively isolate a machine that has amnesia, assuming that ID is validated at all communication points. We currently have this in the form of UUIDs but are not yet doing those kinds of validation checks.

There was a blog post from Flavio Junqueira (of Zookeeper) last month discussing this topic as well: http://fpj.me/2015/05/28/dude-wheres-my-metadata/ -- he has some interesting ideas on the topic, but I'm not sure they're more practical than the unique-ID idea.

Regarding fate-sharing, the issue gets even more complicated when you consider that a server may have its WAL or state machine striped across multiple hard disks. The easiest way to maintain correctness and avoid split-brain in such a case is that any time corruption is detected on any disk, the whole box gets brought offline, the disk(s) get replaced, the entire machine gets wiped and re-ID'ed, then reintroduced and re-snapshotted. However with machines these days able to host 4x12TB of data per box or more, such an approach becomes an extremely expensive proposition at scale.

Mike

David Leon Gil

unread,
Jul 1, 2015, 2:09:47 AM7/1/15
to raft...@googlegroups.com
Interestingly enough, this seems to be another fairly nice example of an ABA problem.

I think that option (1) is the same solution that I mentioned for supporting clients who want to implement a reliable FIFO buffer for commands which don't have ExactlyOnceRPC semantics.

(And I am curious whether if implementing that option is *sufficient* to -- if one wants -- eliminate the need for sessions.)


- David

Oren Eini (Ayende Rahien)

unread,
Jul 1, 2015, 2:19:21 AM7/1/15
to raft...@googlegroups.com
The way I handle that, each node has its own metadata (Separate from the cluster config).
It generates a guid on startup if it isn't there already, so an old node without history is a new node, even if it is on the old url.

A more interesting issue is actually what happens when you restore from an old backup. That is a case where a node was basically time travelled, and its confirmed state on other machines has changed.

Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 


Archie Cobbs

unread,
Jul 1, 2015, 3:05:51 PM7/1/15
to raft...@googlegroups.com
The randomized, unique cluster ID is the approach I took in JSimpleDB.

Thought process...

The Raft algorithm talks about "the cluster" and implicitly assumes everybody knows who "the cluster" is.

In reality, lots of different clusters exist, and (a) mixing two nodes from two different clusters and/or (b) having a node "forget" its persistent state violates Raft in all kinds of bad ways.

In practice, this means we need some notion of cluster identity, and we need a way to (a) verify that identity and (b) invalidate that identity if a node "forgets" (or does anything else that violates a Raft assumption).

So the effect of wiping a node's persistent state should be that you have just permanently removed that node from its cluster... later, if/when you start the node back again, it cannot rejoin its cluster but instead becomes the first member in a brand new cluster.

The simplest way to implement all of this is with unique cluster ID's that are stored in persistent storage along with all the other persistent data, and have nodes reject messages with the wrong cluster ID.

-Archie

Mike Percy

unread,
Jul 2, 2015, 2:58:46 AM7/2/15
to raft...@googlegroups.com
David Leon Gil <cor...@gmail.com> wrote:
I think that option (1) is the same solution that I mentioned for supporting clients who want to implement a reliable FIFO buffer for commands which don't have ExactlyOnceRPC semantics. 
I am curious whether if implementing that option is *sufficient* to -- if one wants -- eliminate the need for sessions.

Hmm I am not sure about that, there are a bunch of factors to consider depending on what you need to guarantee and for how long. I'm not trying to guarantee that level of request de-duping yet.

Oren Eini (Ayende Rahien) <aye...@ayende.com> wrote:
A more interesting issue is actually what happens when you restore from an old backup. That is a case where a node was basically time travelled, and its confirmed state on other machines has changed.

Yes that is pretty scary and if you literally mean a disk backup then I would hope you got the full state and not just partial. In that case I think that node should never be able to get elected leader (assuming this only happened on 1 box) and the leader should re-snapshot it. It could still be bad, though -- if a node forgot its votes then it could potentially vote differently multiple times in the same term and you could end up with multiple leaders in the same term. It seems that the probability of this might be small, since restoring from a backup would usually take a long time. Still, I don't think we can ever safely fully restore from backup...

Archie Cobbs <archie...@gmail.com> wrote:
In practice, this means we need some notion of cluster identity, and we need a way to (a) verify that identity and (b) invalidate that identity if a node "forgets" (or does anything else that violates a Raft assumption).

So the effect of wiping a node's persistent state should be that you have just permanently removed that node from its cluster... later, if/when you start the node back again, it cannot rejoin its cluster but instead becomes the first member in a brand new cluster.

The simplest way to implement all of this is with unique cluster ID's that are stored in persistent storage along with all the other persistent data, and have nodes reject messages with the wrong cluster ID.

I'm confused by this approach. GUID per node makes sense to me, but I'm not sure what purpose cluster IDs are serving, unless you literally mean a different database or some type of sharding (etcd raft calls this multi-node) which clearly is good for certain things, but not this. I don't think cluster ids help with safely adding a node post-disk failure back into the pool for its original cluster (assuming the implication is that all nodes in a cluster share the same cluster ID).

Mike

Archie Cobbs

unread,
Jul 2, 2015, 12:03:19 PM7/2/15
to raft...@googlegroups.com
On Thursday, July 2, 2015 at 1:58:46 AM UTC-5, Mike Percy wrote:
Oren Eini (Ayende Rahien) <aye...@ayende.com> wrote:
A more interesting issue is actually what happens when you restore from an old backup. That is a case where a node was basically time travelled, and its confirmed state on other machines has changed.

Yes that is pretty scary and if you literally mean a disk backup then I would hope you got the full state and not just partial. In that case I think that node should never be able to get elected leader (assuming this only happened on 1 box) and the leader should re-snapshot it. It could still be bad, though -- if a node forgot its votes then it could potentially vote differently multiple times in the same term and you could end up with multiple leaders in the same term. It seems that the probability of this might be small, since restoring from a backup would usually take a long time. Still, I don't think we can ever safely fully restore from backup...

Archie Cobbs <archie...@gmail.com> wrote:
In practice, this means we need some notion of cluster identity, and we need a way to (a) verify that identity and (b) invalidate that identity if a node "forgets" (or does anything else that violates a Raft assumption).

So the effect of wiping a node's persistent state should be that you have just permanently removed that node from its cluster... later, if/when you start the node back again, it cannot rejoin its cluster but instead becomes the first member in a brand new cluster.

The simplest way to implement all of this is with unique cluster ID's that are stored in persistent storage along with all the other persistent data, and have nodes reject messages with the wrong cluster ID.

I'm confused by this approach. GUID per node makes sense to me, but I'm not sure what purpose cluster IDs are serving, unless you literally mean a different database or some type of sharding (etcd raft calls this multi-node) which clearly is good for certain things, but not this. I don't think cluster ids help with safely adding a node post-disk failure back into the pool for its original cluster (assuming the implication is that all nodes in a cluster share the same cluster ID).

The underlying issue here is that all of Raft's correctness is based on a node's persistent state never going backwards or being lost (the latter is really just a special case of the former).

For this reason, those situations (including "time travel") must be avoided at all costs - unless you plan to write a new dissertation on Raft-supporting-time-travel. You're welcome to do that but for me personally I'm just going to follow the existing rules :)

The cluster ID serves as a practical mechanism to prevent this from happening in the case of persistent state being lost (because the cluster ID will therefore also be lost).

But you are correct that it does not prevent disaster in the time travel case (fyi, this was never claimed).

If you wanted to guard against time travel, you would have to add additional mechanism. Seems like it would be hard because there's a chicken-and-egg situation. I.e., consensus algorithms are based on reliable node persistence, yet you would need some kind of consensus algorithm to verify the correctness of a node's persistence...

-Archie

 


Mike Percy

unread,
Jul 2, 2015, 1:31:15 PM7/2/15
to raft...@googlegroups.com
On Thu, Jul 2, 2015 at 9:03 AM, Archie Cobbs <archie...@gmail.com> wrote:
If you wanted to guard against time travel, you would have to add additional mechanism.

Right. Reassigning a new ID to a node maintains correctness and also lets you safely reuse the box since it has a new identity. I'm just saying that there's no need to isolate it forever.

Mike

Kijana Woodard

unread,
Jul 2, 2015, 2:11:23 PM7/2/15
to raft...@googlegroups.com
A simple solution would be to exclude the node id from backups.

On simple crash/stop -> restart, everything works normally.
On restore from backup, node must get a new id and needs to be added to cluster.

Archie Cobbs

unread,
Jul 2, 2015, 2:32:37 PM7/2/15
to raft...@googlegroups.com

By my definition, if you give it a new identity then it's a different node - from Raft's perspective at least. So we are saying the same thing.

-Archie

Archie Cobbs

unread,
Jul 2, 2015, 2:36:21 PM7/2/15
to raft...@googlegroups.com
On Thursday, July 2, 2015 at 1:11:23 PM UTC-5, Kijana Woodard wrote:
A simple solution would be to exclude the node id from backups.

On simple crash/stop -> restart, everything works normally.
On restore from backup, node must get a new id and needs to be added to cluster.

Yep, that's a nice simple solution that seems like it would work.

-Archie

Mike Percy

unread,
Jul 2, 2015, 3:24:55 PM7/2/15
to raft...@googlegroups.com
Yeah this is a nice practical workaround, thanks for the idea Kijana.

Mike

Reply all
Reply to author
Forward
0 new messages