recovering minority clusters from a network split

Anirban Rahut

unread,

Jan 9, 2015, 1:35:04 PM1/9/15

to raft...@googlegroups.com

Hello all,

I would like to give administrators the ability to recover from a network split by using a force setConfiguration.

Lets assume we have 2 data-centers with some nodes in the datacenter - A and some nodes in datacenter - B.
there is a network split. The admin is aware that the split wont heal for a day or so.

So they would like to reconfigure the cluster into 2 separate clusters. Cluster A might have quorum and cluster B
does not have quorum, but we would like both the clusters to be manually overriden.

Any ideas?

Kijana Woodard

unread,

Jan 9, 2015, 3:38:28 PM1/9/15

to Anirban Rahut, raft...@googlegroups.com

What would the behavior of the two clusters be while the network split remained?

What would happen will happen when the network split is resolved? Will the clusters merge? If so, how?

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Anirban Rahut

unread,

Jan 9, 2015, 7:50:08 PM1/9/15

to raft...@googlegroups.com, ara...@gmail.com

when the network is split, admin will re-configure the 2 datacenters into 2 separate clusters.
if there is a catastrophic failure in one datacenter, you don't have the ability to access those nodes,
so you will only configure datacenter B into a smaller cluster.

When the network heals I realize that there can be a problem with the old cluster trying to contact a new cluster.
We can go two ways on this - make sure admins understand the risk and they follow some specific steps
to prepare for a network hear, or do something in the raft code which rejects messages from the old cluster.

Kijana Woodard

unread,

Jan 9, 2015, 9:55:39 PM1/9/15

to Anirban Rahut, raft...@googlegroups.com

" or do something in the raft code which rejects messages from the old cluster."

What do you do about programs and people that have made decisions based on the data in the old cluster?

It sounds like you are emphasizing datacenter availability. Maybe you should have two clusters [one in each datacenter] all the time. If reconciling the two seems hard, it won't be easier in a crisis.

Do you have any numbers on how likely a datacenter split is to occur, how long it is likely to last, and what the cost of being down in the minority datacenter would be? Maybe it's worth giving clients links to both datacenters so they can always [probably find the leader.

Fwiw, the problem with what you're proposing is the nodes in the minority datacenter won't be able to take cluster changes through raft since they, by definition, won't be able to have a leader. All you'd be able to do is start a new cluster and seed it with the existing data. Joining the two later, especially if log compaction had occurred, would be non-trivial.

Oren Eini (Ayende Rahien)

unread,

Jan 11, 2015, 6:29:29 AM1/11/15

to Anirban Rahut, raft...@googlegroups.com

Let us assume that you are doing a hashtable, because that is the easiest model.

You have the current state:

A -> 1

B -> 1

Now you have a split, and you have two clusters.

On one side, A is set to 1+1, B is set to 1 - 1.

On the other side, A is set to 1+2, B is set to 1+3.

Now you have things healed. How are you going to deal with the merging of the data?

What about clients that can see both clusters, and may contact both?

Hibernating Rhinos Ltd

Oren Eini l CEO l Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

Anirban Rahut

unread,

Jan 14, 2015, 1:52:29 PM1/14/15

to raft...@googlegroups.com, ara...@gmail.com

Hi Ayende,

I understand the merge issue. However our use case for raft is simpler. We use raft for leader election, configuration changes
and leader failover. Not for actually sending client driven log entries. Which means that the log only has transitional
configurations, final configuration and noop entries.

Currently the problem that I am trying to work around is .

There is no way I can do anything on the minority cluster because there is no captain.
Since there is no captain, I am even unable to add nodes to it to make it larger and have quorum.

Would something like this work - or maybe there is something simpler on similar lines.

1. Bring another node up.
2. manually replicate all the log entries to this node.
3. Create an artificially high term number on this node.
4. Add 2 new log entries to this nodes log which is a transitional configuration and final configuration with a smaller set of nodes.
5. Bootstrap this node as leader with this log and then make it replicate these logs to the other nodes on the site.
Since the other nodes will have lower term numbers, they should accept these log entries and start working on the new configuration.

thanks for your suggestions.

Kijana Woodard

unread,

Jan 14, 2015, 2:17:27 PM1/14/15

to Anirban Rahut, raft...@googlegroups.com

"We use raft for leader election, configuration changes and leader failover. Not for actually sending client driven log entries."

Why are clients contacting the cluster?

What is the cluster achieving outside of it's own existence [leader & configuration]?

I think I'm misunderstanding either your terminology or your use case.

Have you considered having a "local datacenter cluster" *and* a "cross datacenter cluster"? What would that mean to your use case?

Reply all

Reply to author

Forward