quorum-based split brain resolution

1,063 views
Skip to first unread message

shikhar

unread,
May 5, 2014, 2:07:16 AM5/5/14
to akka...@googlegroups.com
I have been hacking on a discovery plugin for elasticsearch using akka cluster and I wanted to add some automated downing, and the auto-down-unreachable-after is not really an option since it can lead to split brain.

So I went with the approach of using a quorum of members to determine whether the unreachable node should be downed. I'm curious to hear what you think of this.


1. The VotingMembers passed in the constructor are the seed nodes. Using seed nodes was just an easy choice since they are specified before-hand. So ideally there should be 3 or more seed nodes.

2. I am using an app-level ping layer on top of the UNREACHABLE events. When a ping request to an unreachable node, made via the seed nodes "affirmatively times-out" (i.e. they must explicitly return a timeout response rather than the ping request timing out, so that we don't consider an unreachable seed-node as a voter!), then we DOWN that unreachable node. Instead of these app-level pings maybe it makes sense to utilize the Akka private[cluster] metadata like Reachability.isReachable(observer, node) but I'm not entirely sure of the semantics.

3. Currently this QuorumBasedPartitionMonitor actor gets started on every seed node. So in case a member becomes unreachable, they'd all end up trying to arrange for a distributed ping to the unreachable node via one another, and possibly downing it. This seems a bit like a thundering herd so not ideal. But on the other hand I don't want to use a cluster-singleton because this partition resolver is trying to be the layer that allows for singleton failover to happen smoothly. I'd love to hear ideas on how to handle this better.

4. Maybe a generic solution for quorum-based partition resolution should be a part of Akka proper/contrib? It seems AutoDown is rarely a good answer.

Akka Team

unread,
May 6, 2014, 2:28:03 PM5/6/14
to Akka User List
Hi Shikhar,

thanks for sharing!

There are many possible ways of dealing with partitions in a distributed system, and it depends very much on the use-case what the best solution is. One fundamental choice you need to make is whether you can live with split brain scenarios, or whether you can tolerate unavailability; you cannot exclude both, no matter what you do. To demonstrate this consider a rule that a still connected subset of the cluster must have more than N members in order to continue, which means that it will shut itself down if it has less than this quorum. Then you can configure your cluster with M nodes such that N>0.5•M and you can be sure that split-brain scenarios are excluded. The price is that a three-way split can kill all three parts, shutting the whole cluster down. The same reasoning applies to referee schemes (i.e. subset continues if it contains a designated node, in which case the death of this node will kill the whole cluster, and there are countless variations on this scheme).

OTOH you could have a market place whose foremost requirement is availability; in this case you will have to tolerate the (temporary) split into multiple disconnected market places in order to “guarantee” availability (well, there is no such thing, really).

So, when it comes to Akka Cluster in particular, you have a few choices (and endless refinements):
  • implement a auto-downing scheme (using voting or quorum or whatever you like) that prevents split brain
  • implement an auto-downing scheme that limits split brain but aims at availability (e.g. just downing when up to N nodes are unreachable, e.g. N=1 to allow individual failures)
  • implement aggressive auto-downing for high availability while tolerating split brain (needs human oversight to guard against edge cases)
  • do not implement auto-downing and have a 24/7 ops team run the system manually (the cost—while significant—may well be justified in some cases)
  • do not use downing (apart from manually removing failed nodes) and rely on CRDTs (or equivalent) for synchronizing state between nodes, so that after the partition everything heals back together
I probably forgot some possibilities, but this can serve as a starting point. And it should explain why Akka does not ship with “sophisticated” schemes at this point: the use-cases are too diverse. So, for now we cover only the most primitive ways of handling partitions, and we will eventually add algorithms which have proven worth their salt (or SLOC) in production—hopefully with the help of our wonderfully smart community ;-)

Regards,

Roland



--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+...@googlegroups.com.
To post to this group, send email to akka...@googlegroups.com.
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.



--
Akka Team
Typesafe - The software stack for applications that scale
Blog: letitcrash.com
Twitter: @akkateam

Endre Varga

unread,
May 7, 2014, 4:13:25 AM5/7/14
to akka...@googlegroups.com
That is a really good summary, Roland. This should go into docs or at least a blog post?

-Endre

Eric Pederson

unread,
May 8, 2014, 11:14:05 AM5/8/14
to akka...@googlegroups.com
+1 for adding it to the docs.   It's essential info.

Lawrence Wagerfield

unread,
May 8, 2014, 11:35:25 AM5/8/14
to akka...@googlegroups.com
Fascinating and very helpful! 

Out of interest, where does RAFT sit on the aforementioned spectrum? I only ask as there's a few Akka RAFT implementations floating around...

shikhar

unread,
May 8, 2014, 4:38:04 PM5/8/14
to akka...@googlegroups.com
Thanks for the thoughtful reply Ronald!

Akka Team

unread,
May 8, 2014, 5:18:01 PM5/8/14
to Akka User List
Thanks for the motivating words, I’ll try to find the time to write this up more properly; that might happen for the Reactive Design Patterns book at first, but I agree that the docs can be improved as well.

Regards,

Roland

Jonas Bonér

unread,
May 9, 2014, 8:31:46 AM5/9/14
to Akka User List
On Thu, May 8, 2014 at 5:35 PM, Lawrence Wagerfield <lawr...@dmz.wagerfield.com> wrote:
Fascinating and very helpful! 

Out of interest, where does RAFT sit on the aforementioned spectrum? I only ask as there's a few Akka RAFT implementations floating around...

​We have not talked about adding Raft (or similar, like Paxos/VR) to the Akka distribution. As everything in Akka, features are driven by need. If it turns out to be an important building block, either as an abstraction/tool for our users, or internally (if we f.e. decide to implement our own replicated Akka Persistence Journal) then we will add it. Many of the good things in Akka started out as external contributions, that after proven essential, was added to the core distribution—examples are Spray, eventsourced and Akka Camel. 

--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+...@googlegroups.com.
To post to this group, send email to akka...@googlegroups.com.
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.



--
Jonas Bonér
Phone: +46 733 777 123
Home: jonasboner.com
Twitter: @jboner

Eric Pederson

unread,
May 9, 2014, 5:50:23 PM5/9/14
to akka...@googlegroups.com
Lawrence - were you wondering where the existing Akka-based Raft implementations (eg. ktoso/akka-raft) sit in Roland's auto downing to no downing spectrum and/or if RAFT has any particular consistency requirements that map to one of Roland's categories? I'm curious too.


-- Eric


You received this message because you are subscribed to a topic in the Google Groups "Akka User List" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/akka-user/UBSF3QQnGaM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to akka-user+...@googlegroups.com.

Lawrence Wagerfield

unread,
May 9, 2014, 5:58:40 PM5/9/14
to akka...@googlegroups.com
Yes, precisely what I was wondering, although probably not as well articulated!

I am curious as, to my limited knowledge, RAFT prohibits split-brain. Does that mean RAFT therefore sits at the 'unavailable' end of the continuum you described, or does it somehow provide slightly greater availability compared to more primitive schemes like a N>0.5•M quorum?

Thanks,
Lawrence

Roland Kuhn

unread,
May 10, 2014, 4:57:01 AM5/10/14
to akka-user
Yes, RAFT sits at the consistency end of the spectrum: it avoids split-brain at the cost on unavailability, which is what “consensus” is all about. Quoting from the RAFT paper:

Consensus algorithms for practical systems typically have the following properties:
    • They ensure safety (never returning an incorrect result) under all non-Byzantine conditions, including network delays, partitions, and packet-loss, duplication, and reordering.
    • They are fully functional (available) as long as any majority of the servers are operational and can communicate with each other and with clients. Thus, a typical cluster with five servers can tolerate the failure of any two servers. Servers are assumed to fail by stopping; they may later recover from state on stable storage and rejoin the cluster.
    • They do not depend on timing to ensure the consistency of the logs; faulty clocks and extreme message delays can, at worst, cause availability problems.
    • In the common case, a command can complete as soon as a majority of the cluster has responded to a single round of remote procedure calls; a minority of slow servers need not impact overall system performance.

The second bullet point is the most interesting one for this discussion. Interestingly, we have developed the Akka Cluster replicated state machine without such up-front research, it has evolved naturally into an epidemically disseminated CRDT with built-in leader determination for synchronizing certain actions that correspond to the terms in RAFT terminology. But since unavailability in case of a partition is a choice that we want to leave to the user, our clustering support leaves this point open: if you decide to implement a majority quorum scheme, you get a fully consistent but potentially unavailable cluster, but if you relax the conditions you can tone it down to a highly available and “mostly consistent” one (at the end of the spectrum it is just a best effort neighborhood discovery service).

This makes it clear that we still have some work to do: we need to curate implementations of schemes placed along this axis which make sense in certain scenarios and document which one serves what purpose.

Regards,

Roland


Dr. Roland Kuhn
Akka Tech Lead
Typesafe – Reactive apps on the JVM.
twitter: @rolandkuhn


Lawrence Wagerfield

unread,
May 10, 2014, 5:23:22 AM5/10/14
to akka...@googlegroups.com
Thanks again Roland, most helpful.

Stating the schemes in Akka Cluster documentation would be invaluable. My guess is this would prevent some potentially unsound homegrown solutions. Just listing the basics (i.e. the majority quorum approach, RAFT, etc) might prevent a few of these cases, especially if the tradeoffs where given upfront.

Jonas Bonér

unread,
May 10, 2014, 10:54:22 AM5/10/14
to Akka User List

Great summary Roland.

We should definitely give tools to model the different guarantees. It's not just Raft/Paxos-style OR full eventual consistency. It's an axis as you say. Riak gives you some flexibility with its R/W values. For the interested I recommend reading Peter Bailey's HAT paper and Eric Brewer's CAP twelve years later.

--
Jonas Bonér
Phone: +46 733 777 123
Home: jonasboner.com
Twitter: @jboner

Eric Pederson

unread,
May 10, 2014, 8:52:15 PM5/10/14
to akka...@googlegroups.com
Thanks Roland!

I read Distributed Systems for Fun And Profit a few weeks ago.  It has a diagram that was very helpful for me in understanding the difference between the various types of replication systems:

I need to read the Raft paper and I need to look at the source of some of the Akka-based implementations to see how they implement the protocol.
Reply all
Reply to author
Forward
0 new messages