During a network partition, the partitioned node is removed from the cluster after auto-down occurs and quarantined such that it must restarted in order to rejoin the cluster once the partition heals. A manual restart due to a temporary network outage is problematic when one is developing a commercial product with end users who will expect automatic recovery (and rightly so).One option is disable auto-down but that introduces another issue. In lieu of that,1) is there any way to disable the quarantine behavior?
2) is there any way for code to node know or get notified that it has been quarantined and must be restarted so it can be handled automatically?
--Thanks,Tom
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+...@googlegroups.com.
To post to this group, send email to akka...@googlegroups.com.
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.
Patrik Nordwall
Typesafe - Reactive apps on the JVM
Twitter: @patriknw
Hi,We have solved these issues like this:We have a ClusterListener on each node that "pings" the database - As long as it is "online" and a happy member of the cluster it updates a timestamp in the database.To detect split-brain scenarios, we do this:The ClusterListener on each node keeps track of all members of the cluster in memory.Periodically we check if there are more alive nodes in the database than we know is member of our cluster.If we see more alive nodes than we have in the cluster, we know we have a split brain scenario.
To recover from it, the node waits a random amount of seconds, then trigger itself to restart (we spawn a process that executes "./application.sh restart")
When the node is starting up (again) we use the same "alive" mechanism in the database to find seed-nodes - so we actually join the existing cluster. If no one is alive, we know we are the first one starting up, so we're going to be our own seed node.
If it decided to join a cluster but failed to do so, it starts over again with a new restart.This solution has, at least for us, turned out to be a robust solution which supports* staged or instant startup of multiple nodes.* auto-restarting multiple nodes when deploying new version.* auto-healing when something odd happens in our data-center (like network-glitches or something causing the cpu to stall for too long)
We're planning to opensouce this code soon.I hope this info was helpful.Regards,Morten
Hi Morten,We'd like to implement a solution similar to yours, can you elaborate on some details of your solution?
On Wednesday, August 5, 2015 at 2:58:19 PM UTC+3, Morten Kjetland wrote:Hi,We have solved these issues like this:We have a ClusterListener on each node that "pings" the database - As long as it is "online" and a happy member of the cluster it updates a timestamp in the database.To detect split-brain scenarios, we do this:The ClusterListener on each node keeps track of all members of the cluster in memory.Periodically we check if there are more alive nodes in the database than we know is member of our cluster.If we see more alive nodes than we have in the cluster, we know we have a split brain scenario.I think there might be a possible race condition here, what if one or more new node join the cluster and update the DB before all other nodes learned about the new nodes? In this case other nodes might think they are in a split brain situation and restart themselves, right? How do you prevent this?
To recover from it, the node waits a random amount of seconds, then trigger itself to restart (we spawn a process that executes "./application.sh restart")While waiting a random amount of seconds, is the node still part of the cluster or has it left the cluster already?
When the node is starting up (again) we use the same "alive" mechanism in the database to find seed-nodes - so we actually join the existing cluster. If no one is alive, we know we are the first one starting up, so we're going to be our own seed node."If no one is alive, we know we are the first one starting up" - have you implemented this with some atomic operation, like "check and set", to prevent starting 2 clusters? What DB do you use for this?
If it decided to join a cluster but failed to do so, it starts over again with a new restart.This solution has, at least for us, turned out to be a robust solution which supports* staged or instant startup of multiple nodes.* auto-restarting multiple nodes when deploying new version.* auto-healing when something odd happens in our data-center (like network-glitches or something causing the cpu to stall for too long)Do you use akka persistence/cluster sharding? I'm asking because we do use both and have found them to be sensitive to split brain.
It's part of the Typesafe Reactive Platform and implements a number of strategies on how downing can be performed more safely than just timeouts (auto-downing). The strategies are for example "static quorum" or "keep majority" etc. Each of them has specific trade-offs, i.e. scenarios where they work well, and failure scenarios where the strategy would make a decision consistent with how it's working, but maybe not what you need.
The docs are available here: http://doc.akka.io/docs/akka/rp-15v09p01/scala/split-brain-resolver.html and go pretty in-depth about how it all works.
Konrad did a webinar about new features in Akka 2.4 and Reactive Platform and it also covered the Split Brain Resolver a bit: https://youtu.be/D3mPl8OUrjs?t=9m11s (9 minute mark is about SBR).
In order to use this in production you'll need to obtain a Reactive Platform subscription, more details here: http://www.typesafe.com/products/typesafe-reactive-platform (it also explains on the bottom how you can try it out).
--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+...@googlegroups.com.
To post to this group, send email to akka...@googlegroups.com.
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.