Split Brain Scenarios

Skip to first unread message

Stefan Leonhartsberger

Aug 2, 2016, 8:29:32 AM8/2/16
to Serf
Hi all

I have a question regarding a split brain scenario with a SWIM based algorithm that SERF implements.

Let's assume the following situation:
The cluster contains of 8 member nodes. 
Let's say one link goes down an the connection between 2 sites is split up - the cluster is divided into two halfs without connectivity to each other.

After some time the link that went down between the 2 halfs is back up again.

How can such a situation be resolved with SERF? Is it possible to reach convergence again (forming a single cluster again) by i.e. probing one of the other members from time to time?
Do you have any suggestions?

Thank you! and BR

Armon Dadgar

Aug 2, 2016, 9:36:04 PM8/2/16
to ser...@googlegroups.com, Stefan Leonhartsberger

Serf will already handle this scenario for you. I believe Serf will attempt to recover from a network
partition for up to 72 hours before it considers the other side to be permanently failed. Even if that
happens, you can use “serf join” to just merge the two clusters together again.

Best Regards,
Armon Dadgar
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
GitHub Issues: https://github.com/hashicorp/serf/issues
IRC: #serfdom on Freenode
You received this message because you are subscribed to the Google Groups "Serf" group.
To unsubscribe from this group and stop receiving emails from it, send an email to serfdom+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/serfdom/92cb840e-9d49-434c-bcfa-7b75d153682f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Dhananjay Kumar

Jul 17, 2017, 3:09:20 AM7/17/17
to Serf, stefan.leon...@gmail.com
Hi Armon, I guess it is 24 hours by default but configurable(reconnect_timeout). I had a question on 'serf join'. I believe by default join shouldn't replay past events(as per documentation). But I am observing that during a split brain recovery when I do 'serf join failed_IP' it passes all previous events which happened in the absence of failed node. May I know how to disable this event replay?


Armon Dadgar

Jul 17, 2017, 2:04:38 PM7/17/17
to ser...@googlegroups.com, Dhananjay Kumar, stefan.leon...@gmail.com
Hey Dhananjay,

I just replied on the other thread, but I think the node hasn’t left properly if it is replaying.

Best Regards,
Armon Dadgar
Reply all
Reply to author
0 new messages