Split Brain Scenarios

127 views
Skip to first unread message

Stefan Leonhartsberger

unread,
Aug 2, 2016, 8:29:32 AM8/2/16
to Serf
Hi all

I have a question regarding a split brain scenario with a SWIM based algorithm that SERF implements.

Let's assume the following situation:
The cluster contains of 8 member nodes. 
Let's say one link goes down an the connection between 2 sites is split up - the cluster is divided into two halfs without connectivity to each other.

After some time the link that went down between the 2 halfs is back up again.

How can such a situation be resolved with SERF? Is it possible to reach convergence again (forming a single cluster again) by i.e. probing one of the other members from time to time?
Do you have any suggestions?

Thank you! and BR
Stefan

Armon Dadgar

unread,
Aug 2, 2016, 9:36:04 PM8/2/16
to ser...@googlegroups.com, Stefan Leonhartsberger
Stefan,

Serf will already handle this scenario for you. I believe Serf will attempt to recover from a network
partition for up to 72 hours before it considers the other side to be permanently failed. Even if that
happens, you can use “serf join” to just merge the two clusters together again.

Best Regards,
Armon Dadgar
--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/serf/issues
IRC: #serfdom on Freenode
---
You received this message because you are subscribed to the Google Groups "Serf" group.
To unsubscribe from this group and stop receiving emails from it, send an email to serfdom+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/serfdom/92cb840e-9d49-434c-bcfa-7b75d153682f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Dhananjay Kumar

unread,
Jul 17, 2017, 3:09:20 AM7/17/17
to Serf, stefan.leon...@gmail.com
Hi Armon, I guess it is 24 hours by default but configurable(reconnect_timeout). I had a question on 'serf join'. I believe by default join shouldn't replay past events(as per documentation). But I am observing that during a split brain recovery when I do 'serf join failed_IP' it passes all previous events which happened in the absence of failed node. May I know how to disable this event replay?

Thanks,
Dhananjay

Armon Dadgar

unread,
Jul 17, 2017, 2:04:38 PM7/17/17
to ser...@googlegroups.com, Dhananjay Kumar, stefan.leon...@gmail.com
Hey Dhananjay,

I just replied on the other thread, but I think the node hasn’t left properly if it is replaying.

Best Regards,
Armon Dadgar
Reply all
Reply to author
Forward
0 new messages