Auto clustering causes split brain (2 clusters to form) in some scenarios

Aaron Stainback

unread,

Dec 19, 2017, 5:13:23 PM12/19/17

to rabbitmq-users

Please see here

https://github.com/rabbitmq/rabbitmq-peer-discovery-k8s/issues/12

and here

https://github.com/kubernetes/charts/issues/3075

for more info.

The general problem goes like this.

I have 5 nodes, for some reason either mismatched erlang cookies or temporary network partitions because network issues two distinct groups form. I'll call this group A and group B. Group A consist of nodes 0-2 and Group B consist of nodes 3-4. So, in this case, two distinct clusters are formed and all nodes report healthy. This is very misleading and causes a split brain problem.

Where this issue was first found was using the rabbitmq 3.7 K8s helm chart for deployment.

This helm chart by default generates a random erlang.cookie and gives it to all the nodes.

So step #1 was to create a 3 node cluster in K8s using the helm chart. Everything worked fine, very easy to create a rabbit cluster this way.

Step #2 was to scale up my cluster to 5 nodes using helm upgrade but unknown to me at the time, this generated a random new erlang cookie for the two new nodes.

Still, the nodes came up just fine in K8s all 5 reported healthy so I thought everything is good.

That is until I started sending and receiving messages. Messages were getting lost, not delivered and just seemed like they were disappearing. So I logged into rabbit's UI on node 0 and saw only 3 nodes in the cluster even though K8s said 5 healthy.

Then I logged into rabbit's UI on node 4 and there was a cluster of two nodes. WAIT how is this, two different clusters formed under one stateful set in K8s? This seems ridiculous, this should not happen some error should have been reported.

Since then I've been able to re-create this issues by forcing temporary network partition during the formation of a 5 node cluster, isolating 2 nodes away from the other 3 temporarily just during formation of the cluster.

Something is wrong with the way rabbit is doing auto cluster formation that allows for split brain clusters to form in K8s (and even outside K8s, this problem is not unique to K8s, it can happen with any rabbit cluster)

It seems a simple solution to this could be to have a status command in rabbitctl that could take into account cluster membership health or something of that nature.

I'm curious if others have run into this. Also it would be great if this could be fixed and it becomes a non-problem.

I've tried filing bugs on rabbit github pages but they were closed and I was told to come here and start a discussion so here I am.

Thanks.

Michael Klishin

unread,

Dec 20, 2017, 1:03:58 AM12/20/17

to rabbitm...@googlegroups.com

Hi Aaron,

I already explained this on GitHub and will explain again here. You problem comes down to Erlang cookie management. It is not

a bug in peer discovery, or partition recovery, or the lack of a CLI command.

What happens here is this: two nodes with a different cookie from the rest of the cluster use the same discovery mechanism. They discover

all peers and then cannot cluster with the original cluster — the cookies do not match — but can cluster with each other, which they do.

Nodes cannot know what peers they are supposed to have at any given time. They try to cluster with nodes in order, and first successful attempt is also

the last one tried.

Local node health checks cannot detect this: all nodes are fine and are members of a cluster. Which cluster is "right", a node cannot know because

all it is given is a name, a cookie and a discovery endpoint.

Peer discovery features such as health checks with the backend cannot solve this either. What can and will is

correct cookie management where all nodes that are supposed to be members of a single cluster get the same cookie.

This is the "leader election" solution you are looking for (there are no leader or special nodes in RabbitMQ, by the way).

The problem also doesn't apply to all discovery backends: with classic config and DNS backends the list of nodes is predefined and known

ahead of time. WIth AWS autoscaling groups it is sort of that way as well (or in between). For AWS EC2 tags, Consul, etcd and Kubernetes

this works exactly the same way: use two sets of nodes with different cookies and the same discovery endpoint and they will succeed in joining

whoever they can.

Again, there is no solution other than fixing cookie management. How that's best done on Kubernetes, I'm not qualified to tell.

We will document this scenario in our docs and stress the importance of correct cookie management. I can think of one CLI command
that would help with final cluster state verification in cases like this — ensure_cluster_size or similar — but by definition during cluster formation

not all nodes will be up, therefore it has to be used very carefully and possibly with a window of time in which it waits for convergence.

Thanks for bringing this interesting case up. We've solved or seen it solved it in a bunch of automation scenarios, most importantly with BOSH,

and now the question is primarily how to do this on Kubernetes, including in our example provided with the peer discovery plugin.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

MK

Staff Software Engineer, Pivotal/RabbitMQ

Aaron Stainback

unread,

Dec 20, 2017, 1:40:38 AM12/20/17

to rabbitmq-users

I drew a diagram since it seems you are having a hard time understanding this occurs when all cookies are correct and there are no issues with cookie.

Network partitions during peer discovery cause this issue as well.

https://github.com/rabbitmq/rabbitmq-server/issues/1455

Michael Klishin

unread,

Dec 20, 2017, 1:42:29 AM12/20/17

to rabbitm...@googlegroups.com

Peer discovery will cluster with the nodes that can be reached from the discovered list. Forming a cluster with a networking split already in place is going to have this effect.

There aren't too many alternatives I can think of, one of them — wait for a set of nodes to be up before forming a cluster — has been tried in

the rabbitmq-clusterer plugin and it turned out to be a true operational disaster. Now if one node is down, your entire cluster won't form.

So this works as expected from peer discovery at the moment.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Aaron Stainback

unread,

Dec 20, 2017, 1:56:14 AM12/20/17

to rabbitmq-users

A potential solution would be for the peer discovery to give a count of the expected members to the real rabbit clustering and do not report cluster as healthy until a N+1 majority has been reached. This would always allow only one cluster to form.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,

Dec 20, 2017, 2:06:15 AM12/20/17

to rabbitm...@googlegroups.com

Some backends do know how many nodes are expected (classic config, DNS), others don't or that number can vary, potentially

fairly often (e.g. with AWS autoscaling groups).

We've tried the "let's coordinate cluster formation, it sounds awesome" thing before in rabbitmq-clusterer. It is a lot more complicated

than it sounds and we eventually abandoned that plugin because it solved 1 problem and introduced 10. Having nodes retry joining

a cluster in N attempst every T seconds was sufficient without all the downsides. Unfortunately it is only obvious to us now in retrospect.

rabbitmq-clusterer overreaches and tries to wait for all expected members to be online. Therefore one node slow to start or down will affect

entire cluster formation. Requiring a quorum of nodes would be an improvement but I have no doubt it is less awesome than it sounds in practice.

Peer discovery and health checks are orthogonal concerns. I suggested a CLI command that would verify that a cluster reaches N nodes in T seconds.

I am not convinced it should be a part of the peer discovery system. It's a valid command to use after a cluster is formed, for example.

Another idea that is easy to try is this: when a peer is contacted right after discovery, we can introduce the N retries every T seconds idea there.

It should help with the transient network partition scenario as long as it goes away in a certain window of time.

Any thoughts on that?

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,

Dec 20, 2017, 2:11:01 AM12/20/17

to rabbitm...@googlegroups.com

One downside of the retries would be a potentially much slower cluster formation in some scenarios

(when retries happen a lot). Is it a reasonable trade-off for improved resilience towards network splits as demonstrated

on the chart above (during cluster formation, nothing else around partition handling would change before 4.0).

Aaron Stainback

unread,

Dec 20, 2017, 2:22:20 AM12/20/17

to rabbitmq-users

I'm too tired at this point to put any deep thought into the issue to try and come up with another solution other than N+1. I'll put some more thought into it and see what I can come up with. A warning in the logs or some metrics system would be good just to say expected node count does not = actual node count based on latest discovery counts vs actual rabbit cluster counts.

Michael Klishin

unread,

Dec 20, 2017, 2:22:54 AM12/20/17

to rabbitm...@googlegroups.com

One actionable issue filed so far based on this discussion: https://github.com/rabbitmq/rabbitmq-cli/issues/235.

Michael Klishin

unread,

Dec 20, 2017, 2:25:22 AM12/20/17

to rabbitm...@googlegroups.com

That's a good suggestion. We already have a feature in which cluster members which are not on the discovery list can be kicked out.

This area can be extended.

First improvements will have to wait until 3.7.2 (say, late January) so we have plenty of time to think this through.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward