Rabbitmq cluster goes into split brain with pause minority

1,030 views
Skip to first unread message

Radu Marian

unread,
Apr 26, 2023, 4:02:17 AM4/26/23
to rabbitmq-users
Hello rabbitmq users,

We have the following RabbitMQ setup in our project:

1. Three node cluster (nodes called rabbitmq-1, rabbitmq-2 and rabbitmq-3) with partition handling set to PAUSE_MINORITY 
2. Rabbitmq version used is 3.8.27 and Erlang 24.2

Incident:

At some point a network outage occurred between rabbitmq-1 and the other two but it only seems to be a short lived one, maybe a few seconds.

We expected that the isolated node, rabbitmq-1, either rejoins the cluster or shutdowns (due to pause minority) but instead rabbitmq-1 never re-join the cluster and instead continues to accept requests and even promotes it's own mirror queue replicas to master.

See attached logs.

Looking at the logs, the only interesting thing I noticed is that rabbitmq-1 at some point reports that both of its peers, rabbitmq-2 and rabbitmq-3 are down then up at the same time:


Apr 20, 2023 @ 10:05:04.473 +00:00 2023-04-20 10:05:04.470 [info] <0.6029.3> node 'rabbit@rabbitmq-2' down: connection_closed#015 rabbitmq-1

Apr 20, 2023 @ 10:05:04.486 +00:00 2023-04-20 10:05:04.470 [info] <0.6029.3> node 'rabbit@rabbitmq-3' down: connection_closed#015 rabbitmq-1

Apr 20, 2023 @ 10:05:04.486 +00:00 2023-04-20 10:05:04.471 [info] <0.6029.3> node 'rabbit@rabbitmq-3' up#015 rabbitmq-1

Apr 20, 2023 @ 10:05:04.486 +00:00 2023-04-20 10:05:04.471 [info] <0.6029.3> node 'rab...@rabbitmq-2.mx-a' up#015 rabbitmq-1


Has anyone experienced a similar issue? Could it be some race condition in the cluster partition handling if the outage is short-lived?

Many thanks,
Radu.

Rabbitmq-Server-Logs.txt

kfre...@gmail.com

unread,
Aug 14, 2023, 5:10:28 PM8/14/23
to rabbitmq-users
We recently experienced something similar, on a 3-node cluster of RabbitMQ 3.10.6, Erlang 25.0.2.  I did not find any logs indicating a partition detected, but the cluster ended up in a split-brain configuration even though the config specifies a partition handling strategy of pause_minority.  I am collecting additional information for the incident.

Kevin

kjnilsson

unread,
Aug 15, 2023, 3:47:26 AM8/15/23
to rabbitmq-users
The way pause minority works is fundamentally flawed as it requires a node to be able to make accurate decision about the partition status of other nodes. Something which isn't possible to do reliably. It is very likely that occasionally there will be odd behaviour. It is unlikely we can do much to fix it but upgrading to the latest version of RabbitMQ is still something we'd recommend. 3.8 is out of support and 3.10 is also a bit old by now.

Luckily with RabbitMQ 4.0 there will be no more partition handling strategies and the cluster should just heal itself as long as a quorum of nodes are available.

Cheers
Karl

kfre...@gmail.com

unread,
Aug 16, 2023, 4:31:46 PM8/16/23
to rabbitmq-users
Thanks for the reply.  The logs from the event are extremely sparse, basically indicating a single process crash - far different from prior partition events, probably not worth reviewing.  We are moving ahead with a planned upgrade cycle and will utilize autoheal until the release of 4.0.

Kevin

Radu Marian

unread,
Aug 22, 2023, 8:24:08 AM8/22/23
to rabbitmq-users
Hi Karl,

Thank you for your reply.

We upgraded to 3.11.19 and the issue is till there.

Do you know if there is any date for RabbitMQ 4.0? In the meantime we will try to live with pause_minority as it is.

One more thing, while investigating the matter, I wrote a script(python) that reproduces the problem quite reliably. Unfortunately it does not help us that much since we don't have erlang experience in debugging the problem.

I attached it,  perhaps it might help someone else here. If you or someone from the RabbitMQ developers wants to use it to reproduce the issue let me know, I can tell you how to execute it.

Thanks,
Radu.
network_outage_simulator.zip
Reply all
Reply to author
Forward
0 new messages