What cluster_partition_handling configuration is recommended for a two node cluster?

1,080 views
Skip to first unread message

Kiran D

unread,
May 30, 2017, 2:39:32 PM5/30/17
to rabbitmq-users
Our RabbitMQ deployment involves a simple 2 node cluster on two hosts with HA Queues. We are faced with determining how to handle network partition in the event of a network failure. I am testing the autoheal option and have a few observations. I would like to know what is the recommended configuration to handle network partition. Our applications behave erratically on a partition and we cannot ignore a partition event.

My test setup:

1. RabbitMQ cluster nodes rabbit_01 and rabbit_02 (rabbit_01, rabbit_02 are two Ubuntu VMs on a host-only network).
2. cluster_partition_handling = autoheal
3. No rabbitmq clients were operating during the test.
4. RabbitMQ Broker version: 3.6.1

Failure conditions tested:
1. Disconnect network adapters for rabbit_02.
2. Tail rabbitmq logs and/or monitor the web console to be notified of a partition.
3. The autoheal option forces rabbit_02 to be frozen / disabled.
4.  Reconnect network adapters for rabbit_02.
5.  Autoheal option results in rabbit_02 rejoining the cluster.

Above behavior is not consistent though. I noticed it works 3 out of 5 times. On certain times, the step 3 where autoheal restarts rabbitmq broker on rabbit_02 fails and the node never recovers. I had to manually restart rabbitmq service to restore the broker and cluster. Has anyone faced this issue before? 

Questions:
1. Is there something I am missing in above configuration for autoheal?
2. What is the best partition handling configuration for a two node cluster?
3. At this time, we are ok with loss of service on one node as long as the system detects partition quickly and forces all clients to be connected to a single node.
4. Is there any way to programatically detect network partition in our clients? We have clients written in C, C# and Java running on this system and they are all part of the same software system.

Michael Klishin

unread,
May 30, 2017, 3:49:26 PM5/30/17
to rabbitm...@googlegroups.com
According to 3), "pause_minority" could work well for you but autoheal can also satisfy those properties.

How quickly a partition detected does not depend on the strategy used but rather on what peer inactivity
timeout is used: http://www.rabbitmq.com/nettick.html. Do not use values that are very low (e.g. 1 second) as it will lead
to false positives.

It works in a similar fashion for client connections:

The best configuration is to use 3 (or any odd number) of nodes because with 2 nodes determining
which side is the majority is great fun: there is no right answer.

3.6.1 is 9 releases behind and lacks certain fixes related to autoheal specifically more at

You can request node status using the HTTP API that ships with the management plugin:

Lastly, while Java and C# clients can use address lists and support automatic
connection recovery, librabbitmq-c does not, and recovery of client connections is an important
part of system recovery as a whole.


--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Kiran D

unread,
May 30, 2017, 3:59:40 PM5/30/17
to rabbitmq-users
Thanks Michael. We have accounted for the auto-recovery of connection in rabbitmq-c clients. I will upgrade to the latest Broker version and test again.  
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages