Our RabbitMQ deployment involves a simple 2 node cluster on two hosts with HA Queues. We are faced with determining how to handle network partition in the event of a network failure. I am testing the autoheal option and have a few observations. I would like to know what is the recommended configuration to handle network partition. Our applications behave erratically on a partition and we cannot ignore a partition event.
My test setup:
1. RabbitMQ cluster nodes rabbit_01 and rabbit_02 (rabbit_01, rabbit_02 are two Ubuntu VMs on a host-only network).
2. cluster_partition_handling = autoheal
3. No rabbitmq clients were operating during the test.
4. RabbitMQ Broker version: 3.6.1
Failure conditions tested:
1. Disconnect network adapters for rabbit_02.
2. Tail rabbitmq logs and/or monitor the web console to be notified of a partition.
3. The autoheal option forces rabbit_02 to be frozen / disabled.
4. Reconnect network adapters for rabbit_02.
5. Autoheal option results in rabbit_02 rejoining the cluster.
Above behavior is not consistent though. I noticed it works 3 out of 5 times. On certain times, the step 3 where autoheal restarts rabbitmq broker on rabbit_02 fails and the node never recovers. I had to manually restart rabbitmq service to restore the broker and cluster. Has anyone faced this issue before?
Questions:
1. Is there something I am missing in above configuration for autoheal?
2. What is the best partition handling configuration for a two node cluster?
3. At this time, we are ok with loss of service on one node as long as the system detects partition quickly and forces all clients to be connected to a single node.
4. Is there any way to programatically detect network partition in our clients? We have clients written in C, C# and Java running on this system and they are all part of the same software system.