We are experiencing a problem where using HA queues causes entire cluster to partition and become unresponsive when total send rates exceeds 10k messages per second,
We are running performance tests on a 5 node cluster hosted on AWS EC2 instances.
For example, within a few minutes of starting test with 10 producers each sending 1k messages/per sec, rabbit nodes will partition and eventually, entire cluster becomes unresponsive. Around the time partitioning occurs, rabbitmq logs will show output like this:
Error log of rabbit connection loss event:
=ERROR REPORT==== 8-Dec-2014::21:25:34 ===
Partial partition detected:
* We saw DOWN from rabbit@rabbit5
* We can still see rabbit@rabbit4 which can see rabbit@rabbit5
We will therefore intentionally disconnect from rabbit@rabbit4
In addition, our Management UI flashes a red alert window on stating:
"Network partition detected
Mnesia reports that this RabbitMQ cluster has experienced a network partition. This is a dangerous situation. RabbitMQ clusters should not be installed on networks which can experience partitions."
We have tried several approaches to increase performance through AWS including upgrade to Compute Optimized 8 cpu instances, EBS-optimized, using placement groups with not much better results.
We have also tried experimenting with autoheal and increasing net_ticktime to 180. Those have helped slightly but only pushed back the inevitable partition by a couple minutes.
Our HA policy setting was configured as follows:
rabbitmqctl set_policy ha-two-nodes ".*" "{"ha-mode":"exactly","ha-params":2}"
With further detailed analysis using collectd tool to monitor AWS EC2 nodes, we see spikes in disk i/o and average write time on some rabbit nodes around time of partition.
Why do mirrored queues cause a spike in disk i/o activity when well below the high water mark?
Is there anything we missed with set_policy config or are HA queues simply not recommended with total throughput greater than 10k messages/sec.
Any insight would be appreciated.
RabbitMQ version: 3.4.2