Dear support,
We are running 3 node cluster with rmq 3.12.14 and erlang 25.3.2.21.
All queues are mirror (exactly 2) queues.
RABBITMQ_DISTRIBUTION_BUFFER_SIZE=1048576.
When we try to stop one node for maintenance, we meet another 2 nodes network partitioned 1-2 minutes right after node0 reset and stopped from the cluster.
From Prometheus/Grafana dashboard, we observed
6:29:30: erlang_vm_dist_port_queue_size_bytes > 75k
6:32:00: erlang_vm_dist_node_queue_size_bytes > 200M (less than 1048576 (1GB)).
6:30:18: rmq1/2 node logged: rmq0 down.
6:32:09: rmq2/1 network partitioned.
I find when erlang_vm_dist_node_queue_size_bytes > 1GB, almost the cluster will get brain-split. But this time, it is only 200MB.
Any difference between erlang_vm_dist_port_queue_size_bytes and erlang_vm_dist_node_queue_size_bytes ?
What value in these metrics will trigger network partitioned for mirroring queue scenarios?
Many thanks if could share some documentation related.
BR
xiaoyan