We're running rabbitmq-server 3.5.3 with OpenStack, DB2, Pacemaker. We frequently (but not always) get partitions when moving a virtual IP. I can narrow the situation down a little to disabling the virtual IP via Pacemaker (as opposed to moving the virtual IP to another system), which shuts down HAproxy, and undoubtedly causes a flurry of activity as services respond to loss of connections using the virtual IP. In the process, rabbitmq reports "node down" errors and we get partitions. autoheal usually handles this fine, but this is disruptive to rabbitmq applications running elsewhere.
For this example, I have cluster with 3 nodes: vs181, vs182, vs184. I started with the virtual IP hosted on vs184. The nodes are running in VMware VMs and have two NICs, but for all practical purposes, everything (including the virtual IP) is using a single interface. The virtual IP is not used by rabbitmq, though it binds to all interfaces.
It appears the "node down" errors are the result of timeout errors.
# pcs resource disable virtualip
/var/log/message on vs184:
Jul 7 16:50:53 vs184 systemd: Stopping HAProxy Load Balancer...
Jul 7 16:50:53 vs184 systemd: Stopped HAProxy Load Balancer.
Jul 7 16:50:53 vs184 haproxy_agent(haproxy)[22739]: INFO: HAProxy (haproxy) stopped
Jul 7 16:50:53 vs184 crmd[2005]: notice: process_lrm_event: Operation haproxy_stop_0: ok (node=vs184.xxx, call=274, rc=0, cib-update=123, confirmed=true)
Jul 7 16:50:53 vs184 IPaddr2(virtualip)[22817]: INFO: IP status = ok, IP_CIP=
Jul 7 16:50:53 vs184 crmd[2005]: notice: process_lrm_event: Operation virtualip_stop_0: ok (node=vs184.xxx, call=276, rc=0, cib-update=124, confirmed=true)
Jul 7 16:50:53 vs184 avahi-daemon[662]: Withdrawing address record for x.x.48.163 on br-ex.
rabbitmq log on vs181:
=INFO REPORT==== 7-Jul-2015::16:51:08 ===
rabbit on node 'rab...@x.x.48.184' down
=INFO REPORT==== 7-Jul-2015::16:51:08 ===
node 'rab...@x.x.48.184' down: etimedout
=INFO REPORT==== 7-Jul-2015::16:51:14 ===
Autoheal request received from 'rab...@x.x.48.182'
rabbitmq log on vs182:
=INFO REPORT==== 7-Jul-2015::16:51:14 ===
rabbit on node 'rab...@x.x.48.184' down
=INFO REPORT==== 7-Jul-2015::16:51:14 ===
Keep rab...@x.x.48.184 listeners: the node is already back
=ERROR REPORT==== 7-Jul-2015::16:51:14 ===
Mnesia('rab...@x.x.48.182'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, 'rab...@x.x.48.184'}
=INFO REPORT==== 7-Jul-2015::16:51:14 ===
node 'rab...@x.x.48.184' down: etimedout
=INFO REPORT==== 7-Jul-2015::16:51:14 ===
node 'rab...@x.x.48.184' up
=INFO REPORT==== 7-Jul-2015::16:51:14 ===
Autoheal request sent to 'rab...@x.x.48.181'
rabbitmqlog on vs184:
=INFO REPORT==== 7-Jul-2015::16:51:13 ===
rabbit on node 'rab...@x.x.48.182' down
=ERROR REPORT==== 7-Jul-2015::16:51:14 ===
Mnesia('rab...@x.x.48.184'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, 'rab...@x.x.48.182'}
=INFO REPORT==== 7-Jul-2015::16:51:16 ===
node 'rab...@x.x.48.182' down: etimedout
=INFO REPORT==== 7-Jul-2015::16:51:16 ===
node 'rab...@x.x.48.182' up
=ERROR REPORT==== 7-Jul-2015::16:51:16 ===
Mnesia('rab...@x.x.48.184'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, 'rab...@x.x.48.181'}
Could these be caused directly by deleting the virtual IP? Or does that just trigger a flurry of activity that bogs the system down enough to cause timeouts in rabbitmq?
Is there some timeout setting that might get rabbitmq-server through this more gracefully? I've tried changing net_ticktime with no obvious affect.
John McMeeking