Hi all,
I have a cluster of 8 nodes running RabbitMQ 3.5.1 on Win 2008 R2. I some times experience a network partition I'm not able to recover from, unless I restart all nodes.
Apparently all nodes are working correctly and there are no network link down.
I have two nodes which, at the same time, are not able to communicate each other, while others nodes can see them up.
Here the log from the two nodes (node 4 and node 6):
Node 4
=INFO REPORT==== 3-May-2015::12:37:28 ===
rabbit on node imagiclerabbit@PVFAXAS06V down
=INFO REPORT==== 3-May-2015::12:37:31 ===
node imagiclerabbit@PVFAXAS06V down: connection_closed
=ERROR REPORT==== 3-May-2015::12:37:31 ===
Partial partition detected:
* We saw DOWN from imagiclerabbit@PVFAXAS06V
* We can still see imagiclerabbit@PVFAXAS01V which can see imagiclerabbit@PVFAXAS06V
We will therefore intentionally disconnect from imagiclerabbit@PVFAXAS01V
=ERROR REPORT==== 3-May-2015::12:37:32 ===
Partial partition detected:
* We saw DOWN from imagiclerabbit@PVFAXAS06V
* We can still see imagiclerabbit@PVFAXAS07V which can see imagiclerabbit@PVFAXAS06V
We will therefore intentionally disconnect from imagiclerabbit@PVFAXAS07V
=ERROR REPORT==== 3-May-2015::12:37:33 ===
Partial partition detected:
* We saw DOWN from imagiclerabbit@PVFAXAS06V
* We can still see imagiclerabbit@PVFAXAS03V which can see imagiclerabbit@PVFAXAS06V
We will therefore intentionally disconnect from imagiclerabbit@PVFAXAS03V
=ERROR REPORT==== 3-May-2015::12:37:34 ===
Partial partition detected:
* We saw DOWN from imagiclerabbit@PVFAXAS06V
* We can still see imagiclerabbit@PVFAXAS05V which can see imagiclerabbit@PVFAXAS06V
We will therefore intentionally disconnect from imagiclerabbit@PVFAXAS05V
=ERROR REPORT==== 3-May-2015::12:37:34 ===
Mnesia(imagiclerabbit@PVFAXAS04V): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, imagiclerabbit@PVFAXAS01V}
Node 6
=INFO REPORT==== 3-May-2015::12:37:28 ===
rabbit on node imagiclerabbit@PVFAXAS04V down
=INFO REPORT==== 3-May-2015::12:37:30 ===
node imagiclerabbit@PVFAXAS04V down: connection_closed
=ERROR REPORT==== 3-May-2015::12:37:30 ===
Partial partition detected:
* We saw DOWN from imagiclerabbit@PVFAXAS04V
* We can still see imagiclerabbit@PVFAXAS08V which can see imagiclerabbit@PVFAXAS04V
We will therefore intentionally disconnect from imagiclerabbit@PVFAXAS08V
=ERROR REPORT==== 3-May-2015::12:37:31 ===
Partial partition detected:
* We saw DOWN from imagiclerabbit@PVFAXAS04V
* We can still see imagiclerabbit@PVFAXAS07V which can see imagiclerabbit@PVFAXAS04V
We will therefore intentionally disconnect from imagiclerabbit@PVFAXAS07V
=ERROR REPORT==== 3-May-2015::12:37:33 ===
Partial partition detected:
* We saw DOWN from imagiclerabbit@PVFAXAS04V
* We can still see imagiclerabbit@PVFAXAS05V which can see imagiclerabbit@PVFAXAS04V
We will therefore intentionally disconnect from imagiclerabbit@PVFAXAS05V
=INFO REPORT==== 3-May-2015::12:37:34 ===
rabbit on node imagiclerabbit@PVFAXAS08V down
=ERROR REPORT==== 3-May-2015::12:37:37 ===
Mnesia(imagiclerabbit@PVFAXAS06V): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, imagiclerabbit@PVFAXAS05V}
It seems all happens within a few seconds.
After that I'm not able to connect to the other nodes where I have errors like these:
=ERROR REPORT==== 3-May-2015::14:25:19 ===
closing AMQP connection <0.17372.85> ([::1]:50659 -> [::1]:5672):
{heartbeat_timeout,running}
The management plugin doesn't work on any nodes.
Questions:
1) How can i reduce the sensibility to very short link down on Windows server?
2) How can i replicate this on a lab test cluster?
3) Why other nodes become unreacheable/unavailable?
Thanks,
Rick