Hi,
We have a RabbitMQ cluster running with below policies
We have 3 servers running in this Cluster. The spec of rabbit and the servers are as below
RabbitMQ version: 3.6.1
Erlang: 19.2
Servers: Windows 2012 R2 Server OS
In the last couple of months we had 2 major incidents where we lost a lot of messages. On both occasions some queues were showing on RabbitMQ management console as running on only 1 or 2 nodes (the 3rd node was completely missing from the synchronised servers on all the queues). (Refer to QueuesNotSynced Image)
Overview tab was showing Network Partition detected error message (Refer to networkPartitions image).
From the disk space remaining, I could see that the 3rd node contained all the messages but the node wasnt appearing in the cluster and the overview tab in RabbitMQ management page was showing the node in red with a message that node not running.
We wanted to recover the messages as there were about 1.2 million messages on the queues at that time. We tried to stop the rabbitmq service on the 2 nodes (that were showing up as synced on some queues) and then stop and start the 3rd node (which we think contained all the lost messages). The 3rd node came back online but all the queues were showing status of Down and were in red colour. Also I started noticing the disk space available going up (meaning that the messages were getting deleted). At this point we brought the other 2 servers online as well and the cluster started working as normal but the messages were lost completely. Can you help me investigate this issue please? I am not sure what kind of information i need to provide but I can provide any logs that you are interested in. I think below are the first error messages we saw on all 3 servers when the issue started to occur.
Would appreciate any help you can provide please?
Server 1
2018-04-01 00:02:08.183
inconsistent_database, running_partitioned_network
Mnesia('Server1'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, 'Server3'}
=INFO REPORT==== 1-Apr-2018::00:01:50 ===
Mirrored queue 'Queue1.DeadLetter' in vhost 'VHost1': Master <Server1.2.967.0> saw deaths of mirrors <Server2.2.29262.127> <Server3.1.668.0>
=INFO REPORT==== 1-Apr-2018::00:02:06 ===
Mirrored queue 'Queue2.Wait10' in vhost 'VHost2': Master <Server1.2.767.0> saw deaths of mirrors <Server2.2.29501.127> <Server3.1.835.0>
Server 2
=INFO REPORT==== 1-Apr-2018::00:01:33 ===
rabbit on node 'Server1' down
=INFO REPORT==== 1-Apr-2018::00:01:33 ===
node 'Server1' down: econnaborted
Server 3
=INFO REPORT==== 1-Apr-2018::00:01:32 ===
rabbit on node 'Server1' down
=INFO REPORT==== 1-Apr-2018::00:01:33 ===
node 'Server1' down: connection_closed