Unstable Rabbit Cluster

129 views

Skip to first unread message

Krishna M

unread,

Apr 3, 2018, 11:26:23 AM4/3/18

to rabbitmq-users

Hi,

We have a RabbitMQ cluster running with below policies

ha-mode:all

ha-sync-mode:automatic

We have 3 servers running in this Cluster. The spec of rabbit and the servers are as below

RabbitMQ version: 3.6.1

Erlang: 19.2

Servers: Windows 2012 R2 Server OS

In the last couple of months we had 2 major incidents where we lost a lot of messages. On both occasions some queues were showing on RabbitMQ management console as running on only 1 or 2 nodes (the 3rd node was completely missing from the synchronised servers on all the queues). (Refer to QueuesNotSynced Image)

Overview tab was showing Network Partition detected error message (Refer to networkPartitions image).

From the disk space remaining, I could see that the 3rd node contained all the messages but the node wasnt appearing in the cluster and the overview tab in RabbitMQ management page was showing the node in red with a message that node not running.

We wanted to recover the messages as there were about 1.2 million messages on the queues at that time. We tried to stop the rabbitmq service on the 2 nodes (that were showing up as synced on some queues) and then stop and start the 3rd node (which we think contained all the lost messages). The 3rd node came back online but all the queues were showing status of Down and were in red colour. Also I started noticing the disk space available going up (meaning that the messages were getting deleted). At this point we brought the other 2 servers online as well and the cluster started working as normal but the messages were lost completely. Can you help me investigate this issue please? I am not sure what kind of information i need to provide but I can provide any logs that you are interested in. I think below are the first error messages we saw on all 3 servers when the issue started to occur.

Would appreciate any help you can provide please?

Server 1

2018-04-01 00:02:08.183

inconsistent_database, running_partitioned_network

Mnesia('Server1'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, 'Server3'}

=INFO REPORT==== 1-Apr-2018::00:01:50 ===

Mirrored queue 'Queue1.DeadLetter' in vhost 'VHost1': Master <Server1.2.967.0> saw deaths of mirrors <Server2.2.29262.127> <Server3.1.668.0>

=INFO REPORT==== 1-Apr-2018::00:02:06 ===

Mirrored queue 'Queue2.Wait10' in vhost 'VHost2': Master <Server1.2.767.0> saw deaths of mirrors <Server2.2.29501.127> <Server3.1.835.0>

Server 2

=INFO REPORT==== 1-Apr-2018::00:01:33 ===

rabbit on node 'Server1' down

=INFO REPORT==== 1-Apr-2018::00:01:33 ===

node 'Server1' down: econnaborted

Server 3

=INFO REPORT==== 1-Apr-2018::00:01:32 ===

rabbit on node 'Server1' down

=INFO REPORT==== 1-Apr-2018::00:01:33 ===

node 'Server1' down: connection_closed

QueuesNotSynced.PNG

NetworkPartitions.PNG

Michael Klishin

unread,

Apr 3, 2018, 8:25:49 PM4/3/18

to rabbitm...@googlegroups.com

I'm not sure what your question is. There are events

of nodes losing connectivity in the log: are you sure your cluster is configured

to recover accordingly? [1]

You have to pick a strategy for nodes to use.

1. http://www.rabbitmq.com/partitions.html

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.