Machines that are part of our RabbitMQ cluster seem to randomly drop off from the cluster

73 views
Skip to first unread message

Denis Abramov

unread,
May 28, 2015, 11:16:02 AM5/28/15
to rabbitm...@googlegroups.com
It seems that the machines in our RabbitMQ cluster keep dropping off randomly from the cluster. They always seem to be running fine, nothing in the event log but I can't figure out why this is happening from the logs produced by RabbitMQ.
Most recently we noticed that one of the machines in the cluster stopped responding to RabbitMQ connection requests at 2:11pm on 5/27. The machine was otherwise running fine, nothing in the Event Log. I am attaching its RabbitMQ logs.
Can someone please guide me on how to figure out what happened with this machine?

Thank you for you help,
Denis Abramov
Rabbitmq-logs.zip

Michael Klishin

unread,
May 28, 2015, 11:45:47 AM5/28/15
to rabbitm...@googlegroups.com, Denis Abramov
On 28 May 2015 at 18:16:05, Denis Abramov (da...@columbia.edu) wrote:
> Most recently we noticed that one of the machines in the cluster
> stopped responding to RabbitMQ connection requests at 2:11pm
> on 5/27. The machine was otherwise running fine, nothing in the
> Event Log. I am attaching its RabbitMQ logs.
> Can someone please guide me on how to figure out what happened
> with this machine?

At least around that particular time, RabbitMQ was ordered to stop:

=INFO REPORT==== 27-May-2015::14:25:42 ===
Stopping RabbitMQ

There are no errors in your SASL log, only progress reports (a lot of them). 
--
MK

Staff Software Engineer, Pivotal/RabbitMQ


Denis Abramov

unread,
May 28, 2015, 12:25:11 PM5/28/15
to rabbitm...@googlegroups.com, da...@columbia.edu
Yes, we probably stopped RabbitMQ around that time because we kept getting the following exception:

RabbitMQ.Client.Exceptions.OperationInterruptedException: The AMQP operation was interrupted: AMQP close-reason, initiated by Peer, code=404, text="NOT_FOUND - home node 'rabbit@ELLINAPP2' of durable queue 'Ellin.Request' in vhost '/' is down or inaccessible", classId=50, methodId=10, cause=

   at RabbitMQ.Client.Impl.SimpleBlockingRpcContinuation.GetReply()

   at RabbitMQ.Client.Impl.ModelBase.QueueDeclare(String queue, Boolean passive, Boolean durable, Boolean exclusive, Boolean autoDelete, IDictionary`2 arguments)

   at RabbitMQ.Client.Impl.AutorecoveringModel.QueueDeclare(String queue, Boolean durable, Boolean exclusive, Boolean autoDelete, IDictionary`2 arguments)


Also, when we logged in through the web interface to see how the cluster was doing, we saw an error saying something about RabbitMQ not supporting a partitioned network (don't recall the exact message at this point in time) and this particular server whose logs are above was in "RED" until we stopped and restarted it. Then everything was back to normal.

Aaron Kondziela

unread,
May 28, 2015, 12:40:06 PM5/28/15
to rabbitm...@googlegroups.com
Rabbit can be very sensitive to network partitions. I've had problems with cloud-hosted clusters in the past, where normal hiccups or maintenance work would cause the cluster to fall apart. It's worth looking into.

Michael Klishin

unread,
May 28, 2015, 1:12:16 PM5/28/15
to rabbitm...@googlegroups.com, Aaron Kondziela
 On 28 May 2015 at 19:40:08, Aaron Kondziela (aa...@aaronkondziela.com) wrote:
> I've had problems with cloud-hosted clusters in the past, where
> normal hiccups or maintenance work would cause the cluster to
> fall apart.

The log clearly says the node was ordered to stop.

Michael Klishin

unread,
May 28, 2015, 1:14:46 PM5/28/15
to rabbitm...@googlegroups.com, Denis Abramov
On 28 May 2015 at 19:25:13, Denis Abramov (da...@columbia.edu) wrote:
> Yes, we probably stopped RabbitMQ around that time because
> we kept getting the following exception:
>
>
> RabbitMQ.Client.Exceptions.OperationInterruptedException:
> The AMQP operation was interrupted: AMQP close-reason, initiated
> by Peer, code=404, text="NOT_FOUND - home node 'rabbit@ELLINAPP2'
> of durable queue 'Ellin.Request' in vhost '/' is down or inaccessible",
> classId=50, methodId=10, cause=

This means nodes could not communicate for whatever reason.

Make sure you use autoheal partition handling. 

Denis Abramov

unread,
May 28, 2015, 2:23:26 PM5/28/15
to rabbitm...@googlegroups.com, da...@columbia.edu
What is "autoheal partition handling". How do I enable this?

Denis Abramov

unread,
May 28, 2015, 2:40:02 PM5/28/15
to rabbitm...@googlegroups.com
MK,
   thanks, I found it: https://www.rabbitmq.com/partitions.html. Will try.

Which mode should I pick?

It's important to understand that allowing RabbitMQ to deal with network partitions automatically does not make them less of a problem. Network partitions will always cause problems for RabbitMQ clusters; you just get some degree of choice over what kind of problems you get. As stated in the introduction, if you want to connect RabbitMQ clusters over generally unreliable links, you should use federation or the shovel.

With that said, you might wish to pick a recovery mode as follows:

  • ignore - Your network really is reliable. All your nodes are in a rack, connected with a switch, and that switch is also the route to the outside world. You don't want to run any risk of any of your cluster shutting down if any other part of it fails (or you have a two node cluster).
  • pause_minority - Your network is maybe less reliable. You have clustered across 3 AZs in EC2, and you assume that only one AZ will fail at once. In that scenario you want the remaining two AZs to continue working and the nodes from the failed AZ to rejoin automatically and without fuss when the AZ comes back.
  • autoheal - Your network may not be reliable. You are more concerned with continuity of service than with data integrity. You may have a two node cluster.
Reply all
Reply to author
Forward
0 new messages