Yet another RabbitMQ cluster crash

Pier Castonguay

unread,

Mar 27, 2020, 11:15:14 AM3/27/20

to rabbitmq-users

Even thought I did not received sufficient help for my 2 previous reports [1] and [2], I guess I will keep reporting them.

Woke up this morning with all client applications connecting to RabbitMQ stalled.

Checked the management console, 3 nodes are green. Checked the queues, red warning "unsynchronized mirror" everywhere.

Seems to be a different problem yet again.

Attached crash.log

I'm very disappointed. Feel like I will need to switch technology for production use, RabbitMQ is not stable enough.

crash.log

Michael Klishin

unread,

Mar 27, 2020, 5:55:41 PM3/27/20

to rabbitmq-users

We can only respond with the same "insufficient" response we had in [2].

We don't know the root cause of file system operations failing with "eexist". It's an OS response [1] to a runtime operation

that RabbitMQ does not implement.

I do not recall this scenario being reported on any other OS, so you are welcome to give running RabbitMQ nodes on Linux a shot

to see if the problem persists.

If you feel that this community-provided, entirely "best effort" mailing list support is insufficient, there are commercial

support options, including from companies not affiliated with VMware.

While RabbitMQ maintainers and community try to build the best tool we can and provide a reasonable degree of support,

all software eventually has issues. RabbitMQ and Erlang are both open source, so you are welcome to investigate and fix the root cause

if you can find it. Some prominent community members have tried in the past [2] and we haven't succeeded.

Perhaps if some users with affected environments would step up and investigate the issue, it would be addressed soon enough.

Claiming that you haven't received "sufficient help" with an OSS piece of software you get to use for free and haven't contributed to is not productive.

No one on this list owes you anything.

1. https://docs.microsoft.com/en-us/cpp/c-runtime-library/errno-constants?view=vs-2019

2. https://github.com/rabbitmq/rabbitmq-server/issues/545

Michael Klishin

unread,

Mar 27, 2020, 6:12:18 PM3/27/20

to rabbitmq-users

And while we are at it, there is no evidence of "a cluster crash". There is evidence of a file system operation

that was rejected by the OS which means at least one queue had to stop. There is evidence of channels terminating and eventually a connection (they tried to perform

operations on said queue).

We do not know whether applications use appropriate data safety features [1][2]. Availability and data safety are joint responsibility of servers and clients.

We do not know if this queue is a quorum one or mirrored. If not then there's only so much availability that can be expected from said queue.

We do know that there's little appreciation in this thread for the work this community does and shares free of charge every single day of the year.

Nodes in the management UI will be green if they are connected. This is not a node health indicator in any other sense.

Distributed infrastructure can fail in all kinds of non-trivial ways, including — as we see here — when there are disk write failures.

Even extensive monitoring [3][4] at several levels sometimes cannot catch every single problem, and as [3] explains, node health evaluation

is a harder problem than it sounds with every operations team having their own definition of what "healthy" means.

We have listed several operationally important aspects that affect availability and several option you can try, from hiring a consultant

to switching [5] to an OS where this issue does not exhibit itself (to our knowledge).

I sense this still may not be satisfactory enough but hey, it'd cost you next to nothing and this list says "best effort community support" right on the home page.

1. https://www.rabbitmq.com/dotnet-api-guide.html

2. https://www.rabbitmq.com/confirms.html

3. https://www.rabbitmq.com/monitoring.html

4. https://www.rabbitmq.com/prometheus.html

5. https://www.rabbitmq.com/blue-green-upgrade.html

On Friday, March 27, 2020 at 6:15:14 PM UTC+3, Pier Castonguay wrote:

Michael Klishin

unread,

Mar 27, 2020, 6:17:26 PM3/27/20

to rabbitmq-users

I now see "unsynchronised mirrors" are mentioned. I'd normally ask for logs to see what was the sequence of events beyond the provided 6 minute piece

but since we are obviously looking for a quick fix, consider using a quorum queue.

QQs are superior to classic mirrored queues in nearly every way [1],

although also have some limitations. They won't prevent the OS from rejecting writes but mean time to recovery/resync, leader election and data safety in general would

be much improved, and trying them out really doesn't take much effort.

1. https://www.rabbitmq.com/quorum-queues.html

M K

unread,

Mar 27, 2020, 6:23:31 PM3/27/20

to rabbitmq-users

The first link should have been [1].

1. https://www.rabbitmq.com/dotnet-api-guide.html#recovery

On Saturday, March 28, 2020 at 1:12:18 AM UTC+3, Michael Klishin wrote:

Pier Castonguay

unread,

Mar 28, 2020, 12:26:27 AM3/28/20

to rabbitmq-users

Hello Michael,

I got to say I was indeed unnecessarily harsh and I'm sorry with how I expressed my original post. I understand that this is free community help, and yes we are considering the possibility to hire consultant help, if only to offers a bit of assurance that the solution is still viable.

You got to understand my frustration when I found out that my long-lasting stability test failed once again for reasons that seems out of my control. That combined with the fact that, in all due respect, the log files are extremely hard to analyze, and that I found no easy way to recover from the situation (in this case the "synchronize" button in webui doesn't seems to do anything) other than reseting all nodes and restarting from scratch.

From external my point of view, all 3 errors seemed like completely different situations, because they caused completely different behaviors. It's only now that you pointed it out, and by searching in the error log that I see the "eexist" keyword somewhere in there. I've since learned that the crashes hapenned again while the daily disk backup snapshots were being taken. A latency on the file system operation could be the cause. Now I ask myself, if I can't rely on RabbitMQ to survive that kind of regular event, what can I trust it for? And what's the point of creating redondancies clusters if it's just going to cause them to crash more instead of surviving such events?

Now I don't really understand what you mean by "runtime operation that RabbitMQ does not implement", but "eexist" is a properly documented and common error code in the file open method when you try to create a file using flags asking for it to throw that specific error in case the file already exist. I wish I could take look at the erlang/rabbitmq source code and seek those file open operations, but it goes a bit beyond my field of competence. In any case, it wouldn't hurt opening a ticket asking for someone to look at the functions in the mnesia databases when it decides to create a new file and how that error code could be caught and handled.

> with every operations team having their own definition of what "healthy" means.

From my very basic point of view, if messages reach the queues and can be consumed, it's healthy. When the system is throwing errors, it's not. It's as simple. In my situations, the system was completely non-functional after the error events happened.

> We do not know whether applications use appropriate data safety features. Availability and data safety are joint responsibility of servers and clients.

I couldn't say for sure. As I mentionned in a previous post, we don't control the RabbitMQ client code directly, but use an external library by Particular Software. From what I can tell, they are very competant, so I would trust them having implemented safety features. The source code of their implementation is available here: https://github.com/Particular/NServiceBus.RabbitMQ. If anything would point out toward a possible bad implemention of the client code, I would turn toward them. But, from what I can understand of the problem here is that it isn't message loss or the client applications failing, but RabbitMQ server in itself turning in an erroneous state.

> consider using a quorum queue

I see this is recent feature, and it looks promising. But since it needs to be defined at creation time (instead of in the policies), and that the queues creation are managed by that external library I mentioned, it's unfortunately not possible for me to test them as a quick fix. Seems like they are considering implementing them in future versions, but are having doubts with the lack of TTL support (https://github.com/Particular/NServiceBus.RabbitMQ/issues/556)

> to switching to an OS where this issue does not exhibit itself (to our knowledge)

If that's your recommended action, this might very well be what we are going to try next. It's not a quick and free solution, this will require doubling the amount of hosts in the infrastructure (currently all other softwares of the solution are running on Windows). And if you have doubts about the Windows implementation as you seem to have, maybe think about adding a warning on the website saying that the Windows version have known issues?

Luke Bakken

unread,

Mar 28, 2020, 10:35:37 AM3/28/20

to rabbitmq-users

Hi Pier,

I've since learned that the crashes hapenned again while the daily disk backup snapshots were being taken. A latency on the file system operation could be the cause

This is an important new piece of information. Are you running virtual machines? Are the machines paused during a snapshot?

I don't think we know details about your environment. When I re-read your other discussions, I remember seeing Windows Server 2016. Could you let us know more about the environment in which they run? Virtual machines, what is their RAM capacity, disk, etc?

Could you re-share your complete configuration files? Please attach them to your response.

Do you have any system or RabbitMQ monitoring in place? https://www.rabbitmq.com/monitoring.html

> to switching to an OS where this issue does not exhibit itself (to our knowledge)

If that's your recommended action, this might very well be what we are going to try next. It's not a quick and free solution, this will require doubling the amount of hosts in the infrastructure (currently all other softwares of the solution are running on Windows). And if you have doubts about the Windows implementation as you seem to have, maybe think about adding a warning on the website saying that the Windows version have known issues?

I haven't yet read anything that seems Windows-specific in your issue descriptions. What I'm guessing at the moment is that your disk snapshot process pauses a virtual machine, which is known to not be handled well by RabbitMQ - https://www.rabbitmq.com/partitions.html#suspend

Thanks,

Luke

--
Senior Member of Technical Staff
VMware / RabbitMQ

Pier Castonguay

unread,

Mar 30, 2020, 11:42:47 AM3/30/20

to rabbitmq-users

Hello Luke,

> This is an important new piece of information. Are you running virtual machines? Are the machines paused during a snapshot?

Indeed. I'm not certain a snapshot was the case for the two other situations, but it was this time.

Yes we are running them under a vSphere environment.

I asked my network administrator. Snapshots are taken from vCenter directly, but he mention a VSS operation is done first to tell the OS to handle open files for the copy.

The machines are never suspended per say, but it seems there is a small freeze of a few seconds during the operation.

> I don't think we know details about your environment. When I re-read your other discussions, I remember seeing Windows Server 2016. Could you let us know more about the environment in which they run? Virtual machines, what is their RAM capacity, disk, etc?

First machine is actually Windows Sever 2012 (yeah I know), the other two are Windows Server 2016. All x64

Intel Xeon Gold 6130 @ 2.1ghz, 8 GB Ram per instance, 100gb disk space per disk per instance.

> Could you re-share your complete configuration files? Please attach them to your response.

It didn't change since the beginning except Erlang/RabbitMQ version. All pretty basic:

RabbitMQ 3.8.2, Erlang 22.2

Cluster configuration is as follow:

cluster_formation.peer_discovery_backend = rabbit_peer_discovery_classic_config

cluster_formation.classic_config.nodes.1 = rabbit@Developpement
cluster_formation.classic_config.nodes.2 = rabbit@server2
cluster_formation.classic_config.nodes.3 = rabbit@server3

cluster_partition_handling = autoheal

HA Policy is as follow:

Name: HA
Pattern: .*
Apply to: Exchanges and queues

ha-mode: exactly
ha-params: 2
ha-sync-mode: automatic
max-length-bytes: 50000000

> Do you have any system or RabbitMQ monitoring in place?

We do have basic application service level monitoring in place via ServicePulse Monitoring, which log exceptions each services and shows the live graphs for each queues the length, the throughput, the processing time, etc. Similar to the RabbitMQ monitoring page. Of course, then the crash happen, this is all down to 0.

We also have a OS level monitoring via a software called PRTG Network Monitor. It only takes a reading once per minute, but keep track of CPU, Memory, Network, Disk, Uptime, Service status, even special process like RabbitMQ memory usage and other custom performance counters. On those graphs, I see absolutely no special spike on the day of the event.

Unfortunately, we do not have advanced monitoring interacting with RabbitMQ CLI tools as described on your website.

> I haven't yet read anything that seems Windows-specific in your issue descriptions

This was based on Michael Klishin answer which seemed to imply that Windows was causing the problem.

> pauses a virtual machine, which is known to not be handled well by RabbitMQ

I would not have expected a pause/resume operation on a VM to be any different than a network failure or a system shutdown in term of how RabbitMQ try to resume it's operation when it's back. I believe running RabbitMQ on VM (even on AWS/Azure hosts) to be quite common, and that taking snapshots of servers to also be a common operation.

So for my questions:

1. How do you propose we handle server snapshots if RabbitMQ can't survive a short pause?

2. Is there a way to quickly recover from the situation once it failed without resetting all nodes? As I said, each situation caused completely different behavior:

- First time RabbitMQ server itself became unresponsive.

- Second time exchanges bindings are gone.

- Third time queues did not resynchronize automatically

- And now in my current state I tried to restore from backup instead of resetting (because it's a long unpleasant manual operation) and what I see is all queues are shown in the management website on each nodes, but marked as "down" when if I click on them I get a "The object you clicked on was not found; it may have been deleted on the server.".

3. Is there something that can be done from the Client library to recover from it? I see a lot of this exception in my application logs:

RabbitMQ.Client.Exceptions.OperationInterruptedException: The AMQP operation was interrupted: AMQP close-reason, initiated by Peer, code=404, text="NOT_FOUND - home node 'rabbit@Developpement' of durable queue 'nsb.delay-level-27' in vhost '/' is down or inaccessible", classId=50, methodId=10, cause=

Maybe the client library could catch the NOT_FOUND error and try to re-create the queues/exchanges in this situation? Anything precise I could come up with as a description that I could post on the ParticularSoftware GitHub as an issue that would be clear and help keep the cluster valid?

Thanks

Luke Bakken

unread,

Mar 30, 2020, 12:14:14 PM3/30/20

to rabbitmq-users

Hello -

> pauses a virtual machine, which is known to not be handled well by RabbitMQ

I would not have expected a pause/resume operation on a VM to be any different than a network failure or a system shutdown in term of how RabbitMQ try to resume it's operation when it's back. I believe running RabbitMQ on VM (even on AWS/Azure hosts) to be quite common, and that taking snapshots of servers to also be a common operation.

Pausing a VM is quite different than a network failure or graceful shutdown and that isn't unique to RabbitMQ or Erlang. Some distributed systems handle it better than others. By RabbitMQ version 4.0 we expect to have this scenario handled, but at this time pausing VMs should be disabled.

Based on the issues you describe it does not appear that the pauses were short, either.

1. How do you propose we handle server snapshots if RabbitMQ can't survive a short pause?

Don't take snapshots since there is little need for them for a message broker. Automate the setup of Windows and RabbitMQ if you have to recover from a disaster, or take much-less-frequent snapshots where you gracefully stop RabbitMQ first, then re-start after the snapshot is done.

2. Is there a way to quickly recover from the situation once it failed without resetting all nodes?

It depends on the situation. I can't give specific advice with the information provided. Start by disabling snapshots, then re-run your tests.

3. Is there something that can be done from the Client library to recover from it? I see a lot of this exception in my application logs:

RabbitMQ.Client.Exceptions.OperationInterruptedException: The AMQP operation was interrupted: AMQP close-reason, initiated by Peer, code=404, text="NOT_FOUND - home node 'rabbit@Developpement' of durable queue 'nsb.delay-level-27' in vhost '/' is down or inaccessible", classId=50, methodId=10, cause=

Maybe the client library could catch the NOT_FOUND error and try to re-create the queues/exchanges in this situation? Anything precise I could come up with as a description that I could post on the ParticularSoftware GitHub as an issue that would be clear and help keep the cluster valid?

I think what you need to do first is disable snapshots in your environment, then re-run your tests.