Issues with Rabbit MQ crashing maybe mnesia error or memory spike

441 views
Skip to first unread message

Josh Vander Berg

unread,
Oct 3, 2019, 5:02:33 PM10/3/19
to rabbitmq-users
We've been having issues with RabbitMQ 'soft' crashing over the last couple days.

It doesn't ever really stop working entirely but it generates a good volume of error messages and some of our customers are unable to initiate new subscriptions.

If we restart rabbitmq it appears to recover properly and then resume processing the same amount of client load it was processing previously, but without error.

I've attached a sample of the crash log around the latest event, and a sample of the rabbitmq log at the same period of time.

We've also got a process which captures the 'rabbitmqctl status' output, and I've attached a 'good' one from when there were no errors earlier in the morning, and one with several snapshots around the time we started having issues.

Around these events we see a spike in load, but not significant spike in CPU.  One event had a memory spike, but a more recent event did not - so we are not sure if that's an issue.

This rabbitmq instance is relatively busy, around 1000% CPU on a 48 core machine.

My gut is that we've just reached some saturation point in message throughput and we need to scale this, but I'd like to validate that this is the cause, as the error messages don't really seem all explanatory.

It would also be nice to know if we are somehow misconfigured, limiting our throughput.  We've been through the rabbitmq production guide and didn't see anything obvious there.
crash-log.txt
rabbitmq-sample.log
bad_statusdumps.txt
good_statusdump.txt

Luke Bakken

unread,
Oct 3, 2019, 5:56:31 PM10/3/19
to rabbitmq-users
Hi Josh,

To start, both the Erlang and RabbitMQ version you're using are out-of-date.

The log file suggests that there may be a large connection churn going on. I can see that mnesia is overloaded, and that queue declarations are timing out. If your STOMP clients only connect briefly over-and-over you'll want to tune RabbitMQ and your system for this -

https://www.rabbitmq.com/networking.html#dealing-with-high-connection-churn

If nothing else it can help with memory consumption.

Do you monitor other system stats along with CPU? I would be interested to know what disk I/O wait is like during one of these events. Can you describe your environment? Clustered, VMs, what capacity, do you mirror queues ... all of those details matter.

Thanks -
Luke

Josh Vander Berg

unread,
Oct 3, 2019, 6:21:56 PM10/3/19
to rabbitm...@googlegroups.com
The churn we think is just related to a large pool of clients connection and disconnecting over time.  By default we've got one websocket connection per tab of our web application, and this results in a long lived STOMP connection to rabbitmq, they disconnect when somebody closes a tab.  With thousands of customers, each having many tabs open, the amount of tab opening and closing can be high, and with each comes a connect/disconnect.

The rabbitmq instance is running on a 48 core virtual machine.  It's not dedicated, there's a Java application server and an Apache Camel instance also running on the same machine.  The JVM and beam.smp split their CPU usage almost equally, and I think they use on average a total of about 40% CPU (20% each).

There is no clustering, though we do have a hot standby machine.  But effectively it's just a single instance of rabbitmq.  We use Spring's Messaging platform on the JVM and it wraps rabbitmq as a message broker, and then we use Apache Camel to handle some inbound messaging from third parties.  So we don't really ourselves write code that directly interacts with rabbitmq, we configure Spring Messaging, and Apache Camel, and they handle orchestrating rabbitmq. 

Our logging indicates that yes, we did have large spikes in read/writes around the time of the most recent incidents.

It's worked quite well without much intervention, until recently.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/88033895-b0ef-46bf-a8da-c8a6429a3cbd%40googlegroups.com.

Michael Klishin

unread,
Oct 4, 2019, 12:51:16 AM10/4/19
to rabbitmq-users
This does look like a high connection and queue churn scenario [1].

Instead of having one 48 core shared machine it would make more sense to use a 3 node 3.8 cluster with a smaller total number of cores.
Depending on the spikes (which your monitoring should be able to quantify [2][3]), TCP connection backlog might need increasing [4] but
we cannot infer that from the provided files.

CPU context switching and contention can make things worse during a spike of connections or disconnections. With 48 cores
and a busy neighbor I'd take a look at the rate of CPU context switching and what Erlang VM scheduler binding is used [5].
Note that kernel or user space CPU load is not particularly relevant since what you are observing is operation timeouts in case of a spike.

It sounds like a good chance to beef up your monitoring and upgrade using the Blue/Green deployment strategy, since you only
have one node. I think you should go directly to 3.8.0 or 3.8.x available at the time [9].

This would make CPU contention a non-problem but the downside of having N nodes is that all schema
operations (queue declaration and deletion, for example) becomes more expensive since now three nodes have to perform
them and coordinate instead of just one. Nonetheless, to me a beefy shared machine sound suboptimal and harder
to reason about compared to one (or N) nodes for Camel and a smaller three node RabbitMQ cluster.

Having a new separate cluster would allow you to experiment with it before committing.

Lastly, if the activity on those connections is relatively low, you can probably reduce their TCP buffer size and save some
memory [8] but that's not really related to the incident at hand.




--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Michael Klishin

unread,
Oct 4, 2019, 12:59:59 AM10/4/19
to rabbitmq-users
This reminds me that with 3.8, you get a powerful monitoring option [1] almost for free.
I have no doubt that you have decent monitoring in place already but specifically high connection churn events
would be particularly easy to spot with the dashboards in [1].

JFYI :)

Josh Vander Berg

unread,
Oct 4, 2019, 8:09:52 AM10/4/19
to rabbitm...@googlegroups.com
Cool, thanks for the tip, we are planning to upgrade, can't hurt, might help.

Luke Bakken

unread,
Oct 4, 2019, 12:56:44 PM10/4/19
to rabbitmq-users
Hi Josh,

It's best to always dedicate a VM or machine to the Erlang VM and not have it fight for CPU. The Erlang VM assumes it has full control over all the CPUs available and schedules work accordingly.

It would probably be an improvement to use two 24-core VMs rather than a single VM. Or, you can configure the Erlang VM and JVM to use only a unique subset of cores.

Thanks -
Luke
Reply all
Reply to author
Forward
0 new messages