RabbitMQ crashing due to mnesia error

43 views
Skip to first unread message

Chris

unread,
Mar 25, 2019, 1:57:34 PM3/25/19
to rabbitmq-users
I use RabbitMQ on Ubuntu 18.04 as my messaging backend for a Celery worker cluster, and about once or twice a year, RabbitMQ will crash and be unable to start, showing nothing in the logs except the error:

    Mnesia is overloaded

Googling this message shows a handful of people have encountered this, but no one's found a solution, short of uninstalling and reinstalling RabbitMQ, or simply deleting everything under /var/lib/rabbitmq/mnesia/, which is nearly the equivalent.

I've recently made a change to my application that drastically increases the number of Celery tasks created, and now RabbitMQ crashes with this error every couple of days, so this problem has become debilitating.

How do I fix this, or at least diagnose? It's frustrating that a few people have reported this error, but no one's found a fix. Is RabbitMQ no longer under active development, or is this type of crash for large number of messages a known limitation of RabbitMQ?

Daniil Fedotov

unread,
Mar 25, 2019, 2:25:03 PM3/25/19
to rabbitmq-users
Hi,

The "Mnesia is overloaded" message is a warining message, telling you that mnesia (RabbitMQ internal database) cannot dump writes to disk fast enough. This may cause memory accumulation or fill the disk, but should not crash the node. Memory accumulation and full disk may cause a crash, though.

You should check the system logs and monitoring to see if the node is killed because it exceedes memory limits or if the disk is full. You can also check if there is an `erl_crash.dump` file, which should contain the node crash information.

There are ways to tweak the mnesia log dump mechanism http://erlang.org/doc/man/mnesia.html#configuration-parameters
You can try seeting different values of dc_dump_limit  and dump_log_write_threshold, but they have impact on memory and disk usage and make things worse in your case.

Chris

unread,
Mar 25, 2019, 4:09:46 PM3/25/19
to rabbitmq-users
RabbitMQ is definitely crashing, and after which it refuses to start. Running `sudo service rabbitmq-server start` returns a timeout error. I read that it's possible there are so many queued messages, that Rabbit's started, but is taking so long it appears to not be starting. Indeed, there do seem to be some Erlang and beam.smp processes running after the attempted start, which I believe are called by Rabbit upon startup, but I'm not familiar enough with Rabbit's internals to know what this indicates.

I considered the disk-full issue, but my disk is only 45% full, so I don't think that's the problem.

It's certainly possible that it's being killed by the kernel for consuming too much memory, but I haven't directly observed this. If that's the case, how would I fix that? My server has 16 GB of memory, only half of which seems to be in use at any given time.

>There are ways to tweak the mnesia log dump mechanism http://erlang.org/doc/man/mnesia.html#configuration-parameters
>You can try seeting different values of dc_dump_limit  and dump_log_write_threshold, but they have impact on memory and disk usage and make things worse in your case.

I've seen that link before, but I don't see how it helps me. As I understand it, Mnesia is a separate product from RabbitMQ, and is configured independently. Even worse, those docs don't explain how to actually apply any of those parameters. What's the filename to edit? What's the format of the file? None of the RabbitMQ docs explain how modify any Mnesia settings. I tried toying around by putting some Mnesia values in /etc/rabbitmq/rabbitmq.config, but it had no effect on RabbitMQ.

Luke Bakken

unread,
Mar 25, 2019, 6:37:05 PM3/25/19
to rabbitmq-users
Hi Chris,

Please include your entire rabbitmq.config, including what you tried to configure Mnesia.

Can you please describe what Celery does when you "drastically increase the number of Celery tasks created"? When you do this, does Celery create and delete queues much more frequently, or bindings, or any other RabbitMQ entity (other than messages themselves)?

Usually, you only see "Mnesia is overloaded" when queues, connections, channels, exchanges, bindings, users, etc are created and deleted over-and-over (high churn). Mnesia is not used to store messages nor message metadata.

Thanks,
Luke

Michael Klishin

unread,
Mar 29, 2019, 2:15:34 PM3/29/19
to rabbitmq-users
IIRC older versions of Celery created a queue per task, which is unnecessary IMO. I think modern versions no longer do that but they might still perform binding operations and such which also have no practical effect.

A traffic capture would say a lot about what the client really does [1].


--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
MK

Staff Software Engineer, Pivotal/RabbitMQ
Reply all
Reply to author
Forward
0 new messages