Server rebuilds index after crash, uses up all memory, dies

3,783 views
Skip to first unread message

Linas Valiukas

unread,
May 26, 2016, 4:15:17 PM5/26/16
to rabbitmq-users
Hello!

In order to try out RabbitMQ with our workload, I fired up an EC2 instance and added ~42 mil messages to one of the queues.

All the messages got added to the persistent queue just fine, but then I kill -9'ed the RabbitMQ instance to test whether it would start up after being shut down abruptly. I'm aware that the proper way to stop the server is via "rabbitmqctl stop" but I wanted to see what would happen on a system crash and such.

Now the problem is that upon startup, the server tries to rebuild the index with "rebuilding indices from scratch", apparently uses up all the memory and dies silently:

=INFO REPORT==== 26-May-2016::14:08:20 ===
Starting RabbitMQ 3.6.2 on Erlang 18.3
Copyright (C) 2007-2016 Pivotal Software, Inc.
Licensed under the MPL.  See http://www.rabbitmq.com/

=INFO REPORT==== 26-May-2016::14:08:20 ===
node           : mediacloud@localhost
home dir       : /home/ubuntu
config file(s) : /mediacloud/data/rabbitmq/rabbitmq.config
cookie hash    : 6RiAbVAbV1rGeqQ/Q2/AJQ==
log            : /mediacloud/data/rabbitmq/logs/media...@localhost.log
sasl log       : /mediacloud/data/rabbitmq/logs/media...@localhost-sasl.log
database dir   : /mediacloud/data/rabbitmq/mnesia/mediacloud@localhost

=INFO REPORT==== 26-May-2016::14:08:23 ===
Memory limit set to 1506MB of 3766MB total.

=INFO REPORT==== 26-May-2016::14:08:23 ===
Disk free limit set to 50MB

=INFO REPORT==== 26-May-2016::14:08:23 ===
Limiting to approx 65436 file handles (58890 sockets)

=INFO REPORT==== 26-May-2016::14:08:23 ===
FHC read buffering:  OFF
FHC write buffering: ON

=INFO REPORT==== 26-May-2016::14:08:23 ===
Priority queues enabled, real BQ is rabbit_variable_queue

=INFO REPORT==== 26-May-2016::14:08:23 ===
Management plugin: using rates mode 'basic'

=INFO REPORT==== 26-May-2016::14:08:23 ===
msg_store_transient: using rabbit_msg_store_ets_index to provide index

=INFO REPORT==== 26-May-2016::14:08:23 ===
msg_store_persistent: using rabbit_msg_store_ets_index to provide index

=WARNING REPORT==== 26-May-2016::14:08:23 ===
msg_store_persistent: rebuilding indices from scratch

=INFO REPORT==== 26-May-2016::14:29:05 ===
vm_memory_high_watermark set. Memory used:1868367216 allowed:1579615846

=WARNING REPORT==== 26-May-2016::14:29:05 ===
memory resource limit alarm set on node mediacloud@localhost.

**********************************************************
*** Publishers will be blocked until this alarm clears ***
**********************************************************

Why does RabbitMQ use up all of my memory (~4 GB RAM + 3 GB swap)? Is this that "uses a small amount of memory" case that's mentioned in https://www.rabbitmq.com/persistence-conf.html? How do I start RabbitMQ server in this case?

Regards,

Michael Klishin

unread,
May 26, 2016, 5:30:43 PM5/26/16
to rabbitm...@googlegroups.com
RabbitMQ alarms at ~ 1.5 GB. I remember one message recovery problem that could ignore alarms
but it should not longer be an issue in 3.6.x. See system log for OOM killer messages: regardless of how much
RAM RabbitMQ actually uses, it can be killed by the OOM killer on Linux simply because it's the OS process
that uses more than any other.

There is no evidence that it is index rebuilding that is the root cause. Please capture and post `rabbitmqctl status` after you
hit the alarm.

Also, this does not happen immediately after start so you should have enough time to make queues lazy:

To commend on whether index entries can accumulate a lot of RAM we need to know your message size distribution
and how many messages are enqueued. 

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Linas Valiukas

unread,
May 27, 2016, 12:07:13 PM5/27/16
to rabbitm...@googlegroups.com
Michael, thank you for your reply.

Strange, /var/log doesn't mention killing the process (due to memory or any other reasons), it just went away. Also, when I tried to recreate the issue by killing RabbitMQ again, disabling swap (to hit the memory limit quicker) and restarting the server, the index got recreated properly and the memory usage remained stable throughout the reindexing process.

I have tried this with ~42 million 512 byte messages in a persistent queue (which makes it "lazy" in a sense that new messages get written to disk right away).

Is there a way to somehow avoid the reindexing after a nasty shutdown? We are going to run RabbitMQ on a bigger machine than EC2's m3.medium so the reindexing will be quicker, but it's still annoying to wait for (tens of) minutes for the server to start.

You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/EJ7Z40oK_bM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-user...@googlegroups.com.

Michael Klishin

unread,
May 27, 2016, 5:31:18 PM5/27/16
to rabbitm...@googlegroups.com
No. After abnormal shutdown RabbitMQ cannot trust existing indices and has to perform a sequential scan.

Linas Valiukas

unread,
Jun 27, 2016, 9:49:05 PM6/27/16
to rabbitmq-users
Hi Michael,

Sorry to resurrect an old thread, but we still have the same problem.

We are running RabbitMQ with about 10 queues, two of which are rather big (one of them contains 100m+ messages, another one around 30m+). All of the queues are made both persistent and lazy to reduce RAM usage as much as possible. vm_memory_high_watermark is set to 8 GB as it seems to be enough to run the server with our load.

After a recent unclean shutdown, RabbitMQ went into rebuilding indices as usual, however 8 GB doesn't seem to be enough to rebuild the index. We observe CPU and disk usage up until the server hits the memory watermark. After hitting the limit, server quickly comes back to normal memory usage but the index rebuilding seems to stop -- Erlang spawns ~540 threads most of which idle at "D" ("uninterruptible sleep") process state, load rises to 500+ too. On other attempts, process crashes with: Cannot allocate (around 6 gigs) of memory (of type "old_heap"). Here's a sample core dump:


We run RabbitMQ on 192 GB RAM machine, so memory shouldn't be an issue. Also, we have tried to set vm_memory_high_watermark to 0, but RabbitMQ still crashes with OOM errors.

For the reference, we're running RabbitMQ 3.6.2-1 (from packagecloud.io) + Erlang 17.5.3 (from erlang-solutions.com; 17 instead of 18 as advised on https://groups.google.com/forum/#!topic/rabbitmq-users/7K0Ac5tWUIY) on Ubuntu 12.04.

Please advise.

Regards,

Michael Klishin

unread,
Jun 28, 2016, 5:17:19 AM6/28/16
to rabbitm...@googlegroups.com
RabbitMQ does not allocate memory directly. Try Erlang 18.3.4,
which is now available from the Erlang Solutions downloads page:

Linas Valiukas

unread,
Jun 28, 2016, 11:05:08 AM6/28/16
to rabbitmq-users
We have tried rebuilding index with Erlang 17.5.3, 18.3.4 and 19.0, all versions have failed. Additionally, I've tried running Erlang from esl- and erlang- Ubuntu distributions, to no avail.

I get it that RabbitMQ doesn't manage memory itself, but maybe it uses up too much memory while rebuilding index and so Erlang crashes? If I were to do new long[Integer.MAX_VALUE][Integer.MAX_VALUE] in Java, OutOfMemoryError would be my application's problem, not VM's.

What are other options? Should we try alternate message index plugin? Where can I find one?

Michael Klishin

unread,
Jun 28, 2016, 3:04:30 PM6/28/16
to rabbitm...@googlegroups.com
rabbitmqctl status and rabbitmq_top can help identify what uses most memory. The only similar
issue that I recall in recent history involved queue mirrors on a node.

Message store index is unlikely to be a problem. Queue index which in recent versions embeds small messages.

How much data is approximately on disk?

Linas Valiukas

unread,
Jun 28, 2016, 4:07:31 PM6/28/16
to rabbitm...@googlegroups.com
"rabbitmqctl status" doesn't seem to be available while index is being rebuilt because server doesn't open a port the tool could connect to. erlang_crash.dump (attached in previous email) mentions that the stack dump was busy doing something index-related. Haven't tried "rabbitmq_top".

The last message before the Erlang OOM crash reads "msg_store_persistent: rebuilding indices from scratch" so I was assuming that this was due to message store index.

We store about 89 GB worth of data, about 130m of messages among 2+ queues, average message size being around 1 KB.

Michael Klishin

unread,
Jun 28, 2016, 4:12:33 PM6/28/16
to rabbitm...@googlegroups.com
The only way for us to investigate such issues is to obtain the data set or something that is nearly
identical but has message payloads scrubbed and try to reproduce.

You can run the node in foreground with RABBITMQ_ALLOW_INPUT exported to 1 and hit Enter a couple of times, then start the Observer Erlang app and see memory breakdown on the per process basis.

Linas Valiukas

unread,
Jun 28, 2016, 4:25:38 PM6/28/16
to rabbitm...@googlegroups.com
Would it be an option for me to upload the data set to S3 and let you download it? There's nothing private about it, it's a bunch of Celery JSON messages with IDs, and our project is opensource anyway, so nothing to hide.

Replicating the issue would then be rather trivial - start RabbitMQ, wait for it to crash with OOM error.

Michael Klishin

unread,
Jun 28, 2016, 4:26:48 PM6/28/16
to rabbitm...@googlegroups.com
Ah, perfect. Yes, an S3 link would be perfect. Just curious, can this be reproduced on a machine
with e.g. 16-24 GB of RAM?

Linas Valiukas

unread,
Jun 28, 2016, 6:55:16 PM6/28/16
to rabbitmq-users
Thank you in advance, Michael!


Also, here's a shell script replicating our setup, setting the required environment variables and summarizing the problem: https://gist.github.com/pypt/e36057e44fb5ec3dda80e1e1eef04c43

I have reproduced the issue on m3.xlarge EC2 instance with 15 GB of RAM, so I suspect 16-24 GB should be more than enough.

Please let me know if you need anything else from me to help you investigate.

Linas Valiukas

unread,
Jul 7, 2016, 4:07:00 PM7/7/16
to rabbitmq-users
Michael, did you maybe get a chance to try restoring this failed unindexed queue? Any new insights?

Michael Klishin

unread,
Jul 7, 2016, 4:54:02 PM7/7/16
to rabbitm...@googlegroups.com
No news.
Reply all
Reply to author
Forward
0 new messages