Server rebuilds index after crash, uses up all memory, dies

Linas Valiukas

unread,

May 26, 2016, 4:15:17 PM5/26/16

to rabbitmq-users

Hello!

In order to try out RabbitMQ with our workload, I fired up an EC2 instance and added ~42 mil messages to one of the queues.

All the messages got added to the persistent queue just fine, but then I kill -9'ed the RabbitMQ instance to test whether it would start up after being shut down abruptly. I'm aware that the proper way to stop the server is via "rabbitmqctl stop" but I wanted to see what would happen on a system crash and such.

Now the problem is that upon startup, the server tries to rebuild the index with "rebuilding indices from scratch", apparently uses up all the memory and dies silently:

=INFO REPORT==== 26-May-2016::14:08:20 ===

Starting RabbitMQ 3.6.2 on Erlang 18.3

Licensed under the MPL. See http://www.rabbitmq.com/

=INFO REPORT==== 26-May-2016::14:08:20 ===

node : mediacloud@localhost

home dir : /home/ubuntu

config file(s) : /mediacloud/data/rabbitmq/rabbitmq.config

cookie hash : 6RiAbVAbV1rGeqQ/Q2/AJQ==

log : /mediacloud/data/rabbitmq/logs/media...@localhost.log

sasl log : /mediacloud/data/rabbitmq/logs/media...@localhost-sasl.log

database dir : /mediacloud/data/rabbitmq/mnesia/mediacloud@localhost

=INFO REPORT==== 26-May-2016::14:08:23 ===

Memory limit set to 1506MB of 3766MB total.

=INFO REPORT==== 26-May-2016::14:08:23 ===

Disk free limit set to 50MB

=INFO REPORT==== 26-May-2016::14:08:23 ===

Limiting to approx 65436 file handles (58890 sockets)

=INFO REPORT==== 26-May-2016::14:08:23 ===

FHC read buffering: OFF

FHC write buffering: ON

=INFO REPORT==== 26-May-2016::14:08:23 ===

Priority queues enabled, real BQ is rabbit_variable_queue

=INFO REPORT==== 26-May-2016::14:08:23 ===

Management plugin: using rates mode 'basic'

=INFO REPORT==== 26-May-2016::14:08:23 ===

msg_store_transient: using rabbit_msg_store_ets_index to provide index

=INFO REPORT==== 26-May-2016::14:08:23 ===

msg_store_persistent: using rabbit_msg_store_ets_index to provide index

=WARNING REPORT==== 26-May-2016::14:08:23 ===

msg_store_persistent: rebuilding indices from scratch

=INFO REPORT==== 26-May-2016::14:29:05 ===

vm_memory_high_watermark set. Memory used:1868367216 allowed:1579615846

=WARNING REPORT==== 26-May-2016::14:29:05 ===

memory resource limit alarm set on node mediacloud@localhost.

**********************************************************

*** Publishers will be blocked until this alarm clears ***

**********************************************************

Why does RabbitMQ use up all of my memory (~4 GB RAM + 3 GB swap)? Is this that "uses a small amount of memory" case that's mentioned in https://www.rabbitmq.com/persistence-conf.html? How do I start RabbitMQ server in this case?

Regards,

Michael Klishin

unread,

May 26, 2016, 5:30:43 PM5/26/16

to rabbitm...@googlegroups.com

RabbitMQ alarms at ~ 1.5 GB. I remember one message recovery problem that could ignore alarms

but it should not longer be an issue in 3.6.x. See system log for OOM killer messages: regardless of how much

RAM RabbitMQ actually uses, it can be killed by the OOM killer on Linux simply because it's the OS process

that uses more than any other.

There is no evidence that it is index rebuilding that is the root cause. Please capture and post `rabbitmqctl status` after you

hit the alarm.

Also, this does not happen immediately after start so you should have enough time to make queues lazy:

http://www.rabbitmq.com/lazy-queues.html

To commend on whether index entries can accumulate a lot of RAM we need to know your message size distribution

and how many messages are enqueued.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

MK

Staff Software Engineer, Pivotal/RabbitMQ

Linas Valiukas

unread,

May 27, 2016, 12:07:13 PM5/27/16

to rabbitm...@googlegroups.com

Michael, thank you for your reply.

Strange, /var/log doesn't mention killing the process (due to memory or any other reasons), it just went away. Also, when I tried to recreate the issue by killing RabbitMQ again, disabling swap (to hit the memory limit quicker) and restarting the server, the index got recreated properly and the memory usage remained stable throughout the reindexing process.

I have tried this with ~42 million 512 byte messages in a persistent queue (which makes it "lazy" in a sense that new messages get written to disk right away).

Is there a way to somehow avoid the reindexing after a nasty shutdown? We are going to run RabbitMQ on a bigger machine than EC2's m3.medium so the reindexing will be quicker, but it's still annoying to wait for (tens of) minutes for the server to start.

You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/EJ7Z40oK_bM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-user...@googlegroups.com.

Michael Klishin

unread,

May 27, 2016, 5:31:18 PM5/27/16

to rabbitm...@googlegroups.com

No. After abnormal shutdown RabbitMQ cannot trust existing indices and has to perform a sequential scan.

Linas Valiukas

unread,

Jun 27, 2016, 9:49:05 PM6/27/16

to rabbitmq-users

Hi Michael,

Sorry to resurrect an old thread, but we still have the same problem.

We are running RabbitMQ with about 10 queues, two of which are rather big (one of them contains 100m+ messages, another one around 30m+). All of the queues are made both persistent and lazy to reduce RAM usage as much as possible. vm_memory_high_watermark is set to 8 GB as it seems to be enough to run the server with our load.

After a recent unclean shutdown, RabbitMQ went into rebuilding indices as usual, however 8 GB doesn't seem to be enough to rebuild the index. We observe CPU and disk usage up until the server hits the memory watermark. After hitting the limit, server quickly comes back to normal memory usage but the index rebuilding seems to stop -- Erlang spawns ~540 threads most of which idle at "D" ("uninterruptible sleep") process state, load rises to 500+ too. On other attempts, process crashes with: Cannot allocate (around 6 gigs) of memory (of type "old_heap"). Here's a sample core dump:

https://www.dropbox.com/s/nmobkvrpszqfe8r/erl_crash.dump.xz?dl=0

We run RabbitMQ on 192 GB RAM machine, so memory shouldn't be an issue. Also, we have tried to set vm_memory_high_watermark to 0, but RabbitMQ still crashes with OOM errors.

For the reference, we're running RabbitMQ 3.6.2-1 (from packagecloud.io) + Erlang 17.5.3 (from erlang-solutions.com; 17 instead of 18 as advised on https://groups.google.com/forum/#!topic/rabbitmq-users/7K0Ac5tWUIY) on Ubuntu 12.04.

Please advise.

Regards,

sasl log : /mediacloud/data/rabbitmq/logs/mediacloud@localhost-sasl.log

Michael Klishin

unread,

Jun 28, 2016, 5:17:19 AM6/28/16

to rabbitm...@googlegroups.com

RabbitMQ does not allocate memory directly. Try Erlang 18.3.4,

which is now available from the Erlang Solutions downloads page:

https://www.erlang-solutions.com/resources/download.html

sasl log : /mediacloud/data/rabbitmq/logs/media...@localhost-sasl.log

Linas Valiukas

unread,

Jun 28, 2016, 11:05:08 AM6/28/16

to rabbitmq-users

We have tried rebuilding index with Erlang 17.5.3, 18.3.4 and 19.0, all versions have failed. Additionally, I've tried running Erlang from esl- and erlang- Ubuntu distributions, to no avail.

I get it that RabbitMQ doesn't manage memory itself, but maybe it uses up too much memory while rebuilding index and so Erlang crashes? If I were to do new long[Integer.MAX_VALUE][Integer.MAX_VALUE] in Java, OutOfMemoryError would be my application's problem, not VM's.

What are other options? Should we try alternate message index plugin? Where can I find one?

sasl log : /mediacloud/data/rabbitmq/logs/mediacloud@localhost-sasl.log

Michael Klishin

unread,

Jun 28, 2016, 3:04:30 PM6/28/16

to rabbitm...@googlegroups.com

rabbitmqctl status and rabbitmq_top can help identify what uses most memory. The only similar

issue that I recall in recent history involved queue mirrors on a node.

Message store index is unlikely to be a problem. Queue index which in recent versions embeds small messages.

How much data is approximately on disk?

sasl log : /mediacloud/data/rabbitmq/logs/media...@localhost-sasl.log

Linas Valiukas

unread,

Jun 28, 2016, 4:07:31 PM6/28/16

to rabbitm...@googlegroups.com

"rabbitmqctl status" doesn't seem to be available while index is being rebuilt because server doesn't open a port the tool could connect to. erlang_crash.dump (attached in previous email) mentions that the stack dump was busy doing something index-related. Haven't tried "rabbitmq_top".

The last message before the Erlang OOM crash reads "msg_store_persistent: rebuilding indices from scratch" so I was assuming that this was due to message store index.

We store about 89 GB worth of data, about 130m of messages among 2+ queues, average message size being around 1 KB.

Michael Klishin

unread,

Jun 28, 2016, 4:12:33 PM6/28/16

to rabbitm...@googlegroups.com

The only way for us to investigate such issues is to obtain the data set or something that is nearly

identical but has message payloads scrubbed and try to reproduce.

You can run the node in foreground with RABBITMQ_ALLOW_INPUT exported to 1 and hit Enter a couple of times, then start the Observer Erlang app and see memory breakdown on the per process basis.

Linas Valiukas

unread,

Jun 28, 2016, 4:25:38 PM6/28/16

to rabbitm...@googlegroups.com

Would it be an option for me to upload the data set to S3 and let you download it? There's nothing private about it, it's a bunch of Celery JSON messages with IDs, and our project is opensource anyway, so nothing to hide.

Replicating the issue would then be rather trivial - start RabbitMQ, wait for it to crash with OOM error.

Michael Klishin

unread,

Jun 28, 2016, 4:26:48 PM6/28/16

to rabbitm...@googlegroups.com

Ah, perfect. Yes, an S3 link would be perfect. Just curious, can this be reproduced on a machine

with e.g. 16-24 GB of RAM?

Linas Valiukas

unread,

Jun 28, 2016, 6:55:16 PM6/28/16

to rabbitmq-users

Thank you in advance, Michael!

Here's the dump: https://s3.amazonaws.com/mediacloud-rabbitmq-reindex-oom-crash/dump/rabbitmq-reindex-oom-crash.tar.gz (8.8 GB; expands to 89 GB)

Also, here's a shell script replicating our setup, setting the required environment variables and summarizing the problem: https://gist.github.com/pypt/e36057e44fb5ec3dda80e1e1eef04c43

I have reproduced the issue on m3.xlarge EC2 instance with 15 GB of RAM, so I suspect 16-24 GB should be more than enough.

Please let me know if you need anything else from me to help you investigate.

sasl log : /mediacloud/data/rabbitmq/logs/mediacloud@localhost-sasl.log

Linas Valiukas

unread,

Jul 7, 2016, 4:07:00 PM7/7/16

to rabbitmq-users

Michael, did you maybe get a chance to try restoring this failed unindexed queue? Any new insights?

Michael Klishin

unread,

Jul 7, 2016, 4:54:02 PM7/7/16

to rabbitm...@googlegroups.com

No news.

Reply all

Reply to author

Forward