RabbitMQ crashes when consumer dies and queue has a lot messages.

5,147 views
Skip to first unread message

Sergey Chernov

unread,
Mar 29, 2017, 7:36:00 AM3/29/17
to rabbitmq-users
The problem:


There is published which produces a lot messages.
There is consumer which consume messages.
When messages more that consumer can consume - rabbitmq stores them in mnesia.
If consumer dies, rabbitmq get memory high watermark alert and crashes.
if consumer continue working, rabbitmq doesn't crash.


Once again, rabbitmq crashes when consumer dies when queue has a lot messages.
Rabbitmq doesn't crash if there is no consumer at all.

Quite strange, isn't it?



I have direct exchange and queue.
I have simple client (https://gist.github.com/chernser/8f79a0e103354edad87d71e24b5c52e3)

Connects to rmq servers
Declares queue and exchange
calls basic consume, but do not consume them
publishes 10k messages
closes connection
I have virtual host with 1Gb RAM

When script ends and closes connection, rabbitmq crashes (just terminated) with next log:

=INFO REPORT==== 29-Mar-2017::11:00:06 ===
vm_memory_high_watermark set. Memory used:881321872 allowed:787377356
=WARNING REPORT==== 29-Mar-2017::11:00:06 ===
memory resource limit alarm set on node rabbit@inf.
*** Publishers will be blocked until this alarm clears ***

After that I need to restart the rabbitmq.

My test client can produce 30k messages and more, but rabbitmq keeps running, until connection is closes.




Michael Klishin

unread,
Mar 29, 2017, 8:29:06 AM3/29/17
to rabbitm...@googlegroups.com
Hi Sergey,

Your persistence in posting this is admirable. I wish you had the same persistence in understanding
the problem, even when we are explaining it in a reasonable amount of details [1].

Let's reiterate:

 * Your node has 1 GB of RAM available. Let's assume that 90% of that amount is available to RabbitMQ, which is not the case in practice and definitely not what default configuration assumes.
 * Your publisher publishes 10K of *transient* messages 100kB in size, therefore instructing RabbitMQ to keep them in RAM as much as possible
 * Your consumer uses manual acknowledgements without channel QoS (no limit on the number of outstanding deliveries) and then goes away
 * Your node hits an alarm and blocks publishers

Given all this, RabbitMQ is asked to do the following:

 * Keep ~ 1 GB of messages in RAM with 90% of 1 GB available
 * Deliver them as quickly as possible to the consumer which doesn't acknowledge anything and has no delivery limit
 * Keep all the housekeeping items related to acknowledgements in RAM (because they are a part of channel state)

Now, how does persistence works in layman's terms (more can be found in [2][3]):

 * If a message is routed to a lazy queue, it is sent to disk as quickly as possible
 * If a message is routed to a durable queue and the message is persistent, it is sent to disk as quickly as possible
 * In all other cases, it is kept in RAM until the node is under memory pressure

When is a node under memory pressure?

 * When its estimated RAM consumption limit is above the configured watermark, which by default is 0.4 of the available RAM

What does a node that's under memory pressure do, simplified?

 * Throttles publishes by stopping socket reads (then TCP buffers fill up on the publisher and and the OS accepts no more writes)
 * It tells queues to move all messages (including transient) to disk to the point when it's not under an alarm
 * That point is known as high VM memory watermark paging ratio and is 0.5 of the alarm limit by default

Cool, so how does having consumers affect this?

When a node has no consumers, it throttles publishers and tries to move more and more messages to disk
as needed. When a node *does* have consumers it also has to balance the amount of messages it keeps in RAM
with what has to be delivered, or kept around for acknowledgement purposes.
This is where things get really interesting. The exact algorithm of balancing of what goes into
RAM vs. on disk are fairly involved. The best source of that information (other than the code) is in [3].
While slow consumers can and will throttle RabbitMQ to some extent, RabbitMQ priorititises consumers
over publishers because it is much more common to see the case where publishers outpace consumers
than a case with no consumers at all. Therefore messages can be loaded eagerly into RAM for delivery
when on the other hand a flush to disk would be nice. 
On top of that, consumer or any other client disconnects are sometimes not detected immediately [4], which
means RabbitMQ will assume it can read messages for delivery for a while after the consumer is gone.

All of this is a tough balancing act and if this part was easy, I guess messaging technologies would be
a largely solved problems by now but they aren't [5][7] and this and similar problems around the ages old
publisher/consumer problem resurface in many systems, including those that have little to do with messaging
per se.

You claim that the behaviour in your particular case is a critical bug.
I claim that this is a lack of understanding of certain settings in your script
and AMQP 0-9-1 features, which may or may not be unfortunate but have been set in stone about a decade ago and
can contribute to resource consumption. It would be great if RabbitMQ was more resilient
in this particular and other scenarios and our team is working quite hard to make that happen
but as I've mentioned above, it's a non-trivial balancing act.

Instead of hammering the point that this is a critical bug and the world is on fire, you could have considered the listening
to our feedback and tweaking your system such that:

 * It uses durable queues and publishes messages as persistent
 * Your consumer has a reasonable QoS setting [6] (or uses automatic acknowledgements if "fire and forget" delivery is acceptable)
 * Your queue has a reasonable message TTL
 * If consumer goes away by losing TCP connection, a heartbeat interval is reduced to 6-8 seconds (lower values often lead to false positives).

Instead you decided to ignore the above feedback,
make claims that this is a bug around how "messages are stored in Mnesia" (they aren't)
and that the root cause is clear.

I sure hope it is clear to you now. We'll do our best to make RabbitMQ's internal flow control be more defensive in this
case, although with protocol features such as "unbounded outstanding number of messages in flight" any defensive change
that area would be a protocol spec violation plus potentially pretty surprising to the user, so it's not such a
no-brainer as it may seem at first.

Feel free to switch to a different messaging technology in the meantime.



--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Michael Klishin

unread,
Mar 29, 2017, 8:35:06 AM3/29/17
to rabbitm...@googlegroups.com
…I forgot to exclude the effects of queue process GC on peak memory consumption but it's not that
relevant.

This is a great example of how deliveries to consumers with manual acknowledgements can
lead to the decades old "unbounded buffer problem" which certain messaging protocol features
(or their combination) make them more likely to manifest themselves.

But it's always a data service's fault, never the apps, amirite?


To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Michael Klishin

unread,
Mar 29, 2017, 8:35:46 AM3/29/17
to rabbitm...@googlegroups.com
On Wed, Mar 29, 2017 at 3:35 PM, Michael Klishin <mkli...@pivotal.io> wrote:
…I forgot to exclude the effects of queue process GC on peak memory consumption but it's not that
relevant.

This should read "I forgot to include…"

Sergey Chernov

unread,
Mar 29, 2017, 8:45:39 AM3/29/17
to rabbitm...@googlegroups.com
Thank, Michael for explanation! 

Ok. Lets discuss the problem what I see. 

I understand that I need to use limit and I will. 
I understand why server crashes. 

But, I have question regarding strange behavior: 

Please consider next cases: 

1. If I have no consumer on queue and perform test server, after publisher disconnected keeps running
2. If I have one idle consumer (what may be hanged maching, for example) and same crazy publisher - rabbitmq crashes on consumer disconnect 
3. If I have to publishers producing twice more load and one idle consumer - rebbitmq do not crash but reports high watermark limit. 

So, for me it is quite strange that in more bad cases rmq keeps running and crashes in another case. 
If it is server internal specific and known behavior which requires effort to fix, it is ok. 

If there is some kind of GC on consumer disconnect which leads crash it would be nice to fix. 

One more observation: same amount of messages, idle consumer, but publisher doesn't create/close channel for each publish operation - server keeps  running. 

Thanks in advance! 













--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/Sl8So_xf4Vg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
-- Sergey Chernov
Reply all
Reply to author
Forward
0 new messages