Memory High Water mark hit, empty queues

959 views
Skip to first unread message

Noah

unread,
Dec 1, 2014, 12:27:11 PM12/1/14
to rabbitm...@googlegroups.com
I have an 8 node RMQ cluster, that handles around 10k messages a second.  During periods of high load, some HA queues will fill up, a master or one of their slaves will hit their high water mark, and the cluster will stop accepting publishes until the queues drain, as expected.

However, after the queues drain, the memory usage on the nodes remains high and the cluster sporadically receives new publishes.  Needless to say, under high load is about the worst time for this sort of failure mode to happen.  

I have tried kicking off garbage collection manually via "sudo /usr/sbin/rabbitmqctl eval '[garbage_collect(P) || P <- processes()].'" to no effect.

On the affected nodes the memory details show nearly all the memory being held in Queues, Slaves and Binaries.  Again - there are no messages in the queues. 

We are running RMQ 3.4.1 on erlang R15B01.   Does anyone have any suggestions on how to diagnose or resolve this issue?


-N

Michael Klishin

unread,
Dec 1, 2014, 12:53:51 PM12/1/14
to rabbitm...@googlegroups.com, Noah
On 1 December 2014 at 20:27:13, Noah (noah....@gmail.com) wrote:
> On the affected nodes the memory details show nearly all the
> memory being held in Queues, Slaves and Binaries. Again - there
> are no messages in the queues.

Can you please post `rabbitmqctl report` output, at least the memory breakdown part?
--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Noah

unread,
Dec 1, 2014, 1:15:42 PM12/1/14
to rabbitm...@googlegroups.com, noah....@gmail.com
Hi, I have attached the reports for the affected nodes.  

I have since stopped all publishers and the queues are completely empty (and idle), yet the memory usage persists.


-N
rmq8.log
rmq14.log

Noah

unread,
Dec 1, 2014, 1:30:54 PM12/1/14
to rabbitm...@googlegroups.com, noah....@gmail.com
Another thing I just noticed is that pub confirms are very slowly trickling in long after the queues have been drained.  

My publishers wait to get all their confirmations back before they shut down and I noticed several still waiting and the rmq console showing a few confirmations per second with all other stats at zero.


-N

Michael Klishin

unread,
Dec 1, 2014, 1:38:03 PM12/1/14
to rabbitm...@googlegroups.com, Noah
On 1 December 2014 at 21:30:56, Noah (noah....@gmail.com) wrote:
> Another thing I just noticed is that pub confirms are very slowly
> trickling in long after the queues have been drained.
>
> My publishers wait to get all their confirmations back before
> they shut down and I noticed several still waiting and the rmq
> console showing a few confirmations per second with all other
> stats at zero.

What does

rabbitmqctl eval '[erlang:process_info(element(7, Q), message_queue_len) || Q <- rabbit_amqqueue:list()].'

output?

Noah

unread,
Dec 1, 2014, 1:43:47 PM12/1/14
to rabbitm...@googlegroups.com, noah....@gmail.com
typo?

# rabbitmqctl eval '[erlang:process_info(element(7, Q), message_queue_len) || Q <- rabbit_amqqueue:list()].'
Error: {badarg,[{erlang,process_info,[<6196.14039.11>,message_queue_len],[]},
                {erl_eval,do_apply,6,[{file,"erl_eval.erl"},{line,572}]},
                {erl_eval,eval_lc1,6,[{file,"erl_eval.erl"},{line,598}]},
                {erl_eval,eval_generate,7,[{file,"erl_eval.erl"},{line,627}]},
                {erl_eval,eval_lc,6,[{file,"erl_eval.erl"},{line,584}]},
                {rpc,'-handle_call_call/6-fun-0-',5,
                     [{file,"rpc.erl"},{line,203}]}]}

Michael Klishin

unread,
Dec 1, 2014, 1:50:36 PM12/1/14
to rabbitm...@googlegroups.com, Noah
On 1 December 2014 at 21:43:49, Noah (noah....@gmail.com) wrote:
> typo?
>
> # rabbitmqctl eval '[erlang:process_info(element(7, Q),
> message_queue_len) || Q <- rabbit_amqqueue:list()].'
> Error: {badarg,[{erlang,process_info,[<6196.14039.11>,message_queue_len],[]},

No, more likely your Erlang version does not support this option or even erlang:process_info/2.

Try

rabbitmqctl eval '[proplists:lookup(message_queue_len, erlang:process_info(element(7, Q))) || Q <- rabbit_amqqueue:list()].'

My guess is that the root cause may be similar to 25788 which was fixed in 3.4.2:
http://www.rabbitmq.com/release-notes/README-3.4.2.txt 

Noah

unread,
Dec 1, 2014, 1:53:51 PM12/1/14
to rabbitm...@googlegroups.com, noah....@gmail.com
Still no joy - we are on erlang R15B01 :(  We do have the upgrade to 3.4.2 slated for later today, however.

# rabbitmqctl eval '[proplists:lookup(message_queue_len, erlang:process_info(element(7, Q))) || Q <- rabbit_amqqueue:list()].' 
Error: {badarg,[{erlang,process_info,[<6196.14039.11>],[]},
                {erl_eval,do_apply,6,[{file,"erl_eval.erl"},{line,572}]},
                {erl_eval,expr_list,6,[{file,"erl_eval.erl"},{line,741}]},
                {erl_eval,expr,5,[{file,"erl_eval.erl"},{line,338}]},
                {erl_eval,eval_lc1,6,[{file,"erl_eval.erl"},{line,598}]},
                {erl_eval,eval_generate,7,[{file,"erl_eval.erl"},{line,627}]},
                {erl_eval,eval_lc,6,[{file,"erl_eval.erl"},{line,584}]},
                {rpc,'-handle_call_call/6-fun-0-',5,
                     [{file,"rpc.erl"},{line,203}]}]}


Michael Klishin

unread,
Dec 1, 2014, 2:14:23 PM12/1/14
to rabbitm...@googlegroups.com, Noah
On 1 December 2014 at 21:53:53, Noah (noah....@gmail.com) wrote:
> Still no joy - we are on erlang R15B01 :( We do have the upgrade
> to 3.4.2 slated for later today, however.

OK, the issue is that some queue processes are remote and process_info won't work for them.

If you don't have any publishers online on this node and it has no messages, feel free to do this:

[sudo] kill -SIGUSR1 [Erlang VM pid]

this will terminate the node and produce a crash dump (somewhere under
RabbitMQ user's $HOME).

Then send me the dump archived off-list.

Thank you.

Noah

unread,
Dec 1, 2014, 2:31:44 PM12/1/14
to rabbitm...@googlegroups.com, noah....@gmail.com
Hi -

We are proceeding with the 3.4.2 upgrade so have taken the cluster down for maintenance.  If the issue happens again, I will get you a crash dump.

As an aside -

How much of an improvement in stability and performance should we expect moving from Erlang R15B01 to the latest (OTP 17.3) - and is there any reason not to adopt the latest Erlang release?


Thanks for your help!

-N

Michael Klishin

unread,
Dec 1, 2014, 2:37:21 PM12/1/14
to rabbitm...@googlegroups.com, Noah
On 1 December 2014 at 22:31:46, Noah (noah....@gmail.com) wrote:
> How much of an improvement in stability and performance should
> we expect moving from Erlang R15B01 to the latest (OTP 17.3) -
> and is there any reason not to adopt the latest Erlang release?

We generally see a lot fewer issues reported on 17.x (and no, it's not because very
few people run it: some of the largest users have been running 17.x for some time).

There are scheduler improvements in 17.x, making throughput on multi-core machines
more consistent. There are also quite a few TLS correctness and feature completeness
improvements since R15B01 (in fact, we require R16B03 now for TLS).

In general, I'd recommend 17.x over R15B01 based on the user feedback we get
on this list and through other channels.

Simon MacMullen

unread,
Dec 2, 2014, 5:45:30 AM12/2/14
to Noah, rabbitm...@googlegroups.com
On 01/12/14 17:27, Noah wrote:
> On the affected nodes the memory details show nearly all the memory
> being held in Queues, Slaves and Binaries. Again - there are no
> messages in the queues.

(I understand you're upgrading now but in case the problem reoccurs /
for other people finding this thread.)

> We are running RMQ 3.4.1 on erlang R15B01. Does anyone have any
> suggestions on how to diagnose or resolve this issue?

You might find the rabbitmq-top plugin useful. You can get it from
http://www.rabbitmq.com/community-plugins.html and enable it for long
enough to find the top memory-using processes.

Alternatively:

# rabbitmqctl eval 'lists:sublist(lists:reverse(lists:sort([{catch
process_info(Pid, memory), Pid, catch process_info(Pid)} || Pid <-
processes()])), 30).'

will print out process info for the top 10 memory-using processes

Also (at the risk of being patronising) are you *sure* that all queues
are empty? Could there be messages lurking in a different vhost, for
example?

Cheers, Simon

Simon MacMullen

unread,
Dec 2, 2014, 5:48:43 AM12/2/14
to Michael Klishin, rabbitm...@googlegroups.com, Noah
On 01/12/14 18:50, Michael Klishin wrote:
> On 1 December 2014 at 21:43:49, Noah (noah....@gmail.com) wrote:
>> >typo?
>> >
>> ># rabbitmqctl eval '[erlang:process_info(element(7, Q),
>> >message_queue_len) || Q <- rabbit_amqqueue:list()].'
>> >Error: {badarg,[{erlang,process_info,[<6196.14039.11>,message_queue_len],[]},
> No, more likely your Erlang version does not support this option or even erlang:process_info/2.

erlang:process_info/2 has been around since forever, and I just tried
the above one liner on R15B and it worked. I think the problem was that
Noah was unlucky and ended up calling process_info/2 on a process which
had since disappeared. I suggest we insert "catch" into such one liners
in future...

> My guess is that the root cause may be similar to 25788 which was fixed in 3.4.2:
> http://www.rabbitmq.com/release-notes/README-3.4.2.txt

Hmm, I am not convinced; 25788 only applies while queue synchronisation
is taking place (and the big problem there is the slowdown, the memory
use increases too but with a very small constant factor).

Cheers, Simon

Noah Magram

unread,
Dec 2, 2014, 2:20:17 PM12/2/14
to Simon MacMullen, rabbitm...@googlegroups.com
Hi Simon,

> Also (at the risk of being patronising) are you *sure* that all queues are empty? Could there be messages lurking in a different vhost, for example?

No worries :)  Quite sure, as we only had a single vhost configured on this cluster.  

I read somewhere on this list that queues will not GC until they are in an idle state.  If we were in the weird mode where publisher confirmations were trickling back (despite no publishes to queues happening, and the queues themselves being empty), would this prevent the queues from freeing memory?

> # rabbitmqctl eval 'lists:sublist(lists:reverse(lists:sort([{catch process_info(Pid, memory), Pid, catch process_info(Pid)} || Pid <- processes()])), 30).'

Great - thanks!  The cluster is in a "healthy" state pushing around 10k/s, and the largest process appears to the call to eval, am I interpreting this correctly?  Memory usage is usually pretty low when the queues are nearly empty.  

{{memory,6657192},
  <5367.2949.399>,
  [{current_function,{erl_eval,do_apply,6}},
   {initial_call,{erlang,apply,2}},
   {status,running},
   {message_queue_len,0},
   {messages,[]},
   {links,[]},
   {dictionary,[]},
   {trap_exit,false},
   {error_handler,error_handler},
   {priority,normal},
   {group_leader,<0.22.0>},
   {total_heap_size,832040},
   {heap_size,317811},
   {stack_size,80},
   {reductions,181631},
   {garbage_collection,
       [{min_bin_vheap_size,46368},
        {min_heap_size,233},
        {fullsweep_after,65535},
        {minor_gcs,1}]},
   {suspending,[]}]}

Also I notice a most of the other processes in the top 10 are "process_next_msg" functions, and the dictionary (property, attribute?) seems to hold counters (credit_to) that appear to be for the internal rate limiting / flow control.  Often we are trying to determine why a connection/channel/queue is in "flow", could this information help diagnose that?

And finally, is there an easy way to get these erlang structures into valid JSON?  They seem pretty close already.


Thanks,

-N

Kyle O'Donnell

unread,
Dec 2, 2014, 2:31:27 PM12/2/14
to Noah Magram, Simon MacMullen, rabbitm...@googlegroups.com, t...@stoo.org
Not sure if this applies directly, but we (Tim Stewart big thanks!) found that an upgrade from erl R14 to R15+ required us to set the following in the rabbit config:

{rabbit, [
{hipe_compile,true},
{tcp_listen_options, [binary,
{sndbuf, 4096},
{recbuf, 4096},
{buffer, 4096},
{packet, raw},
{reuseaddr, true},
{backlog, 128},
{nodelay, true},
{exit_on_close, false}]},

We had increased the linux kernel tcp default min/max buffers, which caused each connection to rabbit to use that minimum size buffer. When our app started we'd hit the watermark immediately.
--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send an email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Noah Magram

unread,
Dec 2, 2014, 2:38:26 PM12/2/14
to Kyle O'Donnell, Simon MacMullen, rabbitm...@googlegroups.com, t...@stoo.org
Thanks Kyle!  I don't think this applies to our case as we have more channels and fewer (fast) connections, but I can see how that would help memory usage if you had a case where you had many tcp connections with slow clients and had sndbuf bloat.


-N

t...@stoo.org

unread,
Dec 2, 2014, 2:46:10 PM12/2/14
to rabbitm...@googlegroups.com, noah....@gmail.com, si...@rabbitmq.com, t...@stoo.org, ky...@0b10.mx
Hello,

I had intended to reply to the following threads, though I now realize they are in the archives:

  1. http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/2013-January/024737.html
  2. http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/2013-April/026665.html

Specifically, the solution mentioned in Kyle's previous email and in thread #1 is what solved our memory issue described in thread #2.

-TimS

Simon MacMullen

unread,
Dec 3, 2014, 6:03:04 AM12/3/14
to Noah Magram, rabbitm...@googlegroups.com
On 02/12/14 19:20, Noah Magram wrote:
> I read somewhere on this list that queues will not GC until they are in
> an idle state.

That's not really true. Queues GC after they have done a certain amount
of work (counted as "reductions" by Erlang and visible in the stats
below). However, they also do a sort of super-GC when they go idle, in
which the process memory is GCed then compressed. Obviously going into
and out of this state is expensive, hence why we don't do it all the time.

> If we were in the weird mode where publisher
> confirmations were trickling back (despite no publishes to queues
> happening, and the queues themselves being empty), would this prevent
> the queues from freeing memory?

It certainly shouldn't.

> > # rabbitmqctl eval 'lists:sublist(lists:reverse(__lists:sort([{catch
> process_info(Pid, memory), Pid, catch process_info(Pid)} || Pid <-
> processes()])), 30).'
>
> Great - thanks! The cluster is in a "healthy" state pushing around
> 10k/s, and the largest process appears to the call to eval, am I
> interpreting this correctly? Memory usage is usually pretty low when
> the queues are nearly empty.

You are. I was hoping to see the output of that command when the system
was in a bad state...

> Also I notice a most of the other processes in the top 10 are
> "process_next_msg" functions, and the dictionary (property, attribute?)
> seems to hold counters (credit_to) that appear to be for the internal
> rate limiting / flow control.

Yes, you should also be able to see a description of the process under
'process_name' in the dictionary.

> Often we are trying to determine why a
> connection/channel/queue is in "flow", could this information help
> diagnose that?

Have you seen
http://www.rabbitmq.com/blog/2014/04/14/finding-bottlenecks-with-rabbitmq-3-3/
?

> And finally, is there an easy way to get these erlang structures into
> valid JSON? They seem pretty close already.

It's a bit more fiddly than that since Erlang terms and JSON have
different type systems, you need to make some decisions about how to
convert things.

But: the HTTP API exported by the rabbitmq-top plugin talks JSON and
gives very similar information.

Cheers, Simon

> Thanks,
>
> -N
>
> On Tue, Dec 2, 2014 at 2:45 AM, Simon MacMullen <si...@rabbitmq.com
> <mailto:si...@rabbitmq.com>> wrote:
>
> On 01/12/14 17:27, Noah wrote:
>
> On the affected nodes the memory details show nearly all the memory
> being held in Queues, Slaves and Binaries. Again - there are no
> messages in the queues.
>
>
> (I understand you're upgrading now but in case the problem reoccurs
> / for other people finding this thread.)
>
> We are running RMQ 3.4.1 on erlang R15B01. Does anyone have any
> suggestions on how to diagnose or resolve this issue?
>
>
> You might find the rabbitmq-top plugin useful. You can get it from
> http://www.rabbitmq.com/__community-plugins.html
> <http://www.rabbitmq.com/community-plugins.html> and enable it for
> long enough to find the top memory-using processes.
>
> Alternatively:
>
> # rabbitmqctl eval 'lists:sublist(lists:reverse(__lists:sort([{catch
Reply all
Reply to author
Forward
0 new messages