RabbitMQ possible memory leak

4,791 views
Skip to first unread message

Caroline Briffa

unread,
Nov 3, 2017, 2:00:50 PM11/3/17
to rabbitmq-users
Hi,

We are encountering a situation where RabbitMQ is unexpectedly closing the AMQP connections of a simple producer/consumer POC (attached). This happens when the same channel is shared between the producing and consuming threads. The memory usage keeps increasing until the high watermark is reached, after which the connection is dropped. When creating a new channel per thread, the memory usage remains constant and this behaviour is not observed. 

Are there are known memory leaks which may be causing this issue?

The versions used to run this POC are:

RabbitMQ 3.6.12 
Erlang 19.2.1
amqp-client 5.0.0 
AMQP 0-9-1

and the rabbitMQ logs are:

=WARNING REPORT==== 3-Nov-2017::16:20:02 ===
memory resource limit alarm cleared on node rabbit@63cd86a8d6f7

=WARNING REPORT==== 3-Nov-2017::16:20:02 ===
memory resource limit alarm cleared across the cluster

=ERROR REPORT==== 3-Nov-2017::16:20:11 ===
closing AMQP connection <0.575.0> (172.17.0.1:58532 -> 172.17.0.2:5672):
{inet_error,enotconn}

Thanks,
Caroline
rabbit-poc.zip

Michael Klishin

unread,
Nov 3, 2017, 3:02:05 PM11/3/17
to rabbitm...@googlegroups.com
It is almost certainly not a memory leak. High or "higher than expected"
memory consumption of a node is not evidence of a memory leak.

Contrary to the popular believe, TCP connections consume RAM and so do
transient messages process A publishes. "Transient" here means it explicitly tells
RabbitMQ to

 * Use a non-durable queue
 * Keep messages in memory (default delivery mode = transient)

There is no need to guess, however: RabbitMQ provides tools to break memory usage down.

See [1], `rabbitmqctl status`, management UI's memory breakdown on the node page, the sections on TCP connection buffer logging in [2],
[3], [4], [5]. Individual queue page in the management UI provides a breakdown of message locations
(RAM vs. disk or *both*).

=ERROR REPORT==== 3-Nov-2017::16:20:11 ===
closing AMQP connection <0.575.0> (172.17.0.1:58532 -> 172.17.0.2:5672):
{inet_error,enotconn}

means a TCP socket was detected by RabbitMQ to be in the "not connected" state (a dead one). When a node is in alarm
state [6], it stops reading from the socket on connections that publish, which among other things can lead
to timeouts on the client sent and client socket closure.

TL;DR: you probably want to use a durable queue, publish messages as persistent, or use a lazy queue. If you insist that
RabbitMQ keeps 10M messages in RAM for a long as possible, expect them to potentially consume a fair amount of RAM.



--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Michael Klishin

unread,
Nov 3, 2017, 3:06:01 PM11/3/17
to rabbitm...@googlegroups.com
When creating a new channel per thread, the memory usage remains constant and this behaviour is not observed.

I haven't noticed this line first. Again, no need to guess: ask the node what consumes RAM but I have a hypothesis
based on this observation:

 * Publishing on a shared channel is a no-no per [1]
 * …and it will result in a connection-level (fatal) framing issue due to incorrect interleaving on the wire (just search "concurrent AND interleaving on the wire" in this list's archive)
 * To which Java client 4.0+, which has automatic connection recovery [1] enabled by default, will react by reconnecting
 * Therefore connections pile up, and each connection consumes at least 100 kB for its TCP buffers by default on most OS'es [2]

Per [2], all successful inbound connections from clients are logged.


On Fri, Nov 3, 2017 at 8:00 PM, Caroline Briffa <carolin...@gmail.com> wrote:

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mark Galea

unread,
Nov 3, 2017, 4:43:51 PM11/3/17
to rabbitmq-users
Hi Michael

It seems that your hypothesis is in line with what we are observing. What we observe is that when using a share connection the rabbit node port 5672 stops accepting connections and subsequently rabbit goes down. Another thing which we observe whilst looking at the wireshark capture is that the last message before the node shutdown has a really high channel id. So lets say for all message except the last one the channel id is 1 whereas for the last one we see the channel id 25232

From our testing it seems that rabbit is more sensitive to this issue from version 3.6.10 onwards. Up until version 3.6.9 its quite hard to replicate this issue. Any reason you can think of?

Another thing which is weird is that we dont see any FIN_WAIT connections on the rabbit server. Is our expectation correct here? Should we see FiN_WAIT connnections when netstat’ing on the server.


Sorry for the typos. The mobile UI doesnt really help.

Thanks for reply

Mark
> To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
>
> To post to this group, send email to rabbitm...@googlegroups.com.

Michael Klishin

unread,
Nov 3, 2017, 5:42:36 PM11/3/17
to rabbitm...@googlegroups.com
3.6.9 underreports how much memory a node uses, so chances are it never hits the

I cannot think of any other relevant change since then.

I cannot suggest whether a server-initiated close is what's actually going on without seeing
at least a couple of pages worth of logs. Connection-level exceptions should be easy to spot.

It used to be the case that connections that are blocked were impossible to close for the server
before a certain period of time passes. It's been addressed long before 3.6.9.

Try a heartbeat timeout value of 5-6 seconds (not lower) and see if it changes anything.

Also note that the client and RabbitMQ are not necessarily going to detect TCP connection
termination at the same time, which can explain certain discrepancies between client-reported I/O exceptions
and what RabbitMQ node(s) log.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send an email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,
Nov 3, 2017, 5:45:57 PM11/3/17
to rabbitm...@googlegroups.com
Also, while I applaud users who want to get down to the root cause of a certain behaviour,
you pretty much have an answer: don't share channels for publishing, concurrency hazards
are expected in that case.

It would be great if you could post the memory breakdown section of `rabbitmqctl status` for each scenario.

To post to this group, send an email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Caroline Briffa

unread,
Nov 4, 2017, 11:59:22 AM11/4/17
to rabbitmq-users
Thanks for your replies Michael.

These are the memory breakdown sections for both scenarios. Snapshots were taken after approx 350K messages given that in the shared channel scenario, the connection is aborted after approx 400K and the memory consumption goes back to normal levels when that happens.

Separate channels -

{memory,
     [{total,171943064},
      {connection_readers,94200},
      {connection_writers,570672},
      {connection_channels,1833104},
      {connection_other,145120},
      {queue_procs,9455640},
      {queue_slave_procs,0},
      {plugins,364288},
      {other_proc,22724936},
      {mnesia,63104},
      {mgmt_db,93472},
      {msg_index,37104},
      {other_ets,1371144},
      {binary,4332416},
      {code,27328278},
      {atom,992409},
      {other_system,102537177}]},

Shared channel -

 {memory,
     [{total,530111912},
      {connection_readers,53552},
      {connection_writers,539840},
      {connection_channels,32221744},
      {connection_other,71016},
      {queue_procs,298394024},
      {queue_slave_procs,0},
      {plugins,367280},
      {other_proc,42461328},
      {mnesia,63104},
      {mgmt_db,87368},
      {msg_index,37104},
      {other_ets,1370048},
      {binary,23516744},
      {code,27328278},
      {atom,992409},
      {other_system,102608073}]},


The API says to avoid publishing from different threads on the same channel, however 'Consuming in one thread and publishing in another thread on a shared channel can be safe'. In our scenario we are consuming and publishing on separate threads using the same channel, but manually acknowledging the message we consume too. Is the basicAck also considered publishing in this case?

Would you have time to look at the logs to spot any server-initialised closes if I posted them here? Is setting the log_levels granularity as [{connection, debug}] enough to retrieve this information?

Thanks again,
Caroline
To post to this group, send an email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Arnaud Cogoluègnes

unread,
Nov 6, 2017, 8:33:14 AM11/6/17
to rabbitm...@googlegroups.com

The API says to avoid publishing from different threads on the same channel, however 'Consuming in one thread and publishing in another thread on a shared channel can be safe'. In our scenario we are consuming and publishing on separate threads using the same channel, but manually acknowledging the message we consume too. Is the basicAck also considered publishing in this case?



You're fine if you ack from the same thread as handleDelivery.

Michael Klishin

unread,
Nov 6, 2017, 11:17:19 AM11/6/17
to rabbitm...@googlegroups.com
Acknowledging is not publishing.

The biggest difference is in the queue processes section. Queues do not generally care
about how many channels there are but using 1 channel vs. N channels can have material impact
on throughput since channels have limited throughput on the server (I don't think the difference on the consumer
end would make much difference since consumer work pool is per-connection in the Java client IIRC).

In other words, it can be that in one workload your consumer(s) cannot keep up and thus messages pile up,
or higher queue process GC activity causes memory usage to go up rapidly but then go down just as quickly.

See how this correlates to other metrics such as various rates and message backlog.

You don't need to enable debug logging to find relevant information. Inbound connections, connection and channel
exceptions, missed heartbeats — all of that is visible with defaults.

Our small team makes no promises on whether we will inspect the logs. It can be a very time consuming process.
You are definitely welcome to post them here because honestly, without logs, code and metrics we are playing a guessing game.


To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Mark Galea

unread,
Nov 6, 2017, 2:00:11 PM11/6/17
to rabbitmq-users
We have tested a bit on our end and we seem to have identified a possible cause with our setup.  The setup is as follows:

We have run a specific interleaving to accentuate the problem

Setup A - Shared Channel (Queues not transactional)
The producer produces 2,000,000 on the business_output queue.  The consumer is still switched off at this point.
Looking at queue information on the rabbit UI we can see that there are 139,571 in memory (section Messages) and 1,860,429 paged out. 

Now we switch off the producer and we start a consumer with a QoS of 10 which does not acknowledge (so we see the impact on the memory usage).  
We see 10 unacked messages, 0 Ready, and the same number of messages in memory and paged out; 139,571 in memory (section Messages) and 1,860,429 paged out. So far things are fine.  

Next, we switch off the consumer and start a consumer with no QoS settings (prefetch 0) and we can see that 2,000,000 are now in memory (section Messages) and 0 are paged out.  Basically, the QoS allowed Rabbit to still paged out messages to disk and keep sane memory levels.  Without a QoS, the consumer keeps pulling messages and keeps all of them in memory until rabbit dies. Is there no protection against this scenario? Shouldn't flow control prevent this?

Other Observations:
When a consumer and producer are switched on using separate connections we can observe publish rates of 12K and consumption rates of 12K msg/sec.  Hence we never get to the QoS drama situation described above.  

When the consumer and producer are switched on using a shared connection we can observe that the producer publishes at a rate which is not equal to the consumption rate of the consumer.  The producer is around 7K/sec and the consumer is consuming at around 1K/sec.  Hence on a shared connection the producer hogs the process and starves the consumer.  This leads to the consumer retaining more messages in memory and will result in rabbit dying when the QoS is 0.  

Are channels the unit of concurrency? If we use multiple channels will we make better use of the underlying cores? Why are we observing uneven rates when the channel is shared? 

Another observation is that when we have 2307400 message (of 1 byte each and all in memory) on the queue we are already hitting the high-level watermark and the memory is at around1.5Gb.  Our expectation here is that the queue should consume 2.31Mb (approx).  Any ideas why we are hitting such memory usage?

Michael Klishin

unread,
Nov 6, 2017, 2:27:57 PM11/6/17
to rabbitm...@googlegroups.com
It depends on who you ask. Automatic acknowledgements are not the safest thing in the world
and they trade off a lot of things for efficiency. The guide that covers them warns about the possible
ramifications:

In addition to that, queues have a concept of "memory pressure' (what % or time frame of memory is free before the node hits
an alarm given current ingress and egress rates). So if the node is far from reaching its limit, queues will eagerly load
messages into memory — why wouldn't they? If you want a different behavior, either use manual acknolwedgements
with a moderately high prefetch (say, 100 to 250) or tweak high VM watermark and paging out ratio.

You keep using "connection" and "channel" seemingly interchangeably here. They are separate entities:

Both are backed by an Erlang process each so yes, they are units of concurrency, although it may be a lot less obvious
for connections than for queues and channels.

A channel shared for both consumption and publishing will be a limiting factor.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
Message has been deleted

Mark Galea

unread,
Nov 7, 2017, 3:56:54 PM11/7/17
to rabbitmq-users
Hi Michael, 

Apologies for using channel and connection a bit loosely.  I have translated our POC to perftest commands so that things are a bit more clear. 

Here is the use case I tried to describe yesterday.  We have a producer which produces 5 million messages.  

    bin/runjava com.rabbitmq.perf.PerfTest -x 1 -y 0 -u "non-lazy" --id "test 1" -pmessages 5000000

After this has completed we can see that there are 224,013 in memory and 4,775,987 paged out.  So far so good!

We now run the consumer with the following settings. 

    bin/runjava com.rabbitmq.perf.PerfTest -x 0 -y 1 -u "non-lazy" --id "test 1" -q 0 -A 5000001

We are not crazy :) we specifically run the consumer with a QoS of 0 and to acknowledge after 5000001 so that we simulate a miscofigured consumer or a misbehaving one.      

What we can observe is that the paged out elements start moving back memory (as expected) and rabbit eventually dies (hmmm!).  On easy way to solve this is by setting the QoS (something > 0) but why does this happen?  Why would a server chose death over say not handing out further messages?

This can be reproduced by running the following docker image: 

version: '3'
services:
    rabbitmq:
        image: rabbitmq:3.6.12-management
        hostname: rabbit
        ports:
            - "5672:5672"
            - "15672:15672"
        environment:
            - RABBITMQ_ERLANG_COOKIE='secret cookie here'

on a host with 8 CPUs and 7.2 Gb of memory.  

# Publish observations 
During the publish we can observe that we hit the memory high water alert quite a number of times (and frequently) - in some cases, the memory goes up to 3.7Gb (the high-level watermark is at 2.8Gb). In order to control the high water mark fluctuations we found that the following configuration gives better results: 

# rabbit.config
```
[
        { rabbit, [
                { loopback_users, [ ] },
                { tcp_listeners, [ 5672 ] },
                { ssl_listeners, [ ] },
                { hipe_compile, false },
                { vm_memory_high_watermark_paging_ratio, 0.1 },
                { vm_memory_high_watermark, 0.3 }
        ] },
        { rabbitmq_management, [ { listener, [
                { port, 15672 },
                { ssl, false }
        ] } ] }
].
```

You have mentioned at the vm_memory_high_watermark of 0.3 in one of your replies - https://groups.google.com/forum/#!searchin/rabbitmq-users/memory$203.6.11$200.3%7Csort:date/rabbitmq-users/-qPkj_ty6I8/8R0mFUBOBQAJ.  Are the defaults of 

                { vm_memory_high_watermark_paging_ratio, 0.5 },
                { vm_memory_high_watermark, 0.4 }

based on a particular configuration? Do you have any other insights which we might be missing?


During the investigation, we have also tried lazy queues and there seems to be a general feeling that these should be enabled when the producer produces at a faster rate than the consumer? (our use case) We have experimented with lazy queues and we cannot perceivable performance degradation - on the contrary, the memory is way more stable (as expected).  In our case we typically have a 4:1 ratio between producer and consumer rates, are there any recommendations on when one would enable lazy queues?

Thanks a lot for your time

--
Mark

Michael Klishin

unread,
Nov 9, 2017, 5:54:40 AM11/9/17
to rabbitm...@googlegroups.com
During the investigation, we have also tried lazy queues and there seems to be a general feeling that these should be enabled when the producer produces at a faster rate than the consumer? (our use case)

Lazy queues were introduced exactly for that scenario.

Note that a routine 4:1 publisher:consumer rate will eventually run the node out of disk space. I'm sure you understand this
but it's worth pointing out because we see cases of nodes running out of disk space every few days.

In that case I'd recommend limiting queue length using http://www.rabbitmq.com/maxlength.html as well.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Mark Galea

unread,
Nov 9, 2017, 6:04:58 AM11/9/17
to rabbitmq-users
Thanks Michael! Will make sure to set the max length.   

We have continued our investigation of 'The strange incident of Rabbit death at the night time!' and we have discovered the following: 

RabbitMq Verison 3.6.12, Erlang 19.2.1 - Rabbit Is Fine

RabbitMq Verison 3.6.12, Erlang 18.0 - Rabbit Dies
RabbitMq Verison 3.6.12, Erlang 18.0 (Management Plugin Disabled) - Rabbit Is Intermittent but typically survives!
RabbitMq Verison 3.6.12, Erlang 18.0 (Management Plugin Enabled but with rates_mode - none, collector_statistics - coarse, collect_statistics_interval - 60s) - Rabbit Is Fine

Do you have any thoughts on why this would happen? Is there a suggested Rabbit/Erl version which we should go with? 

Can someone explain the difference between collector_statistics in the rabbit section and rates_mode?  I read the documentation but its still not clear what each setting is actually doing.

Did you have time to look at the QoS explanation above? does it make sense that rabbit dies rather than throttling the consumers?

--
Mark

Michael Klishin

unread,
Nov 9, 2017, 7:24:20 AM11/9/17
to rabbitm...@googlegroups.com
You haven’t provided enough information (namely server logs) to answer this and I do not speculate on “node crashes.”

The docs [1] cover Erlang versions supported and recommended at the moment.

As announced earlier on this list, 3.6.15 will drop support for Erlang/OTP versions older than 19.3.

Michael Klishin

unread,
Nov 9, 2017, 7:29:16 AM11/9/17
to rabbitm...@googlegroups.com
This thread contains like 4 different questions now. This is poor mailing list etiquette.

Starting with 3.6.7, it is extremely rare to see stats DB to be overloaded. It is likely that
you are chasing the wrong goose and the nature of your incidents has nothing to do with the stats DB. It is rarely necessary to tweak rates mode and stats collection intervals in modern versions.

rates_mode is the only setting that controls stats collection mode starting with 3.3.0 or 3.4.0 IIRC. Others are kept around for backwards compatibility.

See server logs, http://rabbitmq.com/memory-use.htmlhttp://rabbitmq.com/memory-use.html. Collect data about your system over time instead of speculating.

On 9 Nov 2017, at 14:04, Mark Galea <ma...@suprnation.com> wrote:

Mark Galea

unread,
Nov 9, 2017, 7:51:46 AM11/9/17
to rabbitmq-users
You are right, I'll open a separate thread.  Thanks for your help and apologies for flooding the thread.  
Reply all
Reply to author
Forward
0 new messages