RMQ being killed by kernel due to memory overuse

1,098 views
Skip to first unread message

Kapil Goyal

unread,
Apr 14, 2015, 11:04:38 PM4/14/15
to rabbitm...@googlegroups.com
Hi Folks,

I am using RMQ 3.3.5, Erlang R15B03 (64 bit). My setup consists of a single RMQ broker that has about 500 clients (~1000 connections). The system has 12 GB RAM and RMQ is using the default high_memory_watermark of .4. Some other processes are consuming about 6 GB RAM. The broker is being killed by kernel within about a minute of starting it due to system running out of memory.

[75336.430645] Out of memory: Kill process 19802 (beam.smp) score 671 or sacrifice child
[75336.430865] Killed process 19802 (beam.smp) total-vm:10008664kB, anon-rss:9359808kB, file-rss:0kB

I generated an RMQ status report just before one such crash:

{memory,
     [{total,7349308784},
      {connection_procs,146138792},
      {queue_procs,46631136},
      {plugins,16382856},
      {other_proc,0},
      {mnesia,4372264},
      {mgmt_db,42765160},
      {msg_index,568968},
      {other_ets,5177064},
      {binary,6909454248},
      {code,21097280},
      {atom,801697},
      {other_system,163107339}]},
{vm_memory_high_watermark,0.4},
 {vm_memory_limit,5031488716},


It appears that the messages from all the clients that connect to broker when it starts are causing this bloat in memory. It still confused me why RMQ was allowing its memory use to grow about 2 GB more than the limit. I then tried reducing the watermark to .2 and reran the test:

19802 urabbit   20   0 9353m 8.6g 1376 S   29 73.4   2:43.69 beam.smp 

RMQ still ends up using about 8 GB and gets killed.

How do I protect RMQ?

Thanks
Kapil

Michael Klishin

unread,
Apr 15, 2015, 12:31:17 AM4/15/15
to Kapil Goyal, rabbitm...@googlegroups.com
On 15 April 2015 at 06:04:38, Kapil Goyal (goy...@vmware.com) wrote:
> It appears that the messages from all the clients that connect
> to broker when it starts are causing this bloat in memory. It still
> confused me why RMQ was allowing its memory use to grow about 2
> GB more than the limit. I then tried reducing the watermark to
> .2 and reran the test:
>
> 19802 urabbit 20 0 9353m 8.6g 1376 S 29 73.4 2:43.69 beam.smp
>
> RMQ still ends up using about 8 GB and gets killed.

RabbitMQ memory monitor performs checks periodically (every 2500 ms). In this timeframe
you can publish enough large messages before it can kick in. Combined with the fairly lazy GC of
binaries in the Erlang runtime it gives the effect you see.

In versions starting with 3.4.0 GC is invoked after processing messages starting with
1 MB [2]. Erlang 17 seems to lower the mark after which it considers
that binaries need to be collected [1].

So, upgrade to at least 3.4.x and Erlang 17. If you are running the VMware-produced monolithic
RPM, there's an updated version (17.4):
http://www.rabbitmq.com/releases/erlang/

You can also lower the watermark to say 0.3 but bump vm_memory_high_watermark_paging_ratio to 0.7
(but definitely not higher than 0.9) to keep (absolute) RAM use threshold for paging to disk roughly the same:

0.4 * 0.5 = 0.2 = 20% of total RAM before
0.3 * 0.7 = 0.21 = 21% of total RAM after 

1. http://blog.bugsense.com/post/74179424069/erlang-binary-garbage-collection-a-love-hate
2. 25850 at http://www.rabbitmq.com/release-notes/README-3.4.0.txt
3. https://github.com/rabbitmq/rabbitmq-server/blob/master/docs/rabbitmq.config.example#L188
--
MK

Staff Software Engineer, Pivotal/RabbitMQ


Kapil Goyal

unread,
Apr 15, 2015, 5:02:33 AM4/15/15
to Klishin, Michael (Pivotal), rabbitm...@googlegroups.com
Thanks Michael. Upgrading will make GC kick in more frequently. But what about the messages that haven't been consumed yet? They'll not be GC candidates. Isn't there another way, for example decreasing the interval at which memory monitor check memory use? It seems to me that it'd be better to not accept messages even at the cost of some performance and throughput than accept more and risk crashing.

Is it possible to generate logs for erlang GC so I can monitor if it's ever kicking in at all before the crash? I read on RMQ website that when GC runs, it could allocate as much memory for itself as already used by the app (RMQ). Considering that GC doesn't kick in until very late, ie when the system is already low on memory, it seems like a recipe for a crash. Surely I am misunderstanding something here. :)

Also, could you please explain the benefit of choosing .3 and .7 vs .4 and .5?

Regards
Kapil Goyal


> On Apr 14, 2015, at 9:31 PM, Michael Klishin <mkli...@pivotal.io> wrote:
>
>> On 15 April 2015 at 06:04:38, Kapil Goyal (goy...@vmware.com) wrote:
>> It appears that the messages from all the clients that connect
>> to broker when it starts are causing this bloat in memory. It still
>> confused me why RMQ was allowing its memory use to grow about 2
>> GB more than the limit. I then tried reducing the watermark to
>> .2 and reran the test:
>>
>> 19802 urabbit 20 0 9353m 8.6g 1376 S 29 73.4 2:43.69 beam.smp
>>
>> RMQ still ends up using about 8 GB and gets killed.
>
> RabbitMQ memory monitor performs checks periodically (every 2500 ms). In this timeframe
> you can publish enough large messages before it can kick in. Combined with the fairly lazy GC of
> binaries in the Erlang runtime it gives the effect you see.
>
> In versions starting with 3.4.0 GC is invoked after processing messages starting with
> 1 MB [2]. Erlang 17 seems to lower the mark after which it considers
> that binaries need to be collected [1].
>
> So, upgrade to at least 3.4.x and Erlang 17. If you are running the VMware-produced monolithic
> RPM, there's an updated version (17.4):
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.rabbitmq.com_releases_erlang_&d=AwIFaQ&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=jI39iGhJSMsophpPzgwoqWd6xw05tZ5QPHjlmT5c7Tw&m=RFts4pV8gEauSTdIOdGptMeYru7plWRzZgxw9WcJAyo&s=J_u0c_0pyvk5ot0ciPoRYMA5DxNbZLJkrZRN73eMYXE&e=
>
> You can also lower the watermark to say 0.3 but bump vm_memory_high_watermark_paging_ratio to 0.7
> (but definitely not higher than 0.9) to keep (absolute) RAM use threshold for paging to disk roughly the same:
>
> 0.4 * 0.5 = 0.2 = 20% of total RAM before
> 0.3 * 0.7 = 0.21 = 21% of total RAM after
>
> 1. https://urldefense.proofpoint.com/v2/url?u=http-3A__blog.bugsense.com_post_74179424069_erlang-2Dbinary-2Dgarbage-2Dcollection-2Da-2Dlove-2Dhate&d=AwIFaQ&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=jI39iGhJSMsophpPzgwoqWd6xw05tZ5QPHjlmT5c7Tw&m=RFts4pV8gEauSTdIOdGptMeYru7plWRzZgxw9WcJAyo&s=PExVRhgF4r2sBNX6zs7G63FssVV6PthOdzkKPivBRAc&e=
> 2. 25850 at https://urldefense.proofpoint.com/v2/url?u=http-3A__www.rabbitmq.com_release-2Dnotes_README-2D3.4.0.txt&d=AwIFaQ&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=jI39iGhJSMsophpPzgwoqWd6xw05tZ5QPHjlmT5c7Tw&m=RFts4pV8gEauSTdIOdGptMeYru7plWRzZgxw9WcJAyo&s=ercXwxzNQnv1jAyC8LkN67iQQ86u1sDt4z-mXGsguUA&e=
> 3. https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_rabbitmq_rabbitmq-2Dserver_blob_master_docs_rabbitmq.config.example-23L188&d=AwIFaQ&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=jI39iGhJSMsophpPzgwoqWd6xw05tZ5QPHjlmT5c7Tw&m=RFts4pV8gEauSTdIOdGptMeYru7plWRzZgxw9WcJAyo&s=WJ-lJnorIhruqrG7-WHS4NGzbEtMlwhxpw13eLXoLvA&e=
> --
> MK
>
> Staff Software Engineer, Pivotal/RabbitMQ
>
>
> --
> You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
> To post to this group, send an email to rabbitm...@googlegroups.com.
> For more options, visit https://urldefense.proofpoint.com/v2/url?u=https-3A__groups.google.com_d_optout&d=AwIFaQ&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=jI39iGhJSMsophpPzgwoqWd6xw05tZ5QPHjlmT5c7Tw&m=RFts4pV8gEauSTdIOdGptMeYru7plWRzZgxw9WcJAyo&s=QWPlQDkydYiCXCRlbRbtvUtvZMvBEazHd7-Ke6rBeQ0&e= .

Jean-Sébastien Pédron

unread,
Apr 15, 2015, 5:17:42 AM4/15/15
to rabbitm...@googlegroups.com
On 15.04.2015 11:01, Kapil Goyal wrote:
> But what about the messages that haven't been consumed yet? They'll
> not be GC candidates. Isn't there another way, for example
> decreasing the interval at which memory monitor check memory use? It
> seems to me that it'd be better to not accept messages even at the
> cost of some performance and throughput than accept more and risk
> crashing.

The memory watermark Michael talked about are here for that purpose. It
won't trigger the checks more frequently but the memory monitor will ask
queues to page messages to disk earlier. See the following documentation:
http://www.rabbitmq.com/memory.html

> Is it possible to generate logs for erlang GC so I can monitor if
> it's ever kicking in at all before the crash? I read on RMQ website
> that when GC runs, it could allocate as much memory for itself as
> already used by the app (RMQ). Considering that GC doesn't kick in
> until very late, ie when the system is already low on memory, it
> seems like a recipe for a crash. Surely I am misunderstanding
> something here. :)

When the GC kicks in, it can temporarily double the memory consumption.
That's why the high memory watermark (vm_memory_high_watermark) defaults
to 40% of the system memory. RabbitMQ tries to remain under that limit
all the time, but it's not a hard limit: we can't control Erlang VM's needs.

The second parameter described in the article linked above
(vm_memory_high_watermark_paging_ratio) tells the memory monitor to
start paging queues out when the memory consumption is above that ratio
of the high memory watermark. The goal is to slow down the increase of
memory consumption.

--
Jean-Sébastien Pédron
Pivotal / RabbitMQ

Kapil Goyal

unread,
Apr 15, 2015, 12:31:58 PM4/15/15
to Jean-Sébastien Pédron, rabbitm...@googlegroups.com
Hello Jean,

As I mentioned in my first email, reducing the watermark to .2 didn't help. The system is in a state now where RMQ cannot stay up for more than a minute. Short of upgrading, is there anything at all I can try to protect RMQ?

Thanks
Kapil
________________________________________
From: rabbitm...@googlegroups.com <rabbitm...@googlegroups.com> on behalf of Jean-Sébastien Pédron <jean-se...@rabbitmq.com>
Sent: Wednesday, April 15, 2015 2:17 AM
To: rabbitm...@googlegroups.com
Subject: Re: [rabbitmq-users] RMQ being killed by kernel due to memory overuse

On 15.04.2015 11:01, Kapil Goyal wrote:
> But what about the messages that haven't been consumed yet? They'll
> not be GC candidates. Isn't there another way, for example
> decreasing the interval at which memory monitor check memory use? It
> seems to me that it'd be better to not accept messages even at the
> cost of some performance and throughput than accept more and risk
> crashing.

The memory watermark Michael talked about are here for that purpose. It
won't trigger the checks more frequently but the memory monitor will ask
queues to page messages to disk earlier. See the following documentation:

https://urldefense.proofpoint.com/v2/url?u=http-3A__www.rabbitmq.com_memory.html&d=AwIFaQ&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=jI39iGhJSMsophpPzgwoqWd6xw05tZ5QPHjlmT5c7Tw&m=448HwLdLCogVxy6D1epWjGLkrInNRMBBLq6ITSNI_J8&s=fuHRjsj-ZCnQYH6pczHkyvWb21obNFlOaaJKk3VF7nY&e=

> Is it possible to generate logs for erlang GC so I can monitor if
> it's ever kicking in at all before the crash? I read on RMQ website
> that when GC runs, it could allocate as much memory for itself as
> already used by the app (RMQ). Considering that GC doesn't kick in
> until very late, ie when the system is already low on memory, it
> seems like a recipe for a crash. Surely I am misunderstanding
> something here. :)

When the GC kicks in, it can temporarily double the memory consumption.
That's why the high memory watermark (vm_memory_high_watermark) defaults
to 40% of the system memory. RabbitMQ tries to remain under that limit
all the time, but it's not a hard limit: we can't control Erlang VM's needs.

The second parameter described in the article linked above
(vm_memory_high_watermark_paging_ratio) tells the memory monitor to
start paging queues out when the memory consumption is above that ratio
of the high memory watermark. The goal is to slow down the increase of
memory consumption.

--
Jean-Sébastien Pédron
Pivotal / RabbitMQ

--


You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send an email to rabbitm...@googlegroups.com.

For more options, visit https://urldefense.proofpoint.com/v2/url?u=https-3A__groups.google.com_d_optout&d=AwIFaQ&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=jI39iGhJSMsophpPzgwoqWd6xw05tZ5QPHjlmT5c7Tw&m=448HwLdLCogVxy6D1epWjGLkrInNRMBBLq6ITSNI_J8&s=feMHb88dL7FMDMWjo38aLXrVz-1498LsaoeUi6PJ8A8&e= .

Michael Klishin

unread,
Apr 15, 2015, 3:41:31 PM4/15/15
to Kapil Goyal, Jean-Sébastien Pédron, rabbitm...@googlegroups.com
On 15 April 2015 at 19:31:58, Kapil Goyal (goy...@vmware.com) wrote:
> As I mentioned in my first email, reducing the watermark to .2
> didn't help. The system is in a state now where RMQ cannot stay
> up for more than a minute. Short of upgrading, is there anything
> at all I can try to protect RabbitMQ?

Kapil,

If no value in the range 0.1 to 0.4 works, you may want to consider if your VM needs some
applications moved to another VM. While there's a window between monitor checks where you can
run over the limit, you likely will run over by roughly the same amount on average, so absolute RAM usage
will be lower. If OOM killer still kicks in, your VM is under provisioned in terms of RAM.

We really don't see this problem often,
so the memory monitor interval is quite reasonable.

We've listed your options:

 * Lower VM watermark
 * Try a different paging ratio (it may make sense moving it lower if throughput is less important)
 * Upgrade to 3.4.4 or 3.5.2.
 * Try Erlang 17.x.
 * Finally, try running fewer processes in the same VM: OOM kicking in at 20% of RAM use suggests that RabbitMQ is not necessarily the largest RAM consumer in the VM. 

Kapil Goyal

unread,
Apr 15, 2015, 9:41:12 PM4/15/15
to Klishin, Michael (Pivotal), Jean-Sébastien Pédron, rabbitm...@googlegroups.com
Thanks Michael.

I am now trying to understand which queues are contributing the memory
bloat when broker¹s memory use grows. Trouble is that when broker gets
into this state, rabbitmqctl stops responding and I cannot get any
information, for example the list of queues or the report. How do you
recommend I go about troubleshooting this?

Regards
Kapil

Michael Klishin

unread,
Apr 15, 2015, 11:43:51 PM4/15/15
to Kapil Goyal, Jean-Sébastien Pédron, rabbitm...@googlegroups.com
 On 16 April 2015 at 04:41:09, Kapil Goyal (goy...@vmware.com) wrote:
> Trouble is that when broker gets
> into this state, rabbitmqctl stops responding and I cannot get
> any
> information, for example the list of queues or the report. How
> do you
> recommend I go about troubleshooting this?

This suggests the OS is swapping. Resource alarms do not affect `rabbitmqctl`. Tools
such as iostat and vmstat will provide more details.

If OS swapping is indeed going on, your best bet is to monitor queue size before swapping happens.

There are multiple features that can be used via policies (so, without modifying any apps) to limit
queue growth: https://www.rabbitmq.com/blog/2014/01/23/preventing-unbounded-buffers-with-rabbitmq/.

Kapil Goyal

unread,
Apr 16, 2015, 5:41:57 PM4/16/15
to Klishin, Michael (Pivotal), Jean-Sébastien Pédron, rabbitm...@googlegroups.com
Thanks Michael. In an effort to understand the cause for RMQ memory
growth, I monitored the rx/tx bytes per second of the system (broker) as
well as the top output. Before I explain my findings, I¹d like to go over
our design a bit. We have a management web-application collocated with the
broker process on the virtual machine (12 GB RAM). Web-app talks to the
broker on loopback interface. There are also about 500 other external
applications that talk to the broker on the public interface Œmgmt¹. Now,
on to my findings.

1) RMQ memory use hovers around 400-600 MB during normal messaging, until
the web-app publishes a big message (~ 15 MB) to broker on the loopback
interface. This message is to be fanned out to all the 500 external
applications over the Œmgmt¹ interface.

RMQ Memory:

Thu Apr 16 21:06:27 UTC 2015 24724 urabbit 20 0 798m 383m 3436 S 51
3.2 1:19.01 beam.smp
Thu Apr 16 21:06:30 UTC 2015 24724 urabbit 20 0 4839m 4.3g 3436 S 219
37.0 1:23.43 beam.smp
Thu Apr 16 21:06:32 UTC 2015 24724 urabbit 20 0 5790m 5.2g 3436 S 268
44.8 1:30.27 beam.smp
Thu Apr 16 21:06:35 UTC 2015 24724 urabbit 20 0 6597m 6.0g 3436 S 257
51.3 1:36.63 beam.smp
Thu Apr 16 21:06:37 UTC 2015 24724 urabbit 20 0 7402m 6.8g 3436 S 249
58.0 1:43.31 beam.smp
Thu Apr 16 21:06:40 UTC 2015 24724 urabbit 20 0 8071m 7.4g 3436 S 221
63.5 1:49.49 beam.smp


Loopback interface:


Thu Apr 16 21:06:27 UTC 2015 TX lo: 145 kB/s RX lo: 145 kB/s
Thu Apr 16 21:06:30 UTC 2015 TX lo: 11699 kB/s RX lo: 11699 kB/s
Thu Apr 16 21:06:32 UTC 2015 TX lo: 15 kB/s RX lo: 15 kB/s
Thu Apr 16 21:06:34 UTC 2015 TX lo: 11 kB/s RX lo: 11 kB/s


Œmgmt¹ interface:

Thu Apr 16 21:06:38 UTC 2015 TX mgmt: 1513 kB/s RX mgmt: 12 kB/s
Thu Apr 16 21:06:40 UTC 2015 TX mgmt: 101721 kB/s RX mgmt: 854 kB/s
Thu Apr 16 21:06:42 UTC 2015 TX mgmt: 885220 kB/s RX mgmt: 7594 kB/s
Thu Apr 16 21:06:44 UTC 2015 TX mgmt: 1758334 kB/s RX mgmt: 14969 kB/s
Thu Apr 16 21:06:46 UTC 2015 TX mgmt: 1133121 kB/s RX mgmt: 9737 kB/s
Thu Apr 16 21:06:48 UTC 2015 TX mgmt: 2402 kB/s RX mgmt: 61 kB/s
Thu Apr 16 21:06:50 UTC 2015 TX mgmt: 46463 kB/s RX mgmt: 363 kB/s
Thu Apr 16 21:06:52 UTC 2015 TX mgmt: 23265 kB/s RX mgmt: 201 kB/s
Thu Apr 16 21:06:54 UTC 2015 TX mgmt: 1 kB/s RX mgmt: 0 kB/s
Thu Apr 16 21:06:56 UTC 2015 TX mgmt: 190 kB/s RX mgmt: 69 kB/s

2) RMQ memory use goes back to about 400-600 MB range until another big
message (~15 MB) is published by web-app. We see a similar pattern as
before:


RMQ Memory:

Thu Apr 16 21:09:26 UTC 2015 24724 urabbit 20 0 838m 435m 3436 S 2
3.6 2:23.60 beam.smp
Thu Apr 16 21:09:29 UTC 2015 24724 urabbit 20 0 1144m 730m 3436 S 83
6.1 2:24.34 beam.smp
Thu Apr 16 21:09:31 UTC 2015 24724 urabbit 20 0 6265m 5.7g 3436 S 292
48.9 2:32.14 beam.smp
Thu Apr 16 21:09:34 UTC 2015 24724 urabbit 20 0 7555m 7.0g 3436 S 380
59.5 2:41.31 beam.smp
Thu Apr 16 21:09:37 UTC 2015 24724 urabbit 20 0 8498m 7.9g 3436 S 341
67.2 2:50.59 beam.smp
Thu Apr 16 21:09:39 UTC 2015 24724 urabbit 20 0 8867m 8.2g 1140 S 237
70.2 2:57.13 beam.smp
Thu Apr 16 21:09:42 UTC 2015 24724 urabbit 20 0 8955m 8.3g 1228 S 204
70.8 3:01.53 beam.smp
Thu Apr 16 21:09:45 UTC 2015 24724 urabbit 20 0 9063m 8.4g 1260 S 184
71.5 3:06.42 beam.smp
Thu Apr 16 21:09:48 UTC 2015 24724 urabbit 20 0 9137m 8.4g 1260 S 84
72.0 3:09.83 beam.smp
Thu Apr 16 21:09:51 UTC 2015 24724 urabbit 20 0 9297m 8.6g 1232 S 251
73.0 3:15.27 beam.smp
Thu Apr 16 21:09:54 UTC 2015 24724 urabbit 20 0 9342m 8.6g 1232 S 26
73.4 3:17.07 beam.smp
Thu Apr 16 21:09:56 UTC 2015 24724 urabbit 20 0 9376m 8.6g 1244 S 49
73.5 3:18.60 beam.smp
Thu Apr 16 21:09:59 UTC 2015 24724 urabbit 20 0 9404m 8.6g 1212 S 41
73.7 3:20.05 beam.smp
Thu Apr 16 21:10:02 UTC 2015 24724 urabbit 20 0 9446m 8.7g 1224 S 71
73.9 3:21.64 beam.smp
Thu Apr 16 21:10:05 UTC 2015 24724 urabbit 20 0 9470m 8.7g 1224 S 55
74.1 3:22.98 beam.smp
Thu Apr 16 21:10:07 UTC 2015 24724 urabbit 20 0 9498m 8.7g 1236 S 20
74.3 3:23.91 beam.smp
Thu Apr 16 21:10:10 UTC 2015 24724 urabbit 20 0 9512m 8.7g 1236 S 20
74.4 3:24.64 beam.smp
Thu Apr 16 21:10:13 UTC 2015 24724 urabbit 20 0 9531m 8.7g 1236 S 79
74.5 3:25.67 beam.smp
Thu Apr 16 21:10:15 UTC 2015 24724 urabbit 20 0 9547m 8.7g 1236 S 28
74.7 3:26.40 beam.smp
Thu Apr 16 21:10:18 UTC 2015 24724 urabbit 20 0 9559m 8.8g 1236 S 22
74.8 3:27.18 beam.smp
Thu Apr 16 21:10:21 UTC 2015 24724 urabbit 20 0 9565m 8.8g 1236 S 12
74.8 3:27.81 beam.smp
Thu Apr 16 21:10:24 UTC 2015 24724 urabbit 20 0 9571m 8.8g 1236 S 45
74.9 3:28.34 beam.smp
Thu Apr 16 21:10:27 UTC 2015 24724 urabbit 20 0 9583m 8.8g 1248 S 16
75.0 3:29.11 beam.smp
Thu Apr 16 21:10:30 UTC 2015 24724 urabbit 20 0 9589m 8.8g 1248 S 22
75.1 3:29.84 beam.smp
Thu Apr 16 21:10:33 UTC 2015 24724 urabbit 20 0 9606m 8.8g 1256 S 46
75.2 3:30.78 beam.smp
Thu Apr 16 21:10:36 UTC 2015 24724 urabbit 20 0 9621m 8.8g 1256 S 29
75.3 3:31.57 beam.smp
Thu Apr 16 21:10:39 UTC 2015 24724 urabbit 20 0 9635m 8.8g 1280 S 24
75.5 3:32.44 beam.smp
Thu Apr 16 21:10:42 UTC 2015 24724 urabbit 20 0 9638m 8.8g 1280 S 75
75.5 3:33.77 beam.smp
Thu Apr 16 21:10:44 UTC 2015 24724 urabbit 20 0 9673m 8.9g 1280 S 30
75.8 3:35.18 beam.smp
Thu Apr 16 21:10:47 UTC 2015 24724 urabbit 20 0 9687m 8.9g 1280 S 95
75.9 3:36.00 beam.smp
Thu Apr 16 21:10:49 UTC 2015 24724 urabbit 20 0 9709m 8.9g 1256 S 86
76.0 3:37.94 beam.smp
Thu Apr 16 21:10:53 UTC 2015 24724 urabbit 20 0 9746m 8.9g 1272 S 91
76.2 3:40.29 beam.smp
Thu Apr 16 21:10:56 UTC 2015 24724 urabbit 20 0 9764m 8.9g 1276 S 8
76.3 3:41.74 beam.smp
Thu Apr 16 21:10:58 UTC 2015 24724 urabbit 20 0 9775m 9.0g 1292 S 85
76.4 3:42.91 beam.smp
Thu Apr 16 21:11:01 UTC 2015 24724 urabbit 20 0 9799m 9.0g 1292 S 79
76.6 3:44.57 beam.smp
Thu Apr 16 21:11:04 UTC 2015 24724 urabbit 20 0 9829m 9.0g 1316 S 107
76.8 3:46.36 beam.smp
Thu Apr 16 21:11:07 UTC 2015 24724 urabbit 20 0 0 0 0 Z 82
0.0 3:49.24 beam.smp <defunct>

Loopback interface:

Thu Apr 16 21:09:28 UTC 2015 TX lo: 0 kB/s RX lo: 0 kB/s
Thu Apr 16 21:09:30 UTC 2015 TX lo: 11581 kB/s RX lo: 11581 kB/s
Thu Apr 16 21:09:32 UTC 2015 TX lo: 0 kB/s RX lo: 0 kB/s


Œmgmt¹ interface:

Thu Apr 16 21:09:38 UTC 2015 TX mgmt: 132494 kB/s RX mgmt: 1156 kB/s
Thu Apr 16 21:09:41 UTC 2015 TX mgmt: 141215 kB/s RX mgmt: 1286 kB/s
Thu Apr 16 21:09:43 UTC 2015 TX mgmt: 138487 kB/s RX mgmt: 1222 kB/s
Thu Apr 16 21:09:45 UTC 2015 TX mgmt: 147107 kB/s RX mgmt: 1282 kB/s
Thu Apr 16 21:09:48 UTC 2015 TX mgmt: 61329 kB/s RX mgmt: 633 kB/s
Thu Apr 16 21:09:50 UTC 2015 TX mgmt: 40657 kB/s RX mgmt: 373 kB/s
Thu Apr 16 21:09:52 UTC 2015 TX mgmt: 93964 kB/s RX mgmt: 772 kB/s
Thu Apr 16 21:09:54 UTC 2015 TX mgmt: 26843 kB/s RX mgmt: 230 kB/s
Thu Apr 16 21:09:56 UTC 2015 TX mgmt: 13467 kB/s RX mgmt: 121 kB/s
Thu Apr 16 21:09:59 UTC 2015 TX mgmt: 7103 kB/s RX mgmt: 51 kB/s
Thu Apr 16 21:10:01 UTC 2015 TX mgmt: 17099 kB/s RX mgmt: 145 kB/s
Thu Apr 16 21:10:03 UTC 2015 TX mgmt: 16268 kB/s RX mgmt: 140 kB/s
Thu Apr 16 21:10:05 UTC 2015 TX mgmt: 7208 kB/s RX mgmt: 66 kB/s
Thu Apr 16 21:10:08 UTC 2015 TX mgmt: 14068 kB/s RX mgmt: 119 kB/s
Thu Apr 16 21:10:10 UTC 2015 TX mgmt: 2798 kB/s RX mgmt: 26 kB/s
Thu Apr 16 21:10:13 UTC 2015 TX mgmt: 168 kB/s RX mgmt: 2 kB/s
Thu Apr 16 21:10:15 UTC 2015 TX mgmt: 3250 kB/s RX mgmt: 32 kB/s
Thu Apr 16 21:10:17 UTC 2015 TX mgmt: 1320 kB/s RX mgmt: 41 kB/s
Thu Apr 16 21:10:19 UTC 2015 TX mgmt: 2154 kB/s RX mgmt: 30 kB/s
Thu Apr 16 21:10:22 UTC 2015 TX mgmt: 2443 kB/s RX mgmt: 21 kB/s
Thu Apr 16 21:10:24 UTC 2015 TX mgmt: 1616 kB/s RX mgmt: 15 kB/s
Thu Apr 16 21:10:27 UTC 2015 TX mgmt: 3395 kB/s RX mgmt: 30 kB/s
Thu Apr 16 21:10:29 UTC 2015 TX mgmt: 3862 kB/s RX mgmt: 24 kB/s
Thu Apr 16 21:10:31 UTC 2015 TX mgmt: 3058 kB/s RX mgmt: 30 kB/s
Thu Apr 16 21:10:33 UTC 2015 TX mgmt: 316 kB/s RX mgmt: 3 kB/s
Thu Apr 16 21:10:36 UTC 2015 TX mgmt: 8004 kB/s RX mgmt: 98 kB/s
Thu Apr 16 21:10:38 UTC 2015 TX mgmt: 8548 kB/s RX mgmt: 84 kB/s
Thu Apr 16 21:10:40 UTC 2015 TX mgmt: 3427 kB/s RX mgmt: 30 kB/s
Thu Apr 16 21:10:42 UTC 2015 TX mgmt: 22626 kB/s RX mgmt: 206 kB/s
Thu Apr 16 21:10:44 UTC 2015 TX mgmt: 24337 kB/s RX mgmt: 223 kB/s
Thu Apr 16 21:10:47 UTC 2015 TX mgmt: 13415 kB/s RX mgmt: 126 kB/s
Thu Apr 16 21:10:49 UTC 2015 TX mgmt: 9635 kB/s RX mgmt: 143 kB/s
Thu Apr 16 21:10:51 UTC 2015 TX mgmt: 58734 kB/s RX mgmt: 523 kB/s
Thu Apr 16 21:10:54 UTC 2015 TX mgmt: 56883 kB/s RX mgmt: 435 kB/s
Thu Apr 16 21:10:56 UTC 2015 TX mgmt: 38141 kB/s RX mgmt: 358 kB/s
Thu Apr 16 21:10:58 UTC 2015 TX mgmt: 16653 kB/s RX mgmt: 156 kB/s
Thu Apr 16 21:11:01 UTC 2015 TX mgmt: 23908 kB/s RX mgmt: 207 kB/s
Thu Apr 16 21:11:04 UTC 2015 TX mgmt: 39644 kB/s RX mgmt: 386 kB/s
Thu Apr 16 21:11:08 UTC 2015 TX mgmt: 110407 kB/s RX mgmt: 1015 kB/s
Thu Apr 16 21:11:10 UTC 2015 TX mgmt: 222 kB/s RX mgmt: 438 kB/s
Thu Apr 16 21:11:12 UTC 2015 TX mgmt: 24 kB/s RX mgmt: 32 kB/s



This time, RMQ gets killed before it can empty the queues and clean up its
memory?

My question for you is - why does the RMQ memory jump up by about 5-6 GB,
when a single message of ~15MB is published to be fanned out to ~500
queues. Do you think my analysis of why RMQ crashes is on the mark? In
addition, does anything else strike you as unusual or interesting in the
above data?

Regards
Kapil


On 4/15/15, 8:43 PM, "Michael Klishin" <mkli...@pivotal.io> wrote:

> On 16 April 2015 at 04:41:09, Kapil Goyal (goy...@vmware.com) wrote:
>> Trouble is that when broker gets
>> into this state, rabbitmqctl stops responding and I cannot get
>> any
>> information, for example the list of queues or the report. How
>> do you
>> recommend I go about troubleshooting this?
>
>This suggests the OS is swapping. Resource alarms do not affect
>`rabbitmqctl`. Tools
>such as iostat and vmstat will provide more details.
>
>If OS swapping is indeed going on, your best bet is to monitor queue size
>before swapping happens.
>
>There are multiple features that can be used via policies (so, without
>modifying any apps) to limit
>queue growth:
>https://urldefense.proofpoint.com/v2/url?u=https-3A__www.rabbitmq.com_blog
>_2014_01_23_preventing-2Dunbounded-2Dbuffers-2Dwith-2Drabbitmq_&d=AwIFaQ&c
>=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=jI39iGhJSMsophpPzgwoqWd6xw0
>5tZ5QPHjlmT5c7Tw&m=_EwRThzIenPLzlS05F308FMhedxkZoK8xWnkIQsnDqo&s=rWllzbrX_
>j3QBE7-mGcuERMrnRD_EBt8nU7nHywgYlw&e= .

Michael Klishin

unread,
Apr 16, 2015, 7:41:27 PM4/16/15
to Kapil Goyal, Jean-Sébastien Pédron, rabbitm...@googlegroups.com
 

On 17 April 2015 at 00:41:56, Kapil Goyal (goy...@vmware.com) wrote:
> > My question for you is - why does the RMQ memory jump up by about
> 5-6 GB,
> when a single message of ~15MB is published to be fanned out to
> ~500
> queues.

I cannot explain the sudden 730 MB => 5.7 GB jump, this sounds excessive enough to
suspect an edge case in the runtime. I can try explaining the overall problem
that is related and was at least partially fixed in bug 25850 (3.4.0).

Binaries > 64K bits in size are stored in a separate reference counting heap.
If you fan out a single large messages to N queues (N Erlang processes), then 500+ processes
have references to that heap.

For a reason unknown to me, those heaps are only collected when every process that has ever
"seen" them is GC'ed [1]. In RabbitMQ, a published message passes several processes even
if it goes straight to the consumer, without hitting the message store.

Those processes don't do enough reductions (a relative unit of
process work, used by the VM scheduler) to trigger GCs often.

So what we did is GC every 1MB of processed data at the "edges" (briefly after dispatching
parsed frames and just before writing to the socket).

This introduced virtually no overhead since in the Erlang VM, there is no shared heap, so
GC is radically simpler and doesn't require "stopping the world" [3]. However, our empirical evidence
from interacting with support suggests that the issue is at least significantly less of a problem.

Again, I cannot explain the sudden x8 spike you're observing: perhaps RabbitMQ caused
excessive memory copying when we didn't expect it. I suggest that you try 3.4.x and our Erlang 17 RPM.

PeftTest can help with reproducing the issue and comparing the results.

I'm currently running it with

runperftest.sh -s 15000000 -r 10 -y 4

which means 15 MB messages, 10 per second, with 1 publisher and 4 consumers (larger rates result in PerfTest itself
running out of heap or having pauses that result in messages piling up). I observe roughly 150 MB/s
flowing to and from RabbitMQ, with memory use floating between 80 and 100 MB. RabbitMQ 3.5.1, Erlang 17.5 (64 bit).

I've also tried

./runperftest.sh -s 15000000 -r 10 -y 50

and

./runperftest.sh -s 15000000 -r 3 -y 200

In both cases RabbitMQ memory usage stayed roughly stable and only went up when consumers
GC'ed and that resulted in messages not being consumed for a second or few.

It may be a good idea to ask on the Erlang mailing list what may have changed since R15B03.

By the way, another useful thing shipped in 3.4 is more detailed memory use breakdown:
https://www.rabbitmq.com/blog/2014/10/30/understanding-memory-use-with-rabbitmq-3-4/

> Do you think my analysis of why RMQ crashes is on the mark?
> In
> addition, does anything else strike you as unusual or interesting
> in the
> above data?

Yes, your analysis makes sense. The 730 MB to 5.7 GB spike definitely does not seem normal. 

1. http://dieswaytoofast.blogspot.ru/2012/12/erlang-binaries-and-garbage-collection.html
2. http://prog21.dadgum.com/16.html

Michael Klishin

unread,
Apr 16, 2015, 7:45:27 PM4/16/15
to Kapil Goyal, Jean-Sébastien Pédron, rabbitm...@googlegroups.com
On 17 April 2015 at 02:41:24, Michael Klishin (mkli...@pivotal.io) wrote:
> Binaries > 64K bits in size are stored in a separate reference
> counting heap.
> If you fan out a single large messages to N queues (N Erlang processes),
> then 500+ processes
> have references to that heap.
>
> For a reason unknown to me, those heaps are only collected when
> every process that has ever
> "seen" them is GC'ed [1]. In RabbitMQ, a published message passes
> several processes even
> if it goes straight to the consumer, without hitting the message
> store.
>
> Those processes don't do enough reductions (a relative unit
> of
> process work, used by the VM scheduler) to trigger GCs often.

 Another excellent source of information: section 7.2 in [1].

1. https://s3.amazonaws.com/erlang-in-anger/text.v1.0.3.pdf
Reply all
Reply to author
Forward
0 new messages