vm_memory_high_watermark set: RabbitMQ server is out of memory without clear cause

6,056 views
Skip to first unread message

Mark Engelsman

unread,
Jun 6, 2018, 4:43:02 AM6/6/18
to rabbitmq-users
Hi,

We are running a three node cluster of RabbitMQ and have issues with keeping it up and running regularly. Here is some information about the volumes of usage:




At the most busy time of the day, we have a publish count of around 1300 m/s, with all other numbers (connections, channels, consumers etc.) staying the same.

Last sunday, we had a vm_memory_high_watermark set, so all publishers were blocked from publishing to the cluster.
The memory went up like crazy (from 0.5GB to over 9GB within 10 minutes), but there were not any spikes in other numbers like channels, number of connections, message rate etc..




Here is the log of osmrabbit03 (the server with memory issues):

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

=INFO REPORT==== 3-Jun-2018::16:53:13 ===
connection <0.18188.5690> (10.10.0.210:60616 -> 10.10.0.123:5672): user 'gbadmin' authenticated and granted access to vhost '/'

=WARNING REPORT==== 3-Jun-2018::17:14:09 ===
closing AMQP connection <0.18188.5690> (10.10.0.210:60616 -> 10.10.0.123:5672, vhost: '/', user: 'gbadmin'):
client unexpectedly closed TCP connection

=ERROR REPORT==== 3-Jun-2018::17:15:31 ===
Ranch listener rabbit_web_dispatch_sup_15672 had connection process started with cowboy_protocol:start_link/4 at <0.25283.5808> exit with reason: [{reason,{timeout,{gen_server,call,[<0.572.0>,{fetch,#Fun<rabbit_mgmt_db.22.104142656>, ... (all queues and exchanges)

(... About 10 error reports as the one above)

=ERROR REPORT==== 3-Jun-2018::17:19:31 ===
Ranch listener rabbit_web_dispatch_sup_15672 had connection process started with cowboy_protocol:start_link/4 at <0.20964.5822> exit with reason: [{reason,{timeout,{gen_server,call,[<0.572.0>,{fetch,#Fun<rabbit_mgmt_db.22.104142656>, ... (all queues and exchanges)=INFO REPORT==== 3-Jun-2018::17:19:38 ===
vm_memory_high_watermark set. Memory used:7125663744 allowed:6871733043

=WARNING REPORT==== 3-Jun-2018::17:19:38 ===
memory resource limit alarm set on node rabbit@osmrabbit03.

**********************************************************
*** Publishers will be blocked until this alarm clears ***
**********************************************************

=INFO REPORT==== 3-Jun-2018::17:19:39 ===
vm_memory_high_watermark clear. Memory used:3993567232 allowed:6871733043

=WARNING REPORT==== 3-Jun-2018::17:19:39 ===
memory resource limit alarm cleared on node rabbit@osmrabbit03

=WARNING REPORT==== 3-Jun-2018::17:19:39 ===
memory resource limit alarm cleared across the cluster

=ERROR REPORT==== 3-Jun-2018::17:20:01 ===
Ranch listener rabbit_web_dispatch_sup_15672 had connection process started with cowboy_protocol:start_link/4 at <0.14631.5824> exit with reason: [{reason,{timeout,{gen_server,call,[<0.572.0>,{fetch,#Fun<rabbit_mgmt_db.22.104142656>, ... (all queues and exchanges)
=ERROR REPORT==== 3-Jun-2018::17:20:24 ===
Ranch listener rabbit_web_dispatch_sup_15672 had connection process started with cowboy_protocol:start_link/4 at <0.22413.5825> exit with reason: [{reason,{timeout,{gen_server,call,[<0.567.0>,{fetch,#Fun<rabbit_mgmt_db.21.104142656>,[]},60000]}}},{mfa,{rabbit_mgmt_wm_connections,to_json,2}},{stacktrace,[{gen_server,call,3,[{file,"gen_server.erl"},{line,214}]},{rabbit_mgmt_db,submit_cached,2,[{file,"src/rabbit_mgmt_db.erl"},{line,707}]},{rabbit_mgmt_wm_connections,augmented,2,[{file,"src/rabbit_mgmt_wm_connections.erl"},{line,55}]},{rabbit_mgmt_wm_connections,to_json,2,[{file,"src/rabbit_mgmt_wm_connections.erl"},{line,43}]},{cowboy_rest,call,3,[{file,"src/cowboy_rest.erl"},{line,976}]},{cowboy_rest,set_resp_body,2,[{file,"src/cowboy_rest.erl"},{line,858}]},{cowboy_protocol,execute,4,[{file,"src/cowboy_protocol.erl"},{line,442}]}]},{req,[{socket,#Port<0.2067080>},{transport,ranch_tcp},{connection,close},{pid,<0.22413.5825>},{method,<<"GET">>},{version,'HTTP/1.1'},{peer,{{10,10,0,252},38085}},{host,<<"10.10.0.123">>},{host_info,undefined},{port,15672},{path,<<"/api/connections">>},{path_info,undefined},{qs,<<>>},{qs_vals,[]},{bindings,[]},{headers,[{<<"te">>,<<"deflate,gzip;q=0.3">>},{<<"connection">>,<<"TE, close">>},{<<"authorization">>,<<"Basic bW9uaXRvcmluZzpQN0tLTVo2V2tnSHpkUTF2">>},{<<"host">>,<<"10.10.0.123:15672">>},{<<"user-agent">>,<<"check_rabbitmq_connections libwww-perl/6.08">>}]},{p_headers,[{<<"if-modified-since">>,undefined},{<<"if-none-match">>,undefined},{<<"if-unmodified-since">>,undefined},{<<"if-match">>,undefined},{<<"accept">>,undefined},{<<"connection">>,[<<"te">>,<<"close">>]}]},{cookies,undefined},{meta,[{media_type,{<<"application">>,<<"json">>,[]}},{charset,undefined}]},{body_state,waiting},{buffer,<<>>},{multipart,undefined},{resp_compress,true},{resp_state,waiting},{resp_headers,[{<<"vary">>,[<<"accept">>,[<<", ">>,<<"accept-encoding">>],[<<", ">>,<<"origin">>]]},{<<"content-type">>,[<<"application">>,<<"/">>,<<"json">>,<<>>]},{<<"vary">>,<<"origin">>}]},{resp_body,<<>>},{onresponse,#Fun<rabbit_cowboy_middleware.onresponse.4>}]},{state,{context,{user,<<"monitoring">>,[monitoring],[{rabbit_auth_backend_internal,none}]},<<"P7KKMZ6WkgHzdQ1v">>,undefined}}]

=ERROR REPORT==== 3-Jun-2018::17:20:31 ===
Ranch listener rabbit_web_dispatch_sup_15672 had connection process started with cowboy_protocol:start_link/4 at <0.2776.5826> exit with reason: [{reason,{timeout,{gen_server,call,[<0.572.0>,{fetch,#Fun<rabbit_mgmt_db.22.104142656>, ... (all queues and exchanges)
=ERROR REPORT==== 3-Jun-2018::17:21:01 ===
Ranch listener rabbit_web_dispatch_sup_15672 had connection process started with cowboy_protocol:start_link/4 at <0.27697.5827> exit with reason: [{reason,{timeout,{gen_server,call,[<0.572.0>,{fetch,#Fun<rabbit_mgmt_db.22.104142656>, ... (all queues and exchanges)
=ERROR REPORT==== 3-Jun-2018::17:21:31 ===
Ranch listener rabbit_web_dispatch_sup_15672 had connection process started with cowboy_protocol:start_link/4 at <0.15965.5829> exit with reason: [{reason,{timeout,{gen_server,call,[<0.572.0>,{fetch,#Fun<rabbit_mgmt_db.22.104142656>, ... (all queues and exchanges)
=ERROR REPORT==== 3-Jun-2018::17:21:36 ===
Ranch listener rabbit_web_dispatch_sup_15672 had connection process started with cowboy_protocol:start_link/4 at <0.24446.5829> exit with reason: [{reason,{timeout,{gen_server,call,[<0.567.0>,{fetch,#Fun<rabbit_mgmt_db.21.104142656>,[]},60000]}}},{mfa,{rabbit_mgmt_wm_connections,to_json,2}},{stacktrace,[{gen_server,call,3,[{file,"gen_server.erl"},{line,214}]},{rabbit_mgmt_db,submit_cached,2,[{file,"src/rabbit_mgmt_db.erl"},{line,707}]},{rabbit_mgmt_wm_connections,augmented,2,[{file,"src/rabbit_mgmt_wm_connections.erl"},{line,55}]},{rabbit_mgmt_wm_connections,to_json,2,[{file,"src/rabbit_mgmt_wm_connections.erl"},{line,43}]},{cowboy_rest,call,3,[{file,"src/cowboy_rest.erl"},{line,976}]},{cowboy_rest,set_resp_body,2,[{file,"src/cowboy_rest.erl"},{line,858}]},{cowboy_protocol,execute,4,[{file,"src/cowboy_protocol.erl"},{line,442}]}]},{req,[{socket,#Port<0.2041008>},{transport,ranch_tcp},{connection,close},{pid,<0.24446.5829>},{method,<<"GET">>},{version,'HTTP/1.1'},{peer,{{10,10,0,252},38680}},{host,<<"10.10.0.123">>},{host_info,undefined},{port,15672},{path,<<"/api/connections">>},{path_info,undefined},{qs,<<>>},{qs_vals,[]},{bindings,[]},{headers,[{<<"te">>,<<"deflate,gzip;q=0.3">>},{<<"connection">>,<<"TE, close">>},{<<"authorization">>,<<"Basic bW9uaXRvcmluZzpQN0tLTVo2V2tnSHpkUTF2">>},{<<"host">>,<<"10.10.0.123:15672">>},{<<"user-agent">>,<<"check_rabbitmq_connections libwww-perl/6.08">>}]},{p_headers,[{<<"if-modified-since">>,undefined},{<<"if-none-match">>,undefined},{<<"if-unmodified-since">>,undefined},{<<"if-match">>,undefined},{<<"accept">>,undefined},{<<"connection">>,[<<"te">>,<<"close">>]}]},{cookies,undefined},{meta,[{media_type,{<<"application">>,<<"json">>,[]}},{charset,undefined}]},{body_state,waiting},{buffer,<<>>},{multipart,undefined},{resp_compress,true},{resp_state,waiting},{resp_headers,[{<<"vary">>,[<<"accept">>,[<<", ">>,<<"accept-encoding">>],[<<", ">>,<<"origin">>]]},{<<"content-type">>,[<<"application">>,<<"/">>,<<"json">>,<<>>]},{<<"vary">>,<<"origin">>}]},{resp_body,<<>>},{onresponse,#Fun<rabbit_cowboy_middleware.onresponse.4>}]},{state,{context,{user,<<"monitoring">>,[monitoring],[{rabbit_auth_backend_internal,none}]},<<"P7KKMZ6WkgHzdQ1v">>,undefined}}]

=ERROR REPORT==== 3-Jun-2018::17:22:01 ===
Ranch listener rabbit_web_dispatch_sup_15672 had connection process started with cowboy_protocol:start_link/4 at <0.5944.5831> exit with reason: [{reason,{timeout,{gen_server,call,[<0.572.0>,{fetch,#Fun<rabbit_mgmt_db.22.104142656>, ... (all queues and exchanges)
=INFO REPORT==== 3-Jun-2018::17:22:25 ===
vm_memory_high_watermark set. Memory used:10770513920 allowed:6871733043

=WARNING REPORT==== 3-Jun-2018::17:22:25 ===
memory resource limit alarm set on node rabbit@osmrabbit03.

**********************************************************
*** Publishers will be blocked until this alarm clears ***
**********************************************************

=INFO REPORT==== 3-Jun-2018::17:22:26 ===
vm_memory_high_watermark clear. Memory used:6206062592 allowed:6871733043

=WARNING REPORT==== 3-Jun-2018::17:22:26 ===
memory resource limit alarm cleared on node rabbit@osmrabbit033

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

After this, we restarted the node that had memory issues:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
=INFO REPORT==== 3-Jun-2018::17:40:14 ===
Starting RabbitMQ 3.6.15 on Erlang 20.0
Copyright (C) 2007-2018 Pivotal Software, Inc.
Licensed under the MPL.  See http://www.rabbitmq.com/

=INFO REPORT==== 3-Jun-2018::17:40:14 ===
node           : rabbit@osmrabbit03
home dir       : C:\Windows
config file(s) : c:/Users/ADMINI~1/AppData/Roaming/RabbitMQ/rabbitmq.config
cookie hash    : k4DIRENyx5vP/nEsXKZKKg==
log            : C:/Users/ADMINI~1/AppData/Roaming/RabbitMQ/log/RABBIT~1.LOG
sasl log       : C:/Users/ADMINI~1/AppData/Roaming/RabbitMQ/log/RABBIT~2.LOG
database dir   : c:/Users/ADMINI~1/AppData/Roaming/RabbitMQ/db/RABBIT~1

=INFO REPORT==== 3-Jun-2018::17:40:20 ===
Memory high watermark set to 6553 MiB (6871733043 bytes) of 16383 MiB (17179332608 bytes) total

=INFO REPORT==== 3-Jun-2018::17:40:20 ===
Enabling free disk space monitoring

=INFO REPORT==== 3-Jun-2018::17:40:20 ===
Disk free limit set to 50MB

=INFO REPORT==== 3-Jun-2018::17:40:20 ===
Limiting to approx 8092 file handles (7280 sockets)

=INFO REPORT==== 3-Jun-2018::17:40:20 ===
FHC read buffering:  OFF
FHC write buffering: ON

=INFO REPORT==== 3-Jun-2018::17:40:21 ===
Waiting for Mnesia tables for 30000 ms, 9 retries left

=INFO REPORT==== 3-Jun-2018::17:40:21 ===
Waiting for Mnesia tables for 30000 ms, 9 retries left

=INFO REPORT==== 3-Jun-2018::17:40:21 ===
Waiting for Mnesia tables for 30000 ms, 9 retries left

=INFO REPORT==== 3-Jun-2018::17:40:21 ===
Priority queues enabled, real BQ is rabbit_variable_queue

=INFO REPORT==== 3-Jun-2018::17:40:21 ===
Starting rabbit_node_monitor

=INFO REPORT==== 3-Jun-2018::17:40:21 ===
Management plugin: using rates mode 'basic'

=INFO REPORT==== 3-Jun-2018::17:40:21 ===
msg_store_transient: using rabbit_msg_store_ets_index to provide index

=INFO REPORT==== 3-Jun-2018::17:40:21 ===
msg_store_persistent: using rabbit_msg_store_ets_index to provide index

=WARNING REPORT==== 3-Jun-2018::17:40:21 ===
msg_store_persistent: rebuilding indices from scratch

=INFO REPORT==== 3-Jun-2018::17:40:21 ===
started TCP Listener on [::]:5672

=INFO REPORT==== 3-Jun-2018::17:40:21 ===
started TCP Listener on 0.0.0.0:5672

=INFO REPORT==== 3-Jun-2018::17:40:21 ===
rabbit on node rabbit@osmrabbit01 up

=WARNING REPORT==== 3-Jun-2018::17:40:21 ===
Could not find handle.exe, please install from sysinternals

=INFO REPORT==== 3-Jun-2018::17:40:21 ===
rabbit on node rabbit@osmrabbit02 up

=INFO REPORT==== 3-Jun-2018::17:40:21 ===
Management plugin started. Port: 15672

=INFO REPORT==== 3-Jun-2018::17:40:21 ===
Statistics database started.

=INFO REPORT==== 3-Jun-2018::17:40:21 ===
Server startup complete; 6 plugins started.
 * rabbitmq_management
 * rabbitmq_management_agent
 * rabbitmq_web_dispatch
 * cowboy
 * amqp_client
 * cowlib

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Then, we encountered the next issue. osmrabbit01 tried to sync with osmrabbit02 (to become the new master / slave setup), but this syncing took forever.
I did a 'rabbitmqctl status' before I restarted osmrabbit01 too and this was the result:




After this, publishing worked again, but the queues were corrupt and we did not send any data to the consumers til the next morning (8 a.m.) when we saw no data was coming into our analytical systems.
All queues were corrupt and I had to force delete them and re-create them.

This memory issue happens way too often (average > 1.5 times per month) and is really hurting our business (and my team's reputation).

TL;DR:
How is it possible that this node runs out of memory so quick without any clear cause? There are no other processes on that server running.
If I need to provide more information, I'll try to post anything we have. This is really frustrating me. Thanks in advance!






Karl Nilsson

unread,
Jun 6, 2018, 9:07:03 AM6/6/18
to rabbitm...@googlegroups.com
Hi,

The errors you see in the log are timeouts that happen when the management api tries to aggregate connection metrics. How are you monitoring your RabbitMQ cluster? Do you have agents that poll periodically?

The RabbitMQ top plugin is very good for getting a nice breakdown of memory use by erlang plugin. I suggest you enable it, if not already enabled and report which processes use the most memory. Then we can take it from there.


Cheers
Karl

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mark Engelsman

unread,
Jun 6, 2018, 10:02:50 AM6/6/18
to rabbitmq-users
Hi Karl,

Thanks a lot for the tip for the Top plugin! I wasn't aware of this plugin, but it does indeed provide us the memory breakdown we are looking for in troubled times.
We monitor our RabbitMQ cluster using a NewRelic plugin by 203 Solutions. Not the best (crashes regularly) but does the job when running quite good. A better monitoring tool would be better for us too.

Now, the cluster is running well since Monday morning, so it's hard to provide metrics for the memory usage in times of problems (except when the plugin has a memory usage history somewhere, but I estimate the chances of that somewhere close to nil).

I somewhere hope(d) that someone has encountered the same issue with memory usage  going up real fast without obvious reasons and could share some gotcha's or fixes. 
In that case, we can maybe fix it preventive instead of reacting on the next issue.

Nonetheless, thanks a lot! 



Op woensdag 6 juni 2018 15:07:03 UTC+2 schreef Karl Nilsson:

Luke Bakken

unread,
Jun 6, 2018, 10:24:56 AM6/6/18
to rabbitmq-users
Hi Mark,

At the time of the memory usage increase, were your consumers keeping up with producers? Looking at your screenshot, at that moment your consumers were just a bit behind your producers.

Do you track the messages_ready and messages_unacknowledged statistics? Another indication that consumers aren't keeping up would be a spike in the io_write_count stat.

https://cdn.rawgit.com/rabbitmq/rabbitmq-management/v3.7.5/priv/www/doc/stats.html

If consumers aren't keeping up or acknowledging messages correctly, RabbitMQ will continue to store those messages in memory until various thresholds are met, and then will start paging them to disk. With a fast enough ingress rate, this can cause the memory watermark to be hit like you saw, blocking publishers.

Thanks,
Luke

Mark Engelsman

unread,
Jun 6, 2018, 10:41:34 AM6/6/18
to rabbitmq-users
Hi Luuk,

Thanks for your response. The consumers can easily handle the amounts; we monitor the Consumer utilisation' and it's always 95%+.
The screenshot was a bit 'unlucky' in that view, you can see here (taken two minutes ago) that the two are usually equal.


We have also set alarms on the amount of messages on the queues. We should still be able to hold the cluster with just over 1M messages stored, but there is an alert when there are more than 50K messages on the queue once in the past 5 minutes. That didn't trigger, so it was definetily not the amount of messages that caused the memory usage.

Thanks for thinking with me and the overview you provided in the link, didn't know that one too!
Mark

Mar

The 

Op woensdag 6 juni 2018 16:24:56 UTC+2 schreef Luke Bakken:

Luke Bakken

unread,
Jun 6, 2018, 11:12:44 AM6/6/18
to rabbitmq-users
Hi Mark,

I'm glad you monitor those numbers - so much for the obvious cause.

Basically, nothing seemed out of the ordinary except the memory usage spike on one node?

You say this happens at least once a month. Is the memory spike always on a single node? If so, is it always the same node? Does this happen at about the same day of the month, time of day, etc? I'm just looking for anything consistent :-)

Since you're running on Windows, do you monitor the memory consumption of the erl.exe process itself or the memory usage of the entire server? The reason I ask is that maybe there's a discrepancy between the internal Erlang memory reporting and what the OS actually reports.

Thanks,
Luke

Mark Engelsman

unread,
Jun 6, 2018, 11:43:21 AM6/6/18
to rabbitmq-users
Hi Luke,

Basically, nothing seemed out of the ordinary except the memory usage spike on one node?
Indeed; one node spikes. It's always just one node.

Is the memory spike always on a single node? If so, is it always the same node?
It's always on a single node, yes. Is it always the same node, yes and no. It's always the master node (I think, 95% sure), which harbors all the (master) queues, but not physically the same.
Because we have a three node setup, the last node to go down is always the one that does basically 'nothing' after the cluster is healthy again; the queues etc. are now on the two other nodes (as master and slave).


Does this happen at about the same day of the month, time of day, etc?
In the evening, our application has the most users online (200K or so) and there are some in-game events then, so it's the most busy time and seems kind of obvious that when there are issues, it's around that time.
So yes, normally at the evening. It has not much to do with the load I think, because 28 out of 30 days in the month we have absolutely no problems at all with our cluster. (Would be great if it happened in office hours but of course, no ;-)).

Since you're running on Windows, do you monitor the memory consumption of the erl.exe process itself or the memory usage of the entire server?
The monitoring is of the entire server, but we know RabbitMQ is using a lot because of 'the vm_memory_high_watermark set' error we see in the logging. 

Just asking these questions already helps us to think in the right direction, thank you.

Mark



Op woensdag 6 juni 2018 17:12:44 UTC+2 schreef Luke Bakken:

Luke Bakken

unread,
Jun 6, 2018, 12:17:34 PM6/6/18
to rabbitmq-users
Hi Mark,

With regard to memory use and vm_memory_high_watermark, you may want to read this:

https://www.rabbitmq.com/memory-use.html

The reason I asked about monitoring memory use of your server as opposed to just RabbitMQ is due to how RMQ calculates its own memory usage. In version 3.6.15, RabbitMQ on Windows uses its internal memory statistics to reason about its own memory usage. This might be different enough from the actual memory usage to trigger the memory alarm when in fact your server has plenty of free RAM.

Could you check your monitoring system to see what the actual free memory is for the problematic node during the memory usage spike? Since your server has 16GiB of memory, and RMQ claims to be using 9GiB, I would expect your monitoring system to report about 7GiB free system-wide.

Thanks,
Luke

Michael Klishin

unread,
Jun 6, 2018, 1:48:21 PM6/6/18
to rabbitm...@googlegroups.com
http://www.rabbitmq.com/memory-use.html should be your first step in investigating anything
related to memory usage. Don't guess, collect data instead.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Mark Engelsman

unread,
Jun 11, 2018, 3:11:21 AM6/11/18
to rabbitmq-users
Hi Michael,

I agree and that's what we try to do, too. Last saturday, we were in trouble again (again in the weekend).
The strange thing is that 'osmrabbit02' was our 'main node',  the one that held all the master queues. 'osmrabbit03' had a replication (so the slave), but 01 went out of memory really, really fast:


Within 10 minutes, we had serious problems again.
We installed the TOP plogin like advised in this topic, but the management plugin was not responding and thus we could not see which process used all of our memory.

As I said, there is nothing else running on this server. Memory is for the whole server, but we know it's coming from erl.exe:


So the problem is that we can't see which erlang process is using all of our memory in times of trouble, when you need it the most.
Could upgrading to 3.7.x fix the problem? We're currently running 3.6.15 on Erlang 20.0.

Kind regards,
Mark



Op woensdag 6 juni 2018 19:48:21 UTC+2 schreef Michael Klishin:
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Mark Engelsman

unread,
Jun 11, 2018, 3:14:27 AM6/11/18
to rabbitmq-users
Hi Luuk,

Thank you for thinking with me. As you can see from the post above, it's not only RabbitMQ reporting high memory usage, but the server too (check the second screenshot).

Kind regards and thanks,
Mark

Op woensdag 6 juni 2018 18:17:34 UTC+2 schreef Luke Bakken:

Luke Bakken

unread,
Jun 11, 2018, 12:19:35 PM6/11/18
to rabbitmq-users
Hi Mark,

Thanks for that info and screenshot. That's a very strange issue especially since 01 isn't a queue master. Let me come up with a command you can run to find Erlang VM processes using a large amount of memory.

One thing you could try out is enabling background GC to see if it prevents this memory issue from happening again. Here is the relevant configuration:


I'm wondering if there is a "rush" of incoming binary data that, for whatever reason, the VM thinks it has to wait to collect. You'll have to restart your nodes after applying that setting.

3.7.X might fix the issue, but it would basically be guessing as we don't know the cause.

Thanks for working with us on this. It's a baffling issue.
Luke

Luke Bakken

unread,
Jun 11, 2018, 10:28:05 PM6/11/18
to rabbitmq-users
Hi Mark,

I checked out the recon library, compiled it on Windows, and attached an archive of the beam files to this response (https://github.com/ferd/recon)

Prior to the next memory event, extract the archive to C:\recon or some other path of your choice.

Then, during the next memory event, please open a "RabbitMQ Command Prompt (sbin dir)" command prompt (it's installed to the start menu) and run the following command (using the path to recon if you chose a different one):

rabbitmqctl.bat eval "code:add_patha(""C:/recon""), recon:bin_leak(5)."

This will provide the top five processes whose memory dropped the most after running garbage collection. If you're able to run that during a memory event we may get some valuable information. If not, don't worry, I'll come up with something else. You can run this any time if you'd like to test it out - it shouldn't add much load to your machine.

Thanks,
Luke
recon.zip

Mark Engelsman

unread,
Jun 12, 2018, 8:11:06 AM6/12/18
to rabbitmq-users
Hi Luke,

Thanks a lot for thinking with me, really appreciate it.
I set the 'background_gc_enabled' to true, the interval was already at 60.000.

I also tried to run the recon:bin_leak(5), but it throws some errors:

C:\Program Files\RabbitMQ Server\rabbitmq_server-3.6.15\sbin>rabbitmqctl eval "code:add_patha(""C:/recon""), recon:bin_leak(5)."
Error: {badarith,[{recon,'-bin_leak/1-lc$^1/1-1-',1,
                         [{file,"src/recon.erl"},{line,324}]},
                  {erl_eval,do_apply,6,[{file,"erl_eval.erl"},{line,670}]},
                  {rpc,'-handle_call_call/6-fun-0-',5,
                       [{file,"rpc.erl"},{line,197}]}]}

C:\Program Files\RabbitMQ Server\rabbitmq_server-3.6.15\sbin>rabbitmqctl eval "code:add_path(""C:/recon""), recon:bin_leak(5)."
Error: {badarith,[{recon,'-bin_leak/1-lc$^1/1-1-',1,
                         [{file,"src/recon.erl"},{line,324}]},
                  {erl_eval,do_apply,6,[{file,"erl_eval.erl"},{line,670}]},
                  {rpc,'-handle_call_call/6-fun-0-',5,
                       [{file,"rpc.erl"},{line,197}]}]}

C:\Program Files\RabbitMQ Server\rabbitmq_server-3.6.15\sbin>rabbitmqctl.bat eval "code:add_patha(""C:/recon""), recon:bin_leak(5)."
Error: {badarith,[{recon,'-bin_leak/1-lc$^1/1-1-',1,
                         [{file,"src/recon.erl"},{line,324}]},
                  {erl_eval,do_apply,6,[{file,"erl_eval.erl"},{line,670}]},
                  {rpc,'-handle_call_call/6-fun-0-',5,
                       [{file,"rpc.erl"},{line,197}]}]}

C:\Program Files\RabbitMQ Server\rabbitmq_server-3.6.15\sbin>rabbitmqctl.bat eval "code:add_path(""C:/recon""), recon:bin_leak(5)."
Error: {badarith,[{recon,'-bin_leak/1-lc$^1/1-1-',1,
                         [{file,"src/recon.erl"},{line,324}]},
                  {erl_eval,do_apply,6,[{file,"erl_eval.erl"},{line,670}]},
                  {rpc,'-handle_call_call/6-fun-0-',5,
                       [{file,"rpc.erl"},{line,197}]}]}

(I thought the add_patha could be a typo and should be add_path, but not sure. Tried 4 different commands but all with same output).

Kind regards and thanks again,
Mark

Op dinsdag 12 juni 2018 04:28:05 UTC+2 schreef Luke Bakken:

Luke Bakken

unread,
Jun 12, 2018, 9:43:06 AM6/12/18
to rabbitmq-users
Hi Mark,

That's surprising! I bet compiling recon yourself will resolve the issue. Here are the steps:

Clone these repositories -



Open a cmd shell, and add Erlang to your PATH:

C:\>set PATH=C:\Program Files\erl9.0\bin;%PATH%

Stay in the same shell, cd to the rebar clone and compile it:

C:\>cd \Users\me\rebar
C:\Users\me\rebar>.\bootstrap.bat

Copy a couple files into your recon clone:

C:\Users\me\rebar>copy /Y rebar C:\Users\me\recon
C:\Users\me\rebar>copy /Y rebar.cmd C:\Users\me\recon

Change to your recon clone, and compile it with rebar.cmd:

C:\>cd \Users\me\recon
C:\Users\me\recon>.\rebar.cmd compile

The files you want are in .\ebin:

C:\Users\me\recon>copy /Y .\ebin\*.* C:\recon

Then, give it another try. I've attached a screenshot of a successful run from my Windows VM.

FYI, code:add_patha/1 adds the directory to the front of the code search path (http://erlang.org/doc/man/code.html#add_patha-1).

Thanks -
Luke
recon-success.png

Mark Engelsman

unread,
Jun 12, 2018, 11:03:20 AM6/12/18
to rabbitmq-users
Hi Luke,

Alright, thanks. I'll give it a try.

By the way, after setting the 'background_gc_enabled' to true, I now don't see the rabbit_mgmt_db_cache_connections and rabbit_mgmt_db_cache_queues at the top of the processes anymore. The queue cache uses about 2.8MB now, where it was over 19MB before. I have suspicions that it was some management related process that used all of the memory (which would also be in line with last saturday, where the node that was doing nothing ran out of memory). Good hopes that this will fix our problems.

I'll try to install the recon, following the how-to you provided. Thanks!

Kind regards,
Mark 

Op dinsdag 12 juni 2018 15:43:06 UTC+2 schreef Luke Bakken:

Mark Engelsman

unread,
Jun 13, 2018, 8:20:07 AM6/13/18
to rabbitmq-users

Hi Luke,

Did all steps, but same result:


Pretty strange, because compiling etc. was all good.

By the way, I installed (a trial version of) WombatOAM for monitoring Erlang and RabbitMQ. Do you have any experience with this tool and have any tips where to look that could point us in the right direction for this partical issue? I know it's a long shot, but I also know your knowledge about RabbitMQ and Erlang is much bigger than mine, so worth a try.

Thanks!

Mark




Op dinsdag 12 juni 2018 15:43:06 UTC+2 schreef Luke Bakken:

Luke Bakken

unread,
Jun 13, 2018, 8:27:06 AM6/13/18
to rabbitmq-users
Hi Mark,

I don't have any experience with WombatOAM, but maybe someone else who reads this list does.

I will investigate the badarith error today to see what may be causing it. I'll have some other commands for you to run.

Thanks,
Luke

Ayanda Dube

unread,
Jun 13, 2018, 9:10:07 AM6/13/18
to rabbitm...@googlegroups.com
Hi Mark

Regarding WombatOAM capabilities around RabbitMQ, read [1][2]. You can apply similar principles on other pool of exposed metrics  (e.g. threshold alarming based on message queue lengths) which could help trouble shoot this problem.

PS: We're closely following this issue as well, brainstorming workarounds following recent dialogue with some of our engineers on pin-pointing the root cause (which could lead to something more concrete).

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

Mark Engelsman

unread,
Jun 13, 2018, 9:18:38 AM6/13/18
to rabbitmq-users
Hi Luke,

Don't bother anymore; I got it working now. The response looks like the response you got, so next time we can get some interesting statistics from it.
(I didn't do anything different, just tried to run it again and worked.)

Thanks!

Op woensdag 13 juni 2018 14:27:06 UTC+2 schreef Luke Bakken:

Luke Bakken

unread,
Jun 13, 2018, 11:51:55 AM6/13/18
to rabbitmq-users
Hi Mark,

That is very strange. Let me know if you run into another badarith error and we can get to the bottom of it.

Could you double-check that you are compiling from the master branch of recon? The only place where I think an uncaught badarith error could happen is here, but I have no idea how a non-integer value could be in the Val variable.

Thanks,
Luke

Luke Bakken

unread,
Jun 19, 2018, 1:32:07 PM6/19/18
to rabbitmq-users
Hi Mark,

Do you have any updates from this environment?


On Wednesday, June 13, 2018 at 8:51:55 AM UTC-7, Luke Bakken wrote:
Hi Mark,

That is very strange. Let me know if you run into another badarith error and we can get to the bottom of it.

Could you double-check that you are compiling from the master branch of recon? The only place where I think an uncaught badarith error could happen is here, but I have no idea how a non-integer value could be in the Val variable.

Thanks,
Luke

On Wednesday, June 13, 2018 at 6:18:38 AM UTC-7, Mark Engelsman wrote:
Hi Luke,

Don't bother anymore; I got it working now. The response looks like the response you got, so next time we can get some interesting statistics from it.
(I didn't do anything different, just tried to run it again and worked.)


Mark Engelsman

unread,
Jun 20, 2018, 5:38:30 AM6/20/18
to rabbitmq-users
Hi Luke,

Since I set the 'background_gc_enabled' to true, I didn't run into the 'vm_memory_high_watermark set' issue anymore. That's the good news.
hat we did see was that one of our workers, that publishes a lot of messages to RabbitMQ at a certain time, became really slow.
We decided to stop publishing from these workers on friday and saw that the average time of the workers dropped down hard.



1: The moment we had downtime again;
2.: The moment we enabled the background_gc;
3: The moment we stopped publishing to RabbitMQ.

I think the configuration changes impact the publishing time (this graph points in that direction).

We still don't know which process used all the memory, but the good thing is we don't run into memory issues anymore.
The bad thing is, of course, we don't use RabbitMQ for everything we want anymore. To be honest, I don't really know how to proceed at this point.

Thank you for still showing interest in our issues and your help.

Kind regards,
Mark 

Op dinsdag 19 juni 2018 19:32:07 UTC+2 schreef Luke Bakken:
Hi Mark,

Do you have any updates from this environment?

On Wednesday, June 13, 2018 at 8:51:55 AM UTC-7, Luke Bakken wrote:
Hi Mark,

Michael Klishin

unread,
Jun 20, 2018, 6:46:15 AM6/20/18
to rabbitm...@googlegroups.com
background GC is rarely a fundamental solution. Understanding what processes
individually use most memory and tweaking allocators [1] on Erlang 20.3.8
(20.0 doesn't have certain options [1] recommends) is what I'd do.

Can a memory usage breakdown [2] *chart* and rabbitmq-top results be posted in this thread?
A screenshot with numbers is quite painful to work with.

Since background GC primarily helps with lazy binary heap collection, it's worth collecting metrics on how large the messages
are. There are two time proven ways to reduce them:

 * Payload compression (publishers should also provide a hint in the content encoding property so that your consumers can support both compressed and uncompressed payloads during a transition period)
 * Storing large payloads in a key/value store of some kind and passing around keys


--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,
Jun 20, 2018, 6:47:56 AM6/20/18
to rabbitm...@googlegroups.com
3.7.x also has `rabbitmq-diagnostics memory_breakdown` which reports % of each category
as well as absolute values but the only version mention I could find mentions 3.6.15.

To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Michael Klishin

unread,
Jun 20, 2018, 6:56:36 AM6/20/18
to rabbitm...@googlegroups.com
Also, are your applications using durable queues and publish messages as persistent? This can be seen
on the queue page in the management UI (both queue properties and message position breakdown: RAM vs. disk).

If publishers explicitly tell RabbitMQ to keep messages in memory [1], it will increase GC pressure and likely change behavior
of allocators. You may want to consider making the queues lazy then [2], until the apps can be updated to not publish messages
as transient.

Luke Bakken

unread,
Jun 20, 2018, 11:41:50 AM6/20/18
to rabbitmq-users
Hi Mark,

It's not immediately apparent what the red line and blue lines represent in your chart. Is it memory consumption? I'm assuming it's the memory stat from RabbitMQ at this time.

Just so I understand your chart and the steps you took:

* At point #1, you had a node failure due to hitting the memory watermark. You restarted one node, which caused the dip in the chart.

* Memory was increasing, so background GC was enabled. Since enabling this setting requires a node restart, I'm wondering why we don't see a dip like #1.

* #3 shows what happened when you stopped a worker that publishes a lot of messages to RabbitMQ - you can see the memory consumption drop (at least, that's what it seems).

> one of our workers, that publishes a lot of messages to RabbitMQ at a certain time, became really slow.

What is "a lot of messages" exactly? What size is each message? What does "really slow" mean in comparison to the expected publish rate? What happened to other publishers and consumers?

Thanks,
Luke

On Wednesday, June 20, 2018 at 2:38:30 AM UTC-7, Mark Engelsman wrote:
Hi Luke,

Mark Engelsman

unread,
Jun 21, 2018, 3:59:38 AM6/21/18
to rabbitmq-users
Hi Michael,

Thank you for your tips. I will upgrade to 3.7.x next week, when we have planned downtime.

I tried to provide as many memory usage information as possible. Some are still screenshots, because the lay-out with copying the table got all messed up here.

osmrabbit01:

{memory,
     [{connection_readers,2929864},
      {connection_writers,173416},
      {connection_channels,1346816},
      {connection_other,2960136},
      {queue_procs,25109360},
      {queue_slave_procs,0},
      {plugins,17234376},
      {other_proc,12157888},
      {metrics,1386872},
      {mgmt_db,15314408},
      {mnesia,385232},
      {other_ets,2436472},
      {binary,36164256},
      {msg_index,816480},
      {code,25094855},
      {atom,1156305},
      {other_system,15955792},
      {allocated_unused,212211776},
      {reserved_unallocated,0},
      {total,372834304}]},
 {alarms,[]},



osmrabbit02:
{memory,
     [{connection_readers,2512152},
      {connection_writers,196704},
      {connection_channels,1353392},
      {connection_other,2556928},
      {queue_procs,672128},
      {queue_slave_procs,17646048},
      {plugins,3576376},
      {other_proc,11022560},
      {metrics,1290888},
      {mgmt_db,3902504},
      {mnesia,382984},
      {other_ets,5984880},
      {binary,24335712},
      {msg_index,799936},
      {code,24980007},
      {atom,1041593},
      {other_system,12712616},
      {allocated_unused,138329232},
      {reserved_unallocated,0},
      {total,250150912}]},
 {alarms,[]},


osmrabbit03:
{memory,
     [{connection_readers,2531136},
      {connection_writers,49792},
      {connection_channels,205448},
      {connection_other,2511960},
      {queue_procs,44720},
      {queue_slave_procs,0},
      {plugins,21592888},
      {other_proc,351472},
      {metrics,1278392},
      {mgmt_db,18566400},
      {mnesia,393544},
      {other_ets,2326944},
      {binary,9210112},
      {msg_index,44688},
      {code,24980007},
      {atom,1041593},
      {other_system,11419272},
      {allocated_unused,50842096},
      {reserved_unallocated,0},
      {total,147390464}]},
 {alarms,[]},



For the size of messages, I hope this will provide some insights:


I will answer on your other questions posted (and also on Luke's message) later today; I have a meeting with some colleagues about RabbitMQ now. Thank you very much for your time and if you need more information, let me know and I'll try to provide all insights in RabbitMQ and Erlang.

Kind regards,
Mark Engelsman


Op woensdag 20 juni 2018 12:46:15 UTC+2 schreef Michael Klishin:
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Mark Engelsman

unread,
Jun 21, 2018, 6:58:25 AM6/21/18
to rabbitmq-users
Hi Luuk,

Sorry about being unclear about this. We use RabbitMQ to send data from our simulations to an external storage system.
In our systems (/game), we simulate the outcome of matches once a day. This starts around 16.00 and ends around 22.00.
This means our load is not equally spread over the day, but has peaks around the time window I just mentioned.

What you see in the graph is the average simulation time per league. The moment we had downtime, we didn't publish anything (publishers were blocked because of the memory issues). At 3, we disabled sending simulation data to RabbitMQ, so we miss data in our external storage system starting at 3.

You see that when we disabled RabbitMQ, the simulation per league was quick again. 

Kind regards,
Mark 

Op woensdag 20 juni 2018 17:41:50 UTC+2 schreef Luke Bakken:

Michael Klishin

unread,
Jun 21, 2018, 8:57:30 AM6/21/18
to rabbitm...@googlegroups.com
All those values are pretty low and nothing really stands out (at the moment of capture, of course).

I suspect that tweaking runtime allocator flags (or running on 3.7.6 with Erlang 20.3, where a more efficient combination will be used by default)
could help. Something might cause the allocators to try allocating very large blocks but without Recon
data it is impossible to guess what that might be.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Luke Bakken

unread,
Jul 2, 2018, 10:41:00 AM7/2/18
to rabbitmq-users
Hi Mark,

Did you get a chance to upgrade? I'm wondering if the updated memory allocator flags helped your situation out.

Thanks,
Luke

Mark Engelsman

unread,
Jul 6, 2018, 2:41:38 AM7/6/18
to rabbitmq-users
Hi Luuk,

Sorry for the delayed response. 
Last week, I upgraded RabbitMQ to 3.7.6.
We didn't run into the memory issues anymore since I enabled the garbare collector in the background. After the upgrade, we don't have any issues too. background_gc_enabled is still set to true.

I think we will leave it like this, because it looks like it's finally a stable platform. 

What do you mean with the 'updated memory allocator flags'?

Thanks a lot for your help (and also @Michael Klishkin). Really appreciated!

Mark

Op maandag 2 juli 2018 16:41:00 UTC+2 schreef Luke Bakken:

Michael Klishin

unread,
Jul 6, 2018, 5:46:21 AM7/6/18
to rabbitm...@googlegroups.com
Hi Mark,

That's OK to keep background GC on if it makes a difference for you, that's why we kept it around
even though on some workloads it has a negative effect [on latency and productive CPU utilization].

RabbitMQ 3.7.6 uses a new set of runtime memory allocator flags if they are available (Erlang 20.2.1 or later IIRC),
see [1].


To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages