Sudden increase in RAM usage, memory alarm, 3.6.14 (Erlang 19.3) cluster on Windows

Kevin Freeman

unread,

Jan 5, 2018, 7:52:15 PM1/5/18

to rabbitmq-users

Has anyone else observed a dramatic rise in RAM consumption with 3.6.14? Posting here in case this is not a random occurrence.

A 3-node 3.6.14 cluster (Windows 2012R2, 64GB RAM, 10GB swap per node) has been running stable for 20 days, with per-node RAM usage below 3GB. Memory threshold is at defaults (about 25GB on the 64GB servers). 2 of the nodes exhibited a sharp rise in RAM usage over a span of 5 minutes from 2GB to 30GB and the memory alarm was triggered. On one of the impacted nodes the rabbit service restarted (event log contained "RabbitMQ: Restarted erlang machine."), monitoring tools show that erl.exe RAM usage on that node was over 40GB during the event. After service restart RAM usage has remained at the typical level. The other impacted node did not auto-restart, and after a manual service restart RAM usage has also dropped back to typical levels. At the time of the event rabbitmq_top was not active, and rabbitmqctl.bat status on the impacted nodes returned:

Status of node rabbit@QUEPR1

Error: unable to connect to node rabbit@QUEPR1: nodedown

During the time of day of the event the cluster was receiving about 2500 messages/sec and delivering about 5000/sec. Monitoring tools showed no queues with notable ready or unack count in the minutes prior to the event. The impacted nodes did show a spike in disk I/O during the event, from a baseline of about 700/sec to over 25000/sec during the event.

Could this event have been caused by garbage collection? What else could cause a sharp rise RAM usage (and an even sharper spike in disk I/O) in a short period of time? I realize the data provided are thin ... I have included a screenshot of the 1-day overview of the cluster.

Rabbit logs contain memory alarm messages but no other errors. SASL log contains entries below which I do not understand. In the log, 13:39 is near the time of the initial memory alarm, 13:48 and 13:52 represent erl.exe RAM spikes over 40GB, and 14:02 is the time the service was restarted.

=SUPERVISOR REPORT==== 4-Jan-2018::13:39:12 ===

Supervisor: {<0.11889.2949>,amqp_channel_sup_sup}

Context: shutdown_error

Reason: shutdown

Offender: [{nb_children,1},

{name,channel_sup},

{mfargs,

{amqp_channel_sup,start_link,

[direct,<0.28377.2948>,

<<"<rab...@QUEPR1.1.28377.2948>">>]}},

{restart_type,temporary},

{shutdown,infinity},

{child_type,supervisor}]

=SUPERVISOR REPORT==== 4-Jan-2018::13:48:12 ===

Supervisor: {<0.28526.4447>,amqp_channel_sup_sup}

Context: shutdown_error

Reason: shutdown

Offender: [{nb_children,1},

{name,channel_sup},

{mfargs,

{amqp_channel_sup,start_link,

[direct,<0.2645.5047>,

<<"<rab...@QUEPR1.1.2645.5047>">>]}},

{restart_type,temporary},

{shutdown,infinity},

{child_type,supervisor}]

=SUPERVISOR REPORT==== 4-Jan-2018::13:52:39 ===

Supervisor: {<0.13527.6029>,rabbit_channel_sup_sup}

Context: shutdown_error

Reason: shutdown

Offender: [{nb_children,1},

{name,channel_sup},

{mfargs,{rabbit_channel_sup,start_link,[]}},

{restart_type,temporary},

{shutdown,infinity},

{child_type,supervisor}]

=SUPERVISOR REPORT==== 4-Jan-2018::14:02:12 ===

Supervisor: {<0.32711.0>,rabbit_channel_sup_sup}

Context: shutdown_error

Reason: shutdown

Offender: [{nb_children,2},

{name,channel_sup},

{mfargs,{rabbit_channel_sup,start_link,[]}},

{restart_type,temporary},

{shutdown,infinity},

{child_type,supervisor}]

=SUPERVISOR REPORT==== 4-Jan-2018::14:02:15 ===

Supervisor: {<0.285.1>,rabbit_channel_sup_sup}

Context: shutdown_error

Reason: shutdown

Offender: [{nb_children,1},

{name,channel_sup},

{mfargs,{rabbit_channel_sup,start_link,[]}},

{restart_type,temporary},

{shutdown,infinity},

{child_type,supervisor}]

Screen Shot 2018-01-05 at 1.26.40 PM.png

Michael Klishin

unread,

Jan 7, 2018, 11:33:17 AM1/7/18

to rabbitm...@googlegroups.com

See http://www.rabbitmq.com/memory-use.html and http://www.rabbitmq.com/monitoring.html.

Only looking at a number of metrics (such as memory breakdown and total number of messages enqueued)

can provide an informed answer.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

MK

Staff Software Engineer, Pivotal/RabbitMQ

Luke Bakken

unread,

Jan 7, 2018, 5:53:15 PM1/7/18

to rabbitmq-users

Hi Kevin -

Your statement makes it sound like you recently upgraded. If so, what version of RabbitMQ were you using prior to 3.6.14?

"The impacted nodes did show a spike in disk I/O during the event, from a baseline of about 700/sec to over 25000/sec during the event."

When available RAM decreases RabbitMQ pages enqueued messages to disk which I suspect is why you saw this increased disk I/O. Note that this is just a guess - without monitoring system or RabbitMQ stats it's impossible to be sure. During the event did any consumer processes stop or drastically slow down?

Luke

Reply all

Reply to author

Forward