Has anyone else observed a dramatic rise in RAM consumption with 3.6.14? Posting here in case this is not a random occurrence.
A 3-node 3.6.14 cluster (Windows 2012R2, 64GB RAM, 10GB swap per node) has been running stable for 20 days, with per-node RAM usage below 3GB. Memory threshold is at defaults (about 25GB on the 64GB servers). 2 of the nodes exhibited a sharp rise in RAM usage over a span of 5 minutes from 2GB to 30GB and the memory alarm was triggered. On one of the impacted nodes the rabbit service restarted (event log contained "RabbitMQ: Restarted erlang machine."), monitoring tools show that erl.exe RAM usage on that node was over 40GB during the event. After service restart RAM usage has remained at the typical level. The other impacted node did not auto-restart, and after a manual service restart RAM usage has also dropped back to typical levels. At the time of the event rabbitmq_top was not active, and rabbitmqctl.bat status on the impacted nodes returned:
During the time of day of the event the cluster was receiving about 2500 messages/sec and delivering about 5000/sec. Monitoring tools showed no queues with notable ready or unack count in the minutes prior to the event. The impacted nodes did show a spike in disk I/O during the event, from a baseline of about 700/sec to over 25000/sec during the event.
Could this event have been caused by garbage collection? What else could cause a sharp rise RAM usage (and an even sharper spike in disk I/O) in a short period of time? I realize the data provided are thin ... I have included a screenshot of the 1-day overview of the cluster.
Rabbit logs contain memory alarm messages but no other errors. SASL log contains entries below which I do not understand. In the log, 13:39 is near the time of the initial memory alarm, 13:48 and 13:52 represent erl.exe RAM spikes over 40GB, and 14:02 is the time the service was restarted.
=SUPERVISOR REPORT==== 4-Jan-2018::13:39:12 ===
Supervisor: {<0.11889.2949>,amqp_channel_sup_sup}
Context: shutdown_error
Reason: shutdown
Offender: [{nb_children,1},
{name,channel_sup},
{mfargs,
{amqp_channel_sup,start_link,
[direct,<0.28377.2948>,
<<"<rab...@QUEPR1.1.28377.2948>">>]}},
{restart_type,temporary},
{shutdown,infinity},
{child_type,supervisor}]
=SUPERVISOR REPORT==== 4-Jan-2018::13:48:12 ===
Supervisor: {<0.28526.4447>,amqp_channel_sup_sup}
Context: shutdown_error
Reason: shutdown
Offender: [{nb_children,1},
{name,channel_sup},
{mfargs,
{amqp_channel_sup,start_link,
[direct,<0.2645.5047>,
<<"<rab...@QUEPR1.1.2645.5047>">>]}},
{restart_type,temporary},
{shutdown,infinity},
{child_type,supervisor}]
=SUPERVISOR REPORT==== 4-Jan-2018::13:52:39 ===
Supervisor: {<0.13527.6029>,rabbit_channel_sup_sup}
Context: shutdown_error
Reason: shutdown
Offender: [{nb_children,1},
{name,channel_sup},
{mfargs,{rabbit_channel_sup,start_link,[]}},
{restart_type,temporary},
{shutdown,infinity},
{child_type,supervisor}]
=SUPERVISOR REPORT==== 4-Jan-2018::14:02:12 ===
Supervisor: {<0.32711.0>,rabbit_channel_sup_sup}
Context: shutdown_error
Reason: shutdown
Offender: [{nb_children,2},
{name,channel_sup},
{mfargs,{rabbit_channel_sup,start_link,[]}},
{restart_type,temporary},
{shutdown,infinity},
{child_type,supervisor}]
=SUPERVISOR REPORT==== 4-Jan-2018::14:02:15 ===
Supervisor: {<0.285.1>,rabbit_channel_sup_sup}
Context: shutdown_error
Reason: shutdown
Offender: [{nb_children,1},
{name,channel_sup},
{mfargs,{rabbit_channel_sup,start_link,[]}},
{restart_type,temporary},
{shutdown,infinity},
{child_type,supervisor}]