RabbitMQ cluster nodes constantly crashing after several hours uptime due to out of memory

680 views
Skip to first unread message

Marcus Kröger

unread,
Mar 7, 2016, 11:26:54 AM3/7/16
to rabbitmq-users
Hi

we are running RabbitMQ 3.6.0 with Erlang 18 in a cluster with 2 nodes.

Configuration is as follows:

[{rabbit, [
  {cluster_partition_handling, autoheal},
  {disk_free_limit, 100000000},
  {ssl_listeners, [{"0.0.0.0",50000}]},
  {collect_statistics_interval, 10000},
  {vm_memory_high_watermark, {absolute, "8000MiB"}},
  {vm_memory_high_watermark_paging_ratio, 0.6},
  {auth_backends, [rabbit_auth_backend_internal,rabbit_auth_backend_ldap]},
  {ssl_options, [
   {cacertfile, "/srv/rabbitmq/config/xxx/ssl.crt/xxx.pem"},
   {certfile, "/srv/rabbitmq/config/xxx/ssl.crt/xxxcrt"},
   {keyfile, "/srv/rabbitmq/config/xxx/ssl.key/xxx.key"},
   {verify,verify_peer},
   {fail_if_no_peer_cert,true}
  ]}
 ]},
 {rabbitmq_management, [
  {http_log_dir, "/var/log/rabbitmq/xxx/management"},
  {listener, [{port, 15672}]},
  {redirect_old_port, false},
  {rates_mode,none}
 ]},
].

Even though there is not much load on the system we very often see the message 
"The management statistics database currently has a queue of x events to process. If this number keeps increasing, so will the memory used by the management plugin." on the admin gui. The number stays up to 1000 -> 5000 over hours, but then it grows up to 2 million and above. When this is happening the cluster node is not responding anymore. There is basically no change on the load.

We get this issue after we migrated from 3.4.2 to 3.6.0. Before the same setup run for many years withou any issue.

We already set the rates_mode to none, without success.  Is there anything else we could look at specifically which could cause this kind of issue?

Michael Klishin

unread,
Mar 7, 2016, 11:29:42 AM3/7/16
to rabbitm...@googlegroups.com, Marcus Kröger
On 7 March 2016 at 19:26:56, Marcus Kröger (marcus....@gmail.com) wrote:
> Is there anything else we could look at specifically which could
> cause this kind of issue?

https://github.com/rabbitmq/rabbitmq-management/issues/41, which will be in 3.6.2.

If you can tell me what type of package you use, I’d be happy to build one from stable branch
and send you off-list.

A 3.6.2 preview release should ship  this week.
--
MK

Staff Software Engineer, Pivotal/RabbitMQ


Marcus Kröger

unread,
Mar 7, 2016, 12:14:14 PM3/7/16
to rabbitmq-users, marcus....@gmail.com
Hi Michael

we are compiling from the source as we are using zLinux on a Mainframe

cheers
Marcus

Marcus Kröger

unread,
Mar 7, 2016, 12:15:36 PM3/7/16
to rabbitmq-users, marcus....@gmail.com
would the disabling of the management plugin be a workaround for now?

We would still be able to collect monitoring data via rabbitmqctl.

cheers
Marcus

On Monday, 7 March 2016 17:29:42 UTC+1, Michael Klishin wrote:

Michael Klishin

unread,
Mar 7, 2016, 12:28:40 PM3/7/16
to rabbitm...@googlegroups.com, marcus....@gmail.com

> On 7 mar 2016, at 20:15, Marcus Kröger <marcus....@gmail.com> wrote:
>
> would the disabling of the management plugin be a workaround for now?

It would. Can you run a generic UNIX binary build?

Marcus Kröger

unread,
Mar 7, 2016, 12:31:08 PM3/7/16
to rabbitmq-users, marcus....@gmail.com
Running a unix binary built in our environment is not possible.

So, by disabling the management plugin we would not run into this bug at all and we could wait for 3.6.2 being officially released?

cheers
Marcus

Marcus Kröger

unread,
Mar 7, 2016, 12:46:18 PM3/7/16
to rabbitmq-users, marcus....@gmail.com
And in addition, is this only happening in a cluster setup?

Michael Klishin

unread,
Mar 7, 2016, 12:52:15 PM3/7/16
to rabbitm...@googlegroups.com, marcus....@gmail.com

> On 7 mar 2016, at 20:31, Marcus Kröger <marcus....@gmail.com> wrote:
>
> So, by disabling the management plugin we would not run into this bug at all and we could wait for 3.6.2 being officially released?

Yes. Or you could build the tip of stable from source
if that's what you do.

Michael Klishin

unread,
Mar 7, 2016, 12:52:51 PM3/7/16
to rabbitm...@googlegroups.com, marcus....@gmail.com

> On 7 mar 2016, at 20:46, Marcus Kröger <marcus....@gmail.com> wrote:
>
> And in addition, is this only happening in a cluster setup?

Please see rabbitmq/rabbitmq-management#41.

Marcus Kröger

unread,
Mar 8, 2016, 4:46:28 AM3/8/16
to rabbitmq-users, marcus....@gmail.com
Hi Michael,

thx for the fast replies. We would need the src package for 

rabbitmq-server-generic-unix

We will take this package a compile a server packge ourselves using

RabbitMQ 3.6.2 (pre final) - rabbitmq-server-generic-unix-3.6.2.tar.xz

Michael Klishin

unread,
Mar 8, 2016, 4:49:51 AM3/8/16
to rabbitm...@googlegroups.com, Marcus Kröger
On 8 March 2016 at 12:46:31, Marcus Kröger (marcus....@gmail.com) wrote:
> thx for the fast replies. We would need the src package for
>
> rabbitmq-server-generic-unix
>
> We will take this package a compile a server packge ourselves
> using
>
> Erlang 18.2.1 - http://www.erlang.org/download/otp_src_18.2.1.tar
> RabbitMQ 3.6.2 (pre final) - rabbitmq-server-generic-unix-3.6.2.tar.xz

Why do you need to build your own binary package if I may ask? 

Marcus Kröger

unread,
Mar 8, 2016, 5:11:49 AM3/8/16
to rabbitmq-users, marcus....@gmail.com
Hi

because we run on a mainframe using

SUSE Linux Enterprise Server 11 (s390x)
VERSION = 11
PATCHLEVEL = 3

with kernel

"Linux 3.0.101-0.31-default  s390x s390x s390x GNU/Linux"

regards
Marcus

Michael Klishin

unread,
Mar 8, 2016, 5:15:27 AM3/8/16
to rabbitm...@googlegroups.com, Marcus Kröger
On 8 March 2016 at 13:11:51, Marcus Kröger (marcus....@gmail.com) wrote:
> because we run on a mainframe using
>
> SUSE Linux Enterprise Server 11 (s390x)
> VERSION = 11
> PATCHLEVEL = 3

RabbitMQ has no native code and provided you have a supported Erlang version, the binary generic UNIX
package should work just fine.

We’ll publish a preview of 3.6.2, including source tarballs, later this week. 

Marcus Kröger

unread,
Mar 8, 2016, 5:23:31 AM3/8/16
to Michael Klishin, rabbitm...@googlegroups.com
Hi Michal,

well, our zLinux department would like to  stick to the current process using a source package and create their own "installation" package.

Would it be possible that you send me a src package it advance?

regards
Marcus
Message has been deleted

Marcus Kröger

unread,
Mar 8, 2016, 5:59:07 AM3/8/16
to Michael Klishin, rabbitm...@googlegroups.com
Hi Michal,

we "disabled" the cluster by only running one node.

After running with this single node only this node crashed as well which indicates that the issue 


cannot be the reason for this shutdown.

What do we see:

The rabbit node behaves "normal" for several hours and out of the sudden it uses up all memory and becomes unresponsive. It then crashes with out of memory.

This is what we see in the logs

=INFO REPORT==== 8-Mar-2016::11:21:13 ===
vm_memory_high_watermark set. Memory used:9229658632 allowed:8388608000

=WARNING REPORT==== 8-Mar-2016::11:21:13 ===
memory resource limit alarm set on node 'xxxx@xxx'.

**********************************************************
*** Publishers will be blocked until this alarm clears ***
**********************************************************

=INFO REPORT==== 8-Mar-2016::11:21:14 ===
vm_memory_high_watermark clear. Memory used:4829491912 allowed:8388608000

=WARNING REPORT==== 8-Mar-2016::11:21:14 ===
memory resource limit alarm cleared on node 'xxx@xxx'

=WARNING REPORT==== 8-Mar-2016::11:21:14 ===
memory resource limit alarm cleared across the cluster

=INFO REPORT==== 8-Mar-2016::11:26:44 ===
Starting RabbitMQ 3.6.0 on Erlang 18
Copyright (C) 2007-2015 Pivotal Software, Inc.
Licensed under the MPL.  See http://www.rabbitmq.com/

=INFO REPORT==== 8-Mar-2016::11:26:44 ===
node           : xxx@xxx
home dir       : /home/rabbitmq
config file(s) : /srv/rabbitmq/config/ComXervPower_B_production/rabbitmq.config
cookie hash    : +Pemgog0Dm+Lv7C9ZOD5dQ==
log            : /var/log/rabbitmq/ComXervPower_B_production/ComXervPower_B_production.log
sasl log       : /var/log/rabbitmq/ComXervPower_B_production/ComXervPower_B_production-sasl.log
database dir   : /srv/rabbitmq/data/ComXervPower_B_production

=INFO REPORT==== 8-Mar-2016::11:26:47 ===
Memory limit set to 8000MB of 15083MB total.

=INFO REPORT==== 8-Mar-2016::11:26:47 ===
Disk free limit set to 100MB

=INFO REPORT==== 8-Mar-2016::11:26:47 ===
Limiting to approx 29900 file handles (26908 sockets)

=INFO REPORT==== 8-Mar-2016::11:26:47 ===
FHC read buffering:  OFF
FHC write buffering: ON

=INFO REPORT==== 8-Mar-2016::11:26:47 ===
Priority queues enabled, real BQ is rabbit_variable_queue

=INFO REPORT==== 8-Mar-2016::11:26:47 ===
Management plugin: using rates mode 'none'

=INFO REPORT==== 8-Mar-2016::11:26:47 ===
msg_store_transient: using rabbit_msg_store_ets_index to provide index

=INFO REPORT==== 8-Mar-2016::11:26:47 ===
msg_store_persistent: using rabbit_msg_store_ets_index to provide index

=WARNING REPORT==== 8-Mar-2016::11:26:47 ===
msg_store_persistent: rebuilding indices from scratch

=WARNING REPORT==== 8-Mar-2016::11:26:47 ===
Mnesia('xxx@xxx'): ** WARNING ** Mnesia is overloaded: {dump_log,write_threshold}

=WARNING REPORT==== 8-Mar-2016::11:26:47 ===
Mnesia('xxx@xxx'): ** WARNING ** Mnesia is overloaded: {dump_log,write_threshold}


any information would be appreciated

regards
Marcus

Marcus Kröger

unread,
Mar 8, 2016, 6:00:17 AM3/8/16
to Michael Klishin, rabbitm...@googlegroups.com
config file(s) : /srv/rabbitmq/config/xxx/rabbitmq.config
cookie hash    : +Pemgog0Dm+Lv7C9ZOD5dQ==
log            : /var/log/rabbitmq/xxx/xxx.log
sasl log       : /var/log/rabbitmq/xxx/xxx-sasl.log
database dir   : /srv/rabbitmq/data/xxx

Michael Klishin

unread,
Mar 8, 2016, 6:07:03 AM3/8/16
to Marcus Kröger, rabbitm...@googlegroups.com
On 8 March 2016 at 14:00:15, Marcus Kröger (marcus....@gmail.com) wrote:
> we "disabled" the cluster by only running one node.
>
> After running with this single node only this node crashed as
> well which indicates that the issue
>
> https://github.com/rabbitmq/rabbitmq-management/issues/41
>
> cannot be the reason for this shutdown.

Marcus,

"The management statistics database currently has a queue of x events to process.” says, which was
in your original report, IS an indication of management#41. There is no ambiguity in the message.

I did not recommend going from 2 nodes to 1. I clarified that disabling management UI is what you want.

The log provided has no evidence that the node terminated, or why it might have happened. See the SASL log and syslog
for clues. 

Michael Klishin

unread,
Mar 8, 2016, 6:19:38 AM3/8/16
to Marcus Kröger, rabbitm...@googlegroups.com
On 8 March 2016 at 14:00:15, Marcus Kröger (marcus....@gmail.com) wrote:
> =WARNING REPORT==== 8-Mar-2016::11:21:13 ===
> memory resource limit alarm set on node 'xxxx@xxx'.
>
> **********************************************************
> *** Publishers will be blocked until this alarm clears ***
> **********************************************************

What protocols do you use? are STOMP and MQTT among them? 

Marcus Kröger

unread,
Mar 8, 2016, 6:26:49 AM3/8/16
to Michael Klishin, rabbitm...@googlegroups.com
Hi Michal,

no, we don't use stomp nor mqtt.

there are no entries in sasl log nor syslog which indeed is very weard.

We also cannot take down the management plugin as we need it for our http request towards the rabbit node.

Currently we are intending to either go back to the old version or to the latest version 3.6.2 if you could provide is with the source package

regards
Marcus

Michael Klishin

unread,
Mar 8, 2016, 6:27:53 AM3/8/16
to Marcus Kröger, rabbitm...@googlegroups.com
 On 8 March 2016 at 14:26:47, Marcus Kröger (marcus....@gmail.com) wrote:
> no, we don't use stomp nor mqtt.
>
> there are no entries in sasl log nor syslog which indeed is very
> weard.
>
> We also cannot take down the management plugin as we need it for
> our http request towards the rabbit node.

Then going back to 1 node won’t change anything.

Marcus Kröger

unread,
Mar 8, 2016, 6:31:30 AM3/8/16
to Michael Klishin, rabbitm...@googlegroups.com
we are currently trying to find out the reason by reducing complexity of the setup. 

So, this issue we are facing is definitely not caused by a cluster setup.

Either we try to go to 3.6.2 even though it is pre delivery to check if the issue is fixed or we go back to 3.2.3 rabbit

regards
Marcus
Reply all
Reply to author
Forward
0 new messages