rabbitmqctl reset hangs

413 views
Skip to first unread message

Zev Benjamin

unread,
Feb 2, 2015, 8:20:46 PM2/2/15
to rabbitm...@googlegroups.com
Hi,

We have a production RabbitMQ cluster where we would like to remove a node.  We first ran `rabbitmqctl stop_app`, which seemed to complete cleanly.  The logs show the "Stopping RabbitMQ" line followed by "stopped TCP Listener on [::]:5672" followed by client disconnects.  We then ran `rabbitmqctl reset`.  We see that "Resetting Rabbit" was logged, but the command hung.  Other nodes still saw the node we were trying to remove as part of the cluster.  There is nothing else in the logs that indicate any sort of failure.  Right after the reset hung, calls to queue.bind also stopped completing (clients never received a bind-ok), but messages were otherwise accepted and delivered just fine.  We subsequently tried invoking `rabbitmqctl forget_cluster_node` from another node, but that also hung.  

At this point we tried bringing the node back online, instead.  We ran `rabbitmqctl start_app`, but that failed with:
  • Error: {badarg,[{erlang,register,[rabbit_boot,<5289.23326.934>],[]},
  •                 {rabbit,start_it,1,[{file,"src/rabbit.erl"},{line,350}]},
  •                 {rpc,'-handle_call_call/6-fun-0-',5,
  •                      [{file,"rpc.erl"},{line,205}]}]}

We were able to successfully bring it back up by killing the `erl` process and starting it again. At that point, we got bind-ok messages for the outstanding queue.bind calls and the cluster returned to normal operation.

Any ideas for why the reset would have hung or ways to diagnose further what is going on? We tried the same procedure in our development cluster and the node left the cluster just fine.


Thanks,
Zev

Michael Klishin

unread,
Feb 2, 2015, 8:58:11 PM2/2/15
to rabbitm...@googlegroups.com, Zev Benjamin
On 3 February 2015 at 04:20:48, Zev Benjamin (z...@dropbox.com) wrote:
> Any ideas for why the reset would have hung or ways to diagnose
> further what is going on? We tried the same procedure in our development
> cluster and the node left the cluster just fine.

Zev,

To suggest anything specific we need to know a couple of things.

1. What's in RabbitMQ log files? 
2. What version do you use?
--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Zev Benjamin

unread,
Feb 2, 2015, 9:55:25 PM2/2/15
to rabbitm...@googlegroups.com, z...@dropbox.com
We are running RabbitMQ 3.2.0.

Logs from the node we were trying to remove:
```
=INFO REPORT==== 2-Feb-2015::21:56:05 === Stopping RabbitMQ
=INFO REPORT==== 2-Feb-2015::21:56:05 === stopped TCP Listener on [::]:5672
=ERROR REPORT==== 2-Feb-2015::21:56:05 === AMQP connection <0.32207.933> (running), channel 0 - error: {amqp_error,connection_forced,             "broker forced connection closure with reason 'shutdown'",none}
[...]
=ERROR REPORT==== 2-Feb-2015::21:56:08 === AMQP connection <0.18413.922> (running), channel 0 - error: {amqp_error,connection_forced,             "broker forced connection closure with reason 'shutdown'",none}
=INFO REPORT==== 2-Feb-2015::21:56:15 === Resetting Rabbit
=ERROR REPORT==== 2-Feb-2015::22:06:13 === ** Connection attempt from disallowed node 'rabbitmqctl82162@hostA' **
=ERROR REPORT==== 2-Feb-2015::22:06:56 === ** Connection attempt from disallowed node 'rabbitmqctl82527@hostA' **
=ERROR REPORT==== 2-Feb-2015::22:07:02 === ** Connection attempt from disallowed node 'rabbitmqctl82711@hostA' **
=INFO REPORT==== 2-Feb-2015::22:08:49 === Starting RabbitMQ 3.2.0 on Erlang R16B Copyright (C) 2007-2013 GoPivotal, Inc. Licensed under the MPL.  See http://www.rabbitmq.com/
=INFO REPORT==== 2-Feb-2015::22:08:49 === node           : rabbit@hostA home dir       : /home/rabbitmq config file(s) : (none) cookie hash    : [redacted] log            : /var/log/
rabbitmq/rab...@hostA.log sasl log       : /var/log/rabbitmq/rab...@hostA-sasl.log database dir   : /var/lib/rabbitmq/rabbit@hostA
=INFO REPORT==== 2-Feb-2015::22:08:49 === Limiting to approx 9900 file handles (8908 sockets)
=INFO REPORT==== 2-Feb-2015::22:13:36 === Starting RabbitMQ 3.2.0 on Erlang R16B Copyright (C) 2007-2013 GoPivotal, Inc. Licensed under the MPL.  See http://www.rabbitmq.com/
=INFO REPORT==== 2-Feb-2015::22:13:36 === node           : rabbit@hostA home dir       : /home/rabbitmq config file(s) : /etc/rabbitmq/rabbitmq.config cookie hash    : [redacted] log
            : /var/log/rabbitmq/rab...@hostA.log sasl log       : /var/log/rabbitmq/rab...@hostA-sasl.log database dir   : /var/lib/rabbitmq/rabbit@hostA
=INFO REPORT==== 2-Feb-2015::22:13:36 === Limiting to approx 9900 file handles (8908 sockets)
=INFO REPORT==== 2-Feb-2015::22:13:49 === Memory limit set to 72471MB of 96628MB total.
=INFO REPORT==== 2-Feb-2015::22:13:49 === Disk free limit set to 50MB
=INFO REPORT==== 2-Feb-2015::22:13:49 === msg_store_transient: using rabbit_msg_store_ets_index to provide index
=INFO REPORT==== 2-Feb-2015::22:13:49 === msg_store_persistent: using rabbit_msg_store_ets_index to provide index
=WARNING REPORT==== 2-Feb-2015::22:13:49 === msg_store_persistent: recovery terms differ from present rebuilding indices from scratch
=INFO REPORT==== 2-Feb-2015::22:14:17 === Adding mirror of queue 'queue.28660c0a.54cc2ea9.4b2c.0' in vhost '/' on node 'rabbit@hostA': <0.13839.3>
[...]
=INFO REPORT==== 2-Feb-2015::22:14:18 === Adding mirror of queue 'queue.b5640c0a.54cff278.6fd4.0' in vhost '/' on node 'rabbit@hostA': <0.16780.3>
=INFO REPORT==== 2-Feb-2015::22:14:18 === started TCP Listener on [::]:5672
=INFO REPORT==== 2-Feb-2015::22:14:18 === rabbit on node 'rabbit@hostB' up
=INFO REPORT==== 2-Feb-2015::22:14:18 === rabbit on node 'rabbit@hostC' up
=INFO REPORT==== 2-Feb-2015::22:14:18 === rabbit on node 'rabbit@hostD' up
=INFO REPORT==== 2-Feb-2015::22:14:18 === rabbit on node 'rabbit@hostE' up
=INFO REPORT==== 2-Feb-2015::22:14:18 === rabbit on node 'rabbit@hostF' up
=INFO REPORT==== 2-Feb-2015::22:14:18 === Management plugin started. Port: 15672
=INFO REPORT==== 2-Feb-2015::22:14:18 === Server startup complete; 6 plugins started.  * amqp_client  * mochiweb  * rabbitmq_management  * rabbitmq_management_agent  * rabbitmq_web_dispatch  * webmachine
```

From the node we tried to run forget_cluster_node:
```
=INFO REPORT==== 2-Feb-2015::21:56:07 === Adding mirror of queue 'queue.b8640c0a.54cff277.695e.0' in vhost '/' on node 'rabbit@hostE': <2992.19092.927>
[...]
=INFO REPORT==== 2-Feb-2015::21:56:08 === Adding mirror of queue 'queue.28660c0a.54cff278.7d61.0' in vhost '/' on node 'rabbit@hostF': <2989.6474.1119>
=INFO REPORT==== 2-Feb-2015::21:56:08 === Mirrored-queue (queue 'queue.b5640c0a.54cff277.6fd1.0' in vhost '/'): Slave <'rabbit@hostB'.3.18327.0> saw deaths of mirrors <'rabbit@hostA'.2.22436.934>
=INFO REPORT==== 2-Feb-2015::21:56:08 === Mirrored-queue (queue 'queue.b8640c0a.54cff277.6960.0' in vhost '/'): Master <'rabbit@hostB'.3.18288.0> saw deaths of mirrors <'rabbit@hostA'.2.22430.934>
=INFO REPORT==== 2-Feb-2015::21:56:08 === Mirrored-queue (queue 'queue.b8640c0a.54cff277.695e.0' in vhost '/'): Master <'rabbit@hostB'.3.18222.0> saw deaths of mirrors <'rabbit@hostA'.2.22422.934>
=INFO REPORT==== 2-Feb-2015::21:56:08 === Mirrored-queue (queue 'queue.b8640c0a.54cff277.695d.0' in vhost '/'): Slave <'rabbit@hostB'.3.18224.0> saw deaths of mirrors <'rabbit@hostA'.2.22415.934>
=INFO REPORT==== 2-Feb-2015::21:56:08 === Mirrored-queue (queue 'queue.b5640c0a.54cff278.6fd4.0' in vhost '/'): Slave <'rabbit@hostB'.3.18562.0> saw deaths of mirrors <'rabbit@hostA'.2.22515.934>
=INFO REPORT==== 2-Feb-2015::21:56:10 === rabbit on node 'rabbit@hostA' down
=INFO REPORT==== 2-Feb-2015::22:01:18 === Removing node 'rabbit@hostA' from cluster
=WARNING REPORT==== 2-Feb-2015::22:02:03 === Mnesia('rabbit@hostB'): ** WARNING ** Mnesia is overloaded: {dump_log,time_threshold}
=WARNING REPORT==== 2-Feb-2015::22:05:03 === Mnesia('rabbit@hostB'): ** WARNING ** Mnesia is overloaded: {dump_log,time_threshold}
=WARNING REPORT==== 2-Feb-2015::22:08:03 === Mnesia('rabbit@hostB'): ** WARNING ** Mnesia is overloaded: {dump_log,time_threshold}
=WARNING REPORT==== 2-Feb-2015::22:11:03 === Mnesia('rabbit@hostB'): ** WARNING ** Mnesia is overloaded: {dump_log,time_threshold}
=INFO REPORT==== 2-Feb-2015::22:14:18 === rabbit on node 'rabbit@hostA' up
```

From one of the other nodes (the rest are similar, with varying numbers of mirroring messages and Mnesia overload messages):
```
=INFO REPORT==== 2-Feb-2015::21:56:07 === Adding mirror of queue 'queue.33660c0a.54cff277.70c4.0' in vhost '/' on node 'rabbit@hostE': <2991.19425.927>
=INFO REPORT==== 2-Feb-2015::21:56:07 === Adding mirror of queue 'queue.33660c0a.54cff277.70c4.0' in vhost '/' on node 'rabbit@hostD': <2989.12273.1138>
=INFO REPORT==== 2-Feb-2015::21:56:08 === Adding mirror of queue 'queue.b8640c0a.54cff277.6967.0' in vhost '/' on node 'rabbit@hostE': <2991.19433.927>
=INFO REPORT==== 2-Feb-2015::21:56:08 === Adding mirror of queue 'queue.b8640c0a.54cff277.6967.0' in vhost '/' on node 'rabbit@hostD': <2989.12282.1138>
=INFO REPORT==== 2-Feb-2015::21:56:08 === Adding mirror of queue 'queue.33660c0a.54cff278.70c5.0' in vhost '/' on node 'rabbit@hostE': <2991.19503.927>
=INFO REPORT==== 2-Feb-2015::21:56:08 === Adding mirror of queue 'queue.33660c0a.54cff278.70c5.0' in vhost '/' on node 'rabbit@hostD': <2989.12305.1138>
=INFO REPORT==== 2-Feb-2015::21:56:08 === Mirrored-queue (queue 'queue.33660c0a.54cc2a21.3ca8.0' in vhost '/'): Master <'rabbit@hostC'.3.12219.836> saw deaths of mirrors <'rabbit@hostA'.2.14628.857>
=INFO REPORT==== 2-Feb-2015::21:56:08 === Mirrored-queue (queue 'queue.b8640c0a.54cff277.6961.0' in vhost '/'): Slave <'rabbit@hostC'.3.18832.910> saw deaths of mirrors <'rabbit@hostA'.2.22452.934>
=INFO REPORT==== 2-Feb-2015::21:56:08 === Mirrored-queue (queue 'queue.b5640c0a.54cc2a57.3f1c.0' in vhost '/'): Master <'rabbit@hostC'.3.13195.836> saw deaths of mirrors <'rabbit@hostA'.2.15527.857>
=INFO REPORT==== 2-Feb-2015::21:56:08 === Mirrored-queue (queue 'queue.28660c0a.54cff277.7d5f.0' in vhost '/'): Master <'rabbit@hostC'.3.18916.910> saw deaths of mirrors <'rabbit@hostA'.2.22476.934>
=INFO REPORT==== 2-Feb-2015::21:56:08 === Mirrored-queue (queue 'queue.b8640c0a.54cc2e7b.4c3d.0' in vhost '/'): Master <'rabbit@hostC'.3.24561.836> saw deaths of mirrors <'rabbit@hostA'.2.27350.857>
=INFO REPORT==== 2-Feb-2015::21:56:08 === Mirrored-queue (queue 'queue.b8640c0a.54cc2e7b.4c2d.0' in vhost '/'): Slave <'rabbit@hostC'.3.24457.836> saw deaths of mirrors <'rabbit@hostA'.2.27249.857>
=INFO REPORT==== 2-Feb-2015::21:56:08 === Mirrored-queue (queue 'queue.28660c0a.54cc2ea9.4b34.0' in vhost '/'): Master <'rabbit@hostC'.3.26210.836> saw deaths of mirrors <'rabbit@hostA'.2.29274.857>
=INFO REPORT==== 2-Feb-2015::21:56:08 === Adding mirror of queue 'queue.b5640c0a.54cff278.6fd7.0' in vhost '/' on node 'rabbit@hostE': <2991.19538.927>
=INFO REPORT==== 2-Feb-2015::21:56:08 === Mirrored-queue (queue 'queue.28660c0a.54cc2ea9.4b2c.0' in vhost '/'): Master <'rabbit@hostC'.3.26150.836> saw deaths of mirrors <'rabbit@hostA'.2.29210.857>
=INFO REPORT==== 2-Feb-2015::21:56:08 === Mirrored-queue (queue 'queue.28660c0a.54cc2ea9.4b29.0' in vhost '/'): Master <'rabbit@hostC'.3.26116.836> saw deaths of mirrors <'rabbit@hostA'.2.29204.857>
=INFO REPORT==== 2-Feb-2015::21:56:08 === Mirrored-queue (queue 'queue.b5640c0a.54cff277.6fd3.0' in vhost '/'): Slave <'rabbit@hostC'.3.18956.910> saw deaths of mirrors <'rabbit@hostA'.2.22478.934>
=INFO REPORT==== 2-Feb-2015::21:56:08 === Mirrored-queue (queue 'queue.b8640c0a.54cc2e7a.4c27.0' in vhost '/'): Slave <'rabbit@hostC'.3.24363.836> saw deaths of mirrors <'rabbit@hostA'.2.27196.857>
=INFO REPORT==== 2-Feb-2015::21:56:08 === Mirrored-queue (queue 'queue.b5640c0a.54cc2a5a.3f59.0' in vhost '/'): Master <'rabbit@hostC'.3.13366.836> saw deaths of mirrors <'rabbit@hostA'.2.15697.857>
=INFO REPORT==== 2-Feb-2015::21:56:08 === Adding mirror of queue 'queue.b5640c0a.54cff278.6fd7.0' in vhost '/' on node 'rabbit@hostF': <2988.6499.1119>
=INFO REPORT==== 2-Feb-2015::21:56:09 === Mirrored-queue (queue 'queue.b5640c0a.54cc2a58.3f31.0' in vhost '/'): Slave <'rabbit@hostC'.3.13271.836> saw deaths of mirrors <'rabbit@hostA'.2.15580.857>
=INFO REPORT==== 2-Feb-2015::21:56:09 === Mirrored-queue (queue 'queue.b5640c0a.54cc2a58.3f31.0' in vhost '/'): Promoting slave <'rabbit@hostC'.3.13271.836> to master
=INFO REPORT==== 2-Feb-2015::21:56:09 === Mirrored-queue (queue 'queue.33660c0a.54cc2a23.3cdd.0' in vhost '/'): Slave <'rabbit@hostC'.3.12471.836> saw deaths of mirrors <'rabbit@hostA'.2.14775.857>
=INFO REPORT==== 2-Feb-2015::21:56:09 === Mirrored-queue (queue 'queue.33660c0a.54cc2a23.3cdd.0' in vhost '/'): Promoting slave <'rabbit@hostC'.3.12471.836> to master
=INFO REPORT==== 2-Feb-2015::21:56:09 === Mirrored-queue (queue 'queue.b8640c0a.54cc2e7c.4c4f.0' in vhost '/'): Slave <'rabbit@hostC'.3.24834.836> saw deaths of mirrors <'rabbit@hostA'.2.27603.857>
=INFO REPORT==== 2-Feb-2015::21:56:09 === Mirrored-queue (queue 'queue.b8640c0a.54cc2e7c.4c4f.0' in vhost '/'): Promoting slave <'rabbit@hostC'.3.24834.836> to master
=INFO REPORT==== 2-Feb-2015::21:56:09 === Mirrored-queue (queue 'queue.b8640c0a.54cc2e7c.4c4f.0' in vhost '/'): Slave <'rabbit@hostC'.3.24834.836> saw deaths of mirrors <'rabbit@hostA'.2.27603.857>
=INFO REPORT==== 2-Feb-2015::21:56:09 === Mirrored-queue (queue 'queue.b8640c0a.54cc2e7c.4c4f.0' in vhost '/'): Promoting slave <'rabbit@hostC'.3.24834.836> to master
=INFO REPORT==== 2-Feb-2015::21:56:10 === rabbit on node 'rabbit@hostA' down
=ERROR REPORT==== 2-Feb-2015::21:56:15 === connection <0.28716.836>, channel 1 - soft error: {amqp_error,not_found,             "no queue 'queue.28660c0a.54cc2ea8.4b26.0' in vhost '/'",             'queue.declare'}
=WARNING REPORT==== 2-Feb-2015::22:00:38 === Mnesia('rabbit@hostC'): ** WARNING ** Mnesia is overloaded: {dump_log,time_threshold}
=WARNING REPORT==== 2-Feb-2015::22:03:38 === Mnesia('rabbit@hostC'): ** WARNING ** Mnesia is overloaded: {dump_log,time_threshold}
=WARNING REPORT==== 2-Feb-2015::22:06:38 === Mnesia('rabbit@hostC'): ** WARNING ** Mnesia is overloaded: {dump_log,time_threshold}
=WARNING REPORT==== 2-Feb-2015::22:09:38 === Mnesia('rabbit@hostC'): ** WARNING ** Mnesia is overloaded: {dump_log,time_threshold}
=WARNING REPORT==== 2-Feb-2015::22:12:38 === Mnesia('rabbit@hostC'): ** WARNING ** Mnesia is overloaded: {dump_log,time_threshold}
=INFO REPORT==== 2-Feb-2015::22:14:18 === rabbit on node 'rabbit@hostA' up
```

Michael Klishin

unread,
Feb 2, 2015, 10:03:09 PM2/2/15
to rabbitm...@googlegroups.com, Zev Benjamin
On 3 February 2015 at 05:55:28, Zev Benjamin (z...@dropbox.com) wrote:
> =ERROR REPORT==== 2-Feb-2015::22:06:56 === ** Connection
> attempt from disallowed node 'rabbitmqctl82527@hostA' **
> =ERROR REPORT==== 2-Feb-2015::22:07:02 === ** Connection
> attempt from disallowed node 'rabbitmqctl82711@hostA' **

This means that rabbitmqctl used a different Erlang cookie from the server.
Perhaps it was executed with a different effective user? 

On a vaguely related note, I'd suggest upgrading when you can, as 3.2.0 is 14 releases behind,
and does not include improved rabbitmqctl diagnostics output among other things.

See http://www.rabbitmq.com/changelog.html and Upgrading Clusters on https://www.rabbitmq.com/clustering.html.

Zev Benjamin

unread,
Feb 2, 2015, 10:43:41 PM2/2/15
to rabbitm...@googlegroups.com, z...@dropbox.com
On Monday, February 2, 2015 at 7:03:09 PM UTC-8, Michael Klishin wrote:
On 3 February 2015 at 05:55:28, Zev Benjamin (z...@dropbox.com) wrote:
> =ERROR REPORT==== 2-Feb-2015::22:06:56 === ** Connection  
> attempt from disallowed node 'rabbitmqctl82527@hostA' **  
> =ERROR REPORT==== 2-Feb-2015::22:07:02 === ** Connection  
> attempt from disallowed node 'rabbitmqctl82711@hostA' **  

This means that rabbitmqctl used a different Erlang cookie from the server.
Perhaps it was executed with a different effective user? 

I suspect that someone tried to run rabbitmqctl as root while trying to debug.  These appear in the logs several minutes after the reset and unresponsiveness, so we've been considering them unrelated.
 

On a vaguely related note, I'd suggest upgrading when you can, as 3.2.0 is 14 releases behind,
and does not include improved rabbitmqctl diagnostics output among other things.

It is on our todo list.

Zev Benjamin

unread,
Feb 3, 2015, 4:39:37 PM2/3/15
to rabbitm...@googlegroups.com, z...@dropbox.com
We tried running the reset again last night and it went off without a hitch.  We're going to look into prioritizing the upgrade of our cluster.


Zev
Reply all
Reply to author
Forward
0 new messages