[rabbitmq-discuss] rabbitmqctl stall/hang when leaving a cluster

Matt Pietrek

unread,

Feb 21, 2012, 1:49:02 PM2/21/12

to rabbitmq...@lists.rabbitmq.com

I've noticed a pattern here that seems to be consistently reproducible.

In our setup, we have three RabbitMQ 2.71 brokers in a cluster on Ubuntu 10.04. They're all mirroring the same set of queues, but that may or may not be relevant here.

Starting from a healthy state (Management web console showing all brokers, and all queues in sync), I simply run "rabbitmqctl stop_app" for one of the nodes, followed by running "rabbitmqctl start_app" on the same node.

The expected results is that the one broker should drop out of the cluster, then rejoin it. This is indeed what happens when I run against a broker that does *not* have the stats database (as shown by the Web UI).

If I try this action on the node with the stats database, rabbitmqctl waits forever and I have to ctrl-c out. If I then try "rabbitmqctl stop", it errors out, saying that the node is down.

The only way I can get the cluster back up is to shut down the other two actively running nodes, then restart all three nodes.

Known issue? Something I'm overlooking? Searching online isn't turning up anything obvious.

Thanks,

Matt

Simon MacMullen

unread,

Feb 22, 2012, 6:45:10 AM2/22/12

to Matt Pietrek, rabbitmq...@lists.rabbitmq.com

On 21/02/12 18:49, Matt Pietrek wrote:
> If I try this action on the node with the stats database, rabbitmqctl
> waits forever and I have to ctrl-c out. If I then try "rabbitmqctl
> stop", it errors out, saying that the node is down.

Hmm. Needless to say, this does not happen when I try it :(

Does anything show up in the logs on that node at this point?

Cheers, Simon

--
Simon MacMullen
RabbitMQ, VMware
_______________________________________________
rabbitmq-discuss mailing list
rabbitmq...@lists.rabbitmq.com
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss

Matt Pietrek

unread,

Feb 22, 2012, 1:40:10 PM2/22/12

to Simon MacMullen, rabbitmq...@lists.rabbitmq.com

Unfortunately, I don't see anything in the logs. I'll try again. Is there anything I can do on my end to gather more information?

Matt

Matt Pietrek

unread,

Feb 22, 2012, 6:05:56 PM2/22/12

to Simon MacMullen, rabbitmq...@lists.rabbitmq.com

I was able to just now repro this again. Because of some layers of scripting, it turns out it's not hanging up when running rabbitmq-server. Rather, it's hanging up when running "rabbitmqctl wait <pidfile>"

The output from rabbitmqctl when run on any node is:

Waiting for rabbit@play ...   << Where 'play' is the node that's restarting>>
pid is 21925 ...                     << Where the PID value is different depending on which node it's run>>

The last few lines of the event log for the 'play' node:

=INFO REPORT==== 22-Feb-2012::14:52:08 ===
Stopping Rabbit

=INFO REPORT==== 22-Feb-2012::14:52:08 ===
    application: rabbitmq_management
    exited: stopped
    type: permanent

=INFO REPORT==== 22-Feb-2012::14:52:08 ===
    application: rabbitmq_management_agent
    exited: stopped
    type: permanent

=INFO REPORT==== 22-Feb-2012::14:52:08 ===
stopped TCP Listener on 0.0.0.0:5672

=INFO REPORT==== 22-Feb-2012::14:52:08 ===
    application: rabbit
    exited: stopped
    type: permanent

=INFO REPORT==== 22-Feb-2012::14:52:08 ===
    application: os_mon
    exited: stopped
    type: permanent

=INFO REPORT==== 22-Feb-2012::14:52:08 ===
    application: mnesia
    exited: stopped
    type: permanent

=INFO REPORT==== 22-Feb-2012::14:52:08 ===
Halting Erlang VM

=INFO REPORT==== 22-Feb-2012::14:52:13 ===
Limiting to approx 924 file handles (829 sockets)

Matt Pietrek

unread,

Feb 22, 2012, 7:30:03 PM2/22/12

to Simon MacMullen, rabbitmq...@lists.rabbitmq.com

Let me add some additional information, and re-summarize what I'm seeing.

In our startup script for RabbitMQ we do the following;

rabbitmq-server -detached
rabbitmqctl status
<Extract the PID from rabbitmqctl status, write to our PIDFILE>
rabbitmqctl wait PIDFILE

On shutdown, we do:

rabbitmqctl stop PIDFILE
rm PIDFILE

In normal circumstances, this works just fine hundreds of times in a row. However as mentioned earlier in the thread, sometimes when restarting the node that had a stats database, the "rabbitmqctl wait" hangs.

Thanks,

Mat

Simon MacMullen

unread,

Feb 23, 2012, 6:52:17 AM2/23/12

to Matt Pietrek, rabbitmq...@lists.rabbitmq.com

I *think* the stats database is a red herring. You say this happens when
restarting?

On 23/02/12 00:30, Matt Pietrek wrote:
> Let me add some additional information, and re-summarize what I'm seeing.
>
> In our startup script for RabbitMQ we do the following;
>
> rabbitmq-server -detached
> rabbitmqctl status
> <Extract the PID from rabbitmqctl status, write to our PIDFILE>

There's a potential race here if an old server is running (maybe about
to shut down?). rabbitmqctl status could pick up the old pid.

> rabbitmqctl wait PIDFILE

However, rabbitmqctl wait should then detect that the pid has died and
fail. Unless the pid gets reused by the OS but that is presumably unlikely.

But rabbitmqctl wait will wait indefinitely as long as the pid is alive
and not a fully functional rabbit node. So I'd check two things:

1) You should fix that race, it can be done safely:

Do not use rabbitmq-server -detached and rabbitmqctl status to get the
pid. Instead set RABBITMQ_PID_FILE and background the rabbitmq-server
script. You will then *definitely* get the right pid since the script
writes its own pid then execs - no race possible.

2) Capture the stdout of rabbitmq-server when you start it - if
rabbitmqctl wait still hangs, see how far it's got / what it's doing.

Cheers, Simon

--
Simon MacMullen
RabbitMQ, VMware

Matt Pietrek

unread,

Feb 23, 2012, 4:00:27 PM2/23/12

to Simon MacMullen, rabbitmq...@lists.rabbitmq.com

| I *think* the stats database is a red herring.

Perhaps. But it's the only correlation that I've seen. That is, I've never seen it happen on node that didn't have the stats database before it shut down.

A little more background context: I'm writing "rolling restart" logic. For each node in the cluster, in sequence, I stop the node, perform update logic (currently nothing), then restart the node.

| You say this happens when restarting?

Yes. Occasionally the node will restart OK, but more often than not, it hangs on the "rabbitmqctl wait"

I modified my script to run rabbitmq-server as a background task. Also, worth noting that these scripts are invoked remotely via Capistrano, so until I prefaced them with nohup, the server would start then immediately exit. The invocation line now looks like this:

nohup rabbitmq-server &

The nohup.out on the failing node ends with:

+---+   +---+
|   |   |   |
|   |   |   |
|   |   |   |
|   +---+   +-------+
|                   |
| RabbitMQ +---+   |
|           |   |   |
|   v2.7.1 +---+   |
|                   |
+-------------------+
AMQP 0-9-1 / 0-9 / 0-8
Copyright (C) 2007-2011 VMware, Inc.
Licensed under the MPL. See http://www.rabbitmq.com/

node           : rabbit@play2
app descriptor : /usr/lib/rabbitmq/lib/rabbitmq_server-2.7.1/sbin/../ebin/rabbit.app
home dir       : /home/mpietrek
config file(s) : /home/mpietrek/work/var/run/rabbitmq.config
cookie hash    : pS5H9kY3Wra/XdLEKT5hgQ==
log            : /home/mpietrek/work/logs/play2.mpietrek.internal.illumita.com/rab...@play2.log
sasl log       : /home/mpietrek/work/logs/play2.mpietrek.internal.illumita.com/rab...@play2-sasl.log
database dir   : /home/mpietrek/work/var/lib/rabbit@play2
erlang version : 5.7.4

-- rabbit boot start
starting file handle cache server                                     ...done
starting worker pool                                                  ...done
starting database                                                     ...

Simon MacMullen

unread,

Feb 24, 2012, 7:22:22 AM2/24/12

to Matt Pietrek, rabbitmq...@lists.rabbitmq.com

On 23/02/12 21:00, Matt Pietrek wrote:
> The nohup.out on the failing node ends with:

<snip>

> starting database ...

So how long are you waiting when determining it's hanging? Less than 30
seconds?

Because that looks like Rabbit is waiting for another cluster node (if
it was not the last to shut down, but is the first to start up, it will
wait for the one that was the last to shut down. But it will only wait
for 30 seconds before spitting out an error. I'm not sure how else you
could get it to stop there *without* any further output though.

Cheers, Simon

--
Simon MacMullen
RabbitMQ, VMware

Matt Pietrek

unread,

Feb 24, 2012, 4:43:14 PM2/24/12

to Simon MacMullen, rabbitmq...@lists.rabbitmq.com

| So how long are you waiting when determining it's hanging? Less than 30 seconds?

Just to be double sure, I let it sit for an hour yesterday. I would have expected a timeout, but it never came.

It's a pretty easy scenario to script and try out. I'd send you my code, but it relies on other internal commands.

There may also be a timing issue. If I put a 10 second delay after restarting one broker, and before stopping the next, it seems to help.

That is:

for x in broker_list:
    stop x
    start x
    sleep(10)

Matt

Matt Pietrek

unread,

Mar 13, 2012, 6:00:30 PM3/13/12

to Simon MacMullen, rabbitmq...@lists.rabbitmq.com

Some other work came up so I needed to drop this thread for a few weeks. However, coming back to it, I can easily reproduce this issue within one or two tries.

In a nutshell, in a clustered environment, simply stop one node, wait a few seconds, then restart it. The last output seen is:

starting database

...

I've let it wait for much longer than 30 seconds and it has never come back.

Any chance this may have been stamped out in RabbitMQ 2.8?

Matt Pietrek

unread,

Mar 14, 2012, 1:06:21 PM3/14/12

to Simon MacMullen, rabbitmq...@lists.rabbitmq.com

Following up on this, I've gone back and looked at all the logs I can think of.

On the node that hangs (stuck on "starting database ..."), here's the console output:

+---+   +---+
|   |   |   |
|   |   |   |
|   |   |   |
|   +---+   +-------+
|                   |
| RabbitMQ +---+   |
|           |   |   |
|   v2.7.1 +---+   |
|                   |
+-------------------+
AMQP 0-9-1 / 0-9 / 0-8
Copyright (C) 2007-2011 VMware, Inc.
Licensed under the MPL. See http://www.rabbitmq.com/

node           : rabbit@play

app descriptor : /usr/lib/rabbitmq/lib/rabbitmq_server-2.7.1/sbin/../ebin/rabbit.app
home dir : /home/mpietrek
config file(s) : /home/mpietrek/work/var/run/rabbitmq.config

cookie hash    : pR5H9kY3Wra/XdLELT5hgQ==
log            : /home/mpietrek/work/logs/play.mpietrek.internal.illumita.com/rab...@play.log
sasl log       : /home/mpietrek/work/logs/play.mpietrek.internal.illumita.com/rab...@play-sasl.log
database dir   : /home/mpietrek/work/var/lib/rabbit@play

erlang version : 5.7.4

-- rabbit boot start
starting file handle cache server ...done
starting worker pool ...done
starting database

And this is the last output in the log file:

=INFO REPORT==== 14-Mar-2012::09:50:33 ===

Limiting to approx 924 file handles (829 sockets)

On the node that's the master (labeled "disc stats" in the Overview tab), there's nothing in the log about the new node joining.

Is there anyplace else I should be looking for clues to assist you? This issue is a pretty big spanner in the works for our rolling upgrade scenario.

Thanks much,

Matt

Reply all

Reply to author

Forward