rabbitmqctl list_channels hangs

640 views
Skip to first unread message

ctc...@gmail.com

unread,
Mar 4, 2015, 11:57:55 AM3/4/15
to rabbitm...@googlegroups.com
Hi,

We have a rabbitmq cluster across 3 machines (as part of an HA OpenStack installation). 
If I turn off one of the machines, the rabbitmq cluster appears to cope, but we have found that the command "rabbitmqctl list_channels" hangs, 
and the only way to recover from this appears to be to restart rabbitmq on the remaining nodes.

Running with ERL_INET_GETHOST_DEBUG set:

# rabbitmqctl list_channels
Listing channels ...
inet_gethost[9013] (DEBUG):Saved domainname .
inet_gethost[9013] (DEBUG):Created worker[9014] with fd 3
inet_gethost[9013] (DEBUG):Saved domainname .
inet_gethost[9014] (DEBUG):Worker got request, op = 1, proto = 1, data = ******.
inet_gethost[9014] (DEBUG):Starting gethostbyname(******)
inet_gethost[9014] (DEBUG):gethostbyname OK
^C
Session terminated, killing shell...inet_gethost[9013] (DEBUG):End of file while reading from pipe.
inet_gethost[9013] (DEBUG):Erlang has closed.
...killed.

Any idea what could be happening here, or how to debug it further?

(This is running on CentOS 7, with rabbitmq-server-3.3.5-4.el7.noarch)

Chris

Michael Klishin

unread,
Mar 4, 2015, 12:14:57 PM3/4/15
to rabbitm...@googlegroups.com, ctc...@gmail.com
On 4 March 2015 at 19:57:57, ctc...@gmail.com (ctc...@gmail.com) wrote:
> If I turn off one of the machines, the rabbitmq cluster appears
> to cope, but we have found that the command "rabbitmqctl list_channels"
> hangs,

How much time passes between shutting down one node and trying to list_channels ?

Do you see "node down" in the logs,
as mentioned in https://www.rabbitmq.com/nettick.html?
--
MK

Staff Software Engineer, Pivotal/RabbitMQ

ctc...@gmail.com

unread,
Mar 4, 2015, 12:24:48 PM3/4/15
to rabbitm...@googlegroups.com, ctc...@gmail.com
We have a job that runs every 5 minutes and calls rabbitmqctl list_channels, so anywhere between 0 and 5 minutes. I just ran a test, and the job happened to run about 8 seconds before the "node down" message, and is hanging. I'll try again, but time it better, to allow the "node down" message to appear first.

Chris

 

Michael Klishin

unread,
Mar 4, 2015, 12:28:29 PM3/4/15
to rabbitm...@googlegroups.com, ctc...@gmail.com
On 4 March 2015 at 20:24:50, ctc...@gmail.com (ctc...@gmail.com) wrote:
> We have a job that runs every 5 minutes and calls rabbitmqctl
> list_channels, so anywhere between 0 and 5 minutes. I just ran
> a test, and the job happened to run about 8 seconds before the "node
> down" message, and is hanging. I'll try again, but time it better,
> to allow the "node down" message to appear first.

If the cluster hasn't detected that a node went down, nodes and tools will try to contact it
and wait for a response .

We'll take a look if list_channels could benefit from having a lower timeout. In the meantime you can
adjust kernel.net_ticktime for your cluster (see the link in the earlier email), to 6-15 seconds
(with lower values the risk of false positives becomes fairly high).

ctc...@gmail.com

unread,
Mar 4, 2015, 12:35:21 PM3/4/15
to rabbitm...@googlegroups.com, ctc...@gmail.com


On Wednesday, March 4, 2015 at 5:28:29 PM UTC, Michael Klishin wrote:
On 4 March 2015 at 20:24:50, ctc...@gmail.com (ctc...@gmail.com) wrote:
> We have a job that runs every 5 minutes and calls rabbitmqctl
> list_channels, so anywhere between 0 and 5 minutes. I just ran
> a test, and the job happened to run about 8 seconds before the "node
> down" message, and is hanging. I'll try again, but time it better,
> to allow the "node down" message to appear first.

If the cluster hasn't detected that a node went down, nodes and tools will try to contact it
and wait for a response .
Should it wait forever (as it appears to)? 

We'll take a look if list_channels could benefit from having a lower timeout. In the meantime you can
adjust kernel.net_ticktime for your cluster (see the link in the earlier email), to 6-15 seconds
(with lower values the risk of false positives becomes fairly high).
I'll take a look at that, thanks 

ctc...@gmail.com

unread,
Mar 5, 2015, 4:30:14 AM3/5/15
to rabbitm...@googlegroups.com, ctc...@gmail.com
I tried another test, and made sure that "rabbitmqctl list_channels" was only run after the "node down" message had appeared in the rabbitmq log, but the list_channels command still hangs. 

Michael Klishin

unread,
Mar 5, 2015, 4:35:24 AM3/5/15
to rabbitm...@googlegroups.com, ctc...@gmail.com
 On 5 March 2015 at 12:30:16, ctc...@gmail.com (ctc...@gmail.com) wrote:
> I tried another test, and made sure that "rabbitmqctl list_channels"
> was only run after the "node down" message had appeared in the
> rabbitmq log, but the list_channels command still hangs.

Apparently the timeout for rabbitmqctl operations is "infinity". We'll look into making it
a minute or so.

Simon MacMullen

unread,
Mar 5, 2015, 5:29:40 AM3/5/15
to Michael Klishin, rabbitm...@googlegroups.com, ctc...@gmail.com
On 05/03/15 09:35, Michael Klishin wrote:
> On 5 March 2015 at 12:30:16, ctc...@gmail.com (ctc...@gmail.com) wrote:
>> I tried another test, and made sure that "rabbitmqctl list_channels"
>> was only run after the "node down" message had appeared in the
>> rabbitmq log, but the list_channels command still hangs.
>
> Apparently the timeout for rabbitmqctl operations is "infinity". We'll look into making it
> a minute or so.

Whenever we have had timeouts for any RPC-type operations in the past,
someone has had a slow enough server to run into them without anything
actually being stuck.

Anyway, it sounds like something really is stuck here. That warrants
investigation, not papering over with a timeout.

OP, can you invoke:

$ rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().'

on each node in the cluster and then send us the output, when this
occurs again?

Cheers, Simon

Michael Klishin

unread,
Mar 5, 2015, 5:36:36 AM3/5/15
to rabbitm...@googlegroups.com, Simon MacMullen, ctc...@gmail.com
On 5 March 2015 at 13:29:37, Simon MacMullen (si...@rabbitmq.com) wrote:
> Whenever we have had timeouts for any RPC-type operations in
> the past,
> someone has had a slow enough server to run into them without anything
> actually being stuck.

It can be opt-in. I can certainly see how timeout = infinity would raise eyebrows with some ops
people. 

ctc...@gmail.com

unread,
Mar 5, 2015, 6:02:43 AM3/5/15
to rabbitm...@googlegroups.com, mkli...@pivotal.io, ctc...@gmail.com
I've run that on the other two nodes, the results are attached

thanks, 

Chris 
node1.out
node2.out

Simon MacMullen

unread,
Mar 5, 2015, 7:13:22 AM3/5/15
to ctc...@gmail.com, rabbitm...@googlegroups.com, mkli...@pivotal.io
On 05/03/15 11:02, ctc...@gmail.com wrote:
> I've run that on the other two nodes, the results are attached

Thanks. This looks like bug 26404, fixed in 3.4.0:

26404 prevent queue synchronisation from hanging if there is a very
short partition just as it starts (since 3.1.0)

The release notes are actually a bit over-compressed here, it can be
"queue shutdown" as well as queue synchronisation. See the fixt
committed here:

http://hg.rabbitmq.com/rabbitmq-server/rev/f9806ca76a80

Certainly process <5324.10313.1> (queue master for
"scheduler_fanout_c1e2208123ae4e1290b072e79f07c15e" on node 1) is stuck
due to what looks very much like that bug, and various channels are then
stuck waiting for that queue to respond.

Cheers, Simon

ctc...@gmail.com

unread,
Mar 5, 2015, 7:29:58 AM3/5/15
to rabbitm...@googlegroups.com, ctc...@gmail.com, mkli...@pivotal.io
Thanks Simon, I'll try and get the fix and try it out. Until then, is there any way to recover from this, short of restarting the rabbitmq servers?

Chris

 
 

Simon MacMullen

unread,
Mar 5, 2015, 7:32:36 AM3/5/15
to ctc...@gmail.com, rabbitm...@googlegroups.com, mkli...@pivotal.io
On 05/03/15 12:29, ctc...@gmail.com wrote:
> Thanks Simon, I'll try and get the fix and try it out. Until then, is
> there any way to recover from this, short of restarting the rabbitmq
> servers?

Not one that isn't going to be quite painful.

In principle you could kill the stuck queue, but the "rabbitmqctl eval"
gymnastics to do that would be fiddly.

Cheers, Simon

ctc...@gmail.com

unread,
Mar 5, 2015, 8:25:04 AM3/5/15
to rabbitm...@googlegroups.com, ctc...@gmail.com, mkli...@pivotal.io
I tried using rabbitmq-server-3.4.4-1.noarch.rpm from http://www.rabbitmq.com/install-rpm.html and I'm still seeing a hang. In fact it now hangs on "rabbitmqctl list_queues" too :(

I've attached diagnostics output from one of the nodes.

Chris
node1.out

Michael Klishin

unread,
Mar 5, 2015, 8:27:06 AM3/5/15
to rabbitm...@googlegroups.com, ctc...@gmail.com
On 5 March 2015 at 16:25:04, ctc...@gmail.com (ctc...@gmail.com) wrote:
> I tried using rabbitmq-server-3.4.4-1.noarch.rpm from http://www.rabbitmq.com/install-rpm.html
> and I'm still seeing a hang. In fact it now hangs on "rabbitmqctl
> list_queues" too :(

All rabbitmqctl operations share a timeout setting. Even master currently has it set to 'infinity'. 

ctc...@gmail.com

unread,
Mar 5, 2015, 10:44:59 AM3/5/15
to rabbitm...@googlegroups.com, ctc...@gmail.com
The hanging still seems like a bug though, since once the server is in this state all subsequent rabbitmqctl list_channels calls hang.

Chris 

Michael Klishin

unread,
Mar 5, 2015, 11:13:19 AM3/5/15
to ctc...@gmail.com, rabbitm...@googlegroups.com
Have we suggested otherwise?

MK
--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ctc...@gmail.com

unread,
Mar 5, 2015, 11:19:29 AM3/5/15
to rabbitm...@googlegroups.com, ctc...@gmail.com


On Thursday, March 5, 2015 at 4:13:19 PM UTC, Michael Klishin wrote:
Have we suggested otherwise?

MK

On 5/3/2015, at 18:44, ctc...@gmail.com wrote:



On Thursday, March 5, 2015 at 1:27:06 PM UTC, Michael Klishin wrote:
On 5 March 2015 at 16:25:04, ctc...@gmail.com (ctc...@gmail.com) wrote:
> I tried using rabbitmq-server-3.4.4-1.noarch.rpm from http://www.rabbitmq.com/install-rpm.html  
> and I'm still seeing a hang. In fact it now hangs on "rabbitmqctl  
> list_queues" too :(

All rabbitmqctl operations share a timeout setting. Even master currently has it set to 'infinity'. 
--  
MK  

Staff Software Engineer, Pivotal/RabbitMQ
The hanging still seems like a bug though, since once the server is in this state all subsequent rabbitmqctl list_channels calls hang.

Chris 
I mean that if it is a bug, it's still in 3.4.4, and the fix Simon pointed to doesn't fix it. Do I need to raise a bug report? If so, how do I go about that?

Chris

Michael Klishin

unread,
Mar 5, 2015, 11:45:11 AM3/5/15
to ctc...@gmail.com, rabbitm...@googlegroups.com
We have filed it already. After the 3.5.0 release we will switch to github issues.

MK

ctc...@gmail.com

unread,
Mar 5, 2015, 11:46:24 AM3/5/15
to rabbitm...@googlegroups.com, ctc...@gmail.com
Thanks Michael, any idea when a fix might be available?

Chris 

Michael Klishin

unread,
Mar 5, 2015, 11:49:43 AM3/5/15
to ctc...@gmail.com, rabbitm...@googlegroups.com
Next week with 3.5.0.

MK

Simon MacMullen

unread,
Mar 5, 2015, 11:58:15 AM3/5/15
to ctc...@gmail.com, rabbitm...@googlegroups.com, mkli...@pivotal.io
On 05/03/15 13:25, ctc...@gmail.com wrote:
> I tried using rabbitmq-server-3.4.4-1.noarch.rpm from
> http://www.rabbitmq.com/install-rpm.html and I'm still seeing a hang. In
> fact it now hangs on "rabbitmqctl list_queues" too :(
>
> I've attached diagnostics output from one of the nodes.

Thanks. So I have some idea of what might be going on here, but it
requires quite a carefully / unfortunately timed network partition. So
I'm curious about how you're running into this so easily.

What exactly are you doing? Just powering off machines?

Do you see the string "running_partitioned_network" in your logs?

Cheers, Simon

ctc...@gmail.com

unread,
Mar 5, 2015, 12:12:25 PM3/5/15
to rabbitm...@googlegroups.com, ctc...@gmail.com, mkli...@pivotal.io
Yes, the machines are VMs under kvm and I'm just forcing one machine off (so similar to powering off I guess). We're trying to setup an HA OpenStack, and we need to be able to cope with such a failure. 

I can't see that string in the rabbitmq logs.

Chris

Michael Klishin

unread,
Mar 9, 2015, 9:04:29 AM3/9/15
to rabbitm...@googlegroups.com, ctc...@gmail.com
 On 5 March 2015 at 20:12:26, ctc...@gmail.com (ctc...@gmail.com) wrote:
> Next week with 3.5.0.

After putting together the most minimalistic implementation possible to be included
into 3.5.0 [1], we've decided it is not very useful and needs to be extended in scope
to list unresponsive/unavailable items (channels, queues, etc).

For now, we recommend using the Linux timeout utility [2] when invoking `rabbitmqctl list_*`
when using it non-interactively. A more sophisticated improvement in `rabbitmqctl` itself
will be in 3.6.0.

1. https://github.com/rabbitmq/rabbitmq-server/pull/61
2. http://linux.die.net/man/1/timeout

ctc...@gmail.com

unread,
Mar 9, 2015, 11:02:01 AM3/9/15
to rabbitm...@googlegroups.com, ctc...@gmail.com
Thanks! 

ctc...@gmail.com

unread,
Mar 10, 2015, 4:26:43 AM3/10/15
to rabbitm...@googlegroups.com, ctc...@gmail.com
Just to clarify, will the fix in 3.6.0 give a way to recover from unresponsive/unavailable items without having to restart the whole rabbitmq cluster?

Chris 

Michael Klishin

unread,
Mar 10, 2015, 4:42:34 AM3/10/15
to rabbitm...@googlegroups.com, ctc...@gmail.com
 On 10 March 2015 at 11:26:45, ctc...@gmail.com (ctc...@gmail.com) wrote:
> Just to clarify, will the fix in 3.6.0 give a way to recover from
> unresponsive/unavailable items without having to restart
> the whole rabbitmq cluster?

There will be a way to specify a timeout per item (e.g. channel), so list_* operations won't wait forever.

Entities on unreachable nodes will be marked as such. `rabbitmqctl` can be used to remove a node from the cluster.
How soon connections and channels on unreachable nodes should be considered to be gone forever is orthogonal.
We are certainly happy to make reasonable changes in that area, too, but having timeouts alone should eliminate
the need to force-removing nodes from the cluster.

ctc...@gmail.com

unread,
Mar 10, 2015, 5:27:34 AM3/10/15
to rabbitm...@googlegroups.com, ctc...@gmail.com
The timeout will be useful, but what I am seeing is that after bringing the failed node back into the cluster the rabbitmqctl list_channels command still hangs, and the only way to recover from this seems to be to restart the whole cluster.

Chris 

Michael Klishin

unread,
Mar 10, 2015, 5:31:34 AM3/10/15
to rabbitm...@googlegroups.com, ctc...@gmail.com
 On 10 March 2015 at 12:27:36, ctc...@gmail.com (ctc...@gmail.com) wrote:
> The timeout will be useful, but what I am seeing is that after
> bringing the failed node back into the cluster the rabbitmqctl
> list_channels command still hangs, and the only way to recover
> from this seems to be to restart the whole cluster.

Can you post rabbitmqctl report output when that happens? (after you bring the node back) 

ctc...@gmail.com

unread,
Mar 10, 2015, 5:51:06 AM3/10/15
to rabbitm...@googlegroups.com, ctc...@gmail.com
Not surprisingly, rabbitmqctl report also hangs when trying to list the channels :( :
Output attached.

Chris

 
 
rabbitmqctl_report.out

ctc...@gmail.com

unread,
Mar 13, 2015, 6:04:48 AM3/13/15
to rabbitm...@googlegroups.com, ctc...@gmail.com
Any news on this?

Chris 

Michael Klishin

unread,
Mar 13, 2015, 7:23:59 AM3/13/15
to ctc...@gmail.com, rabbitm...@googlegroups.com
When there are news or we need more information we usually post to the list.

MK
--
Reply all
Reply to author
Forward
0 new messages