rabbitmqctl list_queues hangs and never completes

3,457 views
Skip to first unread message

Alexander Staubo

unread,
Sep 26, 2014, 2:44:25 PM9/26/14
to rabbitm...@googlegroups.com
We have a cron job that captures metrics. It runs "rabbitmqctl list_queues". However, it often does not complete, and sometimes it seems to leave behind a parentless beam.smp process. Eventually the box runs out of memory, naturally. Here is a paste that shows some ps output:


There is nothing relevant in the RabbitMQ logs that I can see.

It is probably an effect of other issues that cause RabbitMQ to be unresponsive. But if RabbitMQ is unresponsive I would still like the command to time out, not hang indefinitely. I have tried wrapping with the "timeout" command from GNU coreutils, but it does not seem to kill the child process.

Simon MacMullen

unread,
Sep 29, 2014, 6:04:35 AM9/29/14
to rabbitm...@googlegroups.com, Alexander Staubo
Are you running 3.3.5? If so, can you invoke

# rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().'

when this is happening and post the output somewhere?

Cheers, Simon
> --
> You received this message because you are subscribed to the Google
> Groups "rabbitmq-users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to rabbitmq-user...@googlegroups.com
> <mailto:rabbitmq-user...@googlegroups.com>.
> To post to this group, send email to rabbitm...@googlegroups.com
> <mailto:rabbitm...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

Alexander Staubo

unread,
Sep 29, 2014, 12:54:49 PM9/29/14
to rabbitm...@googlegroups.com, Simon MacMullen
> Are you running 3.3.5? If so, can you invoke
>
> # rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().'
>
> when this is happening and post the output somewhere?

No, we're still on 3.3.1-1, actually. I wil upgrade since the list of changesets seem to have a bunch of clustering bug fixes.


Alexander Staubo

unread,
Jan 16, 2015, 11:43:44 AM1/16/15
to rabbitm...@googlegroups.com, Simon MacMullen
For what it’s worth, we upgraded to 3.4.2 and this still happens occasionally.
> --
> You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/iZWokxvhlaU/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to rabbitmq-user...@googlegroups.com.
> To post to this group, send an email to rabbitm...@googlegroups.com.

Jean-Sébastien Pédron

unread,
Jan 19, 2015, 5:07:14 AM1/19/15
to rabbitm...@googlegroups.com
On 16.01.2015 17:43, Alexander Staubo wrote:
> For what it’s worth, we upgraded to 3.4.2 and this still happens occasionally.

Hi!

When this happens again, could you please run the command Simon gave you
and post the output?

As a reminder, here it is:

rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().'

--
Jean-Sébastien Pédron
Pivotal / RabbitMQ

Alexander Staubo

unread,
Jan 19, 2015, 5:19:47 PM1/19/15
to Jean-Sébastien Pédron, rabbitm...@googlegroups.com

> On Jan 19, 2015, at 05:07, Jean-Sébastien Pédron <jean-se...@rabbitmq.com> wrote:
> When this happens again, could you please run the command Simon gave you
> and post the output?

Same on every box:

# sudo rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().'
There are 715 processes.
Investigated 0 processes this round, 5000ms to go.
Investigated 0 processes this round, 4500ms to go.
Investigated 0 processes this round, 4000ms to go.
Investigated 0 processes this round, 3500ms to go.
Investigated 0 processes this round, 3000ms to go.
Investigated 0 processes this round, 2500ms to go.
Investigated 0 processes this round, 2000ms to go.
Investigated 0 processes this round, 1500ms to go.
Investigated 0 processes this round, 1000ms to go.
Investigated 0 processes this round, 500ms to go.
Found 0 suspicious processes.
ok

# pgrep -f "beam.smp" | wc -l
173

I have tarred up the current log files; I’ll email them to you privately.

Jean-Sébastien Pédron

unread,
Jan 20, 2015, 8:45:13 AM1/20/15
to rabbitm...@googlegroups.com
On 19.01.2015 23:19, Alexander Staubo wrote:
> # pgrep -f "beam.smp" | wc -l
> 173

I guess most of those processes are hanged rabbitmqctl processes; could
you please post the command lines of all those processes (ie. the output
of ps(1))?

To help debug this, could you also run the attached script on one of the
nodes where list_queues hangs? The script configures and starts the
debugger.

Then, run rabbitmqctl list_queues and wait half a dozen seconds (let it
hang).

Finally, run the following command to stop the debugger:

rabbitmqctl eval "dbg:stop_clear()."

This will have written a file called /tmp/list_queues-${nodename}.dbg.
Can you send it, please?

The logs you sent me show that the cluster sufferred partial
partitioning. If the network is unreliable, clustering RabbitMQ nodes is
not recommended. The shovel and federation plugins are better suited for
that kind of situation.
start_dbg.sh

Alexander Staubo

unread,
Jan 20, 2015, 10:32:56 AM1/20/15
to Jean-Sébastien Pédron, rabbitm...@googlegroups.com
On Jan 20, 2015, at 08:45, Jean-Sébastien Pédron <jean-se...@rabbitmq.com> wrote:
>
> On 19.01.2015 23:19, Alexander Staubo wrote:
>> # pgrep -f "beam.smp" | wc -l
>> 173
>
> I guess most of those processes are hanged rabbitmqctl processes; could
> you please post the command lines of all those processes (ie. the output
> of ps(1))?

You can see the ps output in my earlier gist: https://gist.github.com/atombender/42e416c42ef4f008a2d2 (the first file). Basically, every “list_queues” starts up RabbitMQ and results in such a process being stuck. I have since added a timeout command to our monitoring that forcibly kills itself and its children on timeout. But it seems wrong for RabbitMQ not to enforce such a timeout itself.

> To help debug this, could you also run the attached script on one of the
> nodes where list_queues hangs? The script configures and starts the
> debugger.

Thanks, I will do this the next time the same partition situation arises.

> The logs you sent me show that the cluster sufferred partial
> partitioning. If the network is unreliable, clustering RabbitMQ nodes is
> not recommended. The shovel and federation plugins are better suited for
> that kind of situation.

We’re on Digital Ocean, and we don’t really have network outages, but we do experience occasionally micro-outages when the network chokes for a second or two, then comes back to normal. I know that RabbitMQ deals badly with this, but the issues I’m reporting are clearly bugs. RabbitMQ is frequently not able to come back from a partition with manual intervention (eg., stopping and starting, and/or running manual commands to fix state such as missing bindings).

Laing, Michael

unread,
Jan 20, 2015, 10:49:14 AM1/20/15
to Alexander Staubo, Jean-Sébastien Pédron, rabbitm...@googlegroups.com
Looks like Digital Ocean networking is too unreliable for clustering - or your nodes are overloaded.

ml


--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.

Alexander Staubo

unread,
Jan 20, 2015, 11:08:51 AM1/20/15
to Laing, Michael, Jean-Sébastien Pédron, rabbitm...@googlegroups.com
> On Jan 20, 2015, at 10:49, Laing, Michael <michae...@nytimes.com> wrote:
>
> Looks like Digital Ocean networking is too unreliable for clustering - or your nodes are overloaded.

Could very well be. That said, list_queues shouldn’t ever hang.

Michael Klishin

unread,
Jan 20, 2015, 11:09:51 AM1/20/15
to Alexander Staubo, Laing, Michael, Jean-Sébastien Pédron, rabbitm...@googlegroups.com
On 20 January 2015 at 19:08:51, Alexander Staubo (al...@purefiction.net) wrote:
> Could very well be. That said, list_queues shouldn’t ever hang.

Alexander,

I agree that there should be a timeout, presumably for every `rabbitmqctl` operation. We will file a bug.
--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Esmail Fadae

unread,
May 25, 2016, 3:55:28 PM5/25/16
to rabbitmq-users, al...@purefiction.net, michae...@nytimes.com, jean-se...@rabbitmq.com
Apologies for the necropost, but I'm also encountering an indefinite hang in RabbitMQ 3.6.2 on Ubuntu, except with the command `rabbitmqctl list_vhosts`. I'm also getting nothing peculiar from `rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().'`

Fortunately, the `timeout` shell command appears to be a functioning workaround and should work the same for any other `rabbitmqctl` subcommand; e.g.:
timeout 2 rabbitmqctl list_vhosts

P.S. Related Redhat bug: https://bugzilla.redhat.com/show_bug.cgi?id=1337704.

Michael Klishin

unread,
May 25, 2016, 3:59:30 PM5/25/16
to rabbitm...@googlegroups.com
If you are listing connections, channels, or queues, those can take a while to collect and display in 3.6.x,
depending on how many you have.

3.7.0 will do this in parallel on all nodes and stream results:

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Joshua Ma

unread,
Aug 19, 2016, 4:44:30 AM8/19/16
to rabbitmq-users
We just experienced the same thing - it didn't recover until I ran `sudo rabbitmqctl stop` and re-ran the server. I ran the maybe_stuck() diagnostics before I stopped, though, and they're attached to this post.

We also noticed a queue get really large at the same time (it was named celeryev.3a5d6...etc...long...gibberish - we use celery on top of rabbitmq) - it had 52,500 messages before I ran a policy to cap the queue length.

Any ideas how I can troubleshoot or prevent this? Thanks in advance!

- Josh
maybe_stuck.txt

Michael Klishin

unread,
Aug 19, 2016, 5:09:17 AM8/19/16
to rabbitm...@googlegroups.com
Most of the processes are HTTP API aliveness checks, others are in the delegate module.
Without knowing anything about the version used or having logs, this makes me think of


some of which come down to timeouts set to infinity in various places. They are being eradicated
with pretty much every release, so schedule an upgrade to 3.6.5 (or 3.6.x in general but not 3.6.2).

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Joshua Ma

unread,
Aug 19, 2016, 5:12:44 AM8/19/16
to rabbitm...@googlegroups.com
Hm, we're running 3.6.5, this is a 2-week old cluster:
```
Status of node 'rabbit@ip-10-12-5-214' ...
[{pid,29258},
 {running_applications,
     [{rabbitmq_management,"RabbitMQ Management Console","3.6.5"},
      {rabbitmq_web_dispatch,"RabbitMQ Web Dispatcher","3.6.5"},
      {webmachine,"webmachine","1.10.3"},
      {mochiweb,"MochiMedia Web Server","2.13.1"},
      {rabbitmq_management_agent,"RabbitMQ Management Agent","3.6.5"},
      {rabbit,"RabbitMQ","3.6.5"},
      {os_mon,"CPO  CXC 138 46","2.4.1"},
      {inets,"INETS  CXC 138 49","6.3"},
      {amqp_client,"RabbitMQ AMQP Client","3.6.5"},
      {rabbit_common,[],"3.6.5"},
      {xmerl,"XML parser","1.3.11"},
      {syntax_tools,"Syntax tools","2.0"},
      {ssl,"Erlang/OTP SSL application","8.0"},
      {compiler,"ERTS  CXC 138 10","7.0"},
      {mnesia,"MNESIA  CXC 138 12","4.14"},
      {public_key,"Public key infrastructure","1.2"},
      {crypto,"CRYPTO","3.7"},
      {asn1,"The Erlang ASN1 compiler version 4.0.3","4.0.3"},
      {ranch,"Socket acceptor pool for TCP protocols.","1.2.1"},
      {sasl,"SASL  CXC 138 11","3.0"},
      {stdlib,"ERTS  CXC 138 10","3.0"},
      {kernel,"ERTS  CXC 138 10","5.0"}]},
 {os,{unix,linux}},
 {erlang_version,
     "Erlang/OTP 19 [erts-8.0] [source] [64-bit] [async-threads:64] [kernel-poll:true]\n"},
 {memory,
     [{total,52857568},
      {connection_readers,0},
      {connection_writers,0},
      {connection_channels,0},
      {connection_other,277512},
      {queue_procs,528328},
      {queue_slave_procs,1353272},
      {plugins,504008},
      {other_proc,14195664},
      {mnesia,194920},
      {mgmt_db,23488},
      {msg_index,54376},
      {other_ets,1590488},
      {binary,821488},
      {code,24690254},
      {atom,1033401},
      {other_system,7590369}]},
```

I'll take a look through the github issues, thanks!

--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/iZWokxvhlaU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-users+unsubscribe@googlegroups.com.

To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Joshua Ma
Engineering Tech Lead
Benchling | benchling.com

Michael Klishin

unread,
Aug 19, 2016, 5:26:52 AM8/19/16
to rabbitm...@googlegroups.com
Worth mentioning that `rabbitmqctl list_*` commands these days have a --timeout flag
which may be a suitable workaround in some cases.




--
Joshua Ma
Engineering Tech Lead
Benchling | benchling.com

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--

Joshua Ma

unread,
Aug 19, 2016, 3:52:01 PM8/19/16
to rabbitm...@googlegroups.com
I took a look at the logs now, and we see that pause_minority was hit:
"""
=ERROR REPORT==== 19-Aug-2016::07:25:04 ===
Partial partition detected:
 * We saw DOWN from rabbit@ip-10-12-5-214
 * We can still see rabbit@ip-10-12-3-36 which can see rabbit@ip-10-12-5-2
14
 * pause_minority mode enabled
We will therefore pause until the *entire* cluster recovers
"""

What's weird is that I ran cluster_status at that time and didn't see anything weird:
"""
josh@ip-10-12-3-36:~$ sudo rabbitmqctl cluster_status
Cluster status of node 'rabbit@ip-10-12-3-36' ...
[{nodes,[{disc,['rabbit@ip-10-12-1-116','rabbit@ip-10-12-3-36',
                'rabbit@ip-10-12-5-214']}]},
 {running_nodes,['rabbit@ip-10-12-1-116','rabbit@ip-10-12-5-214',
                 'rabbit@ip-10-12-3-36']},
 {cluster_name,<<"rabbit@main-rabbitmq-1">>},
 {partitions,[]},
 {alarms,[{'rabbit@ip-10-12-1-116',[]},
          {'rabbit@ip-10-12-5-214',[]},
          {'rabbit@ip-10-12-3-36',[]}]}]
"""

Is there something in the future I can run to see that pause_minority had been triggered?
Reply all
Reply to author
Forward
0 new messages