RabbitMQ - cannot delete queue

3,302 views
Skip to first unread message

Kristian Jensen

unread,
May 26, 2015, 2:38:48 PM5/26/15
to rabbitm...@googlegroups.com, spex...@hotmail.com


We had a network network partition and RabbitMQ ended up in "split brain".

After the cluster recovered, I have a queue that I cant delete. In the mgmt. interface the queue is just listed with "?", and I'm unable to delete it from using mgmt. interface or from commandline.

I have tried to remove the node "sh-mq-cl1a-04" from the cluster, but the queue remains in the cluster.



Michael Klishin

unread,
May 26, 2015, 7:23:03 PM5/26/15
to rabbitm...@googlegroups.com, Kristian Jensen
Your queue's master was on the other side of the split then.

This sounds like a problem in earlier versions (before 3.4.x) where
queue processes that failed for whatever reason could not recover.

Restarting the node should help. 
--
MK

Staff Software Engineer, Pivotal/RabbitMQ


Cristian Datculescu

unread,
Jun 15, 2015, 4:13:32 PM6/15/15
to rabbitm...@googlegroups.com, spex...@hotmail.com
Just a small addition: we just had this today on 3.5.3 and restarting the node did not solve the issue.

Michael Klishin

unread,
Jun 15, 2015, 4:15:46 PM6/15/15
to Cristian Datculescu, rabbitm...@googlegroups.com
On 15 June 2015 at 23:13:34, Cristian Datculescu (cristian....@gmail.com) wrote:
> Just a small addition: we just had this today on 3.5.3 and restarting
> the node did not solve the issue.

What is in the log files?

Cristian Datculescu

unread,
Jun 15, 2015, 5:21:45 PM6/15/15
to rabbitm...@googlegroups.com, cristian....@gmail.com
Unfortunately nothing at all. We only managed to solve the issue accidentally  by removing one node (that looked healthy) from the cluster. After that the queue became editable/deletable, etc.

One other thing, after this split we have a lot of other issues: web management barely works on some of the nodes (60-70% of the requests just hang), stopping the app is impossible on some other nodes (we have to kill them manually), and a bunch of other stuff as well. 

There is only one node that works flawlessly but we are not sure what is up with the other nodes. Tomorrow we will try to rebuild the cluster around the functioning node (we cannot rebuild from scratch) and maybe i will be able to provide more details.

Thanks.

Michael Klishin

unread,
Jun 15, 2015, 5:23:47 PM6/15/15
to Cristian Datculescu, rabbitm...@googlegroups.com
On 16 June 2015 at 00:21:47, Cristian Datculescu (cristian....@gmail.com) wrote:
> Unfortunately nothing at all.

Have you checked the SASL log? Was the queue marked with ? in the UI?

Cristian Datculescu

unread,
Jun 15, 2015, 5:34:16 PM6/15/15
to rabbitm...@googlegroups.com, cristian....@gmail.com
Hello. I checked the logs, the only thing that is in there from time to time is something similar to this:

=CRASH REPORT==== 15-Jun-2015::19:44:15 ===
  crasher:
    initial call: application_master:init/4
    pid: <0.151.0>
    registered_name: []
    exception exit: {{timeout_waiting_for_tables,
                         [rabbit_user,rabbit_user_permission,rabbit_vhost,
                          rabbit_durable_route,rabbit_durable_exchange,
                          rabbit_runtime_parameters,rabbit_durable_queue]},
                     {rabbit,start,[normal,[]]}}
      in function  application_master:init/4 (application_master.erl, line 133)
    ancestors: [<0.150.0>]
    messages: [{'EXIT',<0.152.0>,normal}]
    links: [<0.150.0>,<0.7.0>]
    dictionary: []
    trap_exit: true
    status: running
    heap_size: 1598
    stack_size: 27
    reductions: 200
  neighbours:

Cristian Datculescu

unread,
Jun 15, 2015, 5:36:00 PM6/15/15
to rabbitm...@googlegroups.com
Also the queue was marked with ? in the UI

Michael Klishin

unread,
Jun 15, 2015, 5:42:09 PM6/15/15
to Cristian Datculescu, rabbitm...@googlegroups.com
On 16 June 2015 at 00:36:01, Cristian Datculescu (cristian....@gmail.com) wrote:
> Also the queue was marked with ? in the UI

timeout_waiting_for_tables means that nodes were restarted in random order and may or may not
be related to what you are seeing.

My guess is that you had a durable non-mirrored queue whose node became unavailable
(the Erlang process backing the queue could have failed, too, but then there should have
been at least something in the SASL log). In this case all operations on the queue would
wait for it to come back. There are two cases when the queue will be relocated to another node:

 * It is mirrored and there's a mirrored in sync, so it can be promoted
 * The queue is not durable

There is an issue for introducing timeouts. Other than that, you should be using one of
the options above.

Again, this is a guess since there's nothing but timeout_waiting_for_tables in the log. 

Cristian Datculescu

unread,
Jun 15, 2015, 5:47:59 PM6/15/15
to rabbitm...@googlegroups.com, cristian....@gmail.com
The queue was durable and mirrored before we had the network issue (there was a vhost wide policy for all queues/exchanges to be mirorred), otherwise i can understand the case. The sasl log part i think is a little curious as well.

Something that worries me is that the web interface is only working correctly on one node (most of the other requests are failing on other nodes) and whether this is a sign that the cluster is not healthy (all nodes but one of them have this issue).

Even newly added nodes have this issue (trying to get a message and reinsert it works only 20-30% of cases). Also for these there is no sasl log entry at all, so basically i am a little bit blind to what's happening inside.

Thanks for the answers, i hope tomorrow i can come up with more details to pinpoint the issue.

Michael Klishin

unread,
Jun 15, 2015, 5:50:42 PM6/15/15
to Cristian Datculescu, rabbitm...@googlegroups.com
On 16 June 2015 at 00:48:01, Cristian Datculescu (cristian....@gmail.com) wrote:
> Even newly added nodes have this issue (trying to get a message
> and reinsert it works only 20-30% of cases).

Have you checked if the OS may be paging RabbitMQ out? (with vmstat)

Do you observe memory alarms? Does the queue fail (enter the "? state") after
you add a new node? 

Cristian Datculescu

unread,
Jun 15, 2015, 5:54:26 PM6/15/15
to Michael Klishin, rabbitm...@googlegroups.com

No memory alarms no nothing. And the queue is now pretty much fine, no apparent issues with it. Still can't shake the feeling something is wrong.

Michael Klishin

unread,
Jun 15, 2015, 5:57:55 PM6/15/15
to Cristian Datculescu, rabbitm...@googlegroups.com
 On 16 June 2015 at 00:54:25, Cristian Datculescu (cristian....@gmail.com) wrote:
> No memory alarms no nothing. And the queue is now pretty much
> fine, no apparent issues with it. Still can't shake the feeling
> something is wrong.

OK, please monitor the logs and post what seems suspicious.

Cristian Datculescu

unread,
Jun 16, 2015, 5:45:24 AM6/16/15
to rabbitm...@googlegroups.com, cristian....@gmail.com
Hello. We have found a second queue in the same status.

We reduced the cluster to only one node, but the queue is stuck, and we cannot do anything with the vhost as well (even worse, it is the /vhost).
I don't really understand what happens, because on only one cluster node the queue should have recovered by now. 

The node runs RabbitMQ 3.5.3.

Also no messages on -sasl logs or any other logs for that matter.

Cristian Datculescu

unread,
Jun 16, 2015, 5:47:53 AM6/16/15
to rabbitm...@googlegroups.com, cristian....@gmail.com
Maybe important: eventually i got a log message:

=SUPERVISOR REPORT==== 16-Jun-2015::11:45:46 ===
     Supervisor: {<0.20803.75>,
                                           rabbit_channel_sup_sup}
     Context:    shutdown_error
     Reason:     shutdown
     Offender:   [{nb_children,1},
                  {name,channel_sup},
                  {mfargs,{rabbit_channel_sup,start_link,[]}},
                  {restart_type,temporary},
                  {shutdown,infinity},
                  {child_type,supervisor}]


anton kropp

unread,
Feb 24, 2016, 7:05:25 PM2/24/16
to rabbitmq-users, cristian....@gmail.com
Are there any suggestions to this issue? I am running 3.5.4 and have encountered the same undeletedable question mark queue.  

Michael Klishin

unread,
Aug 25, 2016, 7:12:20 PM8/25/16
to rabbitmq-users, cristian....@gmail.com
Two related threads from this year:


I don't think we have seen this reported ever since 3.6.0, so please upgrade to 3.6.5.

祝暝

unread,
Aug 15, 2017, 9:52:47 AM8/15/17
to rabbitmq-users, cristian....@gmail.com
I have this issue,too. Erlang 19.2 ,rabbitmq 3.6.6 . it is a mirrored queue that has unsychronised flag, and i cannot click sychronise to fix it. Anything i do for this queue is useless 
 
在 2016年8月26日星期五 UTC+8上午7:12:20,Michael Klishin写道:

祝暝

unread,
Aug 16, 2017, 8:26:10 AM8/16/17
to rabbitmq-users, cristian....@gmail.com
After one night, the master node restarts others , the undeleted queue turns normal .(crying face)

在 2017年8月15日星期二 UTC+8下午9:52:47,祝暝写道:
Reply all
Reply to author
Forward
0 new messages