[3.6.0] Can't remove a queue after a cluster fail-over / recovery (queue stuck / corrupted)

Michael Denny

unread,

Apr 14, 2016, 6:46:09 AM4/14/16

to rabbitmq-users

I'm using RabbitMQ v3.6.0 with Erlang Erlang 18.2.1 on Windows (yes on windows, sorry about that, we're moving on linux cluster),

we have a cluster configuration with two nodes A and B with HA enabled on all queues, and everything is persisted on disk (queues and messages).

after stopping one node A in a cluster, with an HA configuration, everything move smooth to the node B, but suddenly a queue (the easynetq error queue that usually has some messages in there) with around 1 hundred messages stopped to work, I can't publish on it, I can't consume from it, I can't synchronize and I can't event remove it.

This is the rabbitmq log about `EasyNetQ_Default_Error_Queue` queue:

=INFO REPORT==== 14-Apr-2016::09:53:02 ===

Mirrored queue 'EasyNetQ_Default_Error_Queue' in vhost '/': Slave <rab...@b0p2-rabdef01.1.9129.2> saw deaths of mirrors <rab...@b0p2-rabdef02.3.30395.1>

=INFO REPORT==== 14-Apr-2016::09:53:02 ===

Mirrored queue 'EasyNetQ_Default_Error_Queue' in vhost '/': Promoting slave <rab...@b0p2-rabdef01.1.9129.2> to master

=WARNING REPORT==== 14-Apr-2016::10:01:39 ===

Mirrored queue 'EasyNetQ_Default_Error_Queue' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available

=ERROR REPORT==== 14-Apr-2016::10:08:54 ===

Channel error on connection <0.17962.0> ([::1]:49252 -> [::1]:5672, vhost: '/', user: 'guest'), channel 1:

operation queue.declare caused a channel exception not_found: "home node 'rabbit@b0p2-rabdef01' of durable queue 'EasyNetQ_Default_Error_Queue' in vhost '/' is down or inaccessible"

the last error is when I tried to dump the queue on disk when in fact we were on this corrupted state.

corresponding with the `=WARNING REPORT==== 14-Apr-2016::10:01:39 ===` I've have "supervisor report" and "crash report" on sasl.log (I've attached the full file):

=SUPERVISOR REPORT==== 14-Apr-2016::10:01:34 ===

Supervisor: {<0.391.1164>,rabbit_channel_sup_sup}

Context: shutdown_error

Reason: shutdown

Offender: [{nb_children,2},

{name,channel_sup},

{mfargs,{rabbit_channel_sup,start_link,[]}},

{restart_type,temporary},

{shutdown,infinity},

{child_type,supervisor}]

=CRASH REPORT==== 14-Apr-2016::10:01:40 ===

crasher:

initial call: rabbit_reader:init/4

pid: <0.11029.1155>

registered_name: []

exception exit: channel_termination_timeout

in function rabbit_reader:wait_for_channel_termination/3 (src/rabbit_reader.erl, line 766)

in call from rabbit_reader:send_error_on_channel0_and_close/4 (src/rabbit_reader.erl, line 1503)

in call from rabbit_reader:terminate/2 (src/rabbit_reader.erl, line 615)

in call from rabbit_reader:handle_other/2 (src/rabbit_reader.erl, line 540)

in call from rabbit_reader:mainloop/4 (src/rabbit_reader.erl, line 502)

in call from rabbit_reader:run/1 (src/rabbit_reader.erl, line 427)

in call from rabbit_reader:start_connection/4 (src/rabbit_reader.erl, line 385)

ancestors: [<0.181.1150>,<0.373.0>,<0.372.0>,<0.371.0>,rabbit_sup,

<0.122.0>]

messages: [emit_stats,

{tcp,#Port<0.30851122>,<<8,0,0,0,0,0,0,206>>},

ensure_stats,

{'EXIT',#Port<0.30851122>,normal}]

links: []

dictionary: [{credit_blocked,[]},

{{ch_pid,<0.20995.1163>},{2,#Ref<0.0.892600321.220875>}},

{credit_flow_default_credit,{200,50}},

{process_name,

{rabbit_reader,<<"10.2.1.16:49923 -> 10.2.2.11:5672">>}},

{{channel,2},

{<0.20995.1163>,{method,rabbit_framing_amqp_0_9_1}}}]

trap_exit: true

status: running

heap_size: 2586

stack_size: 27

reductions: 1777974

neighbours:

this is the situation of the queue from the web mgmt:

apart this bad behavior, that I would like to investigate before re-creating the cluster from scratch (probably tonight), I have a couple of questions:

Stopping the service in windows what actually does? Is it terminating the rabbitmq process gracefully?
Is calling `rabbitmqctl stop_app` the best way to stop a node in a cluster while doing ordinary maintenance (system updates/upgrades) instead of stopping the windows service or is the same?
What about in linux? if I have the rabbitmq running in background, `stop_app` and `start_app` are the best command to stop a node? after it is stopped I just kill the process? I'm creating a new cluster in another environment on CentOS 7 and I would like to know how to manage in linux as well.

as I'm writing this I found some old similar post:

Thank you to everyone would help me on this!

rabbit@b0p2-rabdef01-sasl.zip

Michael Denny

unread,

Apr 14, 2016, 6:50:30 AM4/14/16

to rabbitmq-users

I've attached also the sasl.log of the node B (02) it contains this crash report regard the corrupted error queue:

=CRASH REPORT==== 14-Apr-2016::10:31:11 ===

crasher:

initial call: gen:init_it/6

pid: <0.4021.3>

registered_name: []

exception exit: {{badmatch,true},

[{rabbit_amqqueue,'-with/4-fun-0-',5,

[{file,"src/rabbit_amqqueue.erl"},{line,430}]},

{rabbit_channel,handle_method,3,

[{file,"src/rabbit_channel.erl"},{line,1307}]},

{rabbit_channel,handle_cast,2,

[{file,"src/rabbit_channel.erl"},{line,454}]},

{gen_server2,handle_msg,2,

[{file,"src/gen_server2.erl"},{line,1049}]},

{proc_lib,init_p_do_apply,3,

[{file,"proc_lib.erl"},{line,240}]}]}

in function gen_server2:terminate/3 (src/gen_server2.erl, line 1160)

ancestors: [<0.4018.3>,<0.3781.3>,<0.3776.3>,<0.3775.3>,<0.14359.0>,

<0.14358.0>,<0.14357.0>,rabbit_sup,<0.147.0>]

messages: []

links: [<0.4018.3>]

dictionary: [{delegate,delegate_13},

{process_name,

{rabbit_channel,

{<<"10.2.2.63:61127 -> 10.2.2.12:5672">>,2}}},

{permission_cache,

[{{resource,<<"/">>,queue,

<<"EasyNetQ_Default_Error_Queue">>},

configure}]}]

trap_exit: true

status: running

heap_size: 2586

stack_size: 27

reductions: 29027

neighbours:

rabbit@b0p2-rabdef02-sasl.zip

Michael Denny

unread,

Apr 14, 2016, 6:53:37 AM4/14/16

to rabbitmq-users

and also I found this error on node B (02):

=ERROR REPORT==== 14-Apr-2016::10:31:11 ===

** Generic server <0.4021.3> terminating

** Last message in was {'$gen_cast',

{method,

{'queue.declare',0,

<<"EasyNetQ_Default_Error_Queue">>,false,

true,false,false,false,[]},

none,noflow}}

** When Server state == {ch,running,rabbit_framing_amqp_0_9_1,2,<0.3777.3>,

<0.4019.3>,<0.3777.3>,

<<"10.2.2.63:61127 -> 10.2.2.12:5672">>,

{lstate,<0.4020.3>,false},

none,1,

{[],[]},

{user,<<"services">>,

[management],

[{rabbit_auth_backend_internal,none}]},

<<"/">>,<<>>,

{dict,0,16,16,8,80,48,

{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],

[]},

{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],

[],[]}}},

{state,

{dict,0,16,16,8,80,48,

{[],[],[],[],[],[],[],[],[],[],[],[],[],

[],[],[]},

{{[],[],[],[],[],[],[],[],[],[],[],[],[],

[],[],[]}}},

erlang},

{dict,0,16,16,8,80,48,

{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],

[]},

{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],

[],[]}}},

{dict,0,16,16,8,80,48,

{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],

[]},

{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],

[],[]}}},

{set,0,16,16,8,80,48,

{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],

[]},

{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],

[],[]}}},

<0.3778.3>,

{state,fine,5000,#Ref<0.0.524294.41278>},

false,1,

{{0,nil},{0,nil}},

[],

{{0,nil},{0,nil}},

[{<<"publisher_confirms">>,bool,true},

{<<"exchange_exchange_bindings">>,bool,true},

{<<"basic.nack">>,bool,true},

{<<"consumer_cancel_notify">>,bool,true},

{<<"connection.blocked">>,bool,true},

{<<"authentication_failure_close">>,bool,true}],

none,0,none,flow,[]}

** Reason for termination ==

** {{badmatch,true},

[{rabbit_amqqueue,'-with/4-fun-0-',5,

[{file,"src/rabbit_amqqueue.erl"},{line,430}]},

{rabbit_channel,handle_method,3,

[{file,"src/rabbit_channel.erl"},{line,1307}]},

{rabbit_channel,handle_cast,2,

[{file,"src/rabbit_channel.erl"},{line,454}]},

{gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1049}]},

{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}

=ERROR REPORT==== 14-Apr-2016::10:31:11 ===

ERROR: "Error on AMQP connection ~p (~s, vhost: '~s', user: '~s', state: ~p), channel ~p:~n~s~n" - [<0.3777.3>,

<<"10.2.2.63:61127 -> 10.2.2.12:5672">>,

<<"/">>,

<<"services">>,

running,

2,

{{badmatch,

true},

[{rabbit_amqqueue,

'-with/4-fun-0-',

5,

[{file,

"src/rabbit_amqqueue.erl"},

{line,

430}]},

{rabbit_channel,

handle_method,

3,

[{file,

"src/rabbit_channel.erl"},

{line,

1307}]},

{rabbit_channel,

handle_cast,

2,

[{file,

"src/rabbit_channel.erl"},

{line,

454}]},

{gen_server2,

handle_msg,

2,

[{file,

"src/gen_server2.erl"},

{line,

1049}]},

{proc_lib,

init_p_do_apply,

3,

[{file,

"proc_lib.erl"},

{line,

240}]}]}]

=WARNING REPORT==== 14-Apr-2016::10:31:11 ===

Non-AMQP exit reason '{{badmatch,true},

[{rabbit_amqqueue,'-with/4-fun-0-',5,

[{file,"src/rabbit_amqqueue.erl"},{line,430}]},

{rabbit_channel,handle_method,3,

[{file,"src/rabbit_channel.erl"},{line,1307}]},

{rabbit_channel,handle_cast,2,

[{file,"src/rabbit_channel.erl"},{line,454}]},

{gen_server2,handle_msg,2,

[{file,"src/gen_server2.erl"},{line,1049}]},

{proc_lib,init_p_do_apply,3,

[{file,"proc_lib.erl"},{line,240}]}]}'

=ERROR REPORT==== 14-Apr-2016::10:31:39 ===

Channel error on connection <0.3117.3> (10.2.2.32:49488 -> 10.2.2.12:5672, vhost: '/', user: 'services'), channel 1:

operation queue.declare caused a channel exception not_found: "home node 'rabbit@b0p2-rabdef01' of durable queue 'EasyNetQ_Default_Error_Queue' in vhost '/' is down or inaccessible"

Michael Klishin

unread,

Apr 14, 2016, 7:04:32 AM4/14/16

to rabbitm...@googlegroups.com

The error says that there were no mirrors to promote (in sync) when master

had to exit.

Restarting node 395.1 should bring it back.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

<rab...@b0p2-rabdef01-sasl.zip>

Michael Denny

unread,

Apr 14, 2016, 7:20:15 AM4/14/16

to rabbitm...@googlegroups.com

just to let me understand better, this means that the queue was not in sync, so slave can be promoted to master, but why now it stuck when both nodes are up again?

--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/Zxb81qdMLW4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-user...@googlegroups.com.

To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Michael Denny

Engage IT Services s.r.l.

Software Architect, Trainer and Consultant

Microsoft MVP .NET

More about me

Michael Denny

unread,

Apr 14, 2016, 7:20:37 AM4/14/16

to rabbitm...@googlegroups.com

* so slave CAN'T be promoted to master

Michael Klishin

unread,

Apr 14, 2016, 7:27:32 AM4/14/16

to rabbitm...@googlegroups.com

There is a section on unsynchronised slaves in the mirroring guide. On top

of that you have abnormal queue master termination.

We'd need to find a way to reproduce this to tell what it was.

Michael Denny

unread,

Apr 14, 2016, 7:33:21 AM4/14/16

to rabbitm...@googlegroups.com

probably I need to know more about this:

Stopping the service in windows what actually does? Is it terminating the rabbitmq process gracefully?
Is calling `rabbitmqctl stop_app` the best way to stop a node in a cluster while doing ordinary maintenance (system updates/upgrades) instead of stopping the windows service or is the same?
What about in linux? if I have the rabbitmq running in background, `stop_app` and `start_app` are the best command to stop a node? after it is stopped I just kill the process? I'm creating a new cluster in another environment on CentOS 7 and I would like to know how to manage in linux as well.

> We'd need to find a way to reproduce this to tell what it was

in the mean while I'm trying to understand if I can reproduce this behavior in local.

Michael Klishin

unread,

Apr 14, 2016, 7:36:17 AM4/14/16

to rabbitm...@googlegroups.com

Use service management tools, not rabbitmqctl stop_app

or both. stop_app doesn't manage pid files the same way OS-specific packages do,

and so on.

Service scripts perform clean node shutdown.

Michael Denny

unread,

Apr 14, 2016, 11:04:26 AM4/14/16

to rabbitmq-users

thank you for the clarification!

Michael Denny

unread,

Apr 14, 2016, 11:47:30 AM4/14/16

to rabbitmq-users

BTW I can't also change the policy for that queue, it's in a real corrupted state, I would like to avoid rebooting the node, it will create some service interruption because of the connection moving to the other node :(

Michael Denny

unread,

Apr 14, 2016, 12:44:01 PM4/14/16

to rabbitmq-users

In the event viewer I found this,on both nodes :|

RabbitMQ: StopAction did not terminate erlang. Trying forced kill.

this could have caused the queue to be stuck, but anyway it's strange to me that after every node was rebooted (and anyway this was done one by one) the queue still inaccessible.

Michael Denny

unread,

Apr 14, 2016, 1:06:51 PM4/14/16

to rabbitmq-users

A straight question:

when we reach this point, that could happen because that queue does not have any consumer,

=WARNING REPORT==== 14-Apr-2016::10:01:39 ===

Mirrored queue 'EasyNetQ_Default_Error_Queue' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available

when the node that holds the master queue returns alive in the cluster, it is normal that the queue still inaccessible?

Michael Denny

unread,

Apr 14, 2016, 2:35:21 PM4/14/16

to rabbitmq-users

I've rebooted the suggested node (02), and then the stat for that queue disappeared, and in fact the queue was not responding to any commands:

but after starting the node (02), the queue was restored, and now it works fine

BUT all the messages are gone (there were more than 1 hundred messages in there)

to us is not such a problem because it's an error queue and we have those errors also logged on file, but this is a really bad behavior, loosing persistent messages in a persisten queue is not acceptable right?

I've found this logs on the rebooted node (02):

=INFO REPORT==== 14-Apr-2016::20:15:00 ===

Mirrored queue 'EasyNetQ_Default_Error_Queue' in vhost '/': Adding mirror on node 'rabbit@b0p2-rabdef02': <0.5192.0>

=ERROR REPORT==== 14-Apr-2016::20:15:00 ===

Discarding message {'$gen_call',{<0.5194.0>,#Ref<0.0.2.342>},{add_on_right,{4,<0.5194.0>}}} from <0.5194.0> to <0.6423.0> in an old incarnation (1) of this node (2)

=INFO REPORT==== 14-Apr-2016::20:15:03 ===

Mirrored queue 'EasyNetQ_Default_Error_Queue' in vhost '/': Slave <rab...@b0p2-rabdef02.2.5192.0> saw deaths of mirrors <rab...@b0p2-rabdef01.1.9129.2>

=INFO REPORT==== 14-Apr-2016::20:15:03 ===

Mirrored queue 'EasyNetQ_Default_Error_Queue' in vhost '/': Promoting slave <rab...@b0p2-rabdef02.2.5192.0> to master

=WARNING REPORT==== 14-Apr-2016::20:15:04 ===

Mnesia('rabbit@b0p2-rabdef02'): ** WARNING ** Mnesia is overloaded: {dump_log,write_threshold}

=WARNING REPORT==== 14-Apr-2016::20:15:04 ===

Mnesia('rabbit@b0p2-rabdef02'): ** WARNING ** Mnesia is overloaded: {dump_log,write_threshold}

while on node where the queue was master (01) just this:

=INFO REPORT==== 14-Apr-2016::20:14:47 ===

node 'rabbit@b0p2-rabdef02' up

=INFO REPORT==== 14-Apr-2016::20:15:02 ===

rabbit on node 'rabbit@b0p2-rabdef02' up

now the queue is master on the rebooted node (02), before was on (01) with hundred messages, but the policy mirroring is not working anymore, I mean even if the policy is set to mirror on all nodes, the error queue now is only on node (02), and does not replicate anymore on node (01), I'm start thinking the queue on node (01) still exists, and on node (02) has been purged by "Discarding message..." and set as master, the policy can't replicate on node (01) anymore, so now what I can do?

Michael Denny

unread,

Apr 18, 2016, 4:06:00 AM4/18/16

to rabbitmq-users

ok I resolved also the missing syncronized queue by adding a new policy with auto-sync feature, actually I don't know if it's because I've created the new policy itself or because I set the "ha-sync-mode" to "automatic", probably because I've created a brand new policy.

Reply all

Reply to author

Forward