Broken/Unresponsive queues after rabbitmq restart

2,270 views
Skip to first unread message

Fabian Zimmermann

unread,
Aug 12, 2020, 5:28:25 AM8/12/20
to rabbitmq-users
Hi,

we run a openstack cluster and have regular issues with broken/unresponsive queues.

There is currently a discussion on the openstack-list regarding this issue and so it seems we are not the only user having this issue.

So, WHAT do I mean with "broken/unresponsive". Its a queue with is

* still there (f.e. the webgui shows the queue)
* but no longer working/forwarding messages
* not listed in "rabbitmqctl list_queues"
* (sometimes) even not removable (then rabbit_amqqueue:delete_crashed helps)
 
I wrote a small/stupid script to detect those broken queues[1].

It seems a similar/related issue was already reported by https://github.com/rabbitmq/rabbitmq-server/issues/1873 .

Because above bug report is missing a way to reproduce the issue, i tried to write some code to trigger the issue.

I think I was able to find a way.

Steps to reproduce / How do I reproduce the issue:

* clone the git-repo [1]
* ensure there is a rabbitmq-policy "^(?!amq\.).*"

  ha-mode: all
  ha-sync-mode: automatic
  
* run ./rpc_floodd (on 3 rabbitmq-nodes)
* to generate some load, i run the following cmd on all 3 rabbitmq-nodes:

  for id in $( seq 1 15 ); do ./rpc_flood_client & done

These scripts are simulating a "typical" openstack service.

* daemon will create an exchange and listen for msg.
* client will create an reply_exchange, reply_queue, bindings.. 
* client will send a msg to the daemon-exchange.
* daemon will read the msg and respond to the reply_exchange
* client will remove the reply_exchange, queue, .. and loop.

Now just restart (systemctl restart rabbitmq-server) one of the services. This is enough to get some of these queues "broken" as described above.
In an openstack cluster this may lead to issues, because the services are not getting any msgs, but still think the queue is there/working.

If there is something I may test/change or some information is missing - just give me a hint.

 Fabian Zimmermann


Fabian Zimmermann

unread,
Aug 12, 2020, 5:29:47 AM8/12/20
to rabbitmq-users
RabbitMQ 3.8.6Erlang 22.3.4.5
 Fabian
 

David Ernstsson

unread,
Aug 12, 2020, 5:47:06 AM8/12/20
to rabbitm...@googlegroups.com
This looks pretty much same behaviour as we have noted in  https://groups.google.com/forum/#!topic/rabbitmq-users/Je7rfu-MWrg

Den ons 12 aug. 2020 kl 11:29 skrev Fabian Zimmermann <dev...@gmail.com>:
RabbitMQ 3.8.6Erlang 22.3.4.5
 Fabian
 

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/779c3384-cf79-4ea0-9b56-7e0fefab78d4o%40googlegroups.com.

Karl Nilsson

unread,
Aug 12, 2020, 7:04:23 AM8/12/20
to rabbitm...@googlegroups.com
even logs at default level would be better than nothing

From: rabbitm...@googlegroups.com <rabbitm...@googlegroups.com> on behalf of David Ernstsson <david.e...@gmail.com>
Sent: 12 August 2020 10:46 AM
To: rabbitm...@googlegroups.com <rabbitm...@googlegroups.com>
Subject: Re: [rabbitmq-users] Re: Broken/Unresponsive queues after rabbitmq restart
 

Karl Nilsson

unread,
Aug 12, 2020, 7:08:15 AM8/12/20
to rabbitm...@googlegroups.com
Hi,

Inspecting the RabbitMQ server logs at debug level may give more clues as to what is going on.


Cheers
Karl

From: rabbitm...@googlegroups.com <rabbitm...@googlegroups.com> on behalf of Fabian Zimmermann <dev...@gmail.com>
Sent: 12 August 2020 10:28 AM
To: rabbitmq-users <rabbitm...@googlegroups.com>
Subject: [rabbitmq-users] Broken/Unresponsive queues after rabbitmq restart
 
--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.

Fabian Zimmermann

unread,
Aug 12, 2020, 10:21:40 AM8/12/20
to rabbitm...@googlegroups.com
Hi,

i repeated the run and here the output:

-<-
root@control01ve:~# ./check_issue
Start..
Wed Aug 12 16:08:20 CEST 2020
Sleeping 120s
Killing rabbitmq-server
Sleeping 60s
Doing first round
Sleeping 60s
Doing second round
@@ -1,37 +1,22 @@
 rpc_flood_reply_1507587a-18f4-41c5-9a31-dee35556df6b
 rpc_flood_reply_3e66cf31-de97-451e-983b-cbec23a8a19e
 rpc_flood_reply_4fda1435-a0b0-4af1-ad4e-669deff7349f
 rpc_flood_reply_a279b22e-fc4c-4b15-803a-b81f1bd701b4
 rpc_flood_reply_abf6c96b-dcc3-4e12-a158-eeb2b494fc1e
 rpc_flood_reply_bbd852a4-5870-4b60-90f0-08fa85ae7789
 rpc_flood_reply_f2c79425-6ec5-48ef-80e6-4561133cf841
Issue detected
rpc_flood_reply_1507587a-18f4-41c5-9a31-dee35556df6b
ok
rpc_flood_reply_3e66cf31-de97-451e-983b-cbec23a8a19e
ok
rpc_flood_reply_4fda1435-a0b0-4af1-ad4e-669deff7349f
ok
rpc_flood_reply_a279b22e-fc4c-4b15-803a-b81f1bd701b4
ok
rpc_flood_reply_abf6c96b-dcc3-4e12-a158-eeb2b494fc1e
ok
rpc_flood_reply_bbd852a4-5870-4b60-90f0-08fa85ae7789
ok
rpc_flood_reply_f2c79425-6ec5-48ef-80e6-4561133cf841
ok
..End
->-

the given queues above are the broken ones, so maybe you find usefull information in the debug-logs:


Some information I may also add.

* it seems the issue is *not happening* if the queue is a durable queue.
* it seems the queues are located as master-queue on the node being killed/restarted

 Fabian

Fabian Zimmermann

unread,
Aug 14, 2020, 7:54:52 AM8/14/20
to rabbitm...@googlegroups.com
Hi,

just to also answer in this list.

I did some tests in our environment and it seems there are only 2 ways of running openstack+rabbitmq without issues.

1. run without durable queues and without replication
2. run with durable queues and with replication

I'm able to reproduce these broken bindings / broken queues. The way is always the same

* start load (see above git repo)
* kill one rabbitmq
* check for result.

The result depends on the combination of "durable" and "replication". Possible results are the "broken queues" I watched initially or "broken bindings" as the bugreport described.

 Fabian

Karl Nilsson

unread,
Aug 14, 2020, 10:01:43 AM8/14/20
to rabbitm...@googlegroups.com
Using a transient queue with ha isn't something we recommend so your observation that replication needs to go with durability is a good one.

Alt, you could try to use quorum queues if you need a durable, persistent and available queue.


Cheers,
Karl

Sent: 14 August 2020 12:54 PM
To: rabbitm...@googlegroups.com <rabbitm...@googlegroups.com>
Subject: Re: [rabbitmq-users] Broken/Unresponsive queues after rabbitmq restart
 

Fabian Zimmermann

unread,
Aug 14, 2020, 1:53:54 PM8/14/20
to rabbitm...@googlegroups.com
Hi,


Karl Nilsson <nk...@vmware.com> schrieb am Fr., 14. Aug. 2020, 16:01:
Using a transient queue with ha isn't something we recommend so your observation that replication needs to go with durability is a good one.

Yes, i read about this in a sentence in your ha docs, but it seems there are a lot of other docs saying the opposite

which tells you "using durable queues may lead to an inaccessible queue". 

I know there is a link to the suitable ha doc explaining more in detail, but just want to note:

I think there are a lot setups out there using non durable queues with replication. 

If you like I would create a bug report.
If you think this is "works as intended" I'm also fine.


Alt, you could try to use quorum queues if you need a durable, persistent and available queue.


Well I could try to create a feature request in oslo.x

Nevertheless, 

Thanks a lot for fast answers, 

 Fabian

Fabian Zimmermann

unread,
Aug 17, 2020, 9:30:25 AM8/17/20
to rabbitm...@googlegroups.com
Hi,

i just increased the load a bit further (100msg/s, 500 queues, 500 exchanges) and now i'm able to reproduce the issue even with replication *and* durable queues.

* the broken queues are located on the node i killed
* the queues exist, but all counters are "NaN"
* if i try to send a msg to the queue (via web interface), i get: ERR: {"error":"bad_request","reason":"rejected Unable to publish message. Check queue limits."}

Here the error-msg i found in the logs:
--
rab...@control02ve.log:2020-08-17 15:21:21.193 [error] <0.22764.4> Supervisor {<0.22764.4>,rabbit_channel_sup} had child writer started with rabbit_writer:start_link(#Port<0.2309>, 1, 131072, rabbit_framing_amqp_0_9_1, <0.22177.4>, {<<"192.168.2.101:58914 -> 192.168.5.12:5672">>,1}, true) at <0.22765.4> exit with reason noproc in context shutdown_e
--
rab...@control01ve.log:2020-08-17 15:21:57.017 [error] <0.27237.2> Supervisor {<0.27237.2>,amqp_channel_sup_sup} had child channel_sup started with amqp_channel_sup:start_link(direct, <0.27236.2>, <<"<rab...@control01ve.2.27236.2>">>) at undefined exit with reason noproc in context shutdown_error
--

Here debug-logs - maybe someone is able to point out, why this happens?

Kill-Time: 2020-08-17 15:21:13.191

node3: killed

Or any hints how I may get some useful information?

 Fabian

Fabian Zimmermann

unread,
Aug 17, 2020, 9:31:28 AM8/17/20
to rabbitm...@googlegroups.com
and

* the queues are no longer replicated
* the queues are no longer part of the policy

Fabian Zimmermann

unread,
Aug 17, 2020, 9:43:44 AM8/17/20
to rabbitm...@googlegroups.com
seems the queue is not replicated successfully?

--
root@control01ve:/var/log/rabbitmq# rabbitmqctl purge_queue --vhost test rpc_flood_reply_02e14801-920d-4e78-9232-e1de80ac9ffe
Purging queue 'rpc_flood_reply_02e14801-920d-4e78-9232-e1de80ac9ffe' in vhost 'test' ...
Error:
{:nodedown, :rabbit@control03ve}
root@control01ve:/var/log/rabbitmq# rabbitmqctl delete_queue --vhost test rpc_flood_reply_02e14801-920d-4e78-9232-e1de80ac9ffe
Deleting queue 'rpc_flood_reply_02e14801-920d-4e78-9232-e1de80ac9ffe' on vhost 'test' ...
Queue was successfully deleted with 0 messages
--

what's the correct way to avoid this?

The queue is durable and a policy exists for replicating all queues.

 Fabian

Wesley Peng

unread,
Aug 17, 2020, 11:09:33 PM8/17/20
to rabbitm...@googlegroups.com


Fabian Zimmermann wrote:
> root@control01ve:/var/log/rabbitmq# rabbitmqctl purge_queue --vhost test
> rpc_flood_reply_02e14801-920d-4e78-9232-e1de80ac9ffe
> Purging queue 'rpc_flood_reply_02e14801-920d-4e78-9232-e1de80ac9ffe' in
> vhost 'test' ...
> Error:
> {:nodedown, :rabbit@control03ve}


What versions (rabbitmq/erlang) are you using?
If you have upgraded to the latest version, does this still happen?

regards.

Fabian Zimmermann

unread,
Aug 18, 2020, 1:24:38 AM8/18/20
to rabbitm...@googlegroups.com
Hi,

> What versions (rabbitmq/erlang) are you using?
> If you have upgraded to the latest version, does this still happen?

RabbitMQ 3.8.6
Erlang 22.3.4.6

Yes, already running the latest version (afaik?) - still happening.

Fabian

Wesley Peng

unread,
Aug 18, 2020, 1:34:24 AM8/18/20
to rabbitm...@googlegroups.com
Hello

Fabian Zimmermann wrote:
> RabbitMQ 3.8.6
> Erlang 22.3.4.6
>
> Yes, already running the latest version (afaik?) - still happening.

You may want to upgrade erlang to the latest version (OTP 23).

for example, mime is:


OS PID: 12968
OS: Linux
Uptime (seconds): 940969
RabbitMQ version: 3.8.6
Node name: rabbit@ubuntu
Erlang configuration: Erlang/OTP 23 [erts-11.0.2] [source] [64-bit]
[smp:1:1] [ds:1:1:10] [async-threads:64]
Erlang processes: 443 used, 1048576 limit
Scheduler run queue: 1
Cluster heartbeat timeout (net_ticktime): 60

regards.

Fabian Zimmermann

unread,
Aug 18, 2020, 2:14:24 AM8/18/20
to rabbitm...@googlegroups.com
Hi,

just updated to

--
RabbitMQ 3.8.7
Erlang 23.0.3
--

same result:

--
root@control02ve:~# rabbitmqctl purge_queue --vhost test
rpc_flood_reply_d12f5cc7-903e-41a0-932e-8d5b04bb365f
Purging queue 'rpc_flood_reply_d12f5cc7-903e-41a0-932e-8d5b04bb365f'
in vhost 'test' ...
Error:
{:nodedown, :rabbit@control01ve}
root@control02ve:~# rabbitmqctl delete_queue --vhost test
rpc_flood_reply_d12f5cc7-903e-41a0-932e-8d5b04bb365f
Deleting queue 'rpc_flood_reply_d12f5cc7-903e-41a0-932e-8d5b04bb365f'
on vhost 'test' ...
Queue was successfully deleted with 0 messages
--

Fabian
> --
> You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
> To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/a3e9c84c-1ab7-71cb-0218-24102c05af26%40yonghua.org.

David Ernstsson

unread,
Sep 2, 2020, 7:39:23 AM9/2/20
to rabbitmq-users
Any news or clues on how to solve this issue? We seem to experience similar problem and we really don't feel comfortable rolling into big-scale production rabbitmq without a fix at the moment.

Regards, David

Luke Bakken

unread,
Sep 2, 2020, 8:31:43 AM9/2/20
to rabbitmq-users
Hi Fabian -

I would like to try to reproduce this using your project - https://github.com/devfaz/rabbitmq_debug

However, the instructions are hard to follow. Could you please update the README.md file with exact, concise instructions for how to use the scripts you have provided? Ideally there would be no manual steps necessary because those are error-prone. rabbitmqadmin and rabbitmqctl can be used to automate any interaction with RabbitMQ. I can assist with that if necessary.

Thanks -
Luke

David Ernstsson

unread,
Sep 2, 2020, 12:44:24 PM9/2/20
to rabbitmq-users
Any chance this is the same issue as fixed in  https://github.com/rabbitmq/rabbitmq-server/issues/2437  ? 

Luke Bakken

unread,
Sep 2, 2020, 1:10:59 PM9/2/20
to rabbitmq-users
Hi David,

During testing for that issue I never saw issues with the queues or bindings after restarts.

Thanks,
Luke

Fabian Zimmermann

unread,
Sep 7, 2020, 7:26:41 AM9/7/20
to rabbitm...@googlegroups.com
Hi Luke,

As requested I automated the process as much as possible.

So you should be able to reproduce the issue by:

* install docker
* clone the repo https://github.com/devfaz/rabbitmq_debug
* execute ./run
* wait for the script to complete

The output should end with something like

->-
# wip_detect_broken_queues
##########################
rpc_flood_reply_035a60a5-db69-4d40-a3cb-6b88e1f0a7f0
rpc_flood_reply_0d82d412-5bcd-4008-a197-c149000600c4
....
rpc_flood_reply_fd7e70a5-f9ce-4743-a972-039970d4fd6c
-<-

just connect to the webinterface of the containerized rabbitmq (use
docker inspect node1 to get the IP) and try to publish a msg in above
queues.

I'm currently able to reproduce this, but it may be related to the
power of your workstation.

my env:
* Intel(R) Core(TM) i7-9700T CPU @ 2.00GHz
* 32 GB RAM

if you are not able to generate enough load / msg/s. - just increase
the "clients" in genload.sh

Fabian
> --
> You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
> To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/0448261c-039f-4d92-aef0-949c4d4f330fn%40googlegroups.com.

Luke Bakken

unread,
Sep 16, 2020, 4:21:59 PM9/16/20
to rabbitmq-users
Thanks Fabian,

I finally have some time to look into this.

Luke
> To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.

Luke Bakken

unread,
Sep 17, 2020, 5:22:50 PM9/17/20
to rabbitmq-users
Hi Fabian,

I think I'm seeing the same result as you. At the end of the script, node2 is still offline. Is that expected?

I am very unfamiliar with running RabbitMQ in docker, so I have some questions:
  • I can run docker logs NODE to get the RabbitMQ logs for that node. I'm assuming this is collected from stdout as the RabbitMQ node runs because I can't find a log file in /var/log/rabbitmq within the container.
  • The crash.log file does appear to be in the containers at /var/log/rabbitmq/log/crash.log, but there aren't any entries.
What I'm going to do next is take your reproduction scripts and modify them to not use docker but instead use a local cluster. That will allow me to much more easily debug what is going on.

Thanks,
Luke

On Monday, September 7, 2020 at 4:26:41 AM UTC-7, Fabian Zimmermann wrote:

Luke Bakken

unread,
Sep 18, 2020, 5:33:47 PM9/18/20
to rabbitmq-users
Hi again Fabian,

I've modified your code to run against a local RabbitMQ cluster:


The cluster is started as follows:

cd rabbitmq-public-umbrella
make co
cd deps/rabbitmq_server_release
make NODES=3 PLUGINS='rabbitmq_management' start-cluster

The nodes will listen on ports 5672, 5673 and 5674.

In testing I do see that the various scripts to report on broken queues do return queues. However, if I disable the step that stops a node, I still see some queues reported as "broken". But, they appear normal in the management interface and I can publish to them.

If you have time to discuss this further, that would be great. I'm going to continue reading your code to figure out exactly what is being tested.

Thanks,
Luke

Fabian Zimmermann

unread,
Sep 24, 2020, 6:58:16 AM9/24/20
to rabbitm...@googlegroups.com
Hi,

just answered your question on github, feel free to contact me directly.

Fabian
> --
> You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
> To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/9e0d857b-78a9-4407-9d1e-495904c09b8bo%40googlegroups.com.

Terry Rinck

unread,
Sep 27, 2020, 2:30:54 PM9/27/20
to rabbitm...@googlegroups.com
Are you using queue_master_locator: min-masters?
If so, does setting it to client-local produce the same results?

Ayanda

unread,
Sep 29, 2020, 5:11:43 AM9/29/20
to rabbitmq-users
Try closing the channel after queue and exchange deletion here: https://github.com/devfaz/rabbitmq_debug/blob/master/rpc_flood_client#L21-L22
I think your channels are being abruptly dropped when the container is removed, before queue deletion completes. This leaves some of your queues
in bad/unresponsive state.

Fabian Zimmermann

unread,
Oct 5, 2020, 8:55:40 AM10/5/20
to rabbitm...@googlegroups.com
Hi,

the script is just a try to reproduce an issue we have with prod env (an openstack cloud).

In my opinion its a bug to have the queues in this "broken" state.

I think if a queue is part of a replication, then the queue should only be "defined" if the master->slave replication of the queue successfully completed.

It seems (from my point of view) the queues are created at the master node, then replication is tried to setup. If the master fails before the slave completed replication setup, the queue is unusable, because the master is gone, but the slave not yet ready to take over.

As a fallback i would even accept a "automatic removal" as a good workaround, because this would trigger the recovery functions in the client and recreate the queue. Without the client waits unlimited for msgs which will never delivered (until the master is back)

 Fabian

ElixirConf EU Virtual: 7-8 October 2020

Code Mesh V: 5-6 November 2020
Code BEAM STO: 27-28 May 2021


Erlang Solutions cares about your data and privacy; please find all details about the basis for communicating with you and the way we process your data in our Privacy Policy. You can update your email preferences or opt-out from receiving Marketing emails here.


--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages