Deleted queues reappear after restarting mirrored nodes using ha-all

170 views
Skip to first unread message

Lucas LeBlanc

unread,
May 20, 2020, 2:52:36 PM5/20/20
to rabbitmq-users
Hi. This seems like a dumb question, but I can't find anyone else who is talking about this nor can I find any documentation addressing this and it is easily reproducible. I have a RabbitMQ cluster with 4 nodes set up for the purposes of file transfer in 65 KB packets. Typically a queue is created per file, and deleted after the file has been transferred. All of the queues are set to the ha-all policy. 

As I am new to RabbitMQ, there were naturally bugs as I was developing this file transfer pipeline and sometimes I made use of the rabbitmqadmin CLI in order to manually delete queues that were empty or not needed that remained on the cluster. This seemed to work well enough, but then in production I was seeing an issue I hadn't seen before (which is actually not relevant to this topic, but somehow, delivery tags in acknowledgements from the consumer were not being accepted by the broker). 

Restarting the nodes addressed this issue, but something else started happening: Tens of thousands (it ended up being about 33k) of these deleted empty file queues were restored when I restarted the nodes. Now I have to slowly delete them one by one with a script. Is there a policy change that I need to make to ensure that when I delete a queue, mirrored queues do not restore the queue when they restart? Is this a bug? It seems very counter-intuitive that every transaction except deletion would be synchronized. 

Lucas LeBlanc

unread,
May 21, 2020, 10:19:55 AM5/21/20
to rabbitmq-users
I thought this seemed like a simple mistake on my part so I did not provide that much information, but perhaps more detail is in order.
The cluster nodes in question all run RabbitMQ server version 3.8.3-1 on Ubuntu 18.04 (bionic). 
The specifics of the issue go like this: A queue "File.1234" is created by an application and messages are pushed to it. Another application consumes all of those messages and deletes the queue. This happens thousands of times in a day, usually around 2000-6000. When I restarted the nodes in the order prescribed by the documentation (i.e. last node shutdown is the first node to start up), about 33k of deleted, empty queues were restored on the cluster. I reproduced this in my QA environment with a 3-node cluster as well, same policies. It does not seem to matter whether or not a given node is the "master" of the queues that are restored. In QA, I restarted each node and deleted queues each time, and every single node restored the same queues. It seems as if deleting a queue does not mirror the deletion on mirrored nodes and those nodes seem to think that the queues should be restored. I am using these commands to restart the nodes: 
sudo rabbitmqctl stop_app
sudo rabbitmqctl start_app

This is the first time I've had to work with RabbitMQ and I'm using it because our old pipeline was simply unfit for our current business, let alone with further scaling. I also have not really worked with Erlang before. As such, my log combing skills for these particular logs are probably not up to snuff, but I did not find anything that seemed significant aside from tons of queue declaration statements. 

Luke Bakken

unread,
May 21, 2020, 10:49:14 AM5/21/20
to rabbitmq-users
Hi Lucas,
 
It seems as if deleting a queue does not mirror the deletion on mirrored nodes

I'm assuming you tested a cluster restart while your applications were creating and deleting queues. Deletion is not instantaneous and can take some time depending on your cluster load.

Since these queues are short-lived (correct?) I don't see a benefit to mirroring them, really. If you must mirror them, don't mirror to all nodes in your cluster. Mirroring to a quorum of nodes is sufficient. If you have three nodes, just mirror to one other.

If your files are small, transferring them via RabbitMQ could be a valid use-case. Otherwise, the recommendation is to upload a file to a blob store and use RabbitMQ to communicate the file's location (via a URI, for instance).

Thanks,
Luke

Lucas LeBlanc

unread,
May 21, 2020, 12:00:29 PM5/21/20
to rabbitmq-users
Actually, the queues in question were deleted anywhere from recently to weeks ago. We don't create anywhere near 33k queues in a single day. I appreciate the recommendation for mirroring, though. We do have a reliable storage volume where the files live, but the files are transferred between applications using RabbitMQ. We serve and process documents (OCR and classification primarily) so files need to move around a lot for processing. It's been plenty fast and reliable enough for our purposes so far, at least in relation to what we had before.

Luke Bakken

unread,
May 21, 2020, 2:58:28 PM5/21/20
to rabbitmq-users
Hi Lucas,

That is strange behavior that I have not seen reported before. If you can provide exact steps we can follow to reproduce the issue, I'm sure we can fix it.

Thanks -
Luke

Lucas LeBlanc

unread,
May 21, 2020, 3:29:22 PM5/21/20
to rabbitmq-users
There is one sticking point that I am not sure of, which is whether queues which were deleted by our .NET application through the client API were restored, or if it is only queues which were manually deleted through the rabbitmqadmin CLI. The application call deletes the queue regardless of whether it is unused or empty. When I tested for this issue in QA, it seemed like it might only be manually-deleted queues, but I don't think I have manually deleted 33k queues in production (I suppose I could be wrong). Here are the steps that reproduce this issue on our servers:

1. Set up a RabbitMQ cluster with 2 or more nodes. The user permissions should be .* in all fields and the vhost is the default /.
2. Set a policy that enforces "ha-all" mirroring on all exchanges and queues (no ha-params or ha-mode specification)
3. Create one or more classic, non-durable queues and push messages to them
4. After the queues have synced, consume all of the messages and delete the queue afterward (you may need to try both through a client library and through the admin CLI)
5. Restart any of the nodes in the cluster individually. If you used the admin CLI, then the queues may not reappear if you restart the node which you executed the delete call(s) from, based on what I have seen.
6. The empty queues which were deleted should reappear in the management interface

Luke Bakken

unread,
May 21, 2020, 3:47:05 PM5/21/20
to rabbitmq-users
Hi Lucas,

You'll have to clarify your policy. If you don't set ha-mode: all then no mirroring will happen. Can you run rabbitmqctl list_policies so I can see what mirroring policy you're using?

Lucas LeBlanc

unread,
May 22, 2020, 10:09:14 AM5/22/20
to rabbitmq-users
Ugh, sorry, I meant to say no ha-sync-mode specification is necessary. You need to specify ha-mode: all.

Lucas LeBlanc

unread,
May 22, 2020, 10:11:29 AM5/22/20
to rabbitmq-users
Listing policies for vhost "/" ...
vhost   name    pattern apply-to        definition      priority
/       ha-all  .*      all     {"ha-mode":"all"}       0

Luke Bakken

unread,
May 22, 2020, 10:11:53 AM5/22/20
to rabbitmq-users
Hello,

OK yep that's what I figured. As expected, everything worked fine in my environment following the steps provided (using rabbitmqadmin to delete my test queue).

In step 5 should I restart all nodes or "any node" (i.e. any one node)? It's not clear.

Thanks -
Luke

Lucas LeBlanc

unread,
May 22, 2020, 10:21:38 AM5/22/20
to rabbitmq-users
It occurred in my environment when I restarted any one of the mirrored nodes.

Luke Bakken

unread,
May 22, 2020, 10:44:41 AM5/22/20
to rabbitmq-users
OK. I'll try a few more times but my guess is that I won't be able to reproduce this.

Mirroring non-durable queues is not recommended for just this reason. It can be unpredictable.

Thanks,
Luke

Lucas LeBlanc

unread,
May 22, 2020, 11:11:22 AM5/22/20
to rabbitmq-users
Can I ask why it is possible to mirror non-durable queues if it is this unpredictable? I don't see how any useful functionality can be gained out of behavior that is so inconsistent. I was under the impression I would be able to have a cluster that is able to serve a given file even if one node goes down, but I think it may be better to simply not mirror them because making them durable would slow down the transfers too much. By the way, I just restarted one of my QA nodes and got several of the deleted temporary File queues to reappear again. 

Luke Bakken

unread,
May 22, 2020, 11:37:46 AM5/22/20
to rabbitmq-users
Hi Lucas,

It has historically been possible to mirror non-durable queues. We are discussing removing the ability to do so.

Durability will not have an effect on mirror synchronization. A durable queue will survive a broker restart. That's it. It does not affect the persistence of the messages in the queue, nor the mirroring characteristics (except to avoid the edge case you can reproduce, and I can't). For messages to survive a broker restart, they must be published as persistent, and the queue must be durable. Yes, this is too many knobs to tweak but again, history.

If you can script how you are reproducing this issue, it would help me a lot. I suspect there are details missing, or I am not following the exact same steps.

Thanks,
Luke

Luke Bakken

unread,
May 22, 2020, 11:40:26 AM5/22/20
to rabbitmq-users
When I say "unpredictable" yours is only the second time I can remember that someone has had an issue mirroring non-durable queues. It may just be that everyone happens to always use durable queues with mirroring, I don't know.

In any rate, we're discussing it.

Thanks,
Luke

Lucas LeBlanc

unread,
May 22, 2020, 12:43:44 PM5/22/20
to rabbitmq-users
OK. I greatly appreciate the feedback and background information. When I mention that I think durable mode would be unsuitable, I'm talking about documentation I've read stating it can make the message pipeline slower due to writing to disk. I'll have to test durable queues to see how well it works but I imagine the disk space and time cost will be very significant. 

It's difficult for me to script out publishing and consuming messages because I'm using the .NET client API. The messages are not published with any special parameters like persistence. For each file queue, there's a "Create" message that lets the recipient know the file is coming, followed by "Append" messages containing up to 65536 bytes each of binary data, and then a "Close" message. The "Create" message comes on a separate queue but "Append" and "Close" all go in the temporary file queues I was talking about. After the recipient application gets "Close," it deletes the queue. This is also based on an old file transfer protocol that I basically re-used in RabbitMQ, so it may be (probably is) un-optimal. If you have any suggestions for creating a more portable test, I'm willing to give it a shot.

There are no special commands I am executing on the broker servers, they are simple stop_app and start_app calls. I would suggest creating a large number of queues, then emptying them and deleting them both through a client API and through the admin CLI.

Luke Bakken

unread,
May 22, 2020, 2:15:41 PM5/22/20
to rabbitmq-users
Hi Lucas,

I would be interested to know exactly what documentation states that durable means that messages are written to disk. As I said, durable has nothing to do with individual message persistence. For that, you must publish the messages with "delivery mode" set to "persistent". Every client library has its own way of doing that.

See this, too - https://www.cloudamqp.com/blog/2017-03-14-how-to-persist-messages-during-RabbitMQ-broker-restart.html

Since you're not publishing persistent messages, restarting nodes will delete them from that node. This of course is affected by mirroring, as messages will remain on those nodes and will be synchronized on restart (which takes time!). Since the queues aren't durable, they too will be deleted.

The usual practice is that if you're going to bother with mirroring, you should combine it with durable queues and persistent messages and publisher confirms to get complete reliability. Without confirms you are basically hoping that your message makes it to RabbitMQ and is enqueued. TCP does not guarantee that communication is reliable, only protocols can do that. HTTP does it by sending a response code back. AMQP uses publisher confirms.

In my test, I was using rabbitmqctl shutdown which stops the running VM completely. stop_app and start_app don't do that.

I'll see if I can find time next week to try and reproduce, but since there is a better practice (use durable queues) I have to prioritize other work first.

Thanks -
Luke

Lucas LeBlanc

unread,
May 26, 2020, 10:23:17 AM5/26/20
to rabbitmq-users
Not really appreciating the snippiness as I am not intentionally trying to spout false information at you, but duly noted. For reference, the very first sentence under "Durability" in the RabbitMQ documentation states: "Durable queues are persisted to disk..." so I feel that it is a very reasonable mistake to make.

Michael Klishin

unread,
May 26, 2020, 11:32:40 AM5/26/20
to rabbitm...@googlegroups.com

Lucas,

 

This is a reasonable mistake to make.

 

We will update the doc section you are referring to. It currently does mention the difference

between queue durability and that of messages but maybe the distinction should be made

clearer and/or earlier.

 

We are also considering removing the concept

of transient messages in a future version. It is a feature of the original protocol so we could

not ignore it and taking it out is a difficult decision to make.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/0ebef72e-a54e-4847-970c-a1a444a27322%40googlegroups.com.

Luke Bakken

unread,
May 26, 2020, 11:40:57 AM5/26/20
to rabbitmq-users
Hi Lucas,

I asked which document may have been confusing because we try to immediately make improvements if something is incorrect or unclear. The fact that we use the word "persist" to describe durable queues could be changed because "persist" has a specific meaning when it comes to messages. Thanks for pointing that out.

Luke

Lucas LeBlanc

unread,
May 28, 2020, 8:18:23 AM5/28/20
to rabbitmq-users
OK. It is on the Queues page, under "Durability:" https://www.rabbitmq.com/queues.html
This is my first project involving RabbitMQ, and this thread has been quite enlightening for me. I haven't been able to find any "official" documentation on best practices, unless it is embedded within the other pages of documentation describing features. There seems to be the occasional hint or recommendation, but nothing comprehensive. There are blogs and community posts on it as well, but since I am as new as I am, I feel like it would be difficult for me to determine which community recommendations I should actually listen to. If there is a recommended article on it, I would appreciate the tip. Most of the documentation I have read has been very specific to the given feature I am trying to work with or task I have to complete.

Lucas LeBlanc

unread,
May 28, 2020, 8:34:10 AM5/28/20
to rabbitmq-users
It looks like this issue with deleted queues reappearing is not exclusive to the policy setup I was using. I was busy last week with some other code that I recently pushed out to production, but since I finished with that I went back to QA to test out other RabbitMQ queue policy configurations. I took some of Luke's suggestions and made it so that only the permanent queues for sending actionable messages (there are only 2) are mirrored across a quorum of nodes. The temporary file queues are not mirrored any more. Everything works the way I would expect in terms of our applications, but when I tested restarting the node (Again I followed the order recommended in the guide when nodes are upgraded, which is starting up the node which was shutdown last, first), more of these empty uploaded queues reappeared. The queues include new ones that were created after the policy changes were made. I restarted the nodes using "sudo rabbitmqctl shutdown" instead of "stop_app" and then "sudo systemctl start rabbitmq-server" followed by "sudo rabbitmqctl start_app." Here is the policy:

Listing policies for vhost "/" ...
vhost   name    pattern apply-to        definition      priority
/       ha-quorum       (Action|Task)   queues  {"ha-mode":"exactly","ha-params":2,"ha-sync-mode":"automatic"}      0

Action and Task are the two permanent queues I mentioned.

Lucas LeBlanc

unread,
May 28, 2020, 12:20:02 PM5/28/20
to rabbitmq-users
Here's another pertinent detail: I have tried stopping a node, running the "reset" command, rejoining the cluster and starting the node. This still causes the empty queues to reappear, even though I would expect that the node I just reset should have no queue data at all because I reset it. Looking through the logs doesn't produce anything obvious, not even a record of the queues being restored or mirrored.

Luke Bakken

unread,
May 28, 2020, 12:31:23 PM5/28/20
to rabbitmq-users
Hi Lucas,

Thanks for the additional detail. Just to be certain, the queues that re-appear do not match the ha-quorum policy, but are non-durable queues that have this general life cycle -
  • Create classic, non-durable queue
  • Push messages to the queue, consume them
  • Delete the queue either via client library or HTTP API
  • Do the above several times
  • Restart nodes one-at-a-time
When you restart a node, a client application that had been connected to that node must re-connect to another node. Is there a chance that these applications are re-declaring queues when they re-connect, thus making it appear that deleted queues are restored? This is about the only explanation I have for the scenario you describe in which a node reset does not seem to delete queues.

Thanks -
Luke

Lucas LeBlanc

unread,
May 29, 2020, 10:32:44 AM5/29/20
to rabbitmq-users
Yes Luke, you are correct in your assessment. The applications in question have no memory of these queues and would not be able to re-declare them. The non-durable file queues are completely transient and have a lifetime that lasts as long as it takes to download the file to our storage volume and register it in our database. My gut tells me this might be related to some kind of other service or OS feature that is interacting with the RabbitMQ server in an unexpected way. That is often the case when someone has an easily reproducible bug that no one else is experiencing.

Luke Bakken

unread,
May 29, 2020, 10:54:53 AM5/29/20
to rabbitmq-users
Hi Lucas,

You are using the RabbitMQ .NET client in your applications, correct? The reason I ask is that this library has a feature that will restore topology (exchanges, queues, etc) when a client application reconnects. If there is a bug in the library where a queue deletion is not recorded, maybe that is what is going on.

Thanks -
Luke

Lucas LeBlanc

unread,
Jun 1, 2020, 11:59:40 AM6/1/20
to rabbitmq-users
Thank you very much for bringing this feature to light, Luke, I had no idea that this library would do something like this. I will dig deeper into the .NET client library and try to determine if that is the root of the issue.

Lucas LeBlanc

unread,
Jun 1, 2020, 4:18:03 PM6/1/20
to rabbitmq-users
It does indeed seem to be tied to the client applications being active while the broker nodes restart and the topography recovery feature kicking in. I'm not familiar with how exactly the topography recovery feature stores its queue information, but if I make sure the client services are all shut down while I restart the nodes, then even with the feature turned on, the queues do not reappear. Turning the feature off and restarting all of the client services seems to have alleviated the issue completely. The issue was made a bit stickier by the fact that all of the various services on our network that connect to our brokers could possibly restore these empty queues, not just the one service which consumes the messages from the file queues and deletes them. I stand by my initial post where I mentioned my hunch that this would be a silly mistake... I should have studied the client API more thoroughly. This thread has been very informative.

Luke Bakken

unread,
Jun 1, 2020, 5:33:22 PM6/1/20
to rabbitmq-users
Hi Lucas,

If a .NET client application deletes a queue but that queue is restored by topography recovery at a later point, that is a bug in the .NET client. Please let me know if you see that happen.

Thanks,
Luke

Lucas LeBlanc

unread,
Jun 3, 2020, 10:59:48 AM6/3/20
to rabbitmq-users
There were some queues which were manually deleted by me as well (due to inadequacy in the code on my part causing a few to go undeleted -- I have fixed this, thankfully). I have good reason to believe that the queues being restored were ones that I manually deleted using the CLI. I do not know for sure which of the restored queues were manually deleted and which ones were deleted by the client, since I don't have a record of which ones were deleted. 

One thing I am curious about is how the recovery feature interacts with other services consuming from the same cluster. For example: If Service A and Service B are both connected to a cluster, and Service A deletes some queues through its .NET client API, how would Service B know not to restore those queues in the event that the node hosting those queues goes down immediately afterward?

Michael Klishin

unread,
Jun 3, 2020, 11:12:27 AM6/3/20
to rabbitm...@googlegroups.com

It does not. Connection recovery entity cache is per-connection. Only what is deleted on the same connection will be purged from the cache of entities to re-declare on recovery.

 

Things that are shared are assumed to be long lived.

--

You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.

Reply all
Reply to author
Forward
0 new messages