Classic queue startup time significantly increased after upgrade to V2

133 views
Skip to first unread message

pfri...@gmail.com

unread,
Apr 25, 2025, 11:18:05 AMApr 25
to rabbitmq-users
We're running recent versions of RabbitMQ (4.0.7) and Erlang (27.3) on Windows Server 2022, and ever since we made the upgrade to RabbitMQ 4.X we have seen significant increases in startup times for the rabbit service. I've found a thread that indicates that this is a byproduct of our classic queues being upgraded to V2 concurrent with the 4.0 upgrade, but there didn't seem to be a resolution of the problem they were seeing. Prior to the upgrade, our queue recovery times were consistently under 1000ms, but now we're seeing recovery times of, at minimum, 300000ms. Here's a snippet of the most recent startup time in our production environment:

2025-04-11 05:50:11.064000+00:00 [info] <0.486.0> Starting message stores for vhost '/'
2025-04-11 05:50:11.065000+00:00 [info] <0.486.0> Started message store of type transient for vhost '/'
2025-04-11 05:50:11.076000+00:00 [info] <0.486.0> Started message store of type persistent for vhost '/'
2025-04-11 06:01:08.834000+00:00 [info] <0.486.0> Recovering 28 queues of type rabbit_classic_queue took 657796ms
2025-04-11 06:01:08.834000+00:00 [info] <0.486.0> Recovering 0 queues of type rabbit_quorum_queue took 0ms
2025-04-11 06:01:08.834000+00:00 [info] <0.486.0> Recovering 0 queues of type rabbit_stream_queue took 0ms

We always stop the queues prior to restarting or upgrading the rabbit service, so there should be a minimal number of messages that need to be recovered when the service starts. Can anyone provide guidance on how we can get these recovery times down to an acceptable level?

Thanks in advance,
Patrick

Michal Kuratczyk

unread,
Apr 25, 2025, 11:27:56 AMApr 25
to rabbitm...@googlegroups.com
This is definitely not expected, so we need to debug why this happens.

How many messages are in these queues? What's the size of the message?
Can you provide debug-level logs?
Can you provide `ls -lR YOUR_DATA_DIR/msg_stores/ ?

Are you able to reproduce this consistently? If you take a fresh RabbitMQ, publish messages similar to what your applications uses and restart to see - what happens?


--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/rabbitmq-users/1f6dfcf0-6b22-4394-a3fa-9644c3c001c1n%40googlegroups.com.


--
Michal
RabbitMQ Team

This electronic communication and the information and any files transmitted with it, or attached to it, are confidential and are intended solely for the use of the individual or entity to whom it is addressed and may contain information that is confidential, legally privileged, protected by privacy laws, or otherwise restricted from disclosure to anyone else. If you are not the intended recipient or the person responsible for delivering the e-mail to the intended recipient, you are hereby notified that any use, copying, distributing, dissemination, forwarding, printing, or copying of this e-mail is strictly prohibited. If you received this e-mail in error, please return the e-mail to the sender, delete it from your computer, and destroy any printed copy of it.

pfri...@gmail.com

unread,
Apr 25, 2025, 3:27:50 PMApr 25
to rabbitmq-users
Hello Michal,

The problem is most acutely felt on our Production servers, which I unfortunately cannot access nor get permission to enable debug logging at this time. We see it to a lesser degree in our testing environment, where queue recovery has gone from completing in under a second on the V1 classic queues to 30+ seconds on V2. I've attached the log with as many debug settings enabled as I can, though we're still using the classic configuration schema and I can't find much useful documentation to ensure that I'm logging everything that is available.

This log represents a pretty typical case for us - we have Windows services that consume messages in the queue which we shut down first. This also prevents the queues from accepting new messages, so in this case there weren't any messages pending in the queues. We then stop the rabbit service, at which point the recovery.dets file looks something like this:



recovery.dets.png
In this case, on restarting, classic queue recover took about 42 seconds. Message size per queue varies, but I think 2-4 KB would be a good approximation.

Please let me know if there's anything else I can provide.

Thanks,
Patrick
Server_Log_Sanitized.txt

Michal Kuratczyk

unread,
Apr 28, 2025, 5:48:35 AMApr 28
to rabbitm...@googlegroups.com
We can focus on your test env - 40 seconds is already way longer than we'd expect, but we definitely need more information:
* `rabbitmqctl list_queeus` after the recovery (how many messages are in those queues)
* the message size is critical - messages about 4kb are stored differently, so we need to know how many are below and how many are above this threshold (at least roughly)
* do you have transient messages? Is this a mix of transient and persistent messages perhaps?
* `ls -lR msg_stores/`  - not just the top level, we want to see the files in those folders

If you can reproduce it in your test env, can you share the reproduction method? If not, can you share the msg_stores folder?
I'd assume there's no sensitive data in a test env. You can reach out to me directly over email or slack/discord to share it.

Best,


pfri...@gmail.com

unread,
Apr 28, 2025, 6:15:25 PMApr 28
to rabbitmq-users
Hello Michal,

I ran several tests today, during which time I observed a lower than typical testing volume from our business teams. This seemed to have no impact on the startup time, with each instance resulting in a recovery time of more than 40 seconds.

1) Running the list_queues command before and after restarting the rabbit service, I see 0 queued messages in either scenario. I also checked most of the other list_queues commands, all of which yielded similar results.
2) I'm having difficulty in obtaining the exact message sizes, though I made several attempts with the various list_queues message_bytes commands. Some of these queues pass smaller messages, but we also utilize them to pass larger XML bodies. I'll try to get better numbers from our Production environment, but from what I've seen thus far most messages are under 4KB.
3) No, we're using 100% durable queues.
4) Do you need the physical files to view? We handle sensitive data with our application, so while sharing data from the test environment would probably be fine I would prefer not to if possible. If you need the files, I'll get those to you tomorrow. With that being said, all of the queue directories contain a single .queue_name file and the msg_store_transient folder containing a single 0KB 0.rdq file. The msg_store_persistent directory is much more interesting, currently containing 38 .rdq files comprising 128MB of data on this server:
msg_store_persistent.png
Our typical process that reproduces this issue is as follows:

1) Shut down the message consuming services in Windows Services app
2) Shut down or restart the RabbitMQ service in the Windows Services app
3) Start the RabbitMQ service in Windows Services, then wait for the system to start responding (the Rabbit startup process now typically brings the server to its knees)

I greatly appreciate your time and insight into this matter.

Thanks,
Patrick

Michal Kuratczyk

unread,
Apr 29, 2025, 2:28:54 AMApr 29
to rabbitm...@googlegroups.com
The missing part of these reproduction steps is having the state of the node that you have.
Restarting a node, in general, doesn't lead to such a slow startup. Restarting a node with
the data you have does.

If the queues are empty, I'm fairly sure you can solve/workaround this problem by deleting and recreating them.
It is likely the same or a related issue as https://github.com/rabbitmq/rabbitmq-server/discussions/12848.
This discussion focuses on v1->v2 conversion, but some of that code is also used when starting
a node with v2 queues (after the conversion).

If you are in a position to try this in the test env, that'd likely solve your problem and be a data point for us as well.

Of course this is something we'll work on improving - recreating the queues is just a workaround.

Best,


Patrick Frith

unread,
Apr 29, 2025, 5:56:12 PMApr 29
to rabbitm...@googlegroups.com
Yes, recreating the queues seems to do the trick - startup time went down to a few dozen milliseconds. Am I correct in assuming that the rabbitmqadmin tool is the best way to easily delete/create queues via the command line in Windows? We typically set up queues manually via the admin interface, but if this is something that we need to do periodically I would greatly prefer to automate the process. I've been trying to get the rabbitmqadmin tool up and running today, but it's having some issues (likely with TLS). Is there any way I can get deeper insight into this via logfiles?

rabbitmqadmin.png
Thanks,
Patrick

You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/UPQ37TplaiA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/rabbitmq-users/CALWErWKh6datMr0mDrE67NYGAD8JzJyoSc9NUdhF9WEeJXqkMg%40mail.gmail.com.

Michal Kuratczyk

unread,
Apr 30, 2025, 1:48:08 AMApr 30
to rabbitm...@googlegroups.com
Do you know how long ago these queues were created? Assuming the issue was exactly the one I linked to,
the problem was that while the queues were empty. it had a very large message ID counter, meaning it had
processed lots of messages in the past (billions?). On startup, we iterate over all those values to make sure
there are no messages to recover, which is what's taking so long. If these queues were relatively new,
in terms of how many messages they had processed, that sounds like a different issue.

I don't expect you to need to periodically recreate them. It will likely take a long time before they reach a high
enough ID counter for it to matter really. I'd expect the code to be optimized much sooner than that.

There should already be something in the logs when rabbitmqadmin fails to connect.

Best,

pfri...@gmail.com

unread,
Apr 30, 2025, 8:43:48 PMApr 30
to rabbitmq-users
Yes, these queues have been up and running for over a year, and based on our volume they have almost certainly processed billions of messages in that time. IT's good to hear that there are optimizations incoming - I'll be looking out for that.

I didn't have much time to work on getting rabbitmqadmin up and running today, but I didn't see anything that looked to be related in the logs. I'll look into it further tomorrow. Thanks again!

Regard,
Patrick

Reply all
Reply to author
Forward
0 new messages