Recovering queues of type rabbit_classic_queue taking much longer post conversion to v2

vengi...@gmail.com

unread,

Apr 16, 2024, 4:12:17 PM4/16/24

to rabbitmq-users

In late February 2024 we upgraded to version 3.12.12. Prior to this upgrade we were using version 3.11.28. On both versions during a restart of the service I would see messages like this one:

Recovering 829 queues of type rabbit_classic_queue took 1843ms

After the upgrade to 3.12.13 I added this to the rabbitmq.conf file:

classic_queue.default_version = 2

Then I immediately restarted the service. During the startup I saw many messages of the form:

Queue <<queue-name>> in vhost <<vhost-name>> converted XX total messages from v1 to v2

Note: I interpreted this to mean that my queues were being converted (on disk) to version 2.

The issue I am seeing is that the "Recovering" time is much, much longer than before switching to version 2. For example, during a service startup I now see:

Recovering 866 queues of type rabbit_classic_queue took 119253ms

The number of queues grows slightly and the number and size of the messages changes over time, but not drastically, so I really do not believe those things are a factor (especially since I saw this right away after the first restart which immediately followed the change to the rabbitmq.conf file). In this example, this "recovering" time has grown from under 2 seconds to close to 2 minutes. We've since upgraded to version 3.12.13 but this much longer "Recovering" time remains.

Can anyone shed any light on this for me?

What is happening during this "Recovering" work?

Is there anything obvious that I may have missed when switching to version 2 classic queues?

Is there any belief this would be remedied by upgrading to version 3.13.1?

Thanks in advance.

Dave Diehl

Michal Kuratczyk

unread,

Apr 16, 2024, 7:56:45 PM4/16/24

to rabbitm...@googlegroups.com

Hi,

Queue recovery is the phase where a starting node goes through the files on disk to recreate the in-memory state of the queue, so that it can actually start receiving and dispatching messages.

1. Just to be sure - you don't see the "converted XX total messages from v1 to v2" logs anymore, right? That happened only once and now you see longer recovery time, without conversion?

2. Please run "rabbitmqctl list_queues messages > msg_count.txt" and share the msg_count.txt

3. Please find "msg_stores/vhosts" in your RabbitMQ data folder, run "ls -lR > ls.txt" and share ls.txt as well

4. If you know the message size used in your application - please let me know

We can try to reproduce the problem based on the information above but ideally, if you can go through https://www.rabbitmq.com/blog/2022/05/31/flame-graphs

and produce a flamegraph from your node startup, that'd be best. This way we'd see where the time is spent.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/321bb293-0a59-460a-9366-0314ba27871dn%40googlegroups.com.

--

Michal

RabbitMQ Team

This electronic communication and the information and any files transmitted with it, or attached to it, are confidential and are intended solely for the use of the individual or entity to whom it is addressed and may contain information that is confidential, legally privileged, protected by privacy laws, or otherwise restricted from disclosure to anyone else. If you are not the intended recipient or the person responsible for delivering the e-mail to the intended recipient, you are hereby notified that any use, copying, distributing, dissemination, forwarding, printing, or copying of this e-mail is strictly prohibited. If you received this e-mail in error, please return the e-mail to the sender, delete it from your computer, and destroy any printed copy of it.

vengi...@gmail.com

unread,

Apr 17, 2024, 9:42:25 AM4/17/24

to rabbitmq-users

Michal,

Thanks for defining what queue recovery means. You'd think that this would be faster than ever since classic queues v2 are supposed to behave similarly to lazy queues (less data to put into memory). In fact, I should have mentioned this earlier. Many of the queues in my "cd" virtual host had a policy applying queue-mode lazy on them prior to the upgrade to Rabbit 3.12.x. As part of the process back in February of changing the default_version to 2, I also removed the policy adding the lazy queue-mode since that seemed no longer necessary. Was that an incorrect assumption?

Now to answer your questions:

1. Yes, you are correct, I only saw the "converted XX total messages from v1 to v2" one time, which was the first restart after changing the default_version to 2. I have not seen those message again on any subsequent RabbitMQ server startup.

2. I have no queues in the default "/" vhost. We have 8 other vhosts that our various applications use. I ran the command you gave me for each vhost and appended the results into one msg_count.txt file. It is attached. I also took the liberty of doing the same command but with message_bytes option in case you might find that useful. That file is also attached, named msg_bytes.txt

3. File ls.txt is attached. I also ran "du -sh" on that directory and it reported a size of 2.1G

4. Message sizes are highly variable since we have a variety of applications and functionality. Some are only a few hundred bytes. Some are a few thousand bytes and some are 40 - 50 thousand bytes. Some are in the range of 2 million bytes and some are even larger than that, say 45 million bytes. During a restart most messages are likely under 50 thousand bytes (those are the most common messages that have not been consumed).

I'll read the information you pointed me to about the flame graphs, but I'm not sure I'll be able to do this on a production server.

dave

msg_bytes.txt

ls.txt

msg_count.txt

Michal Kuratczyk

unread,

Apr 17, 2024, 10:12:36 PM4/17/24

to rabbitm...@googlegroups.com

Your understanding is correct - the "lazy" mode is ignored since 3.12 so removing it from the policy was correct. This slow down is unlikely to be related specifically to keeping messages in memory though.

Even though messages are generally not kept in memory in v2, RabbitMQ still needs to check what's on disk when it starts. v2 certainly made a lot of things faster but seems like there's a case

where we have a regression. I can have a look into this next week.

Running profiling in production is perfectly reasonable (the impact of running `perf` for a moment is very low; you can also use continuous profiling tools like parca.dev). If it's not easy to install the necessary tools

in your prod environment, you can reproduce the problem in a different environment and record `perf` there. That's what I need to do if you can't.

Best,

To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/2e143590-0eb3-4d50-ae77-4cf1fa7df7b1n%40googlegroups.com.

vengi...@gmail.com

unread,

Apr 19, 2024, 8:25:15 AM4/19/24

to rabbitmq-users

Both Linux admins that I need help from are out today; both are expected back on Monday though. So, I hope to be able to copy the queue data (I plan to tarball up the "msg_stores/vhosts" directory) over to our staging server and re-create the issue there, then create this flame graph for you. With luck I'll finish this on Monday or Tuesday of next week.

dave

Loïc Hoguin

unread,

Apr 26, 2024, 5:48:46 AM4/26/24

to rabbitm...@googlegroups.com, vengi...@gmail.com

Hello,

The queues should not take a long time recovering if the node was stopped gracefully. The amount of time it spends on recovery suggests that it goes through the messages to rebuild the index or similar. If the node was stopped gracefully, perhaps there is an issue with the state that it writes to disk when stopping the node (the recovery.dets file, one per vhost). That file is basically empty when the node is running and gets populated on shutdown. If for some reason the node stops before that happens, then it has to recover from scratch on next start.

Either way, having the logs would help, especially with debug logs enabled. It would tell us exactly why it doesn't do a clean recovery.

Cheers,

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/321bb293-0a59-460a-9366-0314ba27871dn%40googlegroups.com.

Reply all

Reply to author

Forward