rabbitmq unresponsive when many queues are deleted

223 views
Skip to first unread message

gmf

unread,
Apr 13, 2022, 11:48:28 AM4/13/22
to rabbitmq-users
Hi,
In a 3-node cluster (8 cores, 32 GB RAM, RabbitMQ 3.9.14, Erlang 24.3.3), when many queues are deleted (more than 2 deletes / sec), the cluster becomes unresponsive and messages are no longer delivered. 
Attached are the Grafana / Prometheus metrics during one of these events (15:00 - 15:25).
Any idea?
Thanks a lot.

rabbitmq-unresponsive.png

Wes Peng

unread,
Apr 13, 2022, 5:27:14 PM4/13/22
to rabbitm...@googlegroups.com
Do you delete queues from external?
I think you can setup policy to let Rabbitmq delete them automatically.

Thanks 

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/9922a868-1cfd-418d-9405-70c6db92ba15n%40googlegroups.com.

gmf

unread,
Apr 14, 2022, 5:28:25 AM4/14/22
to rabbitmq-users
The service works like this: when a client connects, a dedicated queue is created, and when it disconnects the queue is removed after 30 seconds with a policy (unless the same client reconnects in the meantime). At certain times many clients disconnect (a few thousand in a few minutes), then the corresponding queues (after 30 seconds) are deleted ( rate 2 - 10 per second).
For 15 - 25 minutes after these deletions, rabbitmq is no longer able to send messages and even the management GUI responds slowly.
The CPU utilization of each node (8 cores, 32GB RAM) never exceeds 30% and the RAM used is around 3 GB.
The linux limits of the process are as follows:

Limit                     Soft Limit           Hard Limit           Units    
Max cpu time              unlimited            unlimited            seconds  
Max file size             unlimited            unlimited            bytes    
Max data size             unlimited            unlimited            bytes    
Max stack size            8388608              unlimited            bytes    
Max core file size        unlimited            unlimited            bytes    
Max resident set          unlimited            unlimited            bytes    
Max processes             unlimited            unlimited            processes
Max open files            1048576              1048576              files    
Max locked memory         65536                65536                bytes    
Max address space         unlimited            unlimited            bytes    
Max file locks            unlimited            unlimited            locks    
Max pending signals       127020               127020               signals  
Max msgqueue size         819200               819200               bytes    
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us 


The Erlang runtime launch command looks like this:
/usr/local/lib/erlang/erts-12.3.1/bin/beam.smp -W w -MBas ageffcbf -MHas ageffcbf -MBlmbcs 512 -MHlmbcs 512 -MMmcs 30 -P 1048576 -t 5000000 -stbt db -zdbbl 256000 -sbwt none -sbwtdcpu none -sbwtdio none -sbwt none -sbwtdcpu none -sbwtdio none -B i -- ...

gmf

unread,
Apr 14, 2022, 7:34:55 AM4/14/22
to rabbitmq-users
Attached are the metrics during today's 5-minutes event (13.00 - 13.05).
rabbitmq-overview-202204141300.png
rabbitmq-grafana-202204141300.png

david....@gmx.de

unread,
Apr 14, 2022, 1:42:45 PM4/14/22
to rabbitmq-users
Could you give a bit more information on the queues you are deleting:
1. What queue type are you using (classic queues or quorum queues)?
2. What are the queue properties (durable? exclusive? auto-delete? arguments?)

gmf

unread,
Apr 15, 2022, 3:40:00 AM4/15/22
to rabbitmq-users
Queues are classic, non-durable, non-mirrored, 
Other properties, set via userQueue policy, are:
expires: 30000
message-ttl: 1800000

(an example attached)

About 8000 - 9000 queues for most of the day.




rabbitmq-queue-202204150911.png

Michal Kuratczyk

unread,
Apr 21, 2022, 7:07:40 AM4/21/22
to rabbitm...@googlegroups.com
Hi,

I'm afraid I'm unable to reproduce the problem. I've declared thousands of queues and re-declared them when they started expiring, I had a separate perf-test running publishing and consuming at the same time,
I had clients reconnecting, all to no avail - my transaction/lock number are much lower and my cluster was responsive (no significant drop in publish/consume rates for a client running concurrently).

It seems like in your case, there is something else going, apart from queue deletions. This would explain many transaction restarts.
* Do you see many queues declared at the same time?
* Can you think of anything else that could be declaring/deleting objects at the same time?
* Can you share the logs? Perhaps that would tell us what other things happen around that time
* Could you try to build an executable test case that we can use to reproduce the problem? Ideally using https://github.com/rabbitmq/rabbitmq-perf-test as your client

Thanks,

Screenshot 2022-04-21 at 12.55.29.png




--
Michał
RabbitMQ team

gmf

unread,
Apr 21, 2022, 11:05:47 AM4/21/22
to rabbitmq-users
Hi,
The number of declared queues is always high (100 - 200 per second), and actually seems to grow with the mnesia lock and transaction restarts (see attachment).

It is not clear to me what "mnesia transaction coordinators" are: do they match Erlang schedulers?
During the issue there are 8 fully occupied coordinators per node (equal to the number of CPU cores): can it be useful to increase the number?

Thanks.
rabbitmq-20220421.png

david....@gmx.de

unread,
Apr 25, 2022, 12:34:03 PM4/25/22
to rabbitmq-users
You could try experimenting by setting
mnesia:set_debug_level(debug)
on a RabbitMQ node and see what transactions are restarted and why they are restarted.
For example, run
rabbitmqctl eval 'mnesia:set_debug_level(debug).'
Ideally you will see log lines printed to stdout when transactions restart. An example output could be:
Mnesia(nonode@nohost): Restarting transaction {tid,4,<0.128.0>}: in 3ms {cyclic,nonode@nohost,{'______GLOBAL_____',same_key},write,write,{tid,3,<0.127.0>}}

gmf

unread,
May 12, 2022, 12:51:26 PM5/12/22
to rabbitmq-users
Attached is the log file of a rabbitmq node after running the suggested command.

Anyway, I have found old github issues that seem to be related to my problem:
https://github.com/rabbitmq/rabbitmq-server/issues/1513
but they should already be solved starting from rabbitmq 3.8 and therefore for sure on 3.9.

As already mentioned, during the issue there are 8 fully occupied mnesia transaction coordinators per node (equal to the number of CPU cores): could it be useful to increase the number of Erlang schedulers beyond the number of CPU cores?
Do mnesia transaction coordinators match Erlang schedulers?

Thanks
rabbitmq-0-20220512.log
Reply all
Reply to author
Forward
0 new messages