rabbitmq unresponsive when many queues are deleted

gmf

unread,

Apr 13, 2022, 11:48:28 AM4/13/22

to rabbitmq-users

Hi,
In a 3-node cluster (8 cores, 32 GB RAM, RabbitMQ 3.9.14, Erlang 24.3.3), when many queues are deleted (more than 2 deletes / sec), the cluster becomes unresponsive and messages are no longer delivered.

Attached are the Grafana / Prometheus metrics during one of these events (15:00 - 15:25).
Any idea?
Thanks a lot.

rabbitmq-unresponsive.png

Wes Peng

unread,

Apr 13, 2022, 5:27:14 PM4/13/22

to rabbitm...@googlegroups.com

Do you delete queues from external?

I think you can setup policy to let Rabbitmq delete them automatically.

Thanks

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/9922a868-1cfd-418d-9405-70c6db92ba15n%40googlegroups.com.

gmf

unread,

Apr 14, 2022, 5:28:25 AM4/14/22

to rabbitmq-users

The service works like this: when a client connects, a dedicated queue is created, and when it disconnects the queue is removed after 30 seconds with a policy (unless the same client reconnects in the meantime). At certain times many clients disconnect (a few thousand in a few minutes), then the corresponding queues (after 30 seconds) are deleted ( rate 2 - 10 per second).
For 15 - 25 minutes after these deletions, rabbitmq is no longer able to send messages and even the management GUI responds slowly.
The CPU utilization of each node (8 cores, 32GB RAM) never exceeds 30% and the RAM used is around 3 GB.
The linux limits of the process are as follows:

Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size unlimited unlimited bytes
Max resident set unlimited unlimited bytes
Max processes unlimited unlimited processes
Max open files 1048576 1048576 files
Max locked memory 65536 65536 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 127020 127020 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us

The Erlang runtime launch command looks like this:

/usr/local/lib/erlang/erts-12.3.1/bin/beam.smp -W w -MBas ageffcbf -MHas ageffcbf -MBlmbcs 512 -MHlmbcs 512 -MMmcs 30 -P 1048576 -t 5000000 -stbt db -zdbbl 256000 -sbwt none -sbwtdcpu none -sbwtdio none -sbwt none -sbwtdcpu none -sbwtdio none -B i -- ...

gmf

unread,

Apr 14, 2022, 7:34:55 AM4/14/22

to rabbitmq-users

Attached are the metrics during today's 5-minutes event (13.00 - 13.05).

rabbitmq-overview-202204141300.png

rabbitmq-grafana-202204141300.png

david....@gmx.de

unread,

Apr 14, 2022, 1:42:45 PM4/14/22

to rabbitmq-users

Could you give a bit more information on the queues you are deleting:

1. What queue type are you using (classic queues or quorum queues)?

2. What are the queue properties (durable? exclusive? auto-delete? arguments?)

gmf

unread,

Apr 15, 2022, 3:40:00 AM4/15/22

to rabbitmq-users

Queues are classic, non-durable, non-mirrored,
Other properties, set via userQueue policy, are:

expires: 30000

message-ttl: 1800000

(an example attached)

About 8000 - 9000 queues for most of the day.

rabbitmq-queue-202204150911.png

Michal Kuratczyk

unread,

Apr 21, 2022, 7:07:40 AM4/21/22

to rabbitm...@googlegroups.com

Hi,

I'm afraid I'm unable to reproduce the problem. I've declared thousands of queues and re-declared them when they started expiring, I had a separate perf-test running publishing and consuming at the same time,

I had clients reconnecting, all to no avail - my transaction/lock number are much lower and my cluster was responsive (no significant drop in publish/consume rates for a client running concurrently).

It seems like in your case, there is something else going, apart from queue deletions. This would explain many transaction restarts.

* Do you see many queues declared at the same time?

* Can you think of anything else that could be declaring/deleting objects at the same time?

* Can you share the logs? Perhaps that would tell us what other things happen around that time

* Could you try to build an executable test case that we can use to reproduce the problem? Ideally using https://github.com/rabbitmq/rabbitmq-perf-test as your client

Thanks,

To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/f5fb8911-83a8-4e0f-b852-6af842bbabdan%40googlegroups.com.

--

Michał

RabbitMQ team

gmf

unread,

Apr 21, 2022, 11:05:47 AM4/21/22

to rabbitmq-users

Hi,

The number of declared queues is always high (100 - 200 per second), and actually seems to grow with the mnesia lock and transaction restarts (see attachment).

It is not clear to me what "mnesia transaction coordinators" are: do they match Erlang schedulers?
During the issue there are 8 fully occupied coordinators per node (equal to the number of CPU cores): can it be useful to increase the number?

Thanks.

rabbitmq-20220421.png

david....@gmx.de

unread,

Apr 25, 2022, 12:34:03 PM4/25/22

to rabbitmq-users

You could try experimenting by setting

mnesia:set_debug_level(debug)

on a RabbitMQ node and see what transactions are restarted and why they are restarted.

For example, run

rabbitmqctl eval 'mnesia:set_debug_level(debug).'

Ideally you will see log lines printed to stdout when transactions restart. An example output could be:

Mnesia(nonode@nohost): Restarting transaction {tid,4,<0.128.0>}: in 3ms {cyclic,nonode@nohost,{'______GLOBAL_____',same_key},write,write,{tid,3,<0.127.0>}}

gmf

unread,

May 12, 2022, 12:51:26 PM5/12/22

to rabbitmq-users

Attached is the log file of a rabbitmq node after running the suggested command.

Anyway, I have found old github issues that seem to be related to my problem:
https://github.com/rabbitmq/rabbitmq-server/issues/1513

https://github.com/rabbitmq/rabbitmq-server/issues/1566

but they should already be solved starting from rabbitmq 3.8 and therefore for sure on 3.9.

As already mentioned, during the issue there are 8 fully occupied mnesia transaction coordinators per node (equal to the number of CPU cores): could it be useful to increase the number of Erlang schedulers beyond the number of CPU cores?

Do mnesia transaction coordinators match Erlang schedulers?

Thanks

rabbitmq-0-20220512.log

Reply all

Reply to author

Forward