Hello RabbitMQ users,
We have an intermittent performance related problem on one of our RabbitMQ servers that bothers us a few times per day.
When problem hits:
-Queue buildup in multiple queues
-Remote federation queue also growing
-Sluggish MB GUI (Web-GUI via Management plugin)
-CLI commands time-out
-Possible also connection failures on AMQP port 5671, but this has not been verified.
-No obvious resource shortage (OS performance data looks fine).
Discussed this with some of our internal Erlang specialists and we ran a short "perf" collection when problem was present and reviewed with "perf report".
The statement was that we likely suffer from "ETS lock contention".
This problem got worse after adding CPU (we went from 8 vCPU to 16vCPU).
We have enabled the Prometheus plugin and collect data every 20s.
We have also done some queue cleanup and configuration changes removing collection of rate metrics to lower impact.
This has helped, but we still see the issue.
I understand that this could be a very tricky problem, but I wonder if anyone has experience on ETS lock contention issues?
Are there specific prometheus metrics that can help visualizing ETS locking?
----------------------------------
Config details:
RHEL7 16 vCPU 64GB running in a vmware VM.
RabbitMQ 3.8.9 Erlang 23.1.1
~3000 queues
~1000 Clients
We use persistent messages and durable queues.
We use classic queues and run on a single node.
A short Perf data collection is available if someone wants to have a look.
( perf record -F 99 -p <beam-pid> --call-graph dwarf -- sleep 5)
Best Regards,
Thomas