RabbitMQ 3.8.9 ETS lock contention

167 views
Skip to first unread message

thoma...@gmail.com

unread,
Aug 2, 2022, 8:54:13 AM8/2/22
to rabbitmq-users
Hello RabbitMQ users,

We have an intermittent performance related problem on one of our RabbitMQ servers that bothers us a few times per day.

When problem hits:
-Queue buildup in multiple queues
-Remote federation queue also growing
-Sluggish  MB GUI (Web-GUI via Management plugin)
-CLI commands time-out
-Possible also connection failures on AMQP port 5671, but this has not been verified.
-No obvious resource shortage (OS performance data looks fine).

Discussed this with some of our internal Erlang specialists and we ran a short "perf" collection when problem was present and reviewed with "perf report".
The statement was that we likely suffer from "ETS lock contention".
This problem got worse after adding CPU (we went from 8 vCPU to 16vCPU).

We have enabled the Prometheus plugin and collect data every 20s.
We have also done some queue cleanup and configuration changes removing collection of rate metrics to lower impact.
This has helped, but we still see the issue.

I understand that this could be a very tricky problem, but I wonder if anyone has experience on ETS lock contention issues?
Are there specific prometheus metrics that can help visualizing ETS locking?

----------------------------------
Config details:
RHEL7 16 vCPU 64GB running in a vmware VM.
RabbitMQ 3.8.9 Erlang 23.1.1
~3000 queues
~1000 Clients
We use persistent messages and durable queues.

We use classic queues and run on a single node.
A short Perf data collection is available if someone wants to have a look.
( perf record -F 99 -p <beam-pid> --call-graph dwarf -- sleep 5)

Best Regards,

Thomas

perf.data.gz
rabbitmq.config

thoma...@gmail.com

unread,
Aug 3, 2022, 3:34:00 AM8/3/22
to rabbitmq-users
Hi,

We made an attempt with our in house Erlang experts to reproduce the problem on a lab-system.
The lab RabbitMQ runs RabbitMQ 3.8.9 and Erlang 23.1.1 instrumented with the Erlang lcnt lock profiler.

Load was simulated with perfTest:
bin/runjava com.rabbitmq.perf.PerfTest --queue-pattern 'testqueue.perf-test-%d' \
  --queue-pattern-from 1 --queue-pattern-to 1000 \
  --producers 5 --consumers 1000 \
  -r 10 \
  --size 10000 \
  -f persistent \
  -e thomas_perftest \
  -k perf.test \
  -h amqps://server1:5671


lcnt data was then collected while system was under load.
Output pasted below. Unfortunately formatting is ugly.. Not sure how to avoid that.. Maybe better to attach a file?

There are a few differences to the production system.
  • Most noticeable, we have only 4 CPU's in the lab system.
  • Load is simplified. Production system has more "moving parts".
In our lcnt profiling test, the top ETS table was "db_tab file_handle_cache_stats" and "db_tab rabbit_msg_store_flying" on second place.
Given that our attempt to replicate is valid, does this bring any ideas?


(rabbit@server1)5> lcnt:conflicts().
lock id #tries #collisions collisions [%] time [us] duration [%]
----- --- ------- ------------ --------------- ---------- -------------
db_tab 189 8773773 229458 2.6153 41746866 57.9955
crypto_stat 41 3719555 47793 1.2849 950710 1.3207
run_queue 6 16016081 199865 1.2479 612384 0.8507
alcu_allocator 11 20722 317 1.5298 571383 0.7938
proc_main 2272 9416449 83282 0.8844 24143 0.0335
proc_msgq 2272 18422087 3981 0.0216 4343 0.0060
port_lock 123 699588 75 0.0107 3041 0.0042
drv_ev_state 128 382193 3 0.0008 248 0.0003
proc_status 2272 7821663 12 0.0002 86 0.0001
mseg 1 2036 4 0.1965 73 0.0001
dirty_break_point_index 1 7052 6 0.0851 7 0.0000
dirty_run_queue_sleep_list 2 10288 5 0.0486 5 0.0000
port_sched_lock 127 1008275 8 0.0008 3 0.0000

(rabbit@server1)6> lcnt:conflicts([{combine,false}]).
lock id #tries #collisions collisions [%] time [us] duration [%]
----- --- ------- ------------ --------------- ---------- -------------
db_tab file_handle_cache_stats 3618990 159041 4.3946 32323196 44.9040
db_tab rabbit_msg_store_flying 2447071 63564 2.5976 8919204 12.3907
crypto_stat 3685044 47735 1.2954 950516 1.3205
alcu_allocator eheap_alloc 2840 233 8.2042 571126 0.7934
db_tab rabbit_msg_store_cur_file 1436401 6853 0.4771 504466 0.7008
run_queue 4 4019841 51776 1.2880 166965 0.2320
run_queue 3 3908785 49392 1.2636 150395 0.2089
run_queue 1 4021024 50483 1.2555 150273 0.2088
run_queue 2 4055976 48207 1.1885 144743 0.2011



Best Regards,

Thomas

Johan Rhodin

unread,
Aug 3, 2022, 9:41:35 AM8/3/22
to rabbitm...@googlegroups.com
This sounds like a really interesting case to troubleshoot BUT before we do that you need to upgrade to the latest versions of RabbitMQ and especially Erlang. The Erlang version you are on is old and has known issues, even issues related to nodes "freezing up". So upgrade to Erlang 25 and matching RabbitMQ and let's see if this problem persists.

/Johan

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/b6d3339f-1fc4-4740-a78d-7180776c4309n%40googlegroups.com.

thoma...@gmail.com

unread,
Aug 4, 2022, 4:02:25 AM8/4/22
to rabbitmq-users
Hi Johan,

Thanks for your reply. 
Believe me, I would like to upgrade, but this is not always easy to do.
The system is in heavy production and servers a lot of people with a multitude of different clients and workflows.
Moving to a new version is always a risk and needs to be tested with our combination of clients, workload etc..  before we move to production.

If you have links to known issues with Erlang 23.1.1 I'd be happy to read up.
If we can identify that this is indeed an Erlang bug, we have a strong case to upgrade.

Best Regards,

Thomas

Nam Le

unread,
Aug 4, 2022, 4:22:51 AM8/4/22
to rabbitm...@googlegroups.com
:D

Vào Th 5, 4 thg 8, 2022 vào lúc 15:02 thoma...@gmail.com <thoma...@gmail.com> đã viết:


--
NamLe
Tel: 0905939769

jo...@cloudamqp.com

unread,
Aug 5, 2022, 11:56:51 AM8/5/22
to rabbitmq-users
Yes I fully understand that upgrading might not be easy to do, but the users/operators/developers on this list can't put in meaningful work when there are known issues that have been fixed in both Erlang and RabbitMQ (see https://github.com/rabbitmq/rabbitmq-server/pull/4324 for one instance of ETS improvements).

I don't have a bug/PR handy for 23.1.x issues but maybe this can help convince you/make the case for upgrading: at CloudAMQP we've seen so many issues with Erlang 23.1.x that we actively warn our users against that version and display a warning sign if a user has that version.

/Johan

thoma...@gmail.com

unread,
Aug 9, 2022, 4:07:00 AM8/9/22
to rabbitmq-users
Hi Johan,

Thanks for your reply. The link you provide is very interesting and seems to match well with what we see.
The release that we have "cooking" in late stages of development is based on 3.10.5 and Erlang 24.3.4.

If I understand it correct, the ETS improvements is included in all 3.10 versions? (We selected 3.10.5 as it was the latest and greatest when work was started).

I understand that investing time in troubleshooting old releases is not tempting, but for me as an operations technician, initiating an upgrade that may or may not help the problem is equally not tempting.
Building a strong case to convince our users/managers to take the needed downtime is half the battle..

Anyway, I really appreciate your help pointing to the ETS issue mentioned above.
 
BR,
Thomas

thoma...@gmail.com

unread,
Aug 9, 2022, 5:50:26 AM8/9/22
to rabbitmq-users
Hi,

Is the frequent locks for the  file_handle_cache_stats ETS table a consequence of having the Prometheus plugin enabled?
Can we improve the situation by disabling the plugin or by modifying "collect_statistics_interval"?

Best Regards,

Thomas

jo...@cloudamqp.com

unread,
Aug 10, 2022, 4:49:00 PM8/10/22
to rabbitmq-users
Yes, the ETS improvements are in all 3.10.x (merged before 3.10.0 was released so you'll get it with 3.10.5).

I'm not sure if the usage of prometheus should matter, but tuning "collect_statistics_interval" is a good idea, as well as "stats_event_max_backlog".

/Johan

Luke Bakken

unread,
Aug 17, 2022, 12:16:58 PM8/17/22
to rabbitmq-users
Hello,

I understand that investing time in troubleshooting old releases is not tempting, but for me as an operations technician, initiating an upgrade that may or may not help the problem is equally not tempting.
Building a strong case to convince our users/managers to take the needed downtime is half the battle

RabbitMQ 3.8.X is completely out of support.

Thanks,
Luke 
Reply all
Reply to author
Forward
0 new messages