Performance degradation after moving up to RHEL8 from CentOS7/RHEL7

1,798 views
Skip to first unread message

neilb...@gmail.com

unread,
Oct 3, 2022, 1:37:11 PM10/3/22
to rabbitmq-users
Hi Folks,

Hopefully not too vague a question but has anybody experienced any RabbitMQ performance degradations when moving from either CentOS7 or RHEL7 up to RHEL8?

Our application uses RabbitMQ as its bus and we're testing with RHEL8.5 and RHEL8.6 and have seen, in a busy system but on identical hardware, a general slowdown in publish times to the various queues that we have defined (all durable queues). Things seemed to improve going from RHEL8.5 to RHEL8.6 but overall its still noticeably slower than CentOS7.9 or RHEL7.9. The problem has been observed both on bare metal and AWS EC2 instances. We're typically testing on 4 core (8 VCPU) 32GB RAM servers/instances.

In simpler focused tests with a single threaded publisher we've actually seen better publish rate performance in RHEL8 than CentOS7.9/RHEL7.9 but it seems that when the system as a whole gets busy (with RabbitMQ itself handling more publishing/consuming across the multiple queues) then the slowdown starts to become more prominent in the RHEL8.6 system.

In terms of Rabbit/Erlang verions we're also moving up from RabbitMQ 3.9.13 and Erlang 23 to RabbitMQ 3.10.7 and Erlang 25. We have, however, run our system tests using the same older version of RabbitMQ and Erlang and also see the issue so it seems to be related to the OS upgrade.

I realise this all a bit hazy at this stage but initially wanted to test the water - just to see if anyone had experienced anything similar? We may be in the realms of OS tuning here but I've struggled to find any online resources with suggestions that have made any difference to our tests. If anyone has any suggestions also around what sort of things we should be monitoring to help pin this down that would be much appreicated.

Thanks in advance for any replies.

Neil

Michal Kuratczyk

unread,
Oct 4, 2022, 3:02:45 AM10/4/22
to rabbitm...@googlegroups.com
Hi,

Some things you may want to check:
* tcp buffers, especially if you have many connections and/or "large" messages (dozens of kilobytes or more)
* `rabbitmq-diagnostics memory_breakdown` (a significant increase in binary data report here could indicate a problem with too large TCP buffers)
* `rabbitmq-diagnostics observer` - even just eyeballing what's on top (sorted by reductions, mailbox, memory) can suggest what is different

Best,

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/b157efef-d100-4856-9827-3eb0ce113bcan%40googlegroups.com.


--
Michał
RabbitMQ team

Neil Billett

unread,
Oct 13, 2022, 9:27:34 AM10/13/22
to rabbitm...@googlegroups.com
Thank you Michal for the response and suggestions and sorry for the slow reply.

We did manage to make some progress on this by using TuneD to change the performance profile of the box.

After some experimentation we found that switching to the latency-performance profile (from the default of throughput-performance (onprem) or virtual-guest (ec2)) completely changed the performance of our system for the better - it ironed out the choppy publish rate we were seeing and dramatically improved publishing performance.

Digging into the latency-performance profile itself we found the change of scheduler settings to be the main instigator of the improvement:
[scheduler]
kernel.sched_migration_cost_ns = 5000000     #increased from 500000
kernel.sched_min_granularity_ns = 3000000    #decreased from 10000000
kernel.sched_wakeup_granularity_ns = 4000000 #decreased from 15000000
We noticed less improvement from the other settings that are in the latency-performance profile:
[cpu]
force_latency=cstate.id_no_zero:1|3 #added

[sysctl]
vm.dirty_background_ratio = 3       #decreased from 10
vm.dirty_ratio = 10                 #decreased from 40
...but we will be sticking with the latency-performance profile as a whole for our systems.

Hope it's useful for others that may face the same issue.

thanks,

Neil


You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/pwIT9KZZBZ0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/CAA81d0sFvDKtkvtTu4Zvz9jzbwRbE8KFMwfdgLsJMSnwr1NGQw%40mail.gmail.com.
Reply all
Reply to author
Forward
0 new messages