Hello,
As we have been getting a higher load in our RabbitMQ queues, we have been facing a slow response from RabbitMQ and trying to figure out where are the bottlenecks.
The micro-services using the RabbitMQ cluster are all Java Spring boot based and use Spring-AMQP libraries. We have publisher confirms and returns enabled wherever the message loss can't be tolerated to a great extent. One of the services also makes use of transacted channels to publish the messages, however, while the load is comparable at different times of the day, RabbitMQ cluster suddenly becomes slow and not responsive enough and recovers after the overall load reduces.
During the start of the publishing slowness, only a couple of services reported the following error message: "
org.springframework.amqp.AmqpResourceNotAvailableException: The channelMax limit is reached. Try later." which could not publish the messages anymore until the other services were done with the message publishing and right after the message publishing stopped from previous services in the message processing flow, even those errors stopped and RabbitMQ started receiving messages. We have tried limiting channel-cache size as suggested in
https://github.com/spring-projects/spring-amqp/issues/999 and use the latest version of spring boot(2.3.0), but that did not seem to help as it resulted in "no available channels" errors on high load.
I am adding some details below, but please do let me know if you need any other information related to RabbitMQ logs or configurations missing below. Any pointers on how to proceed would be greatly appreciated.
The actuator metrics endpoint show channels explosion for the service having trouble(which does not match with the channels reported for the service via RabbitMQ management console:
Channels reported via the management console
However,
The CPU and memory usage seems to be under control -
No considerable message backlogs observed -

Some spike in the connections, channels -
Overall message rate:
RabbitMQ cluster config:
We run 3 node cluster in EC2 instances(dockerized Rabbit version 3.7.10)
EC2 type: r5.xlarge
EBS volume size: 1024 GB
channels_max = 2047
"vhosts": [{ "name": "/" } ], "policies": [ { "vhost": "/", "name": "ha-policy", "pattern": "^ha\\.", "apply-to": "queues", "definition": { "ha-mode": "exactly", "ha-params": 2 }, "priority": 0 } ]
"queues": [ We have HA queues and durable some of which are lazy, some make use of priorities(0-2) ] queue_master_locator = min-masters
cluster_partition_handling = pause_minority
cluster_formation.aws.use_autoscaling_group = true
disk_free_limit.relative = 1.0
collect_statistics_interval = 30000
vm_memory_high_watermark.relative = 0.6
Thanks & regards,
Raghu