Finding the bottlenecks under high load - potential channel leaks resulting in failure in publishing messages to RabbitMQ

282 views
Skip to first unread message

Raghurama Adyanadka

unread,
Jul 17, 2020, 8:09:37 AM7/17/20
to rabbitmq-users
Hello,
As we have been getting a higher load in our RabbitMQ queues, we have been facing a slow response from RabbitMQ and trying to figure out where are the bottlenecks.
The micro-services using the RabbitMQ cluster are all Java Spring boot based and use Spring-AMQP libraries. We have publisher confirms and returns enabled wherever the message loss can't be tolerated to a great extent. One of the services also makes use of transacted channels to publish the messages, however, while the load is comparable at different times of the day, RabbitMQ cluster suddenly becomes slow and not responsive enough and recovers after the overall load reduces. 

During the start of the publishing slowness, only a couple of services reported the following error message: "org.springframework.amqp.AmqpResourceNotAvailableException: The channelMax limit is reached. Try later." which could not publish the messages anymore until the other services were done with the message publishing and right after the message publishing stopped from previous services in the message processing flow, even those errors stopped and RabbitMQ started receiving messages. We have tried limiting channel-cache size as suggested in https://github.com/spring-projects/spring-amqp/issues/999 and use the latest version of spring boot(2.3.0), but that did not seem to help as it resulted in "no available channels" errors on high load.

I am adding some details below, but please do let me know if you need any other information related to RabbitMQ logs or configurations missing below. Any pointers on how to proceed would be greatly appreciated.


The actuator metrics endpoint show channels explosion for the service having trouble(which does not match with the channels reported for the service via RabbitMQ management console:


Channels reported via the management console


However,
The CPU and memory usage seems to be under control - 


No considerable message backlogs observed - 

Some spike in the connections, channels - 


Overall message rate:


RabbitMQ cluster config:
We run 3 node cluster in EC2 instances(dockerized Rabbit version 3.7.10)
EC2 type: r5.xlarge
EBS volume size: 1024 GB

channels_max = 2047

"vhosts": [
{
      "name": "/"
    }
  ],
  "policies": [
    {
      "vhost": "/",
      "name": "ha-policy",
      "pattern": "^ha\\.",
      "apply-to": "queues",
      "definition": {
        "ha-mode": "exactly",
        "ha-params": 2
      },
      "priority": 0
    }
  ]

 "queues": [ 
We have HA queues and durable some of which are lazy, some make use of priorities(0-2)
 ]

queue_master_locator = min-masters

cluster_partition_handling = pause_minority

cluster_formation.aws.use_autoscaling_group = true

disk_free_limit.relative = 1.0

collect_statistics_interval = 30000

vm_memory_high_watermark.relative = 0.6



Thanks & regards,
Raghu

Raghurama Adyanadka

unread,
Jul 19, 2020, 6:43:07 AM7/19/20
to rabbitmq-users
Well, turns out the CPU monitoring was not really showing the actual instance CPU utilization. Cloudwatch showed the actual spikes in server CPU utilization.

Clearly, there is a need for CPU upgrade of instances which probably should make the RabbitMQ more responsive and letting the channels become free sooner for more publishing.
We would certainly try this out, but in general, are there any recommendations for instance types(AWS) for the RabbitMQ server? Please let me know.

Also, if you have any other explanation on client connections ending up with channels_max = 2047 on high load, please let me know.

Thanks,
Raghu
Reply all
Reply to author
Forward
0 new messages