Configuration of cluster:
All the nodes are physical machines, having RabbitMQ version 3.7.9 and Erlang 21.1.1 from the very first day when this cluster was created.
Memory is 260 GB, CPU 48 Cores.
Issue:Nodes of this cluster are crashing randomly and frequently. We have got around 4 crashes in last 3 weeks.
From the crash dump, we are seeing errors like following:
Slogan: binary_alloc: Cannot allocate 25165855 bytes of memory (of type "binary").
Slogan: Absurdly large distribution output data buffer (2696898942 bytes) passed
Crash dump is huge and i am not sure how can i share it with you guys. Screenshot of memory tab from one of crash dump which clearly suggests binary_alloc is using memory above 20GB. I have seen this going above 30GBs as well in other crashes. I can provide additional screenshots if needed.
We have a good amount of system resources(configuration below) on all the nodes of these machines. We have set monitoring and we don't get any alerts for memory or disk when this happens. Memory utilization remains normal and there seems nothing from system side that could cause this. High watermark is at 40%.
It is not evident from our investigation that the nodes only crashes when we have large traffic coming on this cluster.
We have other other 3 nodes(physical boxes) cluster with exact similar configuration and have never faced this issue there. We also have other clusters on VM and have not faced this issue there either.
Additional details:- This cluster has been running fine for more than a year without any issues and we have started seeing above issues from last month.
- All Queues are durable and messages are persistent.
- It gets traffic mostly from US datacenters and some small quantity from non-US datacenters as well.
- All publishers publishes to an additional layer of servers called "Shovels". These servers have shovel plugin installed on it and we have configured static shovels to shovel published these messages to clusters.
- This cluster has couple of vhosts and limited amount of queues like 5-10 in each vhost. Out of these queues, there are like 3-4 queues in all vhosts which receives high traffic.
- These queues receives around 5-6k msgs/sec at peak times and queues become large at that time, like around 5-10 millions. On avg it is 2-3k msgs/sec.
Any help in this regard will be very helpful. This cluster is one of our most critical cluster and gets very important messages which we can't loose and consitency and durability of these messages is very important