Hi everyone,
I have a RabbitMQ cluster running version 3.12.13 with Erlang 25.0.4. Each node is equipped with 252GB of memory and 814GB of local disk. Things were running smoothly for a few weeks, but recently we've started seeing a lot of "busy_dist_port" warning messages in the logs, followed by the nodes hitting the vm_memory_high_watermark.
Our monitoring system indicates spikes in internode communication during these times. I'm wondering if there's any specific area I should be tuning to avoid this high memory usage. Below, I've included a snippet of the logs and important configuration details for reference.
Any insights or suggestions would be greatly appreciated!
Logs:
2024-06-28 07:17:34.336865-05:00 [warning] <0.203.0> rabbit_sysmon_handler busy_dist_port <0.14999.114> [{initial_call,{rabbit_mqtt_reader,init,1}},{erlang,bif_return_trap,2},{message_queue_len,0}] {#Port<0.29>,unknown}
2024-06-28 07:17:35.692013-05:00 [warning] <0.203.0> rabbit_sysmon_handler busy_dist_port <0.420.115> [{initial_call,{rabbit_mqtt_reader,init,1}},{erts_internal,dsend_continue_trap,1},{message_queue_len,1}] {#Port<0.29>,unknown}
2024-06-28 07:17:36.448316-05:00 [warning] <0.203.0> rabbit_sysmon_handler busy_dist_port <0.420.115> [{initial_call,{rabbit_mqtt_reader,init,1}},{erts_internal,dsend_continue_trap,1},{message_queue_len,1}] {#Port<0.29>,unknown}
2024-06-28 07:17:37.341571-05:00 [warning] <0.203.0> rabbit_sysmon_handler busy_dist_port <0.420.115> [{initial_call,{rabbit_mqtt_reader,init,1}},{erts_internal,dsend_continue_trap,1},{message_queue_len,1}] {#Port<0.29>,unknown}
2024-06-28 07:17:38.333252-05:00 [warning] <0.203.0> rabbit_sysmon_handler busy_dist_port <0.17932.114> [{initial_call,{rabbit_mqtt_reader,init,1}},{erlang,bif_return_trap,2},{message_queue_len,1}] {#Port<0.29>,unknown}
2024-06-28 07:17:39.099611-05:00 [warning] <0.470.0> memory resource limit alarm set on node 'rabbit@<hostname>'.
2024-06-28 07:17:39.099611-05:00 [warning] <0.470.0>
2024-06-28 07:17:39.099611-05:00 [warning] <0.470.0> **********************************************************
2024-06-28 07:17:39.099611-05:00 [warning] <0.470.0> *** Publishers will be blocked until this alarm clears ***
2024-06-28 07:17:39.099611-05:00 [warning] <0.470.0> **********************************************************
2024-06-28 07:17:39.099611-05:00 [warning] <0.470.0>
Some of the notable settings are as follows.
Important configurations in RABBITMQ_CONF_ENV_FILE
------------------------------------------------------------------------
# file descriptor
ulimit -n 50000
Important configuration in RABBITMQ_CONFIG_FILE
------------------------------------------------------------------
## Additional network and protocol related configuration
heartbeat = 600
frame_max = 131072
initial_frame_max = 4096
channel_max = 128
## Customising TCP Listener (Socket) Configuration.
tcp_listen_options.backlog = 128
tcp_listen_options.nodelay = false
tcp_listen_options.exit_on_close = false
tcp_listen_options.buffer = 3872198
tcp_listen_options.sndbuf = 3872198
tcp_listen_options.recbuf = 3872198
vm_memory_high_watermark.relative = 0.8
vm_memory_high_watermark_paging_ratio = 0.75
memory_monitor_interval = 2500
disk_free_limit.absolute = 50MB
Hi Johan,
Appreciate your response. I will increase the File Descriptor and Distribution buffer size.
It looks like we were sending big messages (larger than 1 GiB) during the time when we experienced the issue.
We also noticed some memory leak during that time. Attached is the Erlang Memory Allocator graph for that time frame (when the node reached the vm_memeory_high_watermark), and you can see that the “eheap_alloc” reached ~350 GB around 7:14 (and never really released all the memory. I am including/attaching a snippet of the erl crash dump also to the ticket. We are using v3.12.13 and Erlang v25.0.4.
On the side notes, is there any way to prevent/throttle the publishers from sending large messages?
Thanks