busy_dist_port warning(s) and high mem usage in v3.12.13

1,108 views
Skip to first unread message

Roy

unread,
Jun 28, 2024, 8:36:16 PM6/28/24
to rabbitmq-users

Hi everyone,

I have a RabbitMQ cluster running version 3.12.13 with Erlang 25.0.4. Each node is equipped with 252GB of memory and 814GB of local disk. Things were running smoothly for a few weeks, but recently we've started seeing a lot of "busy_dist_port" warning messages in the logs, followed by the nodes hitting the vm_memory_high_watermark.

Our monitoring system indicates spikes in internode communication during these times. I'm wondering if there's any specific area I should be tuning to avoid this high memory usage. Below, I've included a snippet of the logs and important configuration details for reference.

Any insights or suggestions would be greatly appreciated!

 

Logs:

2024-06-28 07:17:34.336865-05:00 [warning] <0.203.0> rabbit_sysmon_handler busy_dist_port <0.14999.114> [{initial_call,{rabbit_mqtt_reader,init,1}},{erlang,bif_return_trap,2},{message_queue_len,0}] {#Port<0.29>,unknown}

2024-06-28 07:17:35.692013-05:00 [warning] <0.203.0> rabbit_sysmon_handler busy_dist_port <0.420.115> [{initial_call,{rabbit_mqtt_reader,init,1}},{erts_internal,dsend_continue_trap,1},{message_queue_len,1}] {#Port<0.29>,unknown}

2024-06-28 07:17:36.448316-05:00 [warning] <0.203.0> rabbit_sysmon_handler busy_dist_port <0.420.115> [{initial_call,{rabbit_mqtt_reader,init,1}},{erts_internal,dsend_continue_trap,1},{message_queue_len,1}] {#Port<0.29>,unknown}

2024-06-28 07:17:37.341571-05:00 [warning] <0.203.0> rabbit_sysmon_handler busy_dist_port <0.420.115> [{initial_call,{rabbit_mqtt_reader,init,1}},{erts_internal,dsend_continue_trap,1},{message_queue_len,1}] {#Port<0.29>,unknown}

2024-06-28 07:17:38.333252-05:00 [warning] <0.203.0> rabbit_sysmon_handler busy_dist_port <0.17932.114> [{initial_call,{rabbit_mqtt_reader,init,1}},{erlang,bif_return_trap,2},{message_queue_len,1}] {#Port<0.29>,unknown}

2024-06-28 07:17:39.099611-05:00 [warning] <0.470.0> memory resource limit alarm set on node 'rabbit@<hostname>'.

2024-06-28 07:17:39.099611-05:00 [warning] <0.470.0>

2024-06-28 07:17:39.099611-05:00 [warning] <0.470.0> **********************************************************

2024-06-28 07:17:39.099611-05:00 [warning] <0.470.0> *** Publishers will be blocked until this alarm clears ***

2024-06-28 07:17:39.099611-05:00 [warning] <0.470.0> **********************************************************

2024-06-28 07:17:39.099611-05:00 [warning] <0.470.0>

 


Some of the notable settings are as follows.

 

Important configurations in RABBITMQ_CONF_ENV_FILE

------------------------------------------------------------------------

# file descriptor
ulimit -n 50000


Important configuration in RABBITMQ_CONFIG_FILE

------------------------------------------------------------------

## Additional network and protocol related configuration

heartbeat = 600

frame_max = 131072

initial_frame_max = 4096

channel_max = 128

 

## Customising TCP Listener (Socket) Configuration.

tcp_listen_options.backlog = 128

tcp_listen_options.nodelay = false

tcp_listen_options.exit_on_close = false

 

tcp_listen_options.buffer = 3872198

tcp_listen_options.sndbuf = 3872198

tcp_listen_options.recbuf = 3872198

 

vm_memory_high_watermark.relative = 0.8

vm_memory_high_watermark_paging_ratio = 0.75

memory_monitor_interval = 2500

 

disk_free_limit.absolute = 50MB

Message has been deleted

jo...@cloudamqp.com

unread,
Jul 2, 2024, 5:35:44 PM7/2/24
to rabbitmq-users
Hi,
You can safely increase FDs to 1M. You can also increase the distribution buffer (+zdbbl) https://www.rabbitmq.com/docs/runtime#distribution-buffer.

Are you sending a lot of big messages? (This is one of the most common ways to run into this warning message)
How many queues do you have and of what type?

Note: RabbitMQ 3.12 is out of community support.

/Johan

Roy

unread,
Jul 3, 2024, 11:17:20 AM7/3/24
to rabbitmq-users

Hi Johan,

Appreciate your response. I will increase the File Descriptor and Distribution buffer size.

It looks like we were sending big messages (larger than 1 GiB) during the time when we experienced the issue.

We also noticed some memory leak during that time. Attached is the Erlang Memory Allocator graph for that time frame (when the node reached the vm_memeory_high_watermark), and you can see that the “eheap_alloc” reached ~350 GB around 7:14 (and never really released all the memory. I am including/attaching a snippet of the erl crash dump also to the ticket. We are using v3.12.13 and Erlang v25.0.4.

On the side notes, is there any way to prevent/throttle the publishers from sending large messages?

Thanks
Roy
erlang_mem_allocator_graph.png
erl_crash_dump_snippet.txt

jo...@cloudamqp.com

unread,
Jul 3, 2024, 12:32:50 PM7/3/24
to rabbitmq-users
It should be impossible (or at least very hard) to send 1GiB messages. The max message size is ~512 MiB (536870912 bytes) [0].
eheap_alloc unfortunately doesn't tell much about what was stuck/leaking. If you encounter the situation again you can collect some Erlang-level statistics such as largest mailboxes and biggest memory users:
rabbitmqctl eval 'rabbit_diagnostics:top_memory_use().' 
rabbitmqctl eval 'rabbit_diagnostics:top_binary_refs().'
and
rabbitmqctl eval 'recon:proc_count(message_queue_len, 3).'
the observer can also be used ("rabbitmq-diagnostics observer") to see top memory users, their stack traces etc.

It is possible to limit message size on on the server side (using max_message_size in [0]). Which client are you using? Should be possible to do a check before sending the message.

Again it would be interesting to know more about your topology (number of queues, types of queues, types of connections - e.g. mqtt is mentioned in the log snippet)


/Johan
Message has been deleted

Roy

unread,
Jul 8, 2024, 1:14:28 PM7/8/24
to rabbitmq-users

Hi Johan,

The actual message size was less than 1 MB (apologies for the incorrect size, I was referring to the cumulative message size).

We have been using older versions of RabbitMQ (v3.8.3 and v3.11.13) for a long time without any issues, utilizing "classic" queues from these versions.

However, we recently upgraded to RabbitMQ version 3.12.13 (with Erlang v25.0.4) and started using "rabbit_mqtt_qos0_queue" queues. We currently have around 1,000 queues, and our clients use MQTT connections.

Two issues I have noticed with v3.12.13 are:

  1. While I no longer see queuing, we do experience periodic memory spikes.
  2. Messages sent from a specific client are not received in order on the subscriber side.

I was able to reproduce the memory spike issue by publishing 1 MB messages to a single topic from 200 publishers every 0.1 seconds to one node while subscribing to the same topic from another node. Everything seems to be fine as long as the messages sent over the topic are being subscribed to. However, when I shut down the subscribers and continue to publish messages at the same rate, the memory spike occurs.

I am attaching the publisher/subscriber python scripts that I used to reproduce the issue, along with the commands to run subscriber/publisher ...

python publisher.py -c <cluster_name> -n  <cluster_node_1> -t test_topic -p 200
python subscriber.py -c <cluster_name> -n <cluster_node_2> -t test_topic

Note: You need "
paho.mqtt.client" package, and <cluster_name>_USERNAME, <cluster_name>_PASSWORD, <cluster_name>_VHOST environment variables set, inorder to run the publisher/subscriber scripts.

Thanks

Roy

publisher.py
subscriber.py

Luke Bakken

unread,
Jul 8, 2024, 7:23:44 PM7/8/24
to rabbitmq-users
Hi Roy -

You should read the following blog post, which explains how the rabbit_mqtt_qos0_queue works - https://www.rabbitmq.com/blog/2023/03/21/native-mqtt

When you shut down your subscribers, a memory spike is expected as those messages are kept in memory, especially since you are publishing a very large amount of data.


Thanks,
Luke
Message has been deleted

Luke Bakken

unread,
Jul 9, 2024, 9:08:51 AM7/9/24
to rabbitmq-users
Roy -

I had my colleague review my response and I wasn't correct. If you are shutting down all of your subscribers for a particular topic, RabbitMQ will delete the associated rabbit_mqtt_qos0_queue.

When you say "memory spike" in this situation, you haven't quantified it. How much are we talking about? 

You haven't provided details about this statement, or provided a means to reproduce - "Messages sent from a specific client are not received in order on the subscriber side."
Message has been deleted

Roy

unread,
Jul 9, 2024, 11:10:02 AM7/9/24
to rabbitmq-users

Hi Luke,

Thanks for your response and I appreciate any help in rectifying this issue. Each publisher in my test sends a message payload with an “index” embedded into the message, and the subscriber checks for the order of this “index” while consuming this message. We are noticing out-of-order indexes being received on the subscriber side when we spin up 200 publishers sending messages of size ~1MB every 0.1 seconds. The number of publishers, message size, and frequency of publishing were set to those numbers to replicate our production usage. Does RabbitMQ guarantee the order of messages?

As for the memory spike, we are noticing memory spikes from time to time in production, especially when a large number of clients send messages to a single topic which is then consumed by a single subscriber. One way we could reproduce the same spike in our test environment was to stop the subscriber for a brief period while the publishers continued to publish messages. The memory spike eventually reaches the “vm_memory_high_watermark” and all the publishers get blocked. vm_memory_high_watermark.relative is currently set to 0.8 and the 80% (~250 GB out of ~256 GB ) of a memory was in use before the publishers got blocked.  

We tried to collect additional data about the memory usage, and it seems like Erlang’s heap_alloc (Erlang's process heap) is the one consuming all the memory when the spike occurs, and it looks like the memory is not being released afterward?.

I am attaching screenshots of RAM and Network usage from two different nodes in production (files: node1_ram_network_usage.png and node2_ram_network_usage.png). The subscriber was initially connected to node1, and then it reconnected to node2 when node1 reached the “vm_memory_high_watermark”, eventually causing node2 to also reach the “vm_memory_high_watermark”.

I am also attaching screenshots of Erlang-memory-allocator charts from node1 and node2 (files: node1_erlang_memory_allocator.png and node2_erlang_memory_allocator.png), which show the breakdown of Erlang's memory usage when we experienced the memory spike.

Please let me know if you need any additional information.

Thanks,
Roy

node2_ram_network_usage.png
node1_ram_network_usage.png
node1_erlang-memory-allocator.png
node2_erlang-memory-allocator.png

Luke Bakken

unread,
Jul 9, 2024, 11:13:37 AM7/9/24
to rabbitmq-users

Thanks for your response and I appreciate any help in rectifying this issue. Each publisher in my test sends a message payload with an “index” embedded into the message, and the subscriber checks for the order of this “index” while consuming this message. We are noticing out-of-order indexes being received on the subscriber side when we spin up 200 publishers sending messages of size ~1MB every 0.1 seconds. The number of publishers, message size, and frequency of publishing were set to those numbers to replicate our production usage. Does RabbitMQ guarantee the order of messages?


How could RabbitMQ ensure message order when 200 publishers are all working in parallel? There is no guarantee of message order going to RabbitMQ.

Thanks for the memory information, I will take a look as time allows. 

Luke Bakken

unread,
Jul 9, 2024, 11:15:56 AM7/9/24
to rabbitmq-users
Could you please attach your complete RabbitMQ configuration files? Thanks.

Luke Bakken

unread,
Jul 9, 2024, 11:18:48 AM7/9/24
to rabbitmq-users

Thanks for your response and I appreciate any help in rectifying this issue. Each publisher in my test sends a message payload with an “index” embedded into the message, and the subscriber checks for the order of this “index” while consuming this message. We are noticing out-of-order indexes being received on the subscriber side when we spin up 200 publishers sending messages of size ~1MB every 0.1 seconds. The number of publishers, message size, and frequency of publishing were set to those numbers to replicate our production usage. Does RabbitMQ guarantee the order of messages?


How could RabbitMQ ensure message order when 200 publishers are all working in parallel? There is no guarantee of message order going to RabbitMQ.

To clarify a little further - order is guaranteed when there is a single publisher and subscriber. Anything else and all bets are off 🙂 This isn't unique to RabbitMQ.

Luke Bakken

unread,
Jul 9, 2024, 11:21:14 AM7/9/24
to rabbitmq-users
Sorry to keep asking questions in multiple messages, but could you be precise in what you mean by "stop the subscriber for a brief period while the publishers continued to publish messages"?

Do you fully stop the subscriber process, which should delete the associated rabbit_mqtt_qos0_queue queue?

Roy

unread,
Jul 9, 2024, 12:10:34 PM7/9/24
to rabbitmq-users

Hi Luke,

I have attached the config files as requested.

Regarding the subscriber process, we do fully stop the subscriber and then restart the process, effectively making a new connection using a different port. We haven’t checked if the associated rabbit_mqtt_qos0_queue queue was deleted during the stop/start of the subscriber.

Additionally, I wanted to mention that the mailbox_soft_limit is set to 200, which is the default value. According to the documentation, this setting should provide some overload protection in large fan-in scenarios by avoiding high memory usage. Unfortunately, this doesn’t seem to be the case in our situation.

Thank you for your assistance

Thanks
cluster2_rabbit_server_1-env.conf
cluster2_rabbit_server_1-advanced.conf
cluster2_rabbit_server_1-plugin.conf
cluster2_rabbit_server_1.conf

Roy

unread,
Jul 9, 2024, 1:02:16 PM7/9/24
to rabbitmq-users
Additionally, regarding message order, we have observed a noticeable increase in the number of out-of-order messages delivered in v3.12.13 compared to v3.11.13. Using the previously provided publisher.py/subscriber.py script(s), we tested by sending ~1MB messages from 80 publishers every 0.1 seconds and consuming those messages from a single subscriber to verify the delivery order. In v3.11.13, 99.9% of messages were delivered in order. However, in v3.12.13, after the first few message indexes, we started seeing messages delivered out of order. Is this expected behavior in v3.12.x due to the native MQTT implementation and the use of the Erlang process mailbox for rabbit_mqtt_qos0_queue? This is a concern because out-of-order delivery in our use case can trigger retransmissions from the client side, resulting in even higher publish rates

Thanks

Luke Bakken

unread,
Jul 9, 2024, 4:21:40 PM7/9/24
to rabbitmq-users
Hi Roy,

You got lucky with the behavior in 3.11.13. You can't depend on strict ordering when more than one publisher is involved, no matter the RabbitMQ version. You will have to come up with a different solution that does not depend on this behavior.

Thanks,
Luke

Luke Bakken

unread,
Jul 10, 2024, 1:23:48 PM7/10/24
to rabbitmq-users
Hi Roy,

Thanks for providing those scripts. I have committed them here - https://github.com/lukebakken/rabbitmq-users-4AOwZrQyekI

I believe I have been able to reproduce the memory spike issue, and am investigating.

Luke Bakken

unread,
Jul 11, 2024, 12:42:06 PM7/11/24
to rabbitmq-users
Hi Roy,

I have opened a fix for the issue you report - https://github.com/rabbitmq/rabbitmq-server/pull/11676

Note that the issue only happens when the subscriber's TCP connection is abruptly closed (by CTRL-C, for instance). Subscribers that exit cleanly should not cause abrupt memory spikes. You might want to review your production code for subscribers to ensure that all termination paths (other than SIGKILL, of course) result in a clean shutdown of the MQTT connection.

Thanks,
Luke

Roy

unread,
Jul 14, 2024, 11:18:30 PM7/14/24
to rabbitmq-users

Hi Luke,

Thank you for addressing the reported issue. Could you please let me know if this fix will be included in the next v3.12.x release or in the v3.13.x release?

Thanks

Roy

Luke Bakken

unread,
Jul 22, 2024, 9:48:00 AM7/22/24
to rabbitmq-users
It appears that 3.13.5 has this fix.
Reply all
Reply to author
Forward
0 new messages