Inter-node TLS distribution buffer limit caused cluster outage

kris kloos

unread,

May 17, 2019, 12:03:50 PM5/17/19

to rabbitmq-users

We've recently enabled inter-node TLS across some of our non-prod clusters with the hope of going live with it in production next month. Yesterday we received reports that one of our systems with inter-node TLS enabled was down. When trying to get to the management UI page, we couldn't get in, just got the Rabbit logo and nothing else and the clusters AMQPS listeners appeared to no longer be accepting traffic. The engineer who worked on this issue noticed these warnings in one of the cluster nodes logs during the outage:

2019-05-16 15:04:01.592 [warning] <0.10288.1> rabbit_sysmon_handler busy_dist_port <0.13512.55> [{initial_call,{erlang,apply,2}},{erlang,bif_return_trap,2},{message_queue_len,0}] <0.16022.2>

2019-05-16 15:04:21.101 [warning] <0.10288.1> rabbit_sysmon_handler busy_dist_port <0.16767.1> [{initial_call,{rabbit_prequeue,init,1}},{erlang,bif_return_trap,2},{message_queue_len,0}] <0.16022.2>

2019-05-16 15:04:35.661 [warning] <0.10288.1> rabbit_sysmon_handler busy_dist_port <0.16764.1> [{initial_call,{rabbit_prequeue,init,1}},{erlang,bif_return_trap,2},{message_queue_len,1}] <0.16022.2>

2019-05-16 15:04:48.602 [warning] <0.10288.1> rabbit_sysmon_handler busy_dist_port <0.16794.1> [{initial_call,{rabbit_prequeue,init,1}},{erlang,bif_return_trap,2},{message_queue_len,0}] <0.16022.2>

2019-05-16 15:04:49.606 [warning] <0.10288.1> rabbit_sysmon_handler busy_dist_port <0.16776.1> [{initial_call,{rabbit_prequeue,init,1}},{erlang,bif_return_trap,2},{message_queue_len,0}] <0.16022.2>

Upon restarting that single node the cluster recovered and everything worked as expected. Doing a quick search on that error log I came across this page, https://www.rabbitmq.com/runtime.html#distribution-buffer. From reading through that it sounds like something (large messages?) was causing the inter-node tcp connection buffer to exceed the default limit and generate that warning. Outside of these warnings getting thrown consistently during the outage, nothing else stood out in the logs, I'm curious if this could cause the problem we were experiencing?

Also worth noting, this is a 3 node cluster and ~90% of queues in it have a policy that enforces ha-mode: all. At the the time when the cluster was unavailable I'm pretty sure based on our logs that the node which was restarted was the master for a significant majority of these queues at the time we were experiencing the problem.

kris kloos

unread,

May 17, 2019, 12:05:29 PM5/17/19

to rabbitmq-users

Also forgot to mention, we do occasionally see these warnings getting thrown and the cluster seems to continue operating fine. During this outage the frequency of these events was a lot higher.

Luke Bakken

unread,

May 17, 2019, 2:09:04 PM5/17/19

to rabbitmq-users

Hi Kris,

Whenever you post to this mailing list please be sure to let us know what version of RabbitMQ and Erlang you're using. In your case, there's a chance you're hitting this bug: https://bugs.erlang.org/browse/ERL-934

You should not be using ha-mode: all. Having one mirror in a three-node system is sufficient. Mirroring to all nodes increases the likelihood of the scenario you report, especially if message bodies are large (>100KiB).

You can experiment with larger distribution buffer sizes, as well as a lower value for the following if you have large messages (default is 4096):

mirroring_sync_batch_size = 64

The reason I bring up that setting is that perhaps queues were synchronizing at the time of this event. It would be reflected in the log files, if you'd like to check.

Thanks,

Luke

kris kloos

unread,

May 17, 2019, 3:42:18 PM5/17/19

to rabbitmq-users

Luke,

Thanks for the response.Sorry for not providing these details initially, the cluster is running RabbitMQ 3.7.13 and erlang 21.2.6, so it does look like the runtime version I'm using is susceptible to the erlang SSL bug you linked.

So if I'm understanding that bug correctly then is the thinking that even when a node is in a state where it shouldn't be accepting incoming traffic ( guessing queue synchronization should trigger a state like this correct? ) over it's cluster port with it's peers, when that traffic is over SSL with a version of the erlang runtime that includes the bug you documented the rabbit process will instead continue accepting this traffic, queue it up in local memory, and eventually crash the process due to hitting memory limitations? I'll see if I can find some information regarding resource utilization on that host to confirm if that's what happened.

Regarding your other suggestion with ha-mode, correct me if I'm wrong but I thought the idea was to maintain at least 2 additional replicas to the master so that in the event of a partition the cluster will identify which partition is in the minority and pause that node. With only a single other partition available how will the cluster know how to recover when a network partition does occur? I do see your point about the more synchronizations occurring within my cluster will increase the potential for encountering latency problems that could lead to a net_tick_timeout exception, especially if we are transmitting large messages.

I didn't notice any messages in the log that indicated queue synchronization at the time of the incident. Do you have an example of what that event would look like?

Thanks,
Kris

Reply all

Reply to author

Forward