We've recently enabled inter-node TLS across some of our non-prod clusters with the hope of going live with it in production next month. Yesterday we received reports that one of our systems with inter-node TLS enabled was down. When trying to get to the management UI page, we couldn't get in, just got the Rabbit logo and nothing else and the clusters AMQPS listeners appeared to no longer be accepting traffic. The engineer who worked on this issue noticed these warnings in one of the cluster nodes logs during the outage:
2019-05-16 15:04:01.592 [warning] <0.10288.1> rabbit_sysmon_handler busy_dist_port <0.13512.55> [{initial_call,{erlang,apply,2}},{erlang,bif_return_trap,2},{message_queue_len,0}] <0.16022.2>
2019-05-16 15:04:21.101 [warning] <0.10288.1> rabbit_sysmon_handler busy_dist_port <0.16767.1> [{initial_call,{rabbit_prequeue,init,1}},{erlang,bif_return_trap,2},{message_queue_len,0}] <0.16022.2>
2019-05-16 15:04:35.661 [warning] <0.10288.1> rabbit_sysmon_handler busy_dist_port <0.16764.1> [{initial_call,{rabbit_prequeue,init,1}},{erlang,bif_return_trap,2},{message_queue_len,1}] <0.16022.2>
2019-05-16 15:04:48.602 [warning] <0.10288.1> rabbit_sysmon_handler busy_dist_port <0.16794.1> [{initial_call,{rabbit_prequeue,init,1}},{erlang,bif_return_trap,2},{message_queue_len,0}] <0.16022.2>
2019-05-16 15:04:49.606 [warning] <0.10288.1> rabbit_sysmon_handler busy_dist_port <0.16776.1> [{initial_call,{rabbit_prequeue,init,1}},{erlang,bif_return_trap,2},{message_queue_len,0}] <0.16022.2>
Upon restarting that single node the cluster recovered and everything worked as expected. Doing a quick search on that error log I came across this page,
https://www.rabbitmq.com/runtime.html#distribution-buffer. From reading through that it sounds like something (large messages?) was causing the inter-node tcp connection buffer to exceed the default limit and generate that warning. Outside of these warnings getting thrown consistently during the outage, nothing else stood out in the logs, I'm curious if this could cause the problem we were experiencing?
Also worth noting, this is a 3 node cluster and ~90% of queues in it have a policy that enforces ha-mode: all. At the the time when the cluster was unavailable I'm pretty sure based on our logs that the node which was restarted was the master for a significant majority of these queues at the time we were experiencing the problem.