RabbitMQ Node in a cluster crashing (3.8.1)

74 views
Skip to first unread message

Sagar Shah

unread,
Oct 27, 2020, 5:28:12 PM10/27/20
to rabbitmq-users
Here's the RabbitMQ Cluster Details
Node Count: 3
Version: 3.8.1
Infrastructure: AWS EC2 (each node in different availability zone)

We occasionally notice one of the rabbitmq nodes crashing and restarting, which requires us to synchronize the queue manually.

Further to that, after we synchronize all the queues, we also need to restart many (well, not all but some) of our micro services (spring boot that has rabbitmq queue listeners) so that listeners make the connections with queue and resume processing messages. (Note: We generally notice some of the queues have 0 consumers so we pick those micro services for restart.)

This issue happens every once in 2 weeks. Attached are some of the error logs at the time of crash.

Any help in further troubleshooting this is appreciated. 
Please let me know, if any further details are needed.

rabbit-error.log

Wesley Peng

unread,
Oct 27, 2020, 8:35:55 PM10/27/20
to rabbitm...@googlegroups.com
On 2020/10/28 5:28 上午, Sagar Shah wrote:
> Further to that, after we synchronize all the queues, we also need to
> restart many (well, not all but some) of our micro services (spring
> boot that has rabbitmq queue listeners) so that listeners make the
> connections with queue and resume processing messages. (Note: We
> generally notice some of the queues have 0 consumers so we pick those
> micro services for restart.)


This is old issue for RMQ running on cloud. You should have monitoring
for IaaS metrics such as networking, disk IO, memory usage etc. Sometime
a network issue, or VM taking snapshot can cause the similar issues.


Regards.

Sagar Shah

unread,
Oct 28, 2020, 7:43:52 AM10/28/20
to rabbitmq-users
Thank you for getting back on this issue. We do have monitoring in place, but it still requires us to take all the steps (listed above) to recover from that situation (in reactive manner). 
We are looking for the cause of error and possible fix to prevent it from happening. Is there already a bug/issue for this in rabbitmq GitHub for tracking?

Appreciate your suggestions

Reply all
Reply to author
Forward
0 new messages