One node in the cluster suddenly became unavailable.

210 views
Skip to first unread message

Wong Energy

unread,
Apr 6, 2023, 2:59:30 AM4/6/23
to rabbitmq-users
Dear RabbitMQ Support Team,

I am writing to report an issue with one of the nodes in our cluster. The node is experiencing intermittent unavailability, where it becomes unreachable and then recovers automatically after approximately 1 minute. This issue has occurred multiple times, with the most recent occurrence being on April 3rd, 2023, and other occurrences on March 18th, 2023, and March 2nd, 2023.

Based on the monitoring data exposed by rabbitmq_prometheus, the monitoring data for the faulty node was interrupted along with other nodes.
rabbitmq_prometheus1.pngrabbitmq_prometheus2.png

Our cluster consists of 3 nodes running on CentOS 7 on Azure with a machine configuration of 8-core CPU, 32GB RAM, and 256GB SSD. The RabbitMQ version is 3.8.12 and the Erlang version is 22.3.

Based on the monitoring data exposed by node-exporter, no other applications were deployed on the machine where RabbitMQ is deployed, and there was no high CPU usage before or after the failure, with a stable load of 5 or less, and no significant fluctuations in memory. Disk I/O did not appear to be a bottleneck.

I have put the rabbitmqctl environment information and logs into a file. I am confused about what caused this situation, and I sincerely hope to be given the cause of the failure or suggestions for improvement to avoid such situations from happening again.

Thank you for your attention, and we look forward to your reply.

Sincerely,

Wong
rabbit@T-MAB1-MQ-01.log
rmq_env.txt

Michal Kuratczyk

unread,
Apr 7, 2023, 9:52:25 AM4/7/23
to rabbitm...@googlegroups.com
1. This is a community support forum. If you want to reach RabbitMQ Support Team, you need to be a paying customer and the communication channel is different.

2. RabbitMQ 3.8 has been out of support for a long time

3. Intermittent issues like that are often caused by excessive communication between cluster nodes. This can be caused by queue mirroring (a deprecated feature) or monitoring over Management API, but other things could cause this as well.

Upgrade and move to quorum queues if you are still using mirroring.

Best,

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/5c6a379b-775f-4b57-862b-163714228901n%40googlegroups.com.


--
Michał
RabbitMQ team

Wong Energy

unread,
Aug 31, 2023, 4:28:07 AM8/31/23
to rabbitmq-users
hi,  Michal. 
Today, this issue occurred again. How should we define excessive communication  between cluster nodes ?
 I have observed that there are more queues on this node compared to the others. The "Ready" messages have consistently been around 50,000, with approximately 10,000 messages not being Unacked.
Does it have any correlation with this?

kjnilsson

unread,
Aug 31, 2023, 4:33:57 AM8/31/23
to rabbitmq-users
If you are still on 3.8 please upgrade, we cannot investigate issues with out of support versions.

If you are on a more recent version 3.10+ then please provide full details, incl logs covering the incident.

Reply all
Reply to author
Forward
0 new messages