RabbitMQ Cluster problems

626 views
Skip to first unread message

Peter Drahoš

unread,
Dec 8, 2023, 2:47:26 PM12/8/23
to rabbitmq-users
Hello,
we have on premise cluster with 4 nodes on windows server 2019 (VMware). 
Versions: RabbitMQ 3.12.6, Erlang 26.1.1
All queues are of type quorum.
Config: cluster_partition_handling = pause_minority
Avg Message rates: <5 /s

We started the cluster this year and had several issues with it. All issues resulted in reset and clean creation of cluster because applications stopped working correctly. 
Last problem was very dangerous: Application published to exchage successfully but the message didn't arrive to binded queue. In managment plugin I saw correct binding, but in trace message (from amq.rabbitmq.trace) binding was missing (in routed_queues).

Investigation time (tracing): 2023-12-07 13:22:39.203000+01:00 
Exchange: ReportRequest
Queue: ReportRequest_Queue
Restarts, reseting started on: 2023-12-07 14:44:25.170000+01:00

I'm attaching full logs started from one day before.
There are a lot of logs like: "...may be down, setting pre-vote timeout", after 5s "is back up, cancelling pre-vote timeout" does it mean problem with LAN connection between nodes?

Thank you,
Peter

Cluster_logs_cut.zip

Peter Drahoš

unread,
Dec 9, 2023, 9:12:10 AM12/9/23
to rabbitmq-users
I want to add more info.
ReportRequest_Queue which stopped working was created more than month ago and was working ok.
When I start investigation, I created and bound new queue ReportTest. New queue was working ok (also saw it in routed_queues (amq.rabbitmq.trace)), then in managment I did unbind and bind again ReportRequest_Queue. After that ReportRequest_Queue was working ok, but ReportTest stopped working (also saw it in routed_queues (amq.rabbitmq.trace)).
Looks like the cluster got corrupted.
So another question is how should I monitor the cluster if it's working ok or is corrupted?
We monitor the cluster nodes with haproxy, but it's just running/stopped information and it seems it's not enough.

Thank you.

kjnilsson

unread,
Jan 15, 2024, 10:29:52 AM1/15/24
to rabbitmq-users
Looking at the logs it does seem like you have a cluster that isn't particularly stable. What kind of network do you have between the nodes?

Is there a particular reason for using 4 nodes rather than 3 (or usual recommendation)?

Do you use publisher confirms to make sure messages are received at the queue before your applications move on?

What kind of size of messages do you send?

Cheers
Karl

Peter Drahoš

unread,
Jan 15, 2024, 11:32:34 AM1/15/24
to rabbitmq-users
Hi,
thank you for your reply.
We have 2 server room locations connected with fibre optic, on each location 2 nodes. We want to have location redundancy (reason for using 4 nodes).
Yes, we use publisher confirms.
Messages are small, 95% are <1kB, maybe 5% are 10-30kB.

I tried few times make some diagnostics of the network and ping latency is <1ms.
I tried the perfTest between 2 locations and simulate our app conditions (4 producers, 4 consumers, 4msg/s):
id: test 1, time 97,001 s, received: 4,0 msg/s, min/median/75th/95th/99th consumer latency: 46/46/46/47/47 ms

But I understand you wrote particulary stable. So maybe I should let it run for 1 day and see the results.
The logs meaning unstable network are these:  "...may be down, setting pre-vote timeout", after 5s "is back up, cancelling pre-vote timeout"?

But the problem I wrote is that the queue binding disappeared after few months of working ok (created few months ago). 
Maybe there are some network instability, but most of the time the network seems ok, so I don't understand how this can happen.

Thank you,
Peter

kjnilsson

unread,
Jan 15, 2024, 12:12:14 PM1/15/24
to rabbitmq-users
With a quorum queue the default number of members is 3 so it will pick a random 3 nodes to spread over. This means that one location will have two and the other one member so if the former location (with 2 members) goes down or is otherwise unavailable the queue will not be able to make progress. Something to be aware of.

the "may be down" messages happens when the quorum queue fault detector detects a potentially down or slow node - the fact that it later cancels it's prevote suggest that for a while the link went slowly and the fault detector heartbeats didn't go through. There can be a few reasons for this. Large messages or large peak of incoming messages causing the link to become overloaded or it could simply be that the network itself had an issue. 

Which nodes are in which location?

Peter Drahoš

unread,
Jan 16, 2024, 7:34:35 AM1/16/24
to rabbitmq-users
Thanks for pointing me to the geo clustering topic, now I see it's complex. In case of big incident like location loss, we can have RTO in tens of minutes. The common situation is, when one location goes down, VMs are started on another location. 
I think with 4 nodes it's better chance to have running application sooner (compared to 3), what do you think?

Nodes 1,3 are on location1, nodes 2,4 on another.
I think with our message rate <5msg/s, small message size and mostly empty queues, issue will be the network.
From the logs I see "may be down" also between nodes on the same location. I spoke with admin today and he thinks it could be related to that VMs can be migrated between hosts in case of overload or so on. He will adjust the settings to minimalize this.
How can I test exactly our enviroment (mainly network), if it's suitable for the cluster?

Peter Drahoš

unread,
Jan 19, 2024, 10:10:17 AM1/19/24
to rabbitmq-users
After the change of VM migration behaviour nothing changed in "may be down" events.
In the past 2 days I did ping test between node1 and others (1-2, 1-3, 1-4) - poweshell ping 1kB data every 1s.
Results are in 1-2 (or 1-4) connection: 3 times 12ms, avg. 0.2ms, in 1-3 connection: 1 time 500ms, 2 times 10ms, avg. 1.1ms.
In this period, there were 15 events "may be down" on node1.
One "may be down" event occured when there was that 500ms latency.
All other events occured when there was 0-2ms latency (stable around).
From the results I don't see that the network is particularly unstable, servers or hosts have enough resources, msg rate is very low without peaks.
Do you have some suggestions?

Peter Drahoš

unread,
Apr 25, 2024, 10:04:12 AM4/25/24
to rabbitmq-users
Hello, 
I want to get back to this issue becasue it happend again 2 days ago.
Exactly the same progress, after the same time approx 3-4 months, we diagnosed one queue lost binding, even though it was configured and working from the begining. The issue happened on many queues later.
First we diagnosed one node corrupted, after the reset of this node, the issue spreaded to other nodes (we tested publish to individual nodes). We need to reset the whole cluster.
I didn't receive answers to my questions from last time and I have new ones:
1. how can I test exactly our enviroment (mainly network), if it's suitable for the cluster?
I have to say, I don't believe such partialy downs in communication can cause such lost binding that was working for months.
2. we use RabbitMQ 3.12.6 Erlang 26.1.1, is this issue solved in latest version?
3. Strange is that this issue does not happen to other users, so I suppose the majority of production environment is running on linux, so I see migration to linux as the only possible solution to our problem?

Thank you,
Peter  
 

Luke Bakken

unread,
Apr 26, 2024, 1:17:42 PM4/26/24
to rabbitmq-users
Hi Peter,

You say everything was working "for months" ("queue binding disappeared after few months of working ok"), and then you started seeing these issues. Something must have changed in your environment - either software versions, workload, network reliability, etc. Hopefully you have monitoring in place so you can compare a stretch of time where everything worked normally to when you experience these issues.

At this point, all we can do is provide suggestions. Here is what I would suggest:
  • There is no need that I can see at this time to switch to Linux.
  • You should run a 3-node cluster in each of your "server rooms", and if you need messages to be moved or copied between these clusters, use Federation.
Please remember that the support you get on this mailing list is free-of-charge, as is RabbitMQ, Erlang and other RabbitMQ components. Nobody is under any obligation to respond to your questions. If this is an urgent issue, support is available - https://www.rabbitmq.com/contact#paid-support

Thanks,
Luke
Reply all
Reply to author
Forward
0 new messages