Autoheal issue

27 views
Skip to first unread message

Sven Spudat

unread,
Jun 3, 2022, 1:54:45 PM6/3/22
to rabbitmq-users
Hi,
we ran into an issue with a 3 node cluster of RabbitMQ 3.8.5 (erlang 23.0) recently after a brief power outage (about 5 mins).
The power outage brought down the local network but not the VM hosts where RabbitMQ was running, so the nodes kept running, but could not see each other
(cluster partition handling strategy was auto-heal).

My understanding is, that after the network comes back, they would agree on a winner and the "losers" would restart and everything would continues as before.
It looks like something went wrong, because afterwards we had an inconsistent cluster.
One node was reported as having experienced an error and being unresponsive, but it showed up as still running a queue (with 0 mirrors and NaN counters).

I have seen auto-heal work well when 1 out of 3 nodes goes offline, but to be honest, we never had this particular case where all nodes did not see each other.
Should the cluster be able to recover in this case?
Is there a better strategy than auto-heal for this case?
Would the use of quorum queues have changed anything here?

On top of that we experienced another issue, after the network was back, that I am still struggling to understand.
We do have an incoming shovel to this cluster, that publishes to an exchange with 2 bound queues. One of these queues was on the "unreponsive" node with 0 mirrors.
We normally get a couple hundred messages coming in via this shovel, however after the network was back we got as much as 2 million in a short period of time, eventually leading to a memory alarm.
On the sending cluster I noticed that the "redelivered" counter was very high, so it looks like the same message(s) was sent over and over again?

The shovel uses on-confirm ack mode, so I guess the messages were sent (and also received), but not acked and therefore resent over and over?
I am not 100% clear on when the acknowledgement is sent to the publisher, so I would be grateful if someone could help me out here.
Is it the case, that the receiver acknowledges a message, when it is routed to all bound queues? In that case I could understand it, because one of the bound queues was unresponsive.


Thanks and best regards
Sven
Reply all
Reply to author
Forward
0 new messages