We have a fairly basic HA policy applied across all vHosts in our Rabbit cluster. We apply to all queues and exchanges in each vHost and set the following config keys:
ha-mode: all
ha-sync-mode: automatic
We recently had an incident with our RabbitMQ cluster where a number of our queues (across all vHosts) had lost that HA Policy. In the queue overview the features column for the affected queues didn't have the HA Policy listed, and the queues were not in a Running or Idle state. They also had only two of our four nodes listed as a node and a mirror, and I am pretty sure they were all in an unsynchronised state.
We think that this started when a couple of our Rabbit cluster nodes crashed and restarted, but we are not sure how to prove that yet (the logs aren't clear to me).
The fix came from creating a temporary HA Policy with the same keys and limiting it to a single queue using a match pattern and a higher priority, which immediately applied to that queue but also added the original HA Policy to all queues that had been missing it!
Like I said, the logs are not that revealing to me so I would like to know what would give me an indication of the original cause of the removal of the HA Policy? I have pasted a couple of log excerpts below but I'm honestly not sure how relevant they are and they don't tell me a lot.
This is from around the start of the incident and relates to one of the affected queues (nsb.delay-level is a NServiceBus system queue type):
2021-09-05 21:09:08.807 [warning] <0.8691.16> Mirrored queue 'nsb.delay-level-27' in vhost 'MyVHostName': Unable to start queue mirror on node ''rabbit@node-04''. Target virtual host is not running: {vhost_supervisor_not_running,<<"MyVHostName">>}
Followed by this:
2021-09-05 21:11:09.923 [warning] <0.18672.0> Mirrored queue 'nsb.delay-level-27' in vhost 'MyVHostName': Stopping all nodes on master shutdown since no synchronised mirror (replica) is available
Then this:
2021-09-05 21:11:24.941 [warning] <0.18672.0> Mirrored queue 'nsb.delay-level-27' in vhost 'MyVHostName': Missing 'DOWN' message from <7777.10434.338> in node 'rabbit@node-01'
We then had a particular .NET client that relied on that queue when starting which threw the following error every time it tried (and failed) to start:
2021-09-06 09:42:11.121 [error] <0.9850.320> Channel error on connection <0.10514.320> (ip.address.client:53169 -> ip.address.rabbit:5671, vhost: 'MyVHostName', user: 'MyUser'), channel 1:operation queue.declare caused a channel exception not_found: failed to perform operation on queue 'nsb.delay-level-27' in vhost 'MyVHostName' due to timeout
If anyone could give me some guidance on generally troubleshooting this that'd be amazing.