Hi,
We have a few clusters with 3 nodes (RabbitMQ 3.11.5, Erlang 25.2), mostly with queues and streams.
A few time in the past, after a cluster restart we saw a single quorum queue which shows "NaN"s for all it's stats, and using `rabbitmq-queues quorum_status` on it times out.
Usually the queue recovers after a few minutes, but today it failed to do so after >30minutes, so we had to delete it.
We run the cluster in debug, and we saw the following logs on one of the nodes, which was the leader according the managment UI:
```
08:53:24.872775+00:00 [debug] <0.743.0> queue '*BAD_QUEUE*' in vhost '/': ra_log:init recovered last_index_term {152506801,14} first index 128548952
08:53:24.873471+00:00 [debug] <0.743.0> queue '*BAD_QUEUE*' in vhost '/': post_init -> recover in term: 14 machine version: 3
08:53:24.873534+00:00 [debug] <0.743.0> queue '*BAD_QUEUE*' in vhost '/': recovering state machine version 3:3 from index 128548951 to 152506801
08:54:35.333787+00:00 [debug] <0.743.0> queue '*BAD_QUEUE*' in vhost '/': enabling ra cluster changes in 14
```
(there wasn't a "recovery of state machine ... took XXXms")
The queue is usually empty, with a constant rate of 10messages/sec in/out and a 2000 prefetch for the single consumer using it.
Is there anyway to debug this further? Is there an option that can be set to ensure commits?
Thanks,
Ohad