Quorum queue not recovering after cluster restart

378 views
Skip to first unread message

Ohad Ravid

unread,
Jan 1, 2023, 12:16:40 PM1/1/23
to rabbitmq-users
Hi,

We have a few clusters with 3 nodes (RabbitMQ 3.11.5, Erlang 25.2), mostly with queues and streams.
A few time in the past, after a cluster restart we saw a single quorum queue which shows "NaN"s for all it's stats, and using `rabbitmq-queues quorum_status` on it times out.

Usually the queue recovers after a few minutes, but today it failed to do so after >30minutes, so we had to delete it.

We run the cluster in debug, and we saw the following logs on one of the nodes, which was the leader according the managment UI:

```
08:53:24.872775+00:00 [debug] <0.743.0> queue '*BAD_QUEUE*' in vhost '/': ra_log:init recovered last_index_term {152506801,14} first index 128548952
08:53:24.873471+00:00 [debug] <0.743.0> queue '*BAD_QUEUE*' in vhost '/': post_init -> recover in term: 14 machine version: 3
08:53:24.873534+00:00 [debug] <0.743.0> queue '*BAD_QUEUE*' in vhost '/': recovering state machine version 3:3 from index 128548951 to 152506801
08:54:35.333787+00:00 [debug] <0.743.0> queue '*BAD_QUEUE*' in vhost '/': enabling ra cluster changes in 14
```
(there wasn't a "recovery of state machine ... took XXXms")

The queue is usually empty, with a constant rate of 10messages/sec in/out and a 2000 prefetch for the single consumer using it.

Is there anyway to debug this further? Is there an option that can be set to ensure commits?

Thanks,
Ohad

Karl Nilsson

unread,
Jan 1, 2023, 2:34:03 PM1/1/23
to rabbitm...@googlegroups.com
It probably took a long time to recover the ~24M log entries. 

I’d be curious to know what these entries are given you say the queue is mostly empty.

Tell me more about how the consumers process messages. 

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/cda5d29f-faa2-4928-85cf-c58fe1701246n%40googlegroups.com.
--
Karl Nilsson

Ohad Ravid

unread,
Jan 1, 2023, 5:24:01 PM1/1/23
to rabbitmq-users
The queue is bound to a single exchange (topic, durable,  with a routing key, which has a single producer), and a single consumer.
The consumer is using Rust + Lapin, with 2000 prefetch & publisher confirms (it publishes to a different exchange).

Is there any specific info about the queue I can provide?

I'm posting the UI stats from a working instance after about half a day operation.


Screenshot 2023-01-02 at 0.12.42.png

Screenshot 2023-01-02 at 0.12.55.png
Screenshot 2023-01-02 at 0.13.08.png
Screenshot 2023-01-02 at 0.17.46.png

```
$ rabbitmq-queues quorum_status SomeQueue 
Status of quorum queue SomeQueue on node rabbitmq@xxx ... ┌───────────────────┬────────────┬───────────┬──────────────┬────────────────┬──────┬─────────────────┐ │ Node Name │ Raft State │ Log Index │ Commit Index │ Snapshot Index │ Term │ Machine Version │ ├───────────────────┼────────────┼───────────┼──────────────┼────────────────┼──────┼─────────────────┤ │ rabbitmq@xx1 │ leader │ 516689 │ 516689 │ 515465 │ 1 │ 3 │ ├───────────────────┼────────────┼───────────┼──────────────┼────────────────┼──────┼─────────────────┤ │ rabbitmq@xx3 │ follower │ 516689 │ 516689 │ 514345 │ 1 │ 3 │ ├───────────────────┼────────────┼───────────┼──────────────┼────────────────┼──────┼─────────────────┤ │ rabbitmq@xx2 │ follower │ 516689 │ 516689 │ 516563 │ 1 │ 3 │ └───────────────────┴────────────┴───────────┴──────────────┴────────────────┴──────┴─────────────────┘
```
Reply all
Reply to author
Forward
0 new messages