Quorum queue not recovering after cluster restart

Ohad Ravid

unread,

Jan 1, 2023, 12:16:40 PM1/1/23

to rabbitmq-users

Hi,

We have a few clusters with 3 nodes (RabbitMQ 3.11.5, Erlang 25.2), mostly with queues and streams.

A few time in the past, after a cluster restart we saw a single quorum queue which shows "NaN"s for all it's stats, and using `rabbitmq-queues quorum_status` on it times out.

Usually the queue recovers after a few minutes, but today it failed to do so after >30minutes, so we had to delete it.

We run the cluster in debug, and we saw the following logs on one of the nodes, which was the leader according the managment UI:

```

08:53:24.872775+00:00 [debug] <0.743.0> queue '*BAD_QUEUE*' in vhost '/': ra_log:init recovered last_index_term {152506801,14} first index 128548952

08:53:24.873471+00:00 [debug] <0.743.0> queue '*BAD_QUEUE*' in vhost '/': post_init -> recover in term: 14 machine version: 3

08:53:24.873534+00:00 [debug] <0.743.0> queue '*BAD_QUEUE*' in vhost '/': recovering state machine version 3:3 from index 128548951 to 152506801

08:54:35.333787+00:00 [debug] <0.743.0> queue '*BAD_QUEUE*' in vhost '/': enabling ra cluster changes in 14

```

(there wasn't a "recovery of state machine ... took XXXms")

The queue is usually empty, with a constant rate of 10messages/sec in/out and a 2000 prefetch for the single consumer using it.

Is there anyway to debug this further? Is there an option that can be set to ensure commits?

Thanks,

Ohad

Karl Nilsson

unread,

Jan 1, 2023, 2:34:03 PM1/1/23

to rabbitm...@googlegroups.com

It probably took a long time to recover the ~24M log entries.

I’d be curious to know what these entries are given you say the queue is mostly empty.

Tell me more about how the consumers process messages.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/cda5d29f-faa2-4928-85cf-c58fe1701246n%40googlegroups.com.

--

Karl Nilsson

Ohad Ravid

unread,

Jan 1, 2023, 5:24:01 PM1/1/23

to rabbitmq-users

The queue is bound to a single exchange (topic, durable, with a routing key, which has a single producer), and a single consumer.

The consumer is using Rust + Lapin, with 2000 prefetch & publisher confirms (it publishes to a different exchange).

Is there any specific info about the queue I can provide?

I'm posting the UI stats from a working instance after about half a day operation.

```

$ rabbitmq-queues quorum_status SomeQueue

Status of quorum queue SomeQueue on node rabbitmq@xxx ... ┌───────────────────┬────────────┬───────────┬──────────────┬────────────────┬──────┬─────────────────┐ │ Node Name │ Raft State │ Log Index │ Commit Index │ Snapshot Index │ Term │ Machine Version │ ├───────────────────┼────────────┼───────────┼──────────────┼────────────────┼──────┼─────────────────┤ │ rabbitmq@xx1 │ leader │ 516689 │ 516689 │ 515465 │ 1 │ 3 │ ├───────────────────┼────────────┼───────────┼──────────────┼────────────────┼──────┼─────────────────┤ │ rabbitmq@xx3 │ follower │ 516689 │ 516689 │ 514345 │ 1 │ 3 │ ├───────────────────┼────────────┼───────────┼──────────────┼────────────────┼──────┼─────────────────┤ │ rabbitmq@xx2 │ follower │ 516689 │ 516689 │ 516563 │ 1 │ 3 │ └───────────────────┴────────────┴───────────┴──────────────┴────────────────┴──────┴─────────────────┘

```

Reply all

Reply to author

Forward