unusable queues on one RabbitMQ cluster node after restart

439 views
Skip to first unread message

Vilius Šumskas

unread,
Mar 15, 2024, 4:18:34 AM3/15/24
to rabbitm...@googlegroups.com

Hello,

 

I‘m searching for information to understand how some RabbitMQ internals work, so this post will be longer but please bear with me.

 

We had an incident on our production cluster. Basically we had to update our Kubernetes version, so we have restarted all Kubernetes nodes one at a time. With this a 3-node RabbitMQ cluster was also restarted one node at a time. Cluster is running classic mirrored queues (yes I know, we need to migrate out of those).

 

Timeline of events

2024-03-13 04:36:47 restart initiated on node-2

2024-03-13 04:36:52 node-2 is completely down (CORRECT “Mnesia message store stopped” messages in the log)

2024-03-13 04:39:41 node-2 is starting up

2024-03-13 04:41:04 node-2 is detected as healthy

2024-03-13 04:47:35 restart initiated on node-1

2024-03-13 04:47:41 node-1 is completely down (CORRECT “Mnesia message store stopped” messages in the log)

2024-03-13 04:50:36 node-1 is starting up

2024-03-13 04:52:43 node-1 is detected as healthy

2024-03-13 04:58:27 restart initiated on node-0

2024-03-13 05:01:28 node-0 is completely down (NO “Mnesia message store stopped” log messages for some reason)

2024-03-13 05:02:24 node-0 is starting up

2024-03-13 05:11:11 node-0 is detected as healthy (note big delay in startup time of this node)

 

Problem investigation and a fix

Now the problem is that after node-0 was rebooted some queues (~1600 out of ~8850 in total) were unusable. By unusable I mean that RabbitMQ Management UI showed them as running only on node-0, and metrics for those queues showed NaN. Applications could not use those queues either. I tried to list all queues with “rabbitmqctl list_queues” from multiple cluster nodes, but it only returned ~7250 “good” queues.

 

I also tried to delete some “bad” queues from management UI and recreate them. This helped to solve the issue, queue was recreated on all cluster nodes again with proper leadership. Not sure if this is important, but when I have deleted “bad” queues logs on node-0 reported:

 

2024-03-13 05:23:49.793160+00:00 [warning] <0.21270.187> Queue 0016732.1d96ff4 in vhost / has its master node is down and no mirrors available or eligible for promotion. Forcing queue deletion.

 

Since there were thousands of “bad” queues, and I didn’t find any way to delete just those, I have disconnected all clients and set global policy expire=1l, to delete all of them. Strangely enough, the policy did delete only “good” queues.

 

Finally, I have restarted node-0 with “rabbitmqctl stop_app/start_app”, had to wait while expiration policy will delete all remaining queues, remove the expiration policy, and then imported all queue metadata information from JSON definition backup. This fixed everything. Applications could connect and send messages, etc.

 

Possible explanation(?)

I’m looking for anyone who knows RabbitMQ internals better to confirm or deny my assumptions about possible reasons for this to happen. I suspect that:

  1. Before restart cycle the cluster had queue leaders and connections unbalanced. Main indication of this is that node-0 CPU and memory usage was much bigger than on node-1 and node-2.
  2. Something bad have happened on node-0 during shutdown, or maybe there was just not enough time for it to shutdown (terminationGracePeriod is set to 3 minutes in Kubernetes) so the mnesia store was corrupted.

For example these strange log messages during startup of node-0 indicates something not right:

2024-03-13 05:02:43.729093+00:00 [warning] <0.16042.6> Message store "628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent": rebuilding indices from scratch

2024-03-13 05:09:27.593755+00:00 [info] <0.28149.4> Recovering 7071 queues of type rabbit_classic_queue took 408759ms

  1. Since node-0 started in some funky state the queues I saw running only on node-0 were the ones which didn’t complete failover during node-0 shutdown.

 

Are my assumptions correct?

 

Questions

  1. Is there a way to avoid such incident in the future? Maybe I should set terminationGracePeriod on Kubernetes to much larger value? If yes, should I assume that Kubernetes must not only wait for “Mnesia message store stopped” log message but also to wait for some kind of log message on other nodes which would indicate that queues failed-over and new leaders were elected completely?
  2. Does standard “rabbitmq-diagnostics check_running” check for mirrored queue status? Is it enough of a health check on a cluster, to say that the node is completely ready, and other nodes can be restarted?
  3. If such case happens in the future, is there anything else we could do to bring those “bad” queues up without deleting all queues? If RabbitMQ Management UI shows them and UI can delete them, there should be at least some kind of a tool which lists them in CLI, right?

 

 

Bellow I have tried to gather all relevant detailed information in the logs:

https://p.defau.lt/?bJ0qGB_Tyjd_XUtqqseW6Q – node-0 logs

https://p.defau.lt/?BGbVxmarE1e3_aob00pJmw – node-1 logs

https://p.defau.lt/?69tXbxbHbm1D_6iO6qWreQ – node-2 logs

 

 

--

   Best Regards,

 

    Vilius Šumskas

    Rivile

    IT manager

    +370 614 75713

 

Michal Kuratczyk

unread,
Mar 15, 2024, 5:32:51 AM3/15/24
to rabbitm...@googlegroups.com
Take a look at how the https://github.com/rabbitmq/cluster-operator deploys RabbitMQ. While I can't guarantee nothing like that would happen,
there are some precautions for such issues. Specifically, it sets a preStop hook to call
`rabbitmq-upgrade await_online_quorum_plus_one` and `rabbitmq-upgrade await_online_synchronized_mirror`
and a very long timeout.

Best,

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/AM8PR01MB76991E81BE9D1AD2BF1E87BD92282%40AM8PR01MB7699.eurprd01.prod.exchangelabs.com.


--
Michal
RabbitMQ Team

This electronic communication and the information and any files transmitted with it, or attached to it, are confidential and are intended solely for the use of the individual or entity to whom it is addressed and may contain information that is confidential, legally privileged, protected by privacy laws, or otherwise restricted from disclosure to anyone else. If you are not the intended recipient or the person responsible for delivering the e-mail to the intended recipient, you are hereby notified that any use, copying, distributing, dissemination, forwarding, printing, or copying of this e-mail is strictly prohibited. If you received this e-mail in error, please return the e-mail to the sender, delete it from your computer, and destroy any printed copy of it.

Vilius Šumskas

unread,
Mar 15, 2024, 6:30:18 AM3/15/24
to rabbitm...@googlegroups.com

Thank you for your response Michal!

 

By this, should I assume that rabbitmq-upgrade utility can be used under normal restart operations which not necessary upgrade RabbitMQ version?

 

I suppose for classic mirrored queues I should use “rabbitmq-upgrade await_online_synchronized_mirror” but I didn’t find anything about it in the documentation. Can you confirm this?

I also see that cluster operator uses “rabbitmq-upgrade drain” in a preStop hook. Do I need it? I thought that this is handled automatically since RabbitMQ handles SIGTERM gracefully, no?

 

In addition, I found this piece of documentation on operator’s repository https://github.com/rabbitmq/cluster-operator/blob/52540b600e47be237bc2587682bb37c508c82d51/docs/design/20200520-graceful-pod-termination.md which mentions terminationGracePeriod of one week. Aren’t this too much? I mean if nodes are not synced within few hour or even in 1 day, they will probably need to be terminated anyway. Just looking for proper grace termination timeout value. Maybe there is a formula how timeout value depends on queue amount?

 

--

    Vilius

Michal Kuratczyk

unread,
Mar 15, 2024, 6:53:39 AM3/15/24
to rabbitm...@googlegroups.com
Yes, preStop is executed on every pod stop, not just upgrade.
Yes, await_online_synchronized_mirror is for mirrored queues.
Draining a node shouldn't hurt. Whether it's necessary is an open discussion. It might become a no-op at some point but who knows.
The one-week timeout is an intentional default - it's basically a way to say "if we'll lose the quorum without this node, better keep it around" and force the admin to do something about it.
You may choose a different setting, but as the default we decided to go for safety - we won't stop the node if that'd stop the whole cluster (minority / lose of quorum).



Vilius Šumskas

unread,
Mar 15, 2024, 7:27:58 AM3/15/24
to rabbitm...@googlegroups.com

Digging more deeply into code through git blame I found the rationale behind maintenance mode here https://github.com/rabbitmq/cluster-operator/issues/320

I will add preStop hook for our RabbitMQ statefulset accordingly.

 

The only question now left is did my assumptions why those unusable queues appeared could be correct, and is there anything else we could do to resolve it if the issue happens again?

Michal Kuratczyk

unread,
Mar 15, 2024, 7:41:35 AM3/15/24
to rabbitm...@googlegroups.com
The rationale has changed over time and what operations the maintenance mode performance has changed as well.
The opinions vary on whether it should even exist and what it should do.

As for what could have triggered this - I honestly don't know much about classic queue mirroring (when I joined the team,
quorum queues were already under development with a clear goal to replace mirroring).
As far as I'm concerned, no one should be using mirroring at this point...

Best,

Vilius Šumskas

unread,
Mar 15, 2024, 8:04:36 AM3/15/24
to rabbitm...@googlegroups.com

Understood. Anyway, I will add maintenance mode in preStop hook to match what Cluster Operator is using, and will keep an eye if anything changes in the operator code.

 

Thank you again for you answers Michal! I will get back lurking now, with a hope that maybe Luke or somebody else will have more thoughts regarding the last question.

 

--

   Best Regards,

Vilius Šumskas

unread,
Mar 19, 2024, 8:47:53 AM3/19/24
to rabbitm...@googlegroups.com

Hello,

 

just wanted to confirm, if I correctly understand the detailed intent of RabbitMQ Cluster Operator‘s preStop hook. I see that current it calls:

“rabbitmq-upgrade await_online_quorum_plus_one; rabbitmq-upgrade await_online_synchronized_mirror; rabbitmq-upgrade drain”

 

Shouldn’t this be:

“rabbitmq-upgrade await_online_quorum_plus_one && rabbitmq-upgrade await_online_synchronized_mirror && rabbitmq-upgrade drain” ?

 

I.e. the drain should not happen unless synchronization have really finished successfully?

 

--

   Best Regards,

 

    Vilius Šumskas

    Rivile

    IT manager

 

Luke Bakken

unread,
Mar 19, 2024, 9:57:13 AM3/19/24
to rabbitmq-users
Hello,

That does look like a problem. Please feel free to open an issue in the repository -

Vilius Šumskas

unread,
Mar 19, 2024, 10:04:18 AM3/19/24
to rabbitm...@googlegroups.com
Reply all
Reply to author
Forward
0 new messages