Hello,
I‘m searching for information to understand how some RabbitMQ internals work, so this post will be longer but please bear with me.
We had an incident on our production cluster. Basically we had to update our Kubernetes version, so we have restarted all Kubernetes nodes one at a time. With this a 3-node RabbitMQ cluster was also restarted one node at a time. Cluster is running classic mirrored queues (yes I know, we need to migrate out of those).
Timeline of events
2024-03-13 04:36:47 restart initiated on node-2
2024-03-13 04:36:52 node-2 is completely down (CORRECT “Mnesia message store stopped” messages in the log)
2024-03-13 04:39:41 node-2 is starting up
2024-03-13 04:41:04 node-2 is detected as healthy
2024-03-13 04:47:35 restart initiated on node-1
2024-03-13 04:47:41 node-1 is completely down (CORRECT “Mnesia message store stopped” messages in the log)
2024-03-13 04:50:36 node-1 is starting up
2024-03-13 04:52:43 node-1 is detected as healthy
2024-03-13 04:58:27 restart initiated on node-0
2024-03-13 05:01:28 node-0 is completely down (NO “Mnesia message store stopped” log messages for some reason)
2024-03-13 05:02:24 node-0 is starting up
2024-03-13 05:11:11 node-0 is detected as healthy (note big delay in startup time of this node)
Problem investigation and a fix
Now the problem is that after node-0 was rebooted some queues (~1600 out of ~8850 in total) were unusable. By unusable I mean that RabbitMQ Management UI showed them as running only on node-0, and metrics for those queues showed NaN. Applications could not use those queues either. I tried to list all queues with “rabbitmqctl list_queues” from multiple cluster nodes, but it only returned ~7250 “good” queues.
I also tried to delete some “bad” queues from management UI and recreate them. This helped to solve the issue, queue was recreated on all cluster nodes again with proper leadership. Not sure if this is important, but when I have deleted “bad” queues logs on node-0 reported:
2024-03-13 05:23:49.793160+00:00 [warning] <0.21270.187> Queue 0016732.1d96ff4 in vhost / has its master node is down and no mirrors available or eligible for promotion. Forcing queue deletion.
Since there were thousands of “bad” queues, and I didn’t find any way to delete just those, I have disconnected all clients and set global policy expire=1l, to delete all of them. Strangely enough, the policy did delete only “good” queues.
Finally, I have restarted node-0 with “rabbitmqctl stop_app/start_app”, had to wait while expiration policy will delete all remaining queues, remove the expiration policy, and then imported all queue metadata information from JSON definition backup. This fixed everything. Applications could connect and send messages, etc.
Possible explanation(?)
I’m looking for anyone who knows RabbitMQ internals better to confirm or deny my assumptions about possible reasons for this to happen. I suspect that:
For example these strange log messages during startup of node-0 indicates something not right:
2024-03-13 05:02:43.729093+00:00 [warning] <0.16042.6> Message store "628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent": rebuilding indices from scratch
2024-03-13 05:09:27.593755+00:00 [info] <0.28149.4> Recovering 7071 queues of type rabbit_classic_queue took 408759ms
Are my assumptions correct?
Questions
Bellow I have tried to gather all relevant detailed information in the logs:
https://p.defau.lt/?bJ0qGB_Tyjd_XUtqqseW6Q – node-0 logs
https://p.defau.lt/?BGbVxmarE1e3_aob00pJmw – node-1 logs
https://p.defau.lt/?69tXbxbHbm1D_6iO6qWreQ – node-2 logs
--
Best Regards,
Vilius Šumskas
Rivile
IT manager
--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/AM8PR01MB76991E81BE9D1AD2BF1E87BD92282%40AM8PR01MB7699.eurprd01.prod.exchangelabs.com.
Thank you for your response Michal!
By this, should I assume that rabbitmq-upgrade utility can be used under normal restart operations which not necessary upgrade RabbitMQ version?
I suppose for classic mirrored queues I should use “rabbitmq-upgrade await_online_synchronized_mirror” but I didn’t find anything about it in the documentation. Can you confirm this?
I also see that cluster operator uses “rabbitmq-upgrade drain” in a preStop hook. Do I need it? I thought that this is handled automatically since RabbitMQ handles SIGTERM gracefully, no?
In addition, I found this piece of documentation on operator’s repository https://github.com/rabbitmq/cluster-operator/blob/52540b600e47be237bc2587682bb37c508c82d51/docs/design/20200520-graceful-pod-termination.md which mentions terminationGracePeriod of one week. Aren’t this too much? I mean if nodes are not synced within few hour or even in 1 day, they will probably need to be terminated anyway. Just looking for proper grace termination timeout value. Maybe there is a formula how timeout value depends on queue amount?
--
Vilius
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/CALWErWJFnjtgwu9xKOc44nkAo8UMCJP5_%3Dp9%3DqcqWEGuCB%3DAXw%40mail.gmail.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/AM8PR01MB769905F51B17E895FD9CC42F92282%40AM8PR01MB7699.eurprd01.prod.exchangelabs.com.
Digging more deeply into code through git blame I found the rationale behind maintenance mode here https://github.com/rabbitmq/cluster-operator/issues/320
I will add preStop hook for our RabbitMQ statefulset accordingly.
The only question now left is did my assumptions why those unusable queues appeared could be correct, and is there anything else we could do to resolve it if the issue happens again?
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/CALWErWKRJ4FW_0FzH_dmhKJuQov6jpk_KqBENHfzoqqGyrVr9g%40mail.gmail.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/AM8PR01MB76997874FCCFB340C2C6FFD492282%40AM8PR01MB7699.eurprd01.prod.exchangelabs.com.
Understood. Anyway, I will add maintenance mode in preStop hook to match what Cluster Operator is using, and will keep an eye if anything changes in the operator code.
Thank you again for you answers Michal! I will get back lurking now, with a hope that maybe Luke or somebody else will have more thoughts regarding the last question.
--
Best Regards,
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/CALWErWJv3pZNQv6AmGeZz_qB%3D4hfHwjW2eccSJXd2J_EoWqiDw%40mail.gmail.com.
Hello,
just wanted to confirm, if I correctly understand the detailed intent of RabbitMQ Cluster Operator‘s preStop hook. I see that current it calls:
“rabbitmq-upgrade await_online_quorum_plus_one; rabbitmq-upgrade await_online_synchronized_mirror; rabbitmq-upgrade drain”
Shouldn’t this be:
“rabbitmq-upgrade await_online_quorum_plus_one && rabbitmq-upgrade await_online_synchronized_mirror && rabbitmq-upgrade drain” ?
I.e. the drain should not happen unless synchronization have really finished successfully?
--
Best Regards,
Vilius Šumskas
Rivile
IT manager
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/AM8PR01MB7699C52A92F94A531529301292282%40AM8PR01MB7699.eurprd01.prod.exchangelabs.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/d6caf2de-0238-4213-aea7-58e70e14b1ean%40googlegroups.com.