Hello rabbitmq users,
We have the following issue in our project.
We are using a dockerized RabbitMQ 3.11.19 with Erl 25.3.2.2 in a three node cluster configuration.
Recently we performed a fresh cluster installation on one of our systems but one of the nodes could not start up being stuck in a restart loop.
The error we saw in the logs causing the boot failure is this:
May 17, 2024 @ 02:25:04.035 +00:00 2024-05-17 02:25:04.035184+00:00 [error] <0.234.0>
May 17, 2024 @ 02:25:04.035 +00:00 2024-05-17 02:25:04.034917+00:00 [error] <0.234.0> Found lock file at /var/data/schema_upgrade_lock.
May 17, 2024 @ 02:25:04.035 +00:00 2024-05-17 02:25:04.034917+00:00 [error] <0.234.0> Database backup path: /var/data-upgrade-backup
May 17, 2024 @ 02:25:04.035 +00:00 2024-05-17 02:25:04.034917+00:00 [error] <0.234.0> Either previous upgrade is in progress or has failed.
May 17, 2024 @ 02:25:04.035 +00:00
May 17, 2024 @ 02:25:04.035 +00:00 BOOT FAILED
May 17, 2024 @ 02:25:04.035 +00:00 Error during startup: {error,previous_upgrade_failed}
May 17, 2024 @ 02:25:04.036 +00:00
Seems the initial upgrade has failed and somehow the upgrade lock file was not removed.
Looking further before, we saw that the upgrade step of per vhost message store failed with error:
May 17, 2024 @ 02:24:41.702 +00:00 2024-05-17 02:24:41.693240+00:00 [info] <0.229.0> message_store upgrades: Applying rabbit_variable_queue:move_messages_to_vhost_store
May 17, 2024 @ 02:24:41.702 +00:00 2024-05-17 02:24:41.694818+00:00 [error] <0.229.0> Queue index directory '/var/data/queues/A3D5GXC8PA3TMSF0RIAC3JGEJ' not found for queue '***' in vhost '/'
May 17, 2024 @ 02:24:41.702 +00:00 2024-05-17 02:24:41.700471+00:00 [error] <0.924.0> initial call: rabbit_msg_store:init/1
May 17, 2024 @ 02:24:41.702 +00:00 2024-05-17 02:24:41.700471+00:00 [error] <0.924.0> {gen_server,call,
May 17, 2024 @ 02:24:41.702 +00:00 2024-05-17 02:24:41.700471+00:00 [error] <0.924.0> registered_name: []
May 17, 2024 @ 02:24:41.702 +00:00 2024-05-17 02:24:41.700471+00:00 [error] <0.924.0> {join_local,
May 17, 2024 @ 02:24:41.702 +00:00 2024-05-17 02:24:41.700471+00:00 [error] <0.924.0> msg_store_persistent},
May 17, 2024 @ 02:24:41.702 +00:00 2024-05-17 02:24:41.700471+00:00 [error] <0.924.0> <0.924.0>},
May 17, 2024 @ 02:24:41.702 +00:00 2024-05-17 02:24:41.700471+00:00 [error] <0.924.0> crasher:
May 17, 2024 @ 02:24:41.702 +00:00 2024-05-17 02:24:41.700471+00:00 [error] <0.924.0> exception exit: {noproc,
May 17, 2024 @ 02:24:41.702 +00:00 2024-05-17 02:24:41.693659+00:00 [info] <0.229.0> message_store upgrades: Moving messages to per-vhost message store
To me it looks like an unhandled race condition in the vhost message store migration caused the upgrade to crash leaving the lock file set.
But what could cause the Queue index directory '/var/data/queues/A3D5GXC8PA3TMSF0RIAC3JGEJ' not found for queue '***' in vhost '/' in the first place?
Could it be that the queue was being declared at the same time as the migration was being executed? Would deferring queue declaration from clients help?
Any ideas/help is greatly appreciated,
Radu.