RabbitMQ node in endless restart loop after fresh installation

38 views
Skip to first unread message

Radu Marian

unread,
Jun 3, 2024, 2:20:13 AMJun 3
to rabbitmq-users
Hello rabbitmq users,

We have the following issue in our project.

We are using a dockerized RabbitMQ 3.11.19 with Erl 25.3.2.2 in a three node cluster configuration.

Recently we performed a fresh cluster installation on one of our systems but one of the nodes could not start up being stuck in a restart loop.

The error we saw in the logs causing the boot failure is this:

May 17, 2024 @ 02:25:04.035 +00:00 2024-05-17 02:25:04.035184+00:00 [error] <0.234.0>

May 17, 2024 @ 02:25:04.035 +00:00 2024-05-17 02:25:04.034917+00:00 [error] <0.234.0> Found lock file at /var/data/schema_upgrade_lock.

May 17, 2024 @ 02:25:04.035 +00:00 2024-05-17 02:25:04.034917+00:00 [error] <0.234.0>             Database backup path: /var/data-upgrade-backup

May 17, 2024 @ 02:25:04.035 +00:00 2024-05-17 02:25:04.034917+00:00 [error] <0.234.0>             Either previous upgrade is in progress or has failed.

May 17, 2024 @ 02:25:04.035 +00:00

May 17, 2024 @ 02:25:04.035 +00:00 BOOT FAILED

May 17, 2024 @ 02:25:04.035 +00:00 Error during startup: {error,previous_upgrade_failed}

May 17, 2024 @ 02:25:04.036 +00:00

Seems the initial upgrade has failed and somehow the upgrade lock file was not removed.

Looking further before, we saw that the upgrade step of per vhost message store failed with error:

May 17, 2024 @ 02:24:41.702 +00:00 2024-05-17 02:24:41.693240+00:00 [info] <0.229.0> message_store upgrades: Applying rabbit_variable_queue:move_messages_to_vhost_store

May 17, 2024 @ 02:24:41.702 +00:00 2024-05-17 02:24:41.694818+00:00 [error] <0.229.0> Queue index directory '/var/data/queues/A3D5GXC8PA3TMSF0RIAC3JGEJ' not found for queue '***' in vhost '/'

May 17, 2024 @ 02:24:41.702 +00:00 2024-05-17 02:24:41.700471+00:00 [error] <0.924.0>     initial call: rabbit_msg_store:init/1

May 17, 2024 @ 02:24:41.702 +00:00 2024-05-17 02:24:41.700471+00:00 [error] <0.924.0>                         {gen_server,call,

May 17, 2024 @ 02:24:41.702 +00:00 2024-05-17 02:24:41.700471+00:00 [error] <0.924.0>     registered_name: []

May 17, 2024 @ 02:24:41.702 +00:00 2024-05-17 02:24:41.700471+00:00 [error] <0.924.0>                              {join_local,

May 17, 2024 @ 02:24:41.702 +00:00 2024-05-17 02:24:41.700471+00:00 [error] <0.924.0>                                      msg_store_persistent},

May 17, 2024 @ 02:24:41.702 +00:00 2024-05-17 02:24:41.700471+00:00 [error] <0.924.0>                                  <0.924.0>},

May 17, 2024 @ 02:24:41.702 +00:00 2024-05-17 02:24:41.700471+00:00 [error] <0.924.0>   crasher:

May 17, 2024 @ 02:24:41.702 +00:00 2024-05-17 02:24:41.700471+00:00 [error] <0.924.0>     exception exit: {noproc,

May 17, 2024 @ 02:24:41.702 +00:00 2024-05-17 02:24:41.693659+00:00 [info] <0.229.0> message_store upgrades: Moving messages to per-vhost message store

To me it looks like an unhandled race condition in the vhost message store migration caused the upgrade to crash leaving the lock file set.

But what could cause the Queue index directory '/var/data/queues/A3D5GXC8PA3TMSF0RIAC3JGEJ' not found for queue '***' in vhost '/' in the first place?

Could it be that the queue was being declared at the same time as the migration was being executed? Would deferring queue declaration from clients help?

Any ideas/help is greatly appreciated,
Radu.

Luke Bakken

unread,
Jun 3, 2024, 8:52:22 AMJun 3
to rabbitmq-users
Hello,

Thanks for using RabbitMQ. Please note that our community support policy excludes providing assistance for out-of-support versions of RabbitMQ (https://github.com/rabbitmq/rabbitmq-server/blob/main/COMMUNITY_SUPPORT.md#who-is-eligible-for-community-support)

Having said that, if this truly were a fresh installation, there would not be a lock file, nor a queue index issue.

My guess is that you are using stale data from a previous installation somehow. Check your volumes.

Reply all
Reply to author
Forward
0 new messages