Rabbitmq upgrade from 3.13.3 to 4.0.3 is failing

571 views
Skip to first unread message

Vishnu Vardhan

unread,
Nov 6, 2024, 7:55:21 PM11/6/24
to rabbitmq-users
Hi team,

I am currently using rabbitmq-3.13.3, erlang-26.2.5 and wanted to upgrade to rabbitmq-4.0.3, erlang-26.2.5.5
as mentioned in the documentation - https://www.rabbitmq.com/docs/upgrade where it says all the feature flags must be enabled before performing upgrade, I have enabled all the available feature flags 

bash-4.4$ rabbitmqctl list_feature_flags
Listing feature flags ...
name    state
classic_mirrored_queue_version  enabled
classic_queue_type_delivery_support     enabled
detailed_queues_endpoint        enabled
direct_exchange_routing_v2      enabled
drop_unroutable_metric  enabled
empty_basic_get_metric  enabled
feature_flags_v2        enabled
implicit_default_bindings       enabled
khepri_db       enabled
listener_records_in_ets enabled
maintenance_mode_status enabled
message_containers      enabled
message_containers_deaths_v2    enabled
quorum_queue    enabled
quorum_queue_non_voters enabled
restart_streams enabled
stream_filtering        enabled
stream_queue    enabled
stream_sac_coordinator_unblock_group    enabled
stream_single_active_consumer   enabled
stream_update_config_command    enabled
tracking_records_in_ets enabled
user_limits     enabled
virtual_host_metadata   enabled

and performed the rolling upgrade, but the pod restart is failing with this error in logs 

BOOT FAILED
2024-11-07 00:39:14.834736+00:00 [error] <0.255.0>
2024-11-07 00:39:14.834736+00:00 [error] <0.255.0> BOOT FAILED
2024-11-07 00:39:14.834736+00:00 [error] <0.255.0> ===========
2024-11-07 00:39:14.834736+00:00 [error] <0.255.0> Error during startup: {error,{user_already_exists,<<"defaultuser">>}}
2024-11-07 00:39:14.834736+00:00 [error] <0.255.0>
===========
Error during startup: {error,{user_already_exists,<<"defaultuser">>}}

2024-11-07 00:39:15.836312+00:00 [info] <0.410.0> Virtual host '/' is stopping
2024-11-07 00:39:15.836509+00:00 [info] <0.436.0> Closing all connections in vhost '/' on node 'rabbit@rabbit-crmq-2' because the vhost is stopping
2024-11-07 00:39:15.836660+00:00 [info] <0.422.0> Stopping message store for directory '/var/lib/rabbitmq/mnesia/rabbit@rabbit-crmq-2/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent'
2024-11-07 00:39:15.844437+00:00 [info] <0.422.0> Message store for directory '/var/lib/rabbitmq/mnesia/rabbit@rabbit-crmq-2/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent' is stopped
2024-11-07 00:39:15.844624+00:00 [info] <0.418.0> Stopping message store for directory '/var/lib/rabbitmq/mnesia/rabbit@rabbit-crmq-2/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_transient'
2024-11-07 00:39:15.836312+00:00 [error] <0.254.0>   crasher:
2024-11-07 00:39:15.836312+00:00 [error] <0.254.0>     initial call: application_master:init/4
2024-11-07 00:39:15.836312+00:00 [error] <0.254.0>     pid: <0.254.0>
2024-11-07 00:39:15.836312+00:00 [error] <0.254.0>     registered_name: []
2024-11-07 00:39:15.836312+00:00 [error] <0.254.0>     exception exit: {{user_already_exists,<<"defaultuser">>},
2024-11-07 00:39:15.836312+00:00 [error] <0.254.0>                      {rabbit,start,[normal,[]]}}
2024-11-07 00:39:15.836312+00:00 [error] <0.254.0>       in function  application_master:init/4 (application_master.erl, line 142)
2024-11-07 00:39:15.836312+00:00 [error] <0.254.0>     ancestors: [<0.253.0>]
2024-11-07 00:39:15.836312+00:00 [error] <0.254.0>     message_queue_len: 1
2024-11-07 00:39:15.836312+00:00 [error] <0.254.0>     messages: [{'EXIT',<0.255.0>,normal}]
2024-11-07 00:39:15.836312+00:00 [error] <0.254.0>     links: [<0.253.0>,<0.44.0>]
2024-11-07 00:39:15.836312+00:00 [error] <0.254.0>     dictionary: []
2024-11-07 00:39:15.836312+00:00 [error] <0.254.0>     trap_exit: true
2024-11-07 00:39:15.836312+00:00 [error] <0.254.0>     status: running
2024-11-07 00:39:15.836312+00:00 [error] <0.254.0>     heap_size: 2586
2024-11-07 00:39:15.836312+00:00 [error] <0.254.0>     stack_size: 28
2024-11-07 00:39:15.836312+00:00 [error] <0.254.0>     reductions: 176
2024-11-07 00:39:15.836312+00:00 [error] <0.254.0>   neighbours:
2024-11-07 00:39:15.836312+00:00 [error] <0.254.0>
2024-11-07 00:39:15.847420+00:00 [notice] <0.44.0> Application rabbit exited with reason: {{user_already_exists,<<"defaultuser">>},{rabbit,start,[normal,[]]}}
Runtime terminating during boot (terminating)

Please clarify on the issue and what can be done to resolve this.
Thanks in advance.

Michal Kuratczyk

unread,
Nov 7, 2024, 2:57:01 AM11/7/24
to rabbitm...@googlegroups.com
I'm afraid you enabled too many feature flags. Khepri was only experimental in 3.13, there is a warning about this:
https://www.rabbitmq.com/docs/upgrade#rabbitmq-version-upgradability

`rabbitmqctl enable_feature_flags all` does not enable experimental flags, so you must have done that explicitly.
In more recent 3.13 patch releases we made it harder to accidentally enable experimental feature flags (not only it needs to
be done explicitly but there's an additional command-line flag required / warnings in the Management UI).

I'm afraid your cluster is currently:
* using an experimental feature (Khepri is considered stable in 4.0 but not in 3.13)
* not upgradable to 4.0
* there's no option to "disable Khepri" (to go back to Mnesia)

Your options include:
* or some manual migration (export definitions, consume messages, delete the cluster, deploy a new cluster, import definitions or something along those lines)

Best,

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion, visit https://groups.google.com/d/msgid/rabbitmq-users/244fada6-3af3-4357-8c6b-2909a74c7431n%40googlegroups.com.


--
Michal
RabbitMQ Team

This electronic communication and the information and any files transmitted with it, or attached to it, are confidential and are intended solely for the use of the individual or entity to whom it is addressed and may contain information that is confidential, legally privileged, protected by privacy laws, or otherwise restricted from disclosure to anyone else. If you are not the intended recipient or the person responsible for delivering the e-mail to the intended recipient, you are hereby notified that any use, copying, distributing, dissemination, forwarding, printing, or copying of this e-mail is strictly prohibited. If you received this e-mail in error, please return the e-mail to the sender, delete it from your computer, and destroy any printed copy of it.

Vishnu Vardhan

unread,
Nov 7, 2024, 4:32:27 AM11/7/24
to rabbitmq-users
Thanks for the reply
Yes I have explicitly enabled Khepri feature flag in the previous deployment

This time I have not enabled it in rabbitmq-3.13.3 and upgraded to  rabbitmq-4.0.3 , upgrade seems to be fine and the application was stable
Now when I rollback to  rabbitmq-3.13.3 the pod restart is failing with error:


BOOT FAILED
===========
Exception during startup:

2024-11-07 09:20:06.220832+00:00 [error] <0.255.0>
throw:{timeout,{rabbitmq_metadata,'rabbit@rabbit-crmq-2'}}
2024-11-07 09:20:06.220832+00:00 [error] <0.255.0> BOOT FAILED
2024-11-07 09:20:06.220832+00:00 [error] <0.255.0> ===========
2024-11-07 09:20:06.220832+00:00 [error] <0.255.0> Exception during startup:
2024-11-07 09:20:06.220832+00:00 [error] <0.255.0>
2024-11-07 09:20:06.220832+00:00 [error] <0.255.0> throw:{timeout,{rabbitmq_metadata,'rabbit@rabbit-crmq-2'}}
2024-11-07 09:20:06.220832+00:00 [error] <0.255.0>
2024-11-07 09:20:06.220832+00:00 [error] <0.255.0>     rabbit_khepri:-register_projections/0-lc$^9/1-0-/1, line 1078
2024-11-07 09:20:06.220832+00:00 [error] <0.255.0>     rabbit_khepri:register_projections/0, line 1079
2024-11-07 09:20:06.220832+00:00 [error] <0.255.0>     rabbit_khepri:setup/1, line 255
2024-11-07 09:20:06.220832+00:00 [error] <0.255.0>     rabbit:run_prelaunch_second_phase/0, line 379
2024-11-07 09:20:06.220832+00:00 [error] <0.255.0>     rabbit:start/2, line 893
2024-11-07 09:20:06.220832+00:00 [error] <0.255.0>     application_master:start_it_old/4, line 293
2024-11-07 09:20:06.220832+00:00 [error] <0.255.0>

    rabbit_khepri:-register_projections/0-lc$^9/1-0-/1, line 1078
    rabbit_khepri:register_projections/0, line 1079
    rabbit_khepri:setup/1, line 255
    rabbit:run_prelaunch_second_phase/0, line 379
    rabbit:start/2, line 893
    application_master:start_it_old/4, line 293

2024-11-07 09:20:07.222484+00:00 [error] <0.254.0>   crasher:

Runtime terminating during boot (terminating)
2024-11-07 09:20:07.222484+00:00 [error] <0.254.0>     initial call: application_master:init/4
2024-11-07 09:20:07.222484+00:00 [error] <0.254.0>     pid: <0.254.0>
2024-11-07 09:20:07.222484+00:00 [error] <0.254.0>     registered_name: []
2024-11-07 09:20:07.222484+00:00 [error] <0.254.0>     exception exit: {{timeout,{rabbitmq_metadata,'rabbit@rabbit-crmq-2'}},
2024-11-07 09:20:07.222484+00:00 [error] <0.254.0>                      {rabbit,start,[normal,[]]}}
2024-11-07 09:20:07.222484+00:00 [error] <0.254.0>       in function  application_master:init/4 (application_master.erl, line 142)
2024-11-07 09:20:07.222484+00:00 [error] <0.254.0>     ancestors: [<0.253.0>]
2024-11-07 09:20:07.222484+00:00 [error] <0.254.0>     message_queue_len: 1
2024-11-07 09:20:07.222484+00:00 [error] <0.254.0>     messages: [{'EXIT',<0.255.0>,normal}]
2024-11-07 09:20:07.222484+00:00 [error] <0.254.0>     links: [<0.253.0>,<0.44.0>]
2024-11-07 09:20:07.222484+00:00 [error] <0.254.0>     dictionary: []
2024-11-07 09:20:07.222484+00:00 [error] <0.254.0>     trap_exit: true
2024-11-07 09:20:07.222484+00:00 [error] <0.254.0>     status: running
2024-11-07 09:20:07.222484+00:00 [error] <0.254.0>     heap_size: 376
2024-11-07 09:20:07.222484+00:00 [error] <0.254.0>     stack_size: 28
2024-11-07 09:20:07.222484+00:00 [error] <0.254.0>     reductions: 169
2024-11-07 09:20:07.222484+00:00 [error] <0.254.0>   neighbours:
2024-11-07 09:20:07.222484+00:00 [error] <0.254.0>
2024-11-07 09:20:07.230011+00:00 [notice] <0.44.0> Application rabbit exited with reason: {{timeout,{rabbitmq_metadata,'rabbit@rabbit-crmq-2'}},{rabbit,start,[normal,[]]}}



For both the versions 3.13.3 and 4.0.3 khepri_db is disabled by default and is unchanged. So not sure why error logs talk about khepri feature.
Please clarify on this

Jean-Sébastien Pédron

unread,
Nov 7, 2024, 4:53:08 AM11/7/24
to rabbitmq-users
Hi!

Downgrade is unsupported officially. It works most of the time but this is not something we test or are careful about.

In this specific case, many things were changed in the way the Khepri library is started between 3.13.x and 4.0.x in an incompatible way (given the experimental nature of the integration of Khepri in 3.13.x). The library is always initialized, regardless if it is used by RabbitMQ or not (i.e. regardless of the state of the `khepri_db` feature flag). That is because if the feature flag is enabled from another node in a cluster, that node has to reach all instances of Khepri to initialize the cluster at the Khepri level.

To sum up:
- The Khepri library is always initialized regardless of the `khepri_db` state, in both RabbitMQ 3.13.x and 4.0.x.
- Khepri is only used by RabbitMQ once the `khepri_db` feature flag is enabled.
- Khepri support in RabbitMQ 3.13.x is very early stage and experimental and it is impossible to upgrade to 4.0+ if enabled in 3.13.x.

I see that the documentation you pointed to is unclear and in fact incorrect: only stable feature flags have to be enabled, not all of them. I will fix that today.

What would have helped you in this situation otherwise? I would happily improve the docs to clarify things.

Thank you!

--
Jean-Sébastien Pédron
RabbitMQ core team

Swathi Mocharla

unread,
Nov 7, 2024, 5:26:44 AM11/7/24
to rabbitmq-users
hi Jean, 
Thank you for your prompt response. As already mentioned, we have a usecase to upgrade from 3.13.3 to 3.13.7 or 4.0. We do not want to enable  khepri_db. Upgrade works fine as expected with the feature flag "khepri_db" disabled . 

Is there anything we can do to have a successful rollback. We always run into "rabbit_khepri:-register_projections/0" . The 'deps/rabbit/src/rabbit_khepri.erl' class talks about unregister_projections etc. Is there something we can do manually at our end to successfully perform a rollback?

rabbit_khepri:-register_projections/0-lc$^9/1-0-/1, line 1078
    rabbit_khepri:register_projections/0, line 1079
    rabbit_khepri:setup/1, line 255
    rabbit:run_prelaunch_second_phase/0, line 379
    rabbit:start/2, line 893
    application_master:start_it_old/4, line 293

Thanks,
Swathi

Jean-Sébastien Pédron

unread,
Nov 7, 2024, 5:58:24 AM11/7/24
to rabbitmq-users
No, not really unfortunately. There won’t be new FOSS releases on the 3.13.x branch either to mitigate this situation.

Why do you need the ability to downgrade in place? You could still use a blue-green deployment instead:
Reply all
Reply to author
Forward
0 new messages