Incompatible feature flags on cluster upgrade from 3.12.14 to 3.13.6 ?

Guillaume Connan

unread,

Aug 1, 2024, 11:12:25 AM8/1/24

to rabbitmq-users

Hi,

We encounter a problem that seems to be due to a FF incompatibility when upgrading a RabbitMQ cluster from version 3.12.14 to 3.13.6, with Erlang in version 26.2.5.2 on each side.

On the 3.12.14 nodes (rabbitmq-node2 and rabbitmq-node3), all FF are enabled before starting any new node:
[root@rabbitmq-node2 ~]# rabbitmqctl list_feature_flags
Listing feature flags ...
name state
classic_mirrored_queue_version enabled
classic_queue_type_delivery_support enabled
direct_exchange_routing_v2 enabled
drop_unroutable_metric enabled
empty_basic_get_metric enabled
feature_flags_v2 enabled
implicit_default_bindings enabled
listener_records_in_ets enabled
maintenance_mode_status enabled
quorum_queue enabled
restart_streams enabled
stream_queue enabled
stream_sac_coordinator_unblock_group enabled
stream_single_active_consumer enabled
tracking_records_in_ets enabled
user_limits enabled
virtual_host_metadata enabled

When starting the 3.13.6 node (rabbitmq-node1), we have the following warnings in the log file:
{"time":"2024-08-01 16:07:14.562202+02:00","level":"warning","msg":"Feature flags: nodes `rabbit@rabbitmq-node1` and `rabbit@rabbitmq-node3` are incompatible","line":446,"pid":"<0.256.0>","file":"rabbit_ff_controller.erl","domain":"rabbitmq.feature_flags","mfa":["rabbit_ff_controller","check_node_compatibility_task1",4]}
{"time":"2024-08-01 16:07:14.562254+02:00","level":"warning","msg":"Peer discovery: could not auto-cluster with node 'rabbit@rabbitmq-node3': {error,incompatible_feature_flags}","line":980,"pid":"<0.256.0>","file":"rabbit_peer_discovery.erl","domain":"rabbitmq.peer_discovery","mfa":["rabbit_peer_discovery","join_selected_node_locked",2]}

By diving into the log file on the 3.13.6 node, we can see a difference in the inventorying of the nodes:

3.12.4

states_per_node =>
#{'rabbit@rabbitmq-node3' =>
#{classic_mirrored_queue_version => true,quorum_queue => true,
stream_queue => true,implicit_default_bindings => true,
virtual_host_metadata => true,maintenance_mode_status => true,
user_limits => true,stream_single_active_consumer => true,
feature_flags_v2 => true,direct_exchange_routing_v2 => true,
listener_records_in_ets => true,tracking_records_in_ets => true,
classic_queue_type_delivery_support => true,
restart_streams => true,
stream_sac_coordinator_unblock_group => true,
empty_basic_get_metric => true,drop_unroutable_metric => true},
'rabbit@rabbitmq-node2' =>
#{classic_mirrored_queue_version => true,quorum_queue => true,
stream_queue => true,implicit_default_bindings => true,
virtual_host_metadata => true,maintenance_mode_status => true,
user_limits => true,stream_single_active_consumer => true,
feature_flags_v2 => true,direct_exchange_routing_v2 => true,
listener_records_in_ets => true,tracking_records_in_ets => true,
classic_queue_type_delivery_support => true,
restart_streams => true,
stream_sac_coordinator_unblock_group => true,
empty_basic_get_metric => true,drop_unroutable_metric => true}}

3.13.6

states_per_node =>
#{'rabbit@rabbitmq-node1' =>
#{message_containers_deaths_v2 => true,message_containers => true,
transient_nonexcl_queues => false,global_qos => false,
classic_mirrored_queue_version => true,quorum_queue => true,
stream_queue => true,implicit_default_bindings => true,
virtual_host_metadata => true,maintenance_mode_status => true,
user_limits => true,stream_single_active_consumer => true,
feature_flags_v2 => true,direct_exchange_routing_v2 => true,
listener_records_in_ets => true,tracking_records_in_ets => true,
classic_queue_type_delivery_support => true,
restart_streams => true,
stream_sac_coordinator_unblock_group => true,
stream_filtering => true,khepri_db => false,
classic_queue_mirroring => false,ram_node_type => false,
stream_update_config_command => true,
quorum_queue_non_voters => true,
detailed_queues_endpoint => true,
management_metrics_collection => false,
empty_basic_get_metric => true,drop_unroutable_metric => true}}

These 3.13 specific FF are directly enabled:

message_containers_deaths_v2
message_containers
stream_filtering
stream_update_config_command
quorum_queue_non_voters

Booting and clustering works by specifically enabling 3.12.14 only FF on the 3.13.6 node in the rabbitmq-env.conf file, but it is not optimal:
RABBITMQ_FEATURE_FLAGS=classic_mirrored_queue_version,classic_queue_type_delivery_support,direct_exchange_routing_v2,drop_unroutable_metric,empty_basic_get_metric,feature_flags_v2,implicit_default_bindings,listener_records_in_ets,maintenance_mode_status,quorum_queue,restart_streams,stream_queue,stream_sac_coordinator_unblock_group,stream_single_active_consumer,tracking_records_in_ets,user_limits,virtual_host_metadata

Did we miss something here?

You can find complete debug log files attached.

Thanks for your help !

Regards,
Guillaume

journalctl-stdout.log

rabbit.log

Luke Bakken

unread,

Aug 1, 2024, 11:46:15 AM8/1/24

to rabbitmq-users

Hello,

Would you mind providing a detailed explanation of how you are performing the upgrade, including all commands run and their output?

Guillaume Connan

unread,

Aug 1, 2024, 12:46:56 PM8/1/24

to rabbitmq-users

Hello Luke,

Of course yes !

We basically perform grow-then-shrink cluster upgrades on AWS EC2 instances (no autoscaling) by starting a new EC2 instance based on a pre-packaged AMI that contains the target RabbitMQ version (installed through official RabbitMQ RPM), then we download the RabbitMQ configuration at startup (which is exactly the same file between version 3.12.4 and 3.13.6, attached if needed), then we start the service:

echo "RABBITMQ_USE_LONGNAME=true" > /etc/rabbitmq/rabbitmq-env.conf && \
echo "RABBITMQ_NODENAME=rabbit@`/usr/bin/hostname -I`" >> /etc/rabbitmq/rabbitmq-env.conf && \
mkdir -m 755 -p /var/lib/rabbitmq/mnesia && \
chown -R rabbitmq:rabbitmq /etc/rabbitmq /var/lib/rabbitmq && \
chmod 600 /etc/rabbitmq/* /var/lib/rabbitmq/.erlang.cookie && \
systemctl start rabbitmq-server.service

[root@rabbitmq-node1 ~]# ls -al /etc/rabbitmq /var/lib/rabbitmq
/etc/rabbitmq:
total 28
drwxr-sr-x. 2 rabbitmq rabbitmq 75 Aug 1 11:31 .
drwxr-xr-x. 86 root root 16384 Aug 1 11:31 ..
-rw-------. 1 rabbitmq rabbitmq 232 Aug 1 15:24 enabled_plugins
-rw-------. 1 rabbitmq rabbitmq 489 Aug 1 16:46 rabbitmq-env.conf
-rw-------. 1 rabbitmq rabbitmq 909 Aug 1 14:33 rabbitmq.conf

/var/lib/rabbitmq:
total 20
drwxr-xr-x. 3 rabbitmq rabbitmq 42 Aug 1 16:46 .
drwxr-xr-x. 30 root root 16384 Aug 1 11:31 ..
-rw-------. 1 rabbitmq rabbitmq 49 Jul 31 19:34 .erlang.cookie
drwxr-x---. 4 rabbitmq rabbitmq 151 Aug 1 16:46 mnesia

After the new node is fully initialized (all queues are in sync, no active alarms, ...), we terminate the old one, and so on.
At the end of the cluster upgrade, we enable all new stable FF via the management API.

We are using the same approach and scripts since RabbitMQ 3.7 without issues... until now :-)

Note: 3.13.6 node boot is looping on peer discovery for 1 minute before starting as a standalone node:
{"time":"2024-08-01 18:37:33.669723+02:00","level":"error","msg":"Peer discovery: could not discover and join another node; proceeding as a standalone node","line":259,"pid":"<0.256.0>","file":"rabbit_peer_discovery.erl","domain":"rabbitmq.peer_discovery","mfa":["rabbit_peer_discovery","retry_sync_desired_cluster",3]}

Hope this will help !

rabbitmq.conf

enabled_plugins

rabbitmq-env.conf

Michael Klishin

unread,

Aug 1, 2024, 1:53:32 PM8/1/24

to rabbitmq-users

Grow-then-shrink upgrades are very explicitly documented as dangerous [1]

and requiring a careful set of steps that avoids permanently losing replica identities

(or rather, replacing them carefully with new identities without losing quorum).

Peer discovery has changed in 3.13.

1. https://www.rabbitmq.com/docs/upgrade#grow-then-shrink-upgrades

Guillaume Connan

unread,

Aug 2, 2024, 5:34:11 AM8/2/24

to rabbitmq-users

Highly discouraged, yet still possible :-)

We are planning to switch to in-place upgrades in a near future.

From what I understand, it doesn't seem to be related to peer discovery fundamentals because the other nodes are correctly discovered here and warning come from rabbit_ff_controller.

The inventory module detects an incompatibility between the 2 versions because the 3.13.6 node seems to start with all the 3.13 stable FFs already enabled, some of which are obviously unknown to 3.12 nodes.

Shouldn't the 3.13.6 node wait to know the compatibility level of existing nodes before activating it's FFs accordingly?

Besides that, the new problematic FFs are juste stable, not even required.

As I wrote, clustering works as expected by specifically disabling new stable 3.13 FFs in RABBITMQ_FEATURE_FLAGS before starting 3.13.6 node.

Reply all

Reply to author

Forward