Incompatible feature flags on cluster upgrade from 3.12.14 to 3.13.6 ?

462 views
Skip to first unread message

Guillaume Connan

unread,
Aug 1, 2024, 11:12:25 AM8/1/24
to rabbitmq-users
Hi,

We encounter a problem that seems to be due to a FF incompatibility when upgrading a RabbitMQ cluster from version 3.12.14 to 3.13.6, with Erlang in version 26.2.5.2 on each side.

On the 3.12.14 nodes (rabbitmq-node2 and rabbitmq-node3), all FF are enabled before starting any new node:
[root@rabbitmq-node2 ~]# rabbitmqctl list_feature_flags
Listing feature flags ...
name    state
classic_mirrored_queue_version  enabled
classic_queue_type_delivery_support     enabled
direct_exchange_routing_v2      enabled
drop_unroutable_metric  enabled
empty_basic_get_metric  enabled
feature_flags_v2        enabled
implicit_default_bindings       enabled
listener_records_in_ets enabled
maintenance_mode_status enabled
quorum_queue    enabled
restart_streams enabled
stream_queue    enabled
stream_sac_coordinator_unblock_group    enabled
stream_single_active_consumer   enabled
tracking_records_in_ets enabled
user_limits     enabled
virtual_host_metadata   enabled


ff31214.png

When starting the 3.13.6 node (rabbitmq-node1), we have the following warnings in the log file:
{"time":"2024-08-01 16:07:14.562202+02:00","level":"warning","msg":"Feature flags: nodes `rabbit@rabbitmq-node1` and `rabbit@rabbitmq-node3` are incompatible","line":446,"pid":"<0.256.0>","file":"rabbit_ff_controller.erl","domain":"rabbitmq.feature_flags","mfa":["rabbit_ff_controller","check_node_compatibility_task1",4]}
{"time":"2024-08-01 16:07:14.562254+02:00","level":"warning","msg":"Peer discovery: could not auto-cluster with node 'rabbit@rabbitmq-node3': {error,incompatible_feature_flags}","line":980,"pid":"<0.256.0>","file":"rabbit_peer_discovery.erl","domain":"rabbitmq.peer_discovery","mfa":["rabbit_peer_discovery","join_selected_node_locked",2]}


By diving into the log file on the 3.13.6 node, we can see a difference in the inventorying of the nodes:

  • 3.12.4
  states_per_node =>
      #{'rabbit@rabbitmq-node3' =>
            #{classic_mirrored_queue_version => true,quorum_queue => true,
              stream_queue => true,implicit_default_bindings => true,
              virtual_host_metadata => true,maintenance_mode_status => true,
              user_limits => true,stream_single_active_consumer => true,
              feature_flags_v2 => true,direct_exchange_routing_v2 => true,
              listener_records_in_ets => true,tracking_records_in_ets => true,
              classic_queue_type_delivery_support => true,
              restart_streams => true,
              stream_sac_coordinator_unblock_group => true,
              empty_basic_get_metric => true,drop_unroutable_metric => true},
        'rabbit@rabbitmq-node2' =>
            #{classic_mirrored_queue_version => true,quorum_queue => true,
              stream_queue => true,implicit_default_bindings => true,
              virtual_host_metadata => true,maintenance_mode_status => true,
              user_limits => true,stream_single_active_consumer => true,
              feature_flags_v2 => true,direct_exchange_routing_v2 => true,
              listener_records_in_ets => true,tracking_records_in_ets => true,
              classic_queue_type_delivery_support => true,
              restart_streams => true,
              stream_sac_coordinator_unblock_group => true,
              empty_basic_get_metric => true,drop_unroutable_metric => true}}

  • 3.13.6
  states_per_node =>
      #{'rabbit@rabbitmq-node1' =>
            #{message_containers_deaths_v2 => true,message_containers => true,
              transient_nonexcl_queues => false,global_qos => false,
              classic_mirrored_queue_version => true,quorum_queue => true,
              stream_queue => true,implicit_default_bindings => true,
              virtual_host_metadata => true,maintenance_mode_status => true,
              user_limits => true,stream_single_active_consumer => true,
              feature_flags_v2 => true,direct_exchange_routing_v2 => true,
              listener_records_in_ets => true,tracking_records_in_ets => true,
              classic_queue_type_delivery_support => true,
              restart_streams => true,
              stream_sac_coordinator_unblock_group => true,
              stream_filtering => true,khepri_db => false,
              classic_queue_mirroring => false,ram_node_type => false,
              stream_update_config_command => true,
              quorum_queue_non_voters => true,
              detailed_queues_endpoint => true,
              management_metrics_collection => false,
              empty_basic_get_metric => true,drop_unroutable_metric => true}}

These 3.13 specific FF are directly enabled:
  • message_containers_deaths_v2
  • message_containers
  • stream_filtering
  • stream_update_config_command
  • quorum_queue_non_voters

Booting and clustering works by specifically enabling 3.12.14 only FF on the 3.13.6 node in the rabbitmq-env.conf file, but it is not optimal:
RABBITMQ_FEATURE_FLAGS=classic_mirrored_queue_version,classic_queue_type_delivery_support,direct_exchange_routing_v2,drop_unroutable_metric,empty_basic_get_metric,feature_flags_v2,implicit_default_bindings,listener_records_in_ets,maintenance_mode_status,quorum_queue,restart_streams,stream_queue,stream_sac_coordinator_unblock_group,stream_single_active_consumer,tracking_records_in_ets,user_limits,virtual_host_metadata

ff3136mixed.png

Did we miss something here?

You can find complete debug log files attached.

Thanks for your help !

Regards,
Guillaume
journalctl-stdout.log
rabbit.log

Luke Bakken

unread,
Aug 1, 2024, 11:46:15 AM8/1/24
to rabbitmq-users
Hello,

Would you mind providing a detailed explanation of how you are performing the upgrade, including all commands run and their output?

Guillaume Connan

unread,
Aug 1, 2024, 12:46:56 PM8/1/24
to rabbitmq-users
Hello Luke,

Of course yes !

We basically perform grow-then-shrink cluster upgrades on AWS EC2 instances (no autoscaling) by starting a new EC2 instance based on a pre-packaged AMI that contains the target RabbitMQ version (installed through official RabbitMQ RPM), then we download the RabbitMQ configuration at startup (which is exactly the same file between version 3.12.4 and 3.13.6, attached if needed), then we start the service:
echo "RABBITMQ_USE_LONGNAME=true" > /etc/rabbitmq/rabbitmq-env.conf && \
echo "RABBITMQ_NODENAME=rabbit@`/usr/bin/hostname -I`" >> /etc/rabbitmq/rabbitmq-env.conf && \
mkdir -m 755 -p /var/lib/rabbitmq/mnesia && \
chown -R rabbitmq:rabbitmq /etc/rabbitmq /var/lib/rabbitmq && \
chmod 600 /etc/rabbitmq/* /var/lib/rabbitmq/.erlang.cookie && \
systemctl start rabbitmq-server.service



[root@rabbitmq-node1 ~]# ls -al /etc/rabbitmq /var/lib/rabbitmq
/etc/rabbitmq:
total 28
drwxr-sr-x.  2 rabbitmq rabbitmq    75 Aug  1 11:31 .
drwxr-xr-x. 86 root     root     16384 Aug  1 11:31 ..
-rw-------.  1 rabbitmq rabbitmq   232 Aug  1 15:24 enabled_plugins
-rw-------.  1 rabbitmq rabbitmq   489 Aug  1 16:46 rabbitmq-env.conf
-rw-------.  1 rabbitmq rabbitmq   909 Aug  1 14:33 rabbitmq.conf

/var/lib/rabbitmq:
total 20
drwxr-xr-x.  3 rabbitmq rabbitmq    42 Aug  1 16:46 .
drwxr-xr-x. 30 root     root     16384 Aug  1 11:31 ..
-rw-------.  1 rabbitmq rabbitmq    49 Jul 31 19:34 .erlang.cookie
drwxr-x---.  4 rabbitmq rabbitmq   151 Aug  1 16:46 mnesia


After the new node is fully initialized (all queues are in sync, no active alarms, ...), we terminate the old one, and so on.
At the end of the cluster upgrade, we enable all new stable FF via the management API.

We are using the same approach and scripts since RabbitMQ 3.7 without issues... until now :-)

Note: 3.13.6 node boot is looping on peer discovery for 1 minute before starting as a standalone node:
{"time":"2024-08-01 18:37:33.669723+02:00","level":"error","msg":"Peer discovery: could not discover and join another node; proceeding as a standalone node","line":259,"pid":"<0.256.0>","file":"rabbit_peer_discovery.erl","domain":"rabbitmq.peer_discovery","mfa":["rabbit_peer_discovery","retry_sync_desired_cluster",3]}

Hope this will help !
rabbitmq.conf
enabled_plugins
rabbitmq-env.conf

Michael Klishin

unread,
Aug 1, 2024, 1:53:32 PM8/1/24
to rabbitmq-users
Grow-then-shrink upgrades are very explicitly documented as dangerous [1]
and requiring a careful set of steps that avoids permanently losing replica identities
(or rather, replacing them carefully with new identities without losing quorum).

Peer discovery has changed in 3.13.

1. https://www.rabbitmq.com/docs/upgrade#grow-then-shrink-upgrades

Guillaume Connan

unread,
Aug 2, 2024, 5:34:11 AM8/2/24
to rabbitmq-users
Highly discouraged, yet still possible :-)

We are planning to switch to in-place upgrades in a near future.

From what I understand, it doesn't seem to be related to peer discovery fundamentals because the other nodes are correctly discovered here and warning come from rabbit_ff_controller.

The inventory module detects an incompatibility between the 2 versions because the 3.13.6 node seems to start with all the 3.13 stable FFs already enabled, some of which are obviously unknown to 3.12 nodes.

Shouldn't the 3.13.6 node wait to know the compatibility level of existing nodes before activating it's FFs accordingly?

Besides that, the new problematic FFs are juste stable, not even required.

As I wrote, clustering works as expected by specifically disabling new stable 3.13 FFs in RABBITMQ_FEATURE_FLAGS before starting 3.13.6 node.
Reply all
Reply to author
Forward
0 new messages