Hi, I run a Rabbit cluster using the official Docker image:
https://hub.docker.com/_/rabbitmq/
When attempting to upgrade from v3.8.9 to v3.8.12, the new node fails joining the existing cluster due to the new feature flag for user_limits.
In the past, our upgrade process would go like so:
- Spin up a new node using the new version of the Docker image, like so:
docker run -d --name rabbitmq --hostname $HOSTNAME --restart=always -p 4369:4369 -p 5672:5672 -p 15672:15672 -p 15692:15692 -p 25672:25672 -p 35672:35682 -e RABBITMQ_ERLANG_COOKIE='${erlang_cookie}' -v /root/data:/var/lib/rabbitmq -v /root/conf/:/etc/rabbitmq -v /root/bin:/tmp/bin rabbitmq:${rabbitmq_version}-management - Run a script that makes the new node join the existing cluster
- Wait for cluster to go green with the new node
- Shut down one of the old nodes
- Repeat until all nodes are on the new version
This process served us well until v3.8.10 where a new feature flag for user_limits was introduced. Step 2 now fails with Error: incompatible_feature_flags. This is because the Docker images are immutable per image. You run the 3.8.9 image and you're stuck with it, so in-place upgrades aren't really an option.
# SSH to the instance and bash into the container:
ssh -i ~/.ssh/<key> ec2-user@<ip>
docker exec -ti rabbitmq bash
# Install some tools:
apt-get update; apt-get install --no-install-recommends wget xz-utils
# Download RabbitMQ
# Put the node into maintenance mode:
rabbitmq-upgrade drain
# Stop RabbitMQ Server
# NOTE: This kicks us out of the container, so we have to log back in.
rabbitmqctl stop
docker exec -ti rabbitmq bash
# Install:
tar --extract --file "/usr/local/src/rabbitmq-3.8.12.tar.xz" --directory "/opt/rabbitmq" --strip-components 1; chown -R rabbitmq:rabbitmq "/opt/rabbitmq";
# Configure:
export RABBITMQ_VERSION=3.8.12; sed -i 's/^SYS_PREFIX=.*$/SYS_PREFIX=/' "/opt/rabbitmq/sbin/rabbitmq-defaults";
# Start RabbitMQ:
rabbitmqctl start_app
# Verify:
rabbitmqctl version
rabbitmqctl cluster_status
# Exit maintenance mode:
rabbitmq-upgrade revive
# Re-balance queues (may need to run this a couple times)
rabbitmq-upgrade post_upgrade
rabbitmq-queues rebalance "all"
This worked fine. Once all existing nodes were manually updated, we were able to go back to the old method and spin up new nodes using Docker. However, it feels dirty and goes against everything that running in Docker us supposed to provide. It's tedious and error prone, and our images are not longer "immutable."
I'm wondering if it would be possible to add a feature to start a new node with a feature flag disabled by default. This would allow a node running a newer version to join an older cluster. Once all nodes are updated (and only then), the new feature flag could be enabled.
Or, I'm wondering if there's some other magic to upgrading when using Docker for all of our nodes. Any help would be appreciated. Thanks!