Rolling upgrade from 3.7.18 to 3.8.0 failed, error due to feature flags enabled on upgraded node during upgrade process

1,384 views
Skip to first unread message

JK

unread,
Oct 4, 2019, 5:32:08 PM10/4/19
to rabbitmq-users
Hello Fellow DuckMQ Users (it is almost duck season, right? 😉),

I was testing upgrade paths to 3.8.0 and ran into a problem with a 2 node 3.7.18 cluster: sandboxvm and mastervm, running on CentOS 7. I stopped rmq on mastervm and upgraded to 3.8.0, then started rmq but it failed to start and join the cluster:

2019-10-04 14:58:26.724 [info] <0.2210.0> Successfully stopped RabbitMQ and its dependencies
2019-10-04 14:58:26.724 [info] <0.2210.0> Halting Erlang VM with the following applications:
    lager
    observer_cli
    recon
    ranch
    ssl
    public_key
    asn1
    stdout_formatter
    credentials_obfuscation
    inets
    xmerl
    tools
    crypto
    jsx
    goldrush
    compiler
    syntax_tools
    sasl
    stdlib
    kernel
2019-10-04 14:58:55.183 [info] <0.8.0> Log file opened with Lager
2019-10-04 14:58:57.564 [info] <0.8.0> Feature flags: list of feature flags found:
2019-10-04 14:58:57.564 [info] <0.8.0> Feature flags:   [x] drop_unroutable_metric
2019-10-04 14:58:57.564 [info] <0.8.0> Feature flags:   [x] empty_basic_get_metric
2019-10-04 14:58:57.564 [info] <0.8.0> Feature flags:   [x] implicit_default_bindings
2019-10-04 14:58:57.564 [info] <0.8.0> Feature flags:   [x] quorum_queue
2019-10-04 14:58:57.564 [info] <0.8.0> Feature flags:   [x] virtual_host_metadata
2019-10-04 14:58:57.564 [info] <0.8.0> Feature flags: feature flag states written to disk: yes
2019-10-04 14:58:57.649 [info] <0.43.0> Application mnesia exited with reason: stopped
2019-10-04 14:58:57.668 [error] <0.8.0> Feature flags: node `rabbit@sandboxvm` is INCOMPATIBLE: feature flags enabled locally are not supported remotely
2019-10-04 14:58:57.669 [error] <0.8.0>
Error description:
    init:do_boot/3
    init:start_em/1
    rabbit:start_it/1 line 465
    rabbit:'-boot/0-fun-0-'/0 line 318
    rabbit_mnesia:check_cluster_consistency/0 line 702
throw:{error,incompatible_feature_flags}
Log file(s) (may contain more information):
   /var/log/rabbitmq/rab...@mastervm.log
   /var/log/rabbitmq/rabbit@mastervm_upgrade.log

So the upgrade process enables some feature flags by default and renders that node incompatible with the cluster? That's a bit broken. Was there something I missed with the upgrade process to disable that? Both are on erlang 22.1.1, both were on 3.7.18 before the one node was upgraded. The install process was done on that node as follows:

[root@mastervm ~]# systemctl stop rabbitmq-server
[root@mastervm ~]# yum upgrade /home/<user>/rabbitmq-server-3.8.0-1.el7.noarch.rpm
[root@mastervm ~]# systemctl start rabbitmq-server

Was fairly simple, at that point the node should have started up but it failed to start, with the above error message spamming the log. There is nothing meaningful in the upgrade log at all:

2019-10-04 14:58:55.177 [info] <0.8.0> Log file opened with Lager
2019-10-04 14:59:11.237 [info] <0.8.0> Log file opened with Lager
2019-10-04 14:59:28.318 [info] <0.8.0> Log file opened with Lager
2019-10-04 14:59:44.476 [info] <0.8.0> Log file opened with Lager
2019-10-04 15:00:00.212 [info] <0.8.0> Log file opened with Lager
2019-10-04 15:00:16.241 [info] <0.8.0> Log file opened with Lager

I double checked, and nothing was stated in the upgrade docs about any additional steps to take to keep those flags from being enabled when upgrading. In fact, it's fairly sparse:

Rolling Upgrades
Rolling upgrades are possible only between compatible RabbitMQ and Erlang versions.
Starting RabbitMQ 3.8
RabbitMQ 3.8.0 comes with a feature flags subsystem which is responsible for determining if two versions of RabbitMQ are compatible. If they are, then two nodes with different versions can live in the same cluster: this allows a rolling upgrade of cluster members without shutting down the cluster entirely.
The upgrade from RabbitMQ 3.7.x to 3.8.x is also permitted, but not from older minor or major versions.
To learn more, please read the feature flags documentation.

And nothing in the feature flag docs regarding steps to take for a rolling upgrade. At least nothing that the blurb in the upgrade docs doesn't already mention. Ok, no big deal, I'll just disable the feature flags on the upgraded node. Oh...
How to Disable Feature Flags
It is impossible to disable a feature flag once it is enabled. 

Well, ok then. No mixed version cluster for me today, womp womp...

I can recover this node just fine by reverting it to 3.7.18 with a clean state and rejoin, or even just fail forward with the other node. I did create this cluster for the sole purpose of testing a rolling upgrade so no big deal there, but this was not a successful rolling upgrade experience. If there is something missing from that procedure, please let me know and I'll be happy to retest.

Cheers,
-J

JK

unread,
Oct 4, 2019, 6:18:12 PM10/4/19
to rabbitmq-users
FYI for anyone running into this, failing forward worked fine to get the cluster back. Essentially it became a full stop upgrade: I made sure mastervm was fully down then stopped and upgraded sandboxvm, started sandboxvm then mastervm and both nodes were happy. This cluster is on 3.8.0 now but as I mentioned before, it was for testing the rolling upgrade process.

-J
   /var/log/rabbitmq/rabbit@mastervm.log
   /var/log/rabbitmq/rabbit@mastervm_upgrade.log

Cheers,
-J

Michael Klishin

unread,
Oct 4, 2019, 8:01:05 PM10/4/19
to rabbitmq-users
The upgrade process does not enable feature flags, IIRC at some point we considered doing that if all cluster
members support the same set of flags.

Can you try one more time with debug logging enabled and share both the steps in more detail
and the log files?
   /var/log/rabbitmq/rabbit@mastervm.log
   /var/log/rabbitmq/rabbit@mastervm_upgrade.log

Cheers,
-J

Michael Klishin

unread,
Oct 5, 2019, 8:00:36 AM10/5/19
to rabbitmq-users
I could find evidence that other teams at Pivotal have successfully performed rolling upgrades in an automated scenario (obviously our team has done a lot of testing and there are integration testing pipelines that use mixed version clusters). That did not involve enabling any feature flags by the deployment tool or operator.

So we need full logs and steps to reproduce.

   /var/log/rabbitmq/rab...@mastervm.log
   /var/log/rabbitmq/rabbit@mastervm_upgrade.log

Cheers,
-J

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/ad5a25de-569f-443f-9263-03f8ea73918b%40googlegroups.com.
--
Staff Software Engineer, Pivotal/RabbitMQ

Michael Klishin

unread,
Oct 7, 2019, 9:44:32 AM10/7/19
to rabbitmq-users
A colleague of mine figure out the condition you are looking at [1].

If  a node is upgraded  (as in, its code is replaced) without being restarted, a check if  the feature flag module
is present on that node would cause the module to be loaded from disk, even though the running process is otherwise
entirely unaware of feature flags and has no relevant environment and configuration variables set up.

Our tests upgrade and restarted nodes one by one instead of upgrading all and then restarting one by one.

--
MK

Staff Software Engineer, Pivotal/RabbitMQ

JK

unread,
Oct 7, 2019, 3:40:37 PM10/7/19
to rabbitmq-users
Hello Michael,

Thanks for checking, but you can disregard this. I should have done another test before I posted, I cannot reproduce. I wiped both hosts clean, completely fresh rmq 3.7.18 install, cleared out all log files, config files, etc beforehand. After initial user and cluster setup, I did the upgrade as stated above and had no problems at all during in the install. Lather, rinse, repeat and no still no issues. There had to be some cruft left over from a previous test with 3.8.0 unrelated to upgrading to it. My apologies!

Thanks,
-J
   /var/log/rabbitmq/rabbit@mastervm.log
   /var/log/rabbitmq/rabbit@mastervm_upgrade.log

Cheers,
-J
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitm...@googlegroups.com.
--
Staff Software Engineer, Pivotal/RabbitMQ

Michael Klishin

unread,
Oct 7, 2019, 5:25:57 PM10/7/19
to rabbitmq-users
No, we have identified a real, though unlikely, scenario where a rolling upgrade is not currently possible, thanks to your
report as well as a couple of others on this list.

Here are the steps:

 *  Form a cluster of pre-3.7.18 nodes
 * Install 3.7.18 or 3.8.0 on all of them *without restarting the nodes*
 * Begin restarting nodes one by one

Most systems are upgraded slightly differently:

 *  Form a cluster of pre-3.7.18 nodes
 * Install 3.7.18 or 3.8.0 on node A, restart it
 * Install 3.7.18 or 3.8.0 on node B, restart it
 * and so on

[1] contains more details and there's a patch that is undergoing QA (including by one of the reporters).


   /var/log/rabbitmq/rab...@mastervm.log
   /var/log/rabbitmq/rabbit@mastervm_upgrade.log

Cheers,
-J
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/427d8885-5266-417c-b685-977362be47bb%40googlegroups.com.

JK

unread,
Oct 7, 2019, 8:47:26 PM10/7/19
to rabbitmq-users
Hi Michael,

I see, I'm glad my post was at least of some use! As far as why I ran into that the first time, looking over the other issues you linked to jogged my memory.

The first time I upgraded to 3.8.0 was from a 3.6.x install that had a test cluster configured on the same vms for something else I was breaking :) In that upgrade, I did not shut down any nodes before issuing yum upgrade. I assume like the other posts, that is when it set the flags. It should be noted that doing so with yum doesn't work as expected, it is disruptive on currently running nodes and the first node I updated the package on dropped from the cluster and didn't rejoin until I upgraded the other. What I was expecting was that I'd update the packages then restart the nodes like what is mentioned in the reproducible steps you provided.

My steps were going to be:

 * Form a cluster of 3.6.x nodes.
 * Install 3.8.0 on all of them without restarting nodes.
 * Restart nodes one by one.

But what ended up happening:

 * Form a cluster of 3.6.x nodes.
 * Yum update on the first node, then watch that node drop unexpectedly from the cluster and panic about what happened.
 * Stop panicking because it's not prod and update the next node.
 * Profit.

Since that procedure didn't work, I failed forward and everything was fine. I was not expecting a mixed version cluster though (3.6.x and 3.8.0), nor was I trying the rolling upgrade, I was just trying to be clever and minimize down time per node for when I upgrade prod. After I botched that, I decided to actually test the rolling upgrade process. Then I downgraded to 3.7.18 and cleaned out that first 3.8.0 install, but I was a bit sloppy and only nuked the mnesia dir (not prod, so didn't care about being thorough). The flag settings were likely hiding out somewhere from that first upgrade attempt. When I tried that first rolling upgrade test, it was done the same way as the second steps you provided but I'm guessing with those flags left over, thus failing. After that and after posting here, I was sure to start with a clean environment each time and had no issues.

Thanks again, and HTH everyone with the upgrade!
-J
   /var/log/rabbitmq/rabbit@mastervm.log
   /var/log/rabbitmq/rabbit@mastervm_upgrade.log

Cheers,
-J
Reply all
Reply to author
Forward
0 new messages