3.7.12 -> 3.7.18 upgrades failing with "feature_flags_file_not_set"

217 views
Skip to first unread message

Terry Rinck

unread,
Oct 3, 2019, 11:08:35 AM10/3/19
to rabbitmq-users
We saw a peculiar issue upgrading from 3.7.12 -> 3.7.18.

This was conducted on a four node rabbit cluster v 3.7.12, no changes made to config files or settings prior to upgrade. 
only 3 vhosts living on the broker, with no traffic. This is a testing/integration cluster. 

- all nodes are up
- deployed 3.7.18 package to all nodes
- stopped 1 node

on startup:

09:50:29 sys-rabbit-defau:Waiting for RabbitMQ server to initialize...
2019-10-03 09:50:30.204 [info] <0.7.0> Log file opened with Lager
2019-10-03 09:50:34.499 [info] <0.7.0> Feature flags: list of feature flags found:
2019-10-03 09:50:34.499 [info] <0.7.0> Feature flags: feature flag states written to disk: yes
2019-10-03 09:50:34.775 [info] <0.42.0> Application mnesia exited with reason: stopped
2019-10-03 09:50:34.802 [error] <0.7.0> 
Error description:
    lists:foldl/3 line 1263
    rabbit_mnesia:check_cluster_consistency/2 line 685
    rabbit_mnesia:check_consistency/5 line 876
    rabbit_mnesia:check_rabbit_consistency/2 line 945
    rabbit_feature_flags:check_node_compatibility/2 line 1711
    rabbit_feature_flags:exchange_feature_flags_from_unknown_apps/2 line 1901
    rabbit_feature_flags:fetch_remote_feature_flags_from_apps_unknown_locally/2 line 1909
    rabbit_feature_flags:query_remote_feature_flags/3 line 1845
error:{case_clause,feature_flags_file_not_set}
...
09:50:34 sys-rabbit:
09:50:34 sys-rabbit:BOOT FAILED
09:50:34 sys-rabbit:===========
09:50:34 sys-rabbit:
09:50:34 sys-rabbit:Error description:
09:50:34 sys-rabbit:    lists:foldl/3 line 1263
09:50:34 sys-rabbit:    rabbit_mnesia:check_cluster_consistency/2 line 685
09:50:34 sys-rabbit:    rabbit_mnesia:check_consistency/5 line 876
09:50:34 sys-rabbit:    rabbit_mnesia:check_rabbit_consistency/2 line 945
09:50:34 sys-rabbit:    rabbit_feature_flags:check_node_compatibility/2 line 1711
09:50:34 sys-rabbit:    rabbit_feature_flags:exchange_feature_flags_from_unknown_apps/2 line 1901
09:50:34 sys-rabbit:    rabbit_feature_flags:fetch_remote_feature_flags_from_apps_unknown_locally/2 line 1909
09:50:34 sys-rabbit:    rabbit_feature_flags:query_remote_feature_flags/3 line 1845
09:50:34 sys-rabbit:error:{case_clause,feature_flags_file_not_set}
...
09:50:35 sys-rabbit:{"init terminating in do_boot",{case_clause,feature_flags_file_not_set}}~
09:50:35 sys-rabbit:init terminating in do_boot ({case_clause,feature_flags_file_not_set})~
...
09:50:36 procmgr: sys-rabbit: end of task 57604 0.

So the 1st thing that jumped out is "feature_flags_file_not_set"

What the heck is the 'Feature Flags File'? Nothing about this in the upgrade doc. 

We checked the source code and found in rabbitmq-server a reference to RABBITMQ_FEATURE_FLAGS_FILE env var.

So in rabbitmq-env.conf, we set:
'export RABBITMQ_FEATURE_FLAGS_FILE="/FILE/PATH" '
and created an empty file "/FILE/PATH"

now we get error:

10:07:50 sys-rabbit-defau:Waiting for RabbitMQ server to initialize...
2019-10-03 10:07:50.858 [info] <0.7.0> Log file opened with Lager
2019-10-03 10:07:51.453 [error] <0.7.0> 
Error description:
    init:do_boot/3
    init:start_em/1
    rabbit:start_it/1 line 491
    rabbit:'-boot/0-fun-0-'/0 line 338
    rabbit_feature_flags:initialize_registry/1 line 781
    rabbit_feature_flags:read_enabled_feature_flags_list/0 line 1285
    rabbit_feature_flags:try_to_read_enabled_feature_flags_list/0 line 1299
error:{case_clause,{ok,[]}}
....
10:07:51 sys-rabbit:
10:07:51 sys-rabbit:BOOT FAILED
10:07:51 sys-rabbit:===========
10:07:51 sys-rabbit:
10:07:51 sys-rabbit:Error description:
10:07:51 sys-rabbit:    init:do_boot/3
10:07:51 sys-rabbit:    init:start_em/1
10:07:51 sys-rabbit:    rabbit:start_it/1 line 491
10:07:51 sys-rabbit:    rabbit:'-boot/0-fun-0-'/0 line 338
10:07:51 sys-rabbit:    rabbit_feature_flags:initialize_registry/1 line 781
10:07:51 sys-rabbit:    rabbit_feature_flags:read_enabled_feature_flags_list/0 line 1285
10:07:51 sys-rabbit:    rabbit_feature_flags:try_to_read_enabled_feature_flags_list/0 line 1299
10:07:51 sys-rabbit:error:{case_clause,{ok,[]}}
...
10:07:52 sys-rabbit:{"init terminating in do_boot",{case_clause,{ok,[]}}}~
10:07:52 sys-rabbit:init terminating in do_boot ({case_clause,{ok,[]}})~
10:07:52 sys-rabbit:~
10:07:52 procmgr: sys-rabbit: end of task 84603 0.

Since there's nothing about the feature flag file in the docs, there no info about format or contents, more digging into the source.
There's a comment in rabbit_feature_flags.erl
%% If the file is missing, we consider the list of enabled
 %% feature flags to be empty.

We delete the file, and get the same error as the 1st attempt:

10:19:21 sys-rabbit:Error description:
10:19:21 sys-rabbit:    lists:foldl/3 line 1263
10:19:21 sys-rabbit:    rabbit_mnesia:check_cluster_consistency/2 line 685
10:19:21 sys-rabbit:    rabbit_mnesia:check_consistency/5 line 876
10:19:21 sys-rabbit:    rabbit_mnesia:check_rabbit_consistency/2 line 945
10:19:21 sys-rabbit:    rabbit_feature_flags:check_node_compatibility/2 line 1711
10:19:21 sys-rabbit:    rabbit_feature_flags:exchange_feature_flags_from_unknown_apps/2 line 1901
10:19:21 sys-rabbit:    rabbit_feature_flags:fetch_remote_feature_flags_from_apps_unknown_locally/2 line 1909
10:19:21 sys-rabbit:    rabbit_feature_flags:query_remote_feature_flags/3 line 1845
10:19:21 sys-rabbit:error:{case_clause,feature_flags_file_not_set}
...
10:19:21 sys-rabbit:
10:19:22 sys-rabbit:{"init terminating in do_boot",{case_clause,feature_flags_file_not_set}}~
10:19:22 sys-rabbit:init terminating in do_boot ({case_clause,feature_flags_file_not_set})~
10:19:22 sys-rabbit:~

But this time we noticed "query_remote_feature_flags" hmmm interesting.

So now we run "rabbitmqctl eval 'application:set_env(rabbit, feature_flags_file, ["/FILE/PATH").'" on the other three nodes.

"/FILE/PATH" obviously a substitute for the actual path. 

$ rabbitmqctl eval 'rpc:multicall(application, get_env, [rabbit, feature_flags_file]).'                                                      
{[{ok,'/FILE/PATH'},
  {ok,'/FILE/PATH'},
  {ok,'/FILE/PATH'}],
 []}


At this point, once we've set 'feature_flags_file' location in the running VM on the other nodes, and have set the export in our local rabbitmq-env.conf, (but no actual file exists), now the 1st node starts:


10:34:36 sys-rabbit:  ##  ##
10:34:36 sys-rabbit:  ##  ##      RabbitMQ 3.7.18. Copyright (C) 2007-2019 Pivotal Software, Inc.
10:34:36 sys-rabbit:  ##########  Licensed under the MPL.  See https://www.rabbitmq.com/
10:34:36 sys-rabbit:  ######  ##
10:34:36 sys-rabbit:  ##########  Logs: REDACTED
10:34:36 sys-rabbit:                    REDACTED
10:34:36 sys-rabbit:
10:34:36 sys-rabbit:              Starting broker...



We can easily recreate this. Please let us know what other details might be needed to help address the issue. 

Terry Rinck

unread,
Oct 3, 2019, 11:30:56 AM10/3/19
to rabbitmq-users
We proceeded with the upgrade, without adding the export noted above to the rabbitm-env.conf.

The brokers restarted without issue, however, the feature flag file is automatically set to a location in the mnesia directory

And for sake of thoroughness, we bounced the 1st node again. 


 eval 'rpc:multicall(application, get_env, [rabbit, feature_flags_file]).'
{[{ok,"... /mnesia/sys-rabbit@rqsh1x-ob-569-feature_flags"},
  {ok,"... /mnesia/sys-rabbit@rqsh1x-pw-648-feature_flags"},
  {ok,".... etc/rabbitmq/dev/feature-flags"},  <--- The node with the modified env.conf
  {ok,"... mnesia/sys-rabbit@rqsh1x-pw-162-feature_flags"}],


So it seems like the 1st node wigs-out because no one else knows about feature flags yet?



Michael Klishin

unread,
Oct 3, 2019, 11:48:56 AM10/3/19
to rabbitmq-users
Thanks for the analysis, Terry.

We have seen this specific combination mentioned three or so times in the last 24 hours, can it be different members of the
same team?

We can take another look and perhaps provide a better default for the feature flag file in 3.7.x. What surprises me is that
with a decent amount of feedback from those who moved to 3.8 and 3.7.18, this was not reported elsewhere
(assuming that the recent uptick in reports is coming from the same org).

I'm trying to understand how did

- deployed 3.7.18 package to all nodes
- stopped 1 node

really go. Did you install the package on all nodes? Did you simply unpack e.g. a generic UNIX build
(which is not started automatically unlike, say, Debian)?

Terry Rinck:

Terry Rinck

unread,
Oct 3, 2019, 12:03:19 PM10/3/19
to rabbitmq-users
Thanks for the quick response, Michael,

If other's are reporting this, it is not from my managed services team. 


"- deployed 3.7.18 package to all nodes "

This is in the Debian style you mention.
We unpack the tar from upstream and then package it in a custom manner for deployment in our env.
The files are staged, re-linking the rabbitmq-server script and all others, to the new package location.
The nodes then await a manual restart to roll in the new version.

Jean-Sébastien Pédron

unread,
Oct 3, 2019, 12:34:08 PM10/3/19
to rabbitm...@googlegroups.com
On 03/10/2019 17:08, Terry Rinck wrote:
> We saw a peculiar issue upgrading from 3.7.12 -> 3.7.18.
>
> This was conducted on a four node rabbit cluster v 3.7.12, no changes
> made to config files or settings prior to upgrade. 
> only 3 vhosts living on the broker, with no traffic. This is a
> testing/integration cluster. 
>
> - all nodes are up
> - deployed 3.7.18 package to all nodes
> - stopped 1 node
Hi Terry!

With your steps, I could finally reproduce the issue locally. I don't
know yet exactly what differs from the other things I tried, but I will
keep you posted as I debug this.

Thanks!

--
Jean-Sébastien Pédron
Pivotal / RabbitMQ

signature.asc

Jean-Sébastien Pédron

unread,
Oct 3, 2019, 12:52:29 PM10/3/19
to rabbitm...@googlegroups.com
On 03/10/2019 18:33, Jean-Sébastien Pédron wrote:
> I don't know yet exactly what differs from the other things I tried,
> (...)

The first hypothesis is that in your case (and anyone's using a package
management like dpkg(1)), package files are overwritten/added/removed
while the service is running.

This is usually perfectly fine. But here, the restarted node contacts
one of the other cluster members and here is what happens:

- It runs the new feature flags module (`rabbit_feature_flags`) to
query the remote node's known feature flags.

- The remote node is a 3.7.12 one without the `rabbit_feature_flags`
module initially. However the 3.7.18 files are already deployed on
that remote node: thus, Erlang simply loads it to satisfy the call.

- Unfortunately, that 3.7.18 module runs in the context of a 3.7.12 node
and therefore, the context is totally off. The answer from the remote
node is unexpected for the node restarting and looks like a
mis-configured situation. That's why it errors out.
signature.asc

Jean-Sébastien Pédron

unread,
Oct 7, 2019, 6:08:48 AM10/7/19
to rabbitm...@googlegroups.com
Hi Terry!

Could you please try the attached patch?

If it more convenient, I committed it in a branch:
https://github.com/rabbitmq/rabbitmq-server/commit/fbdef3c51acbc8d221e734f9f29683ba38a01250
0001-rabbit_feature_flags-Prevent-load-of-the-module-on-p.patch
signature.asc

Terry Rinck

unread,
Oct 7, 2019, 10:08:32 AM10/7/19
to rabbitm...@googlegroups.com
Thanks Jean-Sébastien, 
We'll try to give this a test today.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/797e9c57-570a-9a13-a6b3-34d8857187b7%40rabbitmq.com.

Michael Klishin

unread,
Oct 7, 2019, 10:14:01 AM10/7/19
to rabbitmq-users
We can produce a one-off build for you. What package type would you prefer? (Debian, RPM, generic UNIX?)



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Terry Rinck

unread,
Oct 7, 2019, 12:55:13 PM10/7/19
to rabbitmq-users
Awesome! Generic Unix please.


On Monday, October 7, 2019 at 10:14:01 AM UTC-4, Michael Klishin wrote:
We can produce a one-off build for you. What package type would you prefer? (Debian, RPM, generic UNIX?)

On Mon, Oct 7, 2019 at 9:08 AM Terry Rinck <terren...@gmail.com> wrote:
Thanks Jean-Sébastien, 
We'll try to give this a test today.

On Mon, Oct 7, 2019 at 6:08 AM Jean-Sébastien Pédron <jean-se...@rabbitmq.com> wrote:
Hi Terry!

Could you please try the attached patch?

If it more convenient, I committed it in a branch:
https://github.com/rabbitmq/rabbitmq-server/commit/fbdef3c51acbc8d221e734f9f29683ba38a01250

--
Jean-Sébastien Pédron
Pivotal / RabbitMQ

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitm...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitm...@googlegroups.com.

Michael Klishin

unread,
Oct 7, 2019, 8:10:11 PM10/7/19
to rabbitmq-users
This build should have the patch [1]. The link will be valid for one week.


To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/d77fb83f-5dce-4ccd-9893-408a464cd67d%40googlegroups.com.

Terry Rinck

unread,
Oct 8, 2019, 11:15:20 AM10/8/19
to rabbitmq-users
Corporate firewall is blocking this link.

Michael Klishin

unread,
Oct 8, 2019, 12:05:26 PM10/8/19
to rabbitmq-users

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/e043c753-ae27-46a7-a58c-4d3e9a32701b%40googlegroups.com.

Michael Klishin

unread,
Oct 8, 2019, 1:48:43 PM10/8/19
to rabbitmq-users
I cannot share this file to Terry because, according to Google Drive,

> Shared with [email].
> Error: The administrator for [a financial services corp] has disabled the ability to receive items from outside their domain.

So Terry's employer has pretty draconian firewall rules for downloads.

I'm not sure what else our team can do other than merging thee PR on faith and producing a preview release on GitHub.

Terry Rinck

unread,
Oct 8, 2019, 2:52:59 PM10/8/19
to rabbitm...@googlegroups.com
The effort is much appreciated. 
However, after a meeting with my team today we've decided we need to put in place some additional tools before any further upgrading can take place. Perhaps by then this fix will be merged into a later 3.7.x release. 

Aside from that, your help and speed of resolution to this issue was amazing. Thank you!

Michael Klishin

unread,
Oct 8, 2019, 4:33:31 PM10/8/19
to rabbitmq-users
Thank you.

We plan to ship 3.7.20 and 3.8.1 in about a week with preview releases dropping tomorrow or so.


Reply all
Reply to author
Forward
0 new messages