concurrent "rabbitmqctl join_cluster + start_app" from 2 nodes fails sporadically

Kari Niemi

unread,

May 2, 2022, 7:19:20 AM5/2/22

to rabbitmq-users

Hi,

in CI rounds, "rabbitmqctl join_cluster + start_app" from 2 nodes fails sporadically for us. It's a small percent of CI-builds that fails for this problem. The symptoms are the same as in the unresolved discussion/bug reports:

https://groups.google.com/g/rabbitmq-users/c/kGarj94NBaU/m/yguDFUI6EgAJ
maybe this, which has node details obfuscated : https://groups.google.com/g/rabbitmq-users/c/Ix_MdY2Vluo/m/tkFRyOyDAgAJ
https://bugs.launchpad.net/charms/+source/rabbitmq-server/+bug/1587631

The symptoms that we are seeing:

"rabbitmqctl join_cluster + start_app" fails sometimes, when two nodes are joining the cluster at the same moment ( we are not sure if the root cause is the concurrent join, though)
one of the nodes is listed twice in the error message ( is this a separate problem, or symptom/cause for the first one?)

We setup the cluster using ansible:

all 3 nodes in parallel ( for clean virtual machines):
- install rpm
- configure NODENAME in rabbitmq-env.conf
- start service
- create new rabbit user ( and delete guest user) using rabbitmqctl
2 nodes in parallel:
- rabbitmqctl stop_app
- rabbitmqctl join_cluster rabbit@rbn-1
- rabbitmqctl start_app # this fails sporadically

The error message from rabbitmqctl

```

{:rabbit, {{:badmatch, {:error, {{:shutdown, {:failed_to_start_child, :mnesia_kernel_sup, :killed}}, {:mnesia_app, :start, [:normal, []]}}}}, {:rabbit, :start, [:normal, []]}}}

```

The error message in rabbit@rbn-2 instance log (in which, rbn-3 is for some reason listed twice):

```

2022-04-29 17:46:23.008 [error] <0.1384.0> Mnesia('rabbit@rbn-2'): ** ERROR ** (core dumped to file: "/var/lib/rabbitmq/MnesiaCore.rabbit@rbn-2_1651_243583_7909")

** FATAL ** Failed to merge schema: {aborted,{combine_error,schema,['rabbit@rbn-3','rabbit@rbn-3','rabbit@rbn-2','rabbit@rbn-1']}}

```

Versions:

rabbitmq-server.noarch:3.8.27-1.el7
erlang.x86_64:23.3.4.11-1.el7
rhel7

Kari Niemi

unread,

May 3, 2022, 9:16:30 AM5/3/22

to rabbitmq-users

A complete "stack-trace" for the crash looks like this:

---

2022-04-29 17:46:22.943 [info] <0.44.0> Application mnesia started on node 'rabbit@rbn-2'
2022-04-29 17:46:22.945 [info] <0.44.0> Application mnesia exited with reason: stopped
2022-04-29 17:46:22.946 [info] <0.44.0> Application mnesia exited with reason: stopped

2022-04-29 17:46:23.008 [error] <0.1384.0> Mnesia('rabbit@rbn-2'): ** ERROR ** (core dumped to file: "/var/lib/rabbitmq/MnesiaCore.rabbit@rbn-2_1651_243583_7909")
** FATAL ** Failed to merge schema: {aborted,{combine_error,schema,['rabbit@rbn-3','rabbit@rbn-3','rabbit@rbn-2','rabbit@rbn-1']}}

2022-04-29 17:46:23.292 [debug] <0.1314.0> Lager installed handler lager_backend_throttle into lager_event
2022-04-29 17:46:33.010 [info] <0.1383.0> supervisor: {local,mnesia_sup}, errorContext: start_error, reason: killed, offender: [{pid,undefined},{id,mnesia_kernel_sup},{mfargs,{mnesia_kernel_sup,start,[]}},{restart_type,permanent},{shutdown,infinity},{child_type,supervisor}]
2022-04-29 17:46:33.010 [error] <0.1383.0> Supervisor mnesia_sup had child mnesia_kernel_sup started with mnesia_kernel_sup:start() at undefined exit with reason killed in context start_error
2022-04-29 17:46:33.010 [error] <0.1390.0> ** Generic server mnesia_recover terminating
** Last message in was {'EXIT',<0.1386.0>,killed}
** When Server state == {state,<0.1386.0>,undefined,undefined,undefined,0,false,true,[]}
** Reason for termination ==
** killed
2022-04-29 17:46:33.011 [error] <0.1388.0> ** Generic server mnesia_subscr terminating
** Last message in was {'EXIT',<0.1386.0>,killed}
** When Server state == {state,<0.1386.0>,#Ref<0.1445991763.2944532481.89008>}
** Reason for termination ==
** killed
2022-04-29 17:46:33.011 [error] <0.1390.0> CRASH REPORT Process mnesia_recover with 0 neighbours exited with reason: killed in gen_server:decode_msg/9 line 475
2022-04-29 17:46:33.011 [error] <0.1388.0> CRASH REPORT Process mnesia_subscr with 0 neighbours exited with reason: killed in gen_server:decode_msg/9 line 475
2022-04-29 17:46:33.011 [error] <0.1387.0> ** Generic server mnesia_monitor terminating
** Last message in was {'EXIT',<0.1386.0>,killed}
** When Server state == {state,<0.1386.0>,[],[],true,[],undefined,[],[]}
** Reason for termination ==
** killed
2022-04-29 17:46:33.014 [error] <0.1387.0> CRASH REPORT Process mnesia_monitor with 0 neighbours exited with reason: killed in gen_server:decode_msg/9 line 475
2022-04-29 17:46:33.014 [error] <0.1381.0> CRASH REPORT Process <0.1381.0> with 0 neighbours exited with reason: {{shutdown,{failed_to_start_child,mnesia_kernel_sup,killed}},{mnesia_app,start,[normal,[]]}} in application_master:init/4 line 138
2022-04-29 17:46:33.014 [info] <0.44.0> Application mnesia exited with reason: {{shutdown,{failed_to_start_child,mnesia_kernel_sup,killed}},{mnesia_app,start,[normal,[]]}}
2022-04-29 17:46:33.014 [error] <0.1306.0>
2022-04-29 17:46:33.014 [error] <0.1306.0> BOOT FAILED
2022-04-29 17:46:33.015 [error] <0.1306.0> ===========
2022-04-29 17:46:33.015 [error] <0.1306.0> Exception during startup:
2022-04-29 17:46:33.015 [error] <0.1306.0>
2022-04-29 17:46:33.015 [error] <0.1306.0> application_master:start_it_old/4 line 277
2022-04-29 17:46:33.015 [error] <0.1306.0> rabbit:start/2 line 819
2022-04-29 17:46:33.015 [error] <0.1306.0> rabbit:run_prelaunch_second_phase/0 line 349
2022-04-29 17:46:33.015 [error] <0.1306.0> error:{badmatch,
2022-04-29 17:46:33.015 [error] <0.1306.0> {error,
2022-04-29 17:46:33.015 [error] <0.1306.0> {{shutdown,{failed_to_start_child,mnesia_kernel_sup,killed}},
2022-04-29 17:46:33.015 [error] <0.1306.0> {mnesia_app,start,[normal,[]]}}}}
2022-04-29 17:46:33.015 [error] <0.1306.0>
2022-04-29 17:46:34.016 [error] <0.1305.0> CRASH REPORT Process <0.1305.0> with 0 neighbours exited with reason: {{badmatch,{error,{{shutdown,{failed_to_start_child,mnesia_kernel_sup,killed}},{mnesia_app,start,[normal,[]]}}}},{rabbit,start,[normal,[]]}} in application_master:init/4 line 138
2022-04-29 17:46:34.016 [info] <0.44.0> Application rabbit exited with reason: {{badmatch,{error,{{shutdown,{failed_to_start_child,mnesia_kernel_sup,killed}},{mnesia_app,start,[normal,[]]}}}},{rabbit,start,[normal,[]]}}
2022-04-29 17:46:34.019 [info] <0.44.0> Application sysmon_handler exited with reason: stopped
2022-04-29 17:46:34.022 [info] <0.44.0> Application ra exited with reason: stopped
2022-04-29 17:46:34.024 [info] <0.44.0> Application os_mon exited with reason: stopped

---

Kari Niemi

unread,

May 4, 2022, 3:14:07 AM5/4/22

to rabbitmq-users

Further investigation:

* man page of rabbitmqctl says in "join_cluster" that "Before clustering, the node is reset, so be careful when using this command."

* still the documentation conflicts and insists that the user shall do the reset manually himself: https://www.rabbitmq.com/clustering.html#creating

* we are currently not performing a reset, not sure if we should?

* reproducing the failure: I did a loop over our rabbitmq-installation: remove everything, the re-install again. I was already loosing the hope of reproducing the issue, but at round #94 it failed for the discussed issue. It takes ~1min/round - did not have my eye on the logs all the 1:30hrs for sure :-D.

* the above discussed crash probably initiates somewhere along these lines: https://github.com/erlang/otp/blob/fa5033048eff892a468f3e0561bf249e4a5a59a4/lib/mnesia/src/mnesia_controller.erl#L504-L551 ?

Kari Niemi

unread,

May 5, 2022, 2:19:06 AM5/5/22

to rabbitmq-users

Further investigation:

* I tried adding the "rabbitmqctl reset" between stop_app and join_cluster. Still got the problem after few tens of test rounds.

* I did run "rabbitmqctl cluster_status" in one of the nodes and that was when the problem happened

* I tried a few more rounds, and noticed that the problem might be met more often if "rabbitmqctl cluster_status" is run during the join_cluster.

* we usually don't run it - and in CI we never run it during cluster formation.

* it's not apparent, if 1) the 10s time period in crash-cases just makes it more likely that "rabbitmqctl cluster_status" hit that 10s period( and has no cause-effect relation) 2) "rabbitmqctl cluster_status" just causes some extra load/changes the timing to make the problem happen more often (but does not actually logically interfere the join_cluster-operations) 3) "rabbitmqctl cluster_status" actually does some similar action as the concurrent join_cluster-operations, and therefore brings the very same problem more often. 4) there's no relation at all

I started this discussion, because it was obligated in rabbitmq-server github-project to not raise bug-reports directly there - but first discuss and analyse the issue here. Waiting for someone to join the discussion.

Kari Niemi

unread,

May 11, 2022, 2:22:52 AM5/11/22

to rabbitmq-users

Any comments? How many defects were revealed here?

Kari Niemi

unread,

Jun 16, 2022, 4:50:41 AM6/16/22

to rabbitmq-users

None?

Luke Bakken

unread,

Jun 16, 2022, 10:14:12 AM6/16/22

to rabbitmq-users

Hi Kari,

I don't think we ever test concurrent join_cluster operations. Thank you for your detailed analysis. If you'd like, please feel free to open an issue at https://github.com/rabbitmq/rabbitmq-server

However, in a future version of RabbitMQ clustering will be managed by our rabbitmq/khepri project, and not mnesia, so we may not spend time fixing this issue.

In general, we recommend that only one join_cluster operation happens at a time.w

The root cause of this issue may actually be something in mnesia itself - either a bug or limitation.