concurrent "rabbitmqctl join_cluster + start_app" from 2 nodes fails sporadically

521 views
Skip to first unread message

Kari Niemi

unread,
May 2, 2022, 7:19:20 AM5/2/22
to rabbitmq-users
Hi,

in CI rounds, "rabbitmqctl join_cluster + start_app" from 2 nodes fails sporadically for us. It's a small percent of CI-builds that fails for this problem. The symptoms are the same as in the unresolved discussion/bug reports:

The symptoms that we are seeing:
  • "rabbitmqctl join_cluster + start_app"  fails sometimes, when two nodes are joining the cluster at the same moment ( we are not sure if the root cause is the concurrent join, though)
  • one of the nodes is listed twice in the error message ( is this a separate problem, or symptom/cause for the first one?)
We setup the cluster using ansible:
  •  all 3 nodes  in parallel ( for clean virtual machines):
    • install rpm
    • configure NODENAME in rabbitmq-env.conf
    • start service
    • create new rabbit user ( and delete guest user) using rabbitmqctl
  • 2 nodes in parallel:
    •   rabbitmqctl stop_app
    •   rabbitmqctl join_cluster rabbit@rbn-1
    •   rabbitmqctl start_app  # this fails sporadically
The error message from rabbitmqctl
```
{:rabbit, {{:badmatch, {:error, {{:shutdown, {:failed_to_start_child, :mnesia_kernel_sup, :killed}}, {:mnesia_app, :start, [:normal, []]}}}}, {:rabbit, :start, [:normal, []]}}}
```

The error message in rabbit@rbn-2 instance log  (in which, rbn-3 is for some reason listed twice):
```
2022-04-29 17:46:23.008 [error] <0.1384.0> Mnesia('rabbit@rbn-2'): ** ERROR ** (core dumped to file: "/var/lib/rabbitmq/MnesiaCore.rabbit@rbn-2_1651_243583_7909")
 ** FATAL ** Failed to merge schema: {aborted,{combine_error,schema,['rabbit@rbn-3','rabbit@rbn-3','rabbit@rbn-2','rabbit@rbn-1']}}
```

Versions:
  • rabbitmq-server.noarch:3.8.27-1.el7
  • erlang.x86_64:23.3.4.11-1.el7
  • rhel7

Kari Niemi

unread,
May 3, 2022, 9:16:30 AM5/3/22
to rabbitmq-users
A complete "stack-trace" for the crash looks like this:
---
2022-04-29 17:46:22.943 [info] <0.44.0> Application mnesia started on node 'rabbit@rbn-2'
2022-04-29 17:46:22.945 [info] <0.44.0> Application mnesia exited with reason: stopped
2022-04-29 17:46:22.946 [info] <0.44.0> Application mnesia exited with reason: stopped

2022-04-29 17:46:23.008 [error] <0.1384.0> Mnesia('rabbit@rbn-2'): ** ERROR ** (core dumped to file: "/var/lib/rabbitmq/MnesiaCore.rabbit@rbn-2_1651_243583_7909")
 ** FATAL ** Failed to merge schema: {aborted,{combine_error,schema,['rabbit@rbn-3','rabbit@rbn-3','rabbit@rbn-2','rabbit@rbn-1']}}
2022-04-29 17:46:23.292 [debug] <0.1314.0> Lager installed handler lager_backend_throttle into lager_event
2022-04-29 17:46:33.010 [info] <0.1383.0> supervisor: {local,mnesia_sup}, errorContext: start_error, reason: killed, offender: [{pid,undefined},{id,mnesia_kernel_sup},{mfargs,{mnesia_kernel_sup,start,[]}},{restart_type,permanent},{shutdown,infinity},{child_type,supervisor}]
2022-04-29 17:46:33.010 [error] <0.1383.0> Supervisor mnesia_sup had child mnesia_kernel_sup started with mnesia_kernel_sup:start() at undefined exit with reason killed in context start_error
2022-04-29 17:46:33.010 [error] <0.1390.0> ** Generic server mnesia_recover terminating
** Last message in was {'EXIT',<0.1386.0>,killed}
** When Server state == {state,<0.1386.0>,undefined,undefined,undefined,0,false,true,[]}
** Reason for termination ==
** killed
2022-04-29 17:46:33.011 [error] <0.1388.0> ** Generic server mnesia_subscr terminating
** Last message in was {'EXIT',<0.1386.0>,killed}
** When Server state == {state,<0.1386.0>,#Ref<0.1445991763.2944532481.89008>}
** Reason for termination ==
** killed
2022-04-29 17:46:33.011 [error] <0.1390.0> CRASH REPORT Process mnesia_recover with 0 neighbours exited with reason: killed in gen_server:decode_msg/9 line 475
2022-04-29 17:46:33.011 [error] <0.1388.0> CRASH REPORT Process mnesia_subscr with 0 neighbours exited with reason: killed in gen_server:decode_msg/9 line 475
2022-04-29 17:46:33.011 [error] <0.1387.0> ** Generic server mnesia_monitor terminating
** Last message in was {'EXIT',<0.1386.0>,killed}
** When Server state == {state,<0.1386.0>,[],[],true,[],undefined,[],[]}
** Reason for termination ==
** killed
2022-04-29 17:46:33.014 [error] <0.1387.0> CRASH REPORT Process mnesia_monitor with 0 neighbours exited with reason: killed in gen_server:decode_msg/9 line 475
2022-04-29 17:46:33.014 [error] <0.1381.0> CRASH REPORT Process <0.1381.0> with 0 neighbours exited with reason: {{shutdown,{failed_to_start_child,mnesia_kernel_sup,killed}},{mnesia_app,start,[normal,[]]}} in application_master:init/4 line 138
2022-04-29 17:46:33.014 [info] <0.44.0> Application mnesia exited with reason: {{shutdown,{failed_to_start_child,mnesia_kernel_sup,killed}},{mnesia_app,start,[normal,[]]}}
2022-04-29 17:46:33.014 [error] <0.1306.0>
2022-04-29 17:46:33.014 [error] <0.1306.0> BOOT FAILED
2022-04-29 17:46:33.015 [error] <0.1306.0> ===========
2022-04-29 17:46:33.015 [error] <0.1306.0> Exception during startup:
2022-04-29 17:46:33.015 [error] <0.1306.0>
2022-04-29 17:46:33.015 [error] <0.1306.0>     application_master:start_it_old/4 line 277
2022-04-29 17:46:33.015 [error] <0.1306.0>     rabbit:start/2 line 819
2022-04-29 17:46:33.015 [error] <0.1306.0>     rabbit:run_prelaunch_second_phase/0 line 349
2022-04-29 17:46:33.015 [error] <0.1306.0> error:{badmatch,
2022-04-29 17:46:33.015 [error] <0.1306.0>           {error,
2022-04-29 17:46:33.015 [error] <0.1306.0>               {{shutdown,{failed_to_start_child,mnesia_kernel_sup,killed}},
2022-04-29 17:46:33.015 [error] <0.1306.0>                {mnesia_app,start,[normal,[]]}}}}
2022-04-29 17:46:33.015 [error] <0.1306.0>
2022-04-29 17:46:34.016 [error] <0.1305.0> CRASH REPORT Process <0.1305.0> with 0 neighbours exited with reason: {{badmatch,{error,{{shutdown,{failed_to_start_child,mnesia_kernel_sup,killed}},{mnesia_app,start,[normal,[]]}}}},{rabbit,start,[normal,[]]}} in application_master:init/4 line 138
2022-04-29 17:46:34.016 [info] <0.44.0> Application rabbit exited with reason: {{badmatch,{error,{{shutdown,{failed_to_start_child,mnesia_kernel_sup,killed}},{mnesia_app,start,[normal,[]]}}}},{rabbit,start,[normal,[]]}}
2022-04-29 17:46:34.019 [info] <0.44.0> Application sysmon_handler exited with reason: stopped
2022-04-29 17:46:34.022 [info] <0.44.0> Application ra exited with reason: stopped
2022-04-29 17:46:34.024 [info] <0.44.0> Application os_mon exited with reason: stopped

---

Kari Niemi

unread,
May 4, 2022, 3:14:07 AM5/4/22
to rabbitmq-users
Further investigation:
 * man  page of rabbitmqctl says in "join_cluster" that "Before clustering, the node is reset, so be careful when using this command."
 * still the documentation conflicts and insists that the user shall do the reset manually himself: https://www.rabbitmq.com/clustering.html#creating
 * we are currently not performing a reset, not sure if we should?
 * reproducing the failure: I did a loop over our rabbitmq-installation: remove everything, the re-install again. I was already loosing the hope of reproducing the issue, but at round #94 it failed for the discussed issue. It takes ~1min/round - did not have my eye on the logs all the 1:30hrs for sure :-D.

Kari Niemi

unread,
May 5, 2022, 2:19:06 AM5/5/22
to rabbitmq-users
Further investigation:
* I tried adding the "rabbitmqctl reset" between stop_app and join_cluster. Still got the problem after few tens of test rounds.
  * I did run "rabbitmqctl cluster_status" in one of the nodes and that was when the problem happened
* I tried a few more rounds, and noticed that the problem might be met more often if "rabbitmqctl cluster_status" is run during the join_cluster.
  * we usually don't run it - and in CI we never run it during cluster formation.
  * it's not apparent, if 1) the 10s time period in crash-cases just makes it more likely that "rabbitmqctl cluster_status" hit that 10s period( and has no cause-effect relation) 2)  "rabbitmqctl cluster_status" just causes some extra load/changes the timing to make the problem happen more often (but does not actually logically interfere the join_cluster-operations) 3) "rabbitmqctl cluster_status" actually does some similar action as the concurrent join_cluster-operations, and therefore brings the very same problem more often. 4) there's no relation at all

I started this discussion, because it was obligated in rabbitmq-server github-project to not raise bug-reports directly there - but first discuss and analyse the issue here. Waiting for someone to join the discussion.

Kari Niemi

unread,
May 11, 2022, 2:22:52 AM5/11/22
to rabbitmq-users
Any comments? How many defects were revealed here?

Kari Niemi

unread,
Jun 16, 2022, 4:50:41 AM6/16/22
to rabbitmq-users
None?

Luke Bakken

unread,
Jun 16, 2022, 10:14:12 AM6/16/22
to rabbitmq-users
Hi  Kari,

I don't think we ever test concurrent join_cluster operations. Thank you for your detailed analysis. If you'd like, please feel free to open an issue at https://github.com/rabbitmq/rabbitmq-server

However, in a future version of RabbitMQ clustering will be managed by our rabbitmq/khepri project, and not mnesia, so we may not spend time fixing this issue.

In general, we recommend that only one join_cluster operation happens at a time.w

The root cause of this issue may actually be something in mnesia itself - either a bug or limitation.

Thanks,
Luke
Reply all
Reply to author
Forward
0 new messages