RabbitMQ Cluster - Node core dumps on join

343 views
Skip to first unread message

schhibber

unread,
Aug 27, 2015, 3:37:50 PM8/27/15
to rabbitmq-users
Hi,  I  am looking for some input or help trying to diagnose an issue we have been running into for awhile now. 

We are running a number of RMQ clusters that cycle daily in AWS.  When we bootstrap a rabbit cluster, on occasion, we get rabbit instances that fails to join the cluster.  During the actual join rabbit will core dump.  Let me restate this is happening on occasion and I suspect it is some kind of race condition we are hitting when multiple nodes are trying to join the cluster at the same time.  That being said it would be nice to ensure our cluster came up in a proper state conistantly.

We are using the following on all the nodes:
RabbitMQ 3.5.3, Erlang R16B03

One node is strictly a management node, and all other systems join the cluster via the management node.

One again if anyone can provide some input or help it would be appreciated.  Any questions or requests for more informmation let me know and I will gladly dig it up.  
I have posted the logs from a instance that has crashed down below if needed I can get someone access to the core dump.

Thank you,
Sono

=INFO REPORT==== 27-Aug-2015::13:40:29 ===

Clustering with ['rabbit@ip-10-53-33-11'] as disc node


=ERROR REPORT==== 27-Aug-2015::13:40:30 ===

Mnesia('rabbit@ip-10-53-61-49'): ** ERROR ** (core dumped to file: "/var/lib/rabbitmq/MnesiaCore.rabbit@ip-10-53-61-49_1440_682830_510419")

 ** FATAL ** Failed to merge schema: {aborted,

                                      {combine_error,schema,

                                       ['rabbit@ip-10-53-39-121',

                                        'rabbit@ip-10-53-61-49',

                                        'rabbit@ip-10-53-39-121',

                                        'rabbit@ip-10-53-33-11']}}


=ERROR REPORT==== 27-Aug-2015::13:40:40 ===

** Generic server mnesia_monitor terminating 

** Last message in was {'EXIT',<0.721.0>,killed}

** When Server state == {state,<0.721.0>,[],[],true,[],undefined,[]}

** Reason for termination == 

** killed


=ERROR REPORT==== 27-Aug-2015::13:40:40 ===

** Generic server mnesia_recover terminating 

** Last message in was {'EXIT',<0.721.0>,killed}

** When Server state == {state,<0.721.0>,undefined,undefined,undefined,0,

                               false,true,[]}

** Reason for termination == 

** killed


=ERROR REPORT==== 27-Aug-2015::13:40:40 ===

** Generic server mnesia_snmp_sup terminating 

** Last message in was {'EXIT',<0.721.0>,killed}

** When Server state == {state,

                            {local,mnesia_snmp_sup},

                            simple_one_for_one,

                            [{child,undefined,mnesia_snmp_sup,

                                 {mnesia_snmp_hook,start,[]},

                                 transient,3000,worker,

                                 [mnesia_snmp_sup,mnesia_snmp_hook,

                                  supervisor]}],

                            undefined,0,86400000,[],mnesia_snmp_sup,[]}

** Reason for termination == 

** killed


=ERROR REPORT==== 27-Aug-2015::13:40:40 ===

** Generic server mnesia_subscr terminating 

** Last message in was {'EXIT',<0.721.0>,killed}

** When Server state == {state,<0.721.0>,696381}

** Reason for termination == 

** killed


=INFO REPORT==== 27-Aug-2015::13:40:40 ===

Error description:

   {could_not_start,mnesia,

       {{shutdown,{failed_to_start_child,mnesia_kernel_sup,killed}},

        {mnesia_sup,start,[normal,[]]}}}


Log files (may contain more information):

   /var/log/rabbitmq/rab...@ip-10-53-61-49.log

   /var/log/rabbitmq/rab...@ip-10-53-61-49-sasl.log



=ERROR REPORT==== 27-Aug-2015::13:40:42 ===

Mnesia('rabbit@ip-10-53-61-49'): ** ERROR ** (core dumped to file: "/var/lib/rabbitmq/MnesiaCore.rabbit@ip-10-53-61-49_1440_682842_523320")

 ** FATAL ** mnesia_tm crashed: {killed,

                                 {gen_server,call,

                                  [<0.722.0>,

                                   {close_log,latest_log},

                                   infinity]}} state: [<0.721.0>]

Michael Klishin

unread,
Aug 27, 2015, 4:02:30 PM8/27/15
to rabbitm...@googlegroups.com
Is this in a completely new cluster? If so, I'd try Erlang 18.0 as it seems to be an issue with a part of OTP.

rabbitmq-clusterer and autocluster plugins
may also be worth investigating. They do not replace the use of Mnesia for cluster membership but make it easier to automate and (re)start nodes in any order.

MK

schhibber

unread,
Aug 27, 2015, 5:50:43 PM8/27/15
to rabbitmq-users
Yah it's a completely new cluster.  Each time we are spinning up we are recreating the entire cluster.

Going to give the latest Erlang package a try and see how it goes!
Reply all
Reply to author
Forward
0 new messages