Auto clustering timeout issue

332 views
Skip to first unread message

Skrzypek, Jonathan

unread,
Apr 18, 2016, 9:16:31 AM4/18/16
to rabbitm...@googlegroups.com

Hi,

 

I’m facing an issue with auto-clustering.

When deploying a cluster with the rabbitmq.config listing the nodes properly, sometimes it looks like the auto-clustering is somehow timing out.

 

All my nodes don’t necessarily come up at the same time, it can take up to 10 to 15 seconds for all of them to be up and running.

Here is an example of a 3 node cluster, where I end up with a 2 node cluster, and one node “left alone”.

 

Logs from node that ends up not being clustered (NODE_2) :

 

=INFO REPORT==== 18-Apr-2016::08:59:30 ===

Database directory at /local/data/rabbitmq/NODE_2/mnesia is empty. Initialising from scratch...

 

=WARNING REPORT==== 18-Apr-2016::08:59:30 ===

Could not auto-cluster with NODE_1@rabbit-004: {badrpc,

                                                                                   nodedown}

 

=WARNING REPORT==== 18-Apr-2016::08:59:30 ===

Could not auto-cluster with NODE_3@rabbit-003: {error,

                                                                                   tables_not_present}

 

=WARNING REPORT==== 18-Apr-2016::08:59:30 ===

Could not find any node for auto-clustering from: ['NODE_1@rabbit-004',

                                                   'NODE_3@rabbit-003',

                                                   'NODE_2@rabbit-006']

Starting blank node...

 

=INFO REPORT==== 18-Apr-2016::08:59:30 ===

Priority queues enabled, real BQ is rabbit_variable_queue

=INFO REPORT==== 18-Apr-2016::08:59:30 ===

Management plugin: using rates mode 'basic'

=INFO REPORT==== 18-Apr-2016::08:59:30 ===

msg_store_transient: using rabbit_msg_store_ets_index to provide index

=INFO REPORT==== 18-Apr-2016::08:59:30 ===

msg_store_persistent: using rabbit_msg_store_ets_index to provide index

=WARNING REPORT==== 18-Apr-2016::08:59:30 ===

msg_store_persistent: rebuilding indices from scratch

=INFO REPORT==== 18-Apr-2016::08:59:30 ===

Adding vhost '/'

=INFO REPORT==== 18-Apr-2016::08:59:30 ===

Creating user 'guest'

=INFO REPORT==== 18-Apr-2016::08:59:30 ===

Setting user tags for user 'guest' to [administrator]

=INFO REPORT==== 18-Apr-2016::08:59:30 ===

Setting permissions for 'guest' in '/' to '.*', '.*', '.*'

=INFO REPORT==== 18-Apr-2016::08:59:31 ===

started TCP Listener on 0.0.0.0:29000

=INFO REPORT==== 18-Apr-2016::08:59:31 ===

Management plugin started. Port: 29001

 

=INFO REPORT==== 18-Apr-2016::08:59:31 ===

Server startup complete; 6 plugins started.

* rabbitmq_management

* rabbitmq_web_dispatch

* webmachine

* mochiweb

* rabbitmq_management_agent

* amqp_client

 

=INFO REPORT==== 18-Apr-2016::08:59:45 ===

node 'NODE_1@rabbit-004' up

 

What’s interesting is that it detects that NODE_1 came up, but it still doesn’t join the cluster.

 

I noticed this in rabbitmq-sasl ;

 

=SUPERVISOR REPORT==== 18-Apr-2016::08:59:31 ===

     Supervisor: {local,rabbit_mgmt_sup}

     Context:    child_terminated

     Reason:     {shutdown,[{{already_started,<3032.360.0>},

                             {child,undefined,rabbit_mgmt_db,

                                    {rabbit_mgmt_db,start_link,[]},

                                    permanent,4294967295,worker,

                                    [rabbit_mgmt_db]}}]}

     Offender:   [{pid,<0.356.0>},

                  {name,mirroring},

                  {mfargs,

                      {mirrored_supervisor,start_internal,

                          [rabbit_mgmt_sup,

                           #Fun<rabbit_misc.execute_mnesia_transaction.1>,

                           [{rabbit_mgmt_db,

                                {rabbit_mgmt_db,start_link,[]},

                                permanent,4294967295,worker,

                                [rabbit_mgmt_db]}]]}},

                  {restart_type,permanent},

                  {shutdown,4294967295},

                  {child_type,worker}]

 

 

=SUPERVISOR REPORT==== 18-Apr-2016::08:59:31 ===

     Supervisor: {local,rabbit_mgmt_sup}

     Context:    shutdown

     Reason:     reached_max_restart_intensity

     Offender:   [{pid,<0.356.0>},

                  {name,mirroring},

                  {mfargs,

                      {mirrored_supervisor,start_internal,

                          [rabbit_mgmt_sup,

                           #Fun<rabbit_misc.execute_mnesia_transaction.1>,

                           [{rabbit_mgmt_db,

                                {rabbit_mgmt_db,start_link,[]},

                                permanent,4294967295,worker,

                                [rabbit_mgmt_db]}]]}},

                  {restart_type,permanent},

                  {shutdown,4294967295},

                  {child_type,worker}]

 

For the same cluster, NODE 3 also started by saying :

 

=WARNING REPORT==== 18-Apr-2016::08:59:30 ===

Could not auto-cluster with NODE_1@rabbit-004: {badrpc,

                                                                                   nodedown}

 

=WARNING REPORT==== 18-Apr-2016::08:59:30 ===

Could not auto-cluster with NODE_2@rabbit-006: {error,

                                                                                   tables_not_present}

 

=WARNING REPORT==== 18-Apr-2016::08:59:30 ===

Could not find any node for auto-clustering from: ['NODE_1@rabbit-004',

                                                   'NODE_3@rabbit-003',

                                                   'NODE_2@rabbit-006']

Starting blank node...

 

But then later on in the logs ;

 

=INFO REPORT==== 18-Apr-2016::08:59:45 ===

node 'NODE_1@rabbit-004' up

 

=INFO REPORT==== 18-Apr-2016::08:59:45 ===

rabbit on node 'NODE_1@rabbit-004' up

 

NODE 1 is the one that started last, and as such clustered properly with NODE 3.

 

I tried setting  {kernel, [{net_setuptime, 30}]} but not sure if that’s the timeout auto-clustering uses to wait on mnesia tables ?

 

Rabbitmq.config for all nodes :

 

[

   {rabbit, [{tcp_listeners, [29000]},

        {collect_statistics_interval, 30000},

                {loopback_users, []},

                {cluster_nodes,{['NODE_1@rabbit-004', 'NODE_3@rabbit-003', 'NODE_2@rabbit-006'],disc}},

                {cluster_partition_handling,pause_minority},

                {disk_free_limit, 1073741824},

                {vm_memory_high_watermark, 0.4}

            ]

   },

 

   {rabbitmq_management, [{listener, [{port, 29001}]}]},

   {kernel, [{net_setuptime, 30}]}

 

].

 

Any ideas ?

Michael Klishin

unread,
Apr 18, 2016, 9:44:36 AM4/18/16
to rabbitm...@googlegroups.com

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Skrzypek, Jonathan

unread,
Apr 18, 2016, 10:15:59 AM4/18/16
to rabbitm...@googlegroups.com

Ok so this explains the logs in rabbitmq-sasl

 

This seems to have been fixed in 3.5.7 though ? I’m running 3.5.7 already

Skrzypek, Jonathan

unread,
Apr 18, 2016, 2:09:23 PM4/18/16
to rabbitm...@googlegroups.com

Sent too fast, sorry.

I’m running 3.5.7 already but still seeing the issue.

The github issue says it happens when nodes communicated beforehand, but in my case they should be fresh nodes with fresh empty mnesia

Michael Klishin

unread,
Apr 18, 2016, 2:44:12 PM4/18/16
to rabbitm...@googlegroups.com
FWIW we haven't seen this reported against 3.6.x and the solution is basically
to let a one-off process fail if another node already hosts the stats DB, instead
of failing the entire plugin process tree initialisation.

Skrzypek, Jonathan

unread,
Apr 19, 2016, 5:46:31 AM4/19/16
to rabbitm...@googlegroups.com

Alright, this happens when I’m doing tests where there’s a sequence of brokers with the same name being installed/uninstalled on the same box.

Every time the local broker files are deleted, so the node start with a fresh mnesia, but the EPMD is still running in the background.

 

So even if the node is “fresh”, it’s virtually the same node with the same erlang node name, so might be considered the same by the EPMD.

Could this be linked ?

 

I will revert back to 3.5.0 to see if it occurs too.

I unfortunately haven’t packaged 3.6.X yet, and it might take a while before I can leverage it as it requires Erlang 18 too.

 

Thanks for the pointers

Michael Klishin

unread,
Apr 19, 2016, 5:58:44 AM4/19/16
to rabbitm...@googlegroups.com
3.6.x does not require Erlang 18. See http://rabbitmq.com/which-erlang.html

Your hypothesis sounds plausible. We need to think if there is a workaround.

Skrzypek, Jonathan

unread,
Apr 22, 2016, 10:05:10 AM4/22/16
to rabbitm...@googlegroups.com

I’m hitting the issue quite frequently, and making sure to kill the EPMD before deploying a broker is not really a viable option.

 

We can’t really build and deploy 3.6.X anytime soon so if you think it is fixed in this major, do you have patches in mind ?

We could try and patch 3.5.7 with the corresponding changes.

Reply all
Reply to author
Forward
0 new messages