Hi,
I’m facing an issue with auto-clustering.
When deploying a cluster with the rabbitmq.config listing the nodes properly, sometimes it looks like the auto-clustering is somehow timing out.
All my nodes don’t necessarily come up at the same time, it can take up to 10 to 15 seconds for all of them to be up and running.
Here is an example of a 3 node cluster, where I end up with a 2 node cluster, and one node “left alone”.
Logs from node that ends up not being clustered (NODE_2) :
=INFO REPORT==== 18-Apr-2016::08:59:30 ===
Database directory at /local/data/rabbitmq/NODE_2/mnesia is empty. Initialising from scratch...
=WARNING REPORT==== 18-Apr-2016::08:59:30 ===
Could not auto-cluster with NODE_1@rabbit-004: {badrpc,
nodedown}
=WARNING REPORT==== 18-Apr-2016::08:59:30 ===
Could not auto-cluster with NODE_3@rabbit-003: {error,
tables_not_present}
=WARNING REPORT==== 18-Apr-2016::08:59:30 ===
Could not find any node for auto-clustering from: ['NODE_1@rabbit-004',
'NODE_3@rabbit-003',
'NODE_2@rabbit-006']
Starting blank node...
=INFO REPORT==== 18-Apr-2016::08:59:30 ===
Priority queues enabled, real BQ is rabbit_variable_queue
=INFO REPORT==== 18-Apr-2016::08:59:30 ===
Management plugin: using rates mode 'basic'
=INFO REPORT==== 18-Apr-2016::08:59:30 ===
msg_store_transient: using rabbit_msg_store_ets_index to provide index
=INFO REPORT==== 18-Apr-2016::08:59:30 ===
msg_store_persistent: using rabbit_msg_store_ets_index to provide index
=WARNING REPORT==== 18-Apr-2016::08:59:30 ===
msg_store_persistent: rebuilding indices from scratch
=INFO REPORT==== 18-Apr-2016::08:59:30 ===
Adding vhost '/'
=INFO REPORT==== 18-Apr-2016::08:59:30 ===
Creating user 'guest'
=INFO REPORT==== 18-Apr-2016::08:59:30 ===
Setting user tags for user 'guest' to [administrator]
=INFO REPORT==== 18-Apr-2016::08:59:30 ===
Setting permissions for 'guest' in '/' to '.*', '.*', '.*'
=INFO REPORT==== 18-Apr-2016::08:59:31 ===
started TCP Listener on 0.0.0.0:29000
=INFO REPORT==== 18-Apr-2016::08:59:31 ===
Management plugin started. Port: 29001
=INFO REPORT==== 18-Apr-2016::08:59:31 ===
Server startup complete; 6 plugins started.
* rabbitmq_management
* rabbitmq_web_dispatch
* webmachine
* mochiweb
* rabbitmq_management_agent
* amqp_client
=INFO REPORT==== 18-Apr-2016::08:59:45 ===
node 'NODE_1@rabbit-004' up
What’s interesting is that it detects that NODE_1 came up, but it still doesn’t join the cluster.
I noticed this in rabbitmq-sasl ;
=SUPERVISOR REPORT==== 18-Apr-2016::08:59:31 ===
Supervisor: {local,rabbit_mgmt_sup}
Context: child_terminated
Reason: {shutdown,[{{already_started,<3032.360.0>},
{child,undefined,rabbit_mgmt_db,
{rabbit_mgmt_db,start_link,[]},
permanent,4294967295,worker,
[rabbit_mgmt_db]}}]}
Offender: [{pid,<0.356.0>},
{name,mirroring},
{mfargs,
{mirrored_supervisor,start_internal,
[rabbit_mgmt_sup,
#Fun<rabbit_misc.execute_mnesia_transaction.1>,
[{rabbit_mgmt_db,
{rabbit_mgmt_db,start_link,[]},
permanent,4294967295,worker,
[rabbit_mgmt_db]}]]}},
{restart_type,permanent},
{shutdown,4294967295},
{child_type,worker}]
=SUPERVISOR REPORT==== 18-Apr-2016::08:59:31 ===
Supervisor: {local,rabbit_mgmt_sup}
Context: shutdown
Reason: reached_max_restart_intensity
Offender: [{pid,<0.356.0>},
{name,mirroring},
{mfargs,
{mirrored_supervisor,start_internal,
[rabbit_mgmt_sup,
#Fun<rabbit_misc.execute_mnesia_transaction.1>,
[{rabbit_mgmt_db,
{rabbit_mgmt_db,start_link,[]},
permanent,4294967295,worker,
[rabbit_mgmt_db]}]]}},
{restart_type,permanent},
{shutdown,4294967295},
{child_type,worker}]
For the same cluster, NODE 3 also started by saying :
=WARNING REPORT==== 18-Apr-2016::08:59:30 ===
Could not auto-cluster with NODE_1@rabbit-004: {badrpc,
nodedown}
=WARNING REPORT==== 18-Apr-2016::08:59:30 ===
Could not auto-cluster with NODE_2@rabbit-006: {error,
tables_not_present}
=WARNING REPORT==== 18-Apr-2016::08:59:30 ===
Could not find any node for auto-clustering from: ['NODE_1@rabbit-004',
'NODE_3@rabbit-003',
'NODE_2@rabbit-006']
Starting blank node...
But then later on in the logs ;
=INFO REPORT==== 18-Apr-2016::08:59:45 ===
node 'NODE_1@rabbit-004' up
=INFO REPORT==== 18-Apr-2016::08:59:45 ===
rabbit on node 'NODE_1@rabbit-004' up
NODE 1 is the one that started last, and as such clustered properly with NODE 3.
I tried setting {kernel, [{net_setuptime, 30}]} but not sure if that’s the timeout auto-clustering uses to wait on mnesia tables ?
Rabbitmq.config for all nodes :
[
{rabbit, [{tcp_listeners, [29000]},
{collect_statistics_interval, 30000},
{loopback_users, []},
{cluster_nodes,{['NODE_1@rabbit-004', 'NODE_3@rabbit-003', 'NODE_2@rabbit-006'],disc}},
{cluster_partition_handling,pause_minority},
{disk_free_limit, 1073741824},
{vm_memory_high_watermark, 0.4}
]
},
{rabbitmq_management, [{listener, [{port, 29001}]}]},
{kernel, [{net_setuptime, 30}]}
].
Any ideas ?
--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Ok so this explains the logs in rabbitmq-sasl
This seems to have been fixed in 3.5.7 though ? I’m running 3.5.7 already
Sent too fast, sorry.
I’m running 3.5.7 already but still seeing the issue.
The github issue says it happens when nodes communicated beforehand, but in my case they should be fresh nodes with fresh empty mnesia
Alright, this happens when I’m doing tests where there’s a sequence of brokers with the same name being installed/uninstalled on the same box.
Every time the local broker files are deleted, so the node start with a fresh mnesia, but the EPMD is still running in the background.
So even if the node is “fresh”, it’s virtually the same node with the same erlang node name, so might be considered the same by the EPMD.
Could this be linked ?
I will revert back to 3.5.0 to see if it occurs too.
I unfortunately haven’t packaged 3.6.X yet, and it might take a while before I can leverage it as it requires Erlang 18 too.
Thanks for the pointers
I’m hitting the issue quite frequently, and making sure to kill the EPMD before deploying a broker is not really a viable option.
We can’t really build and deploy 3.6.X anytime soon so if you think it is fixed in this major, do you have patches in mind ?
We could try and patch 3.5.7 with the corresponding changes.