RabbitMQ 3.7.7 sometimes fails to form cluster with rabbit_peer_discovery_classic_config

1,935 views
Skip to first unread message

ferdinan...@gmail.com

unread,
Jul 16, 2018, 3:44:21 AM7/16/18
to rabbitmq-users
Hi,

When initially forming a RabbitMQ cluster using rabbit_peer_discovery_classic_config and starting all nodes (docker containers) at the same time, sometimes only two of three nodes form a cluster while the third remains a standalone node.

Logs
2018-07-13 06:36:19.728 [info] <0.191.0> Configured peer discovery backend: rabbit_peer_discovery_classic_config
2018-07-13 06:36:19.728 [info] <0.191.0> Will try to lock with peer discovery backend rabbit_peer_discovery_classic_config
2018-07-13 06:36:19.728 [info] <0.191.0> Peer discovery backend does not support locking, falling back to randomized delay
2018-07-13 06:36:19.728 [info] <0.191.0> Peer discovery backend rabbit_peer_discovery_classic_config does not support registration, skipping randomized startup delay.
2018-07-13 06:36:19.728 [info] <0.191.0> All discovered existing cluster peers: rabbit@dockerhost2, rabbit@dockerhost1, rabbit@dockerhost0
2018-07-13 06:36:19.729 [info] <0.191.0> Peer nodes we can cluster with: rabbit@dockerhost2, rabbit@dockerhost0
2018-07-13 06:36:19.764 [warning] <0.191.0> Could not auto-cluster with node rabbit@dockerhost2: {error,tables_not_present}
2018-07-13 06:36:19.770 [warning] <0.191.0> Could not auto-cluster with node rabbit@dockerhost0: {error,tables_not_present}
2018-07-13 06:36:19.771 [warning] <0.191.0> Could not successfully contact any node of: rabbit@dockerhost2,rabbit@dockerhost0 (as in Erlang distribution). Starting as a blank standalone node... 

Used versions
RabbitMQ Version: 3.7.7
Erlang Version: 20.2.3

Used config
cluster_formation.peer_discovery_backend = rabbit_peer_discovery_classic_config
log.console.level = debug
cluster_formation.classic_config.nodes.1 = rabbit@dockerhost0
cluster_formation.classic_config.nodes.2 = rabbit@dockerhost1
cluster_formation.classic_config.nodes.3 = rabbit@dockerhost2

The issue looks like a race condition, however, at https://www.rabbitmq.com/cluster-formation.html#initial-formation-race-condition it is stated, that the rabbit_peer_discovery_classic_config backend avoids the issue of race conditions by relying on a pre-configured set of peers. When performing a randomized sleep in the start script before starting rabbitmq, the problem occurs less likely depending on the actual delay.

The probability of the occurrence of this problem is much higher with three nodes than with two nodes, but it was observed with two nodes as well.
 
Is this normal behavior?

Thanks
Ferdinand

Michael Klishin

unread,
Jul 17, 2018, 1:30:11 PM7/17/18
to rabbitm...@googlegroups.com
The log suggests that the node in question could not sync tables from its peers. It can be a side effect on parallel booting.
Start your containers sequentially or increase the delay range to something like 5-30 seconds [1]. The plugin has a low default
that is least annoying during development and may need adjustment depending on the specific deployment scenario.

You can also enable debug logging [2] to see what delay value is used.

1. Search for "randomized_startup_delay_range" on http://www.rabbitmq.com/cluster-formation.html

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

ferdinan...@gmail.com

unread,
Jul 19, 2018, 3:21:48 AM7/19/18
to rabbitmq-users
Configuring a randomized_startup_delay_range seems to have no effect when using the rabbit_peer_discovery_classic_config backend, see attached logs.
It states:

...Peer discovery backend rabbit_peer_discovery_classic_config does not support registration, skipping randomized startup delay.

The startup delay range was configured to 5-60 seconds, as visible at the top of the log file.

A randomized sleep in the start script seems necessary.


Ferdinand
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
rabbitmq.log

Reid Harrison

unread,
Jul 31, 2018, 1:13:03 AM7/31/18
to rabbitmq-users
I am having this same issue from time to time with 3.7.7. The cluster doesn't always form correctly when using the old cluster_nodes config.

Reid Harrison

unread,
Aug 13, 2018, 10:52:06 AM8/13/18
to rabbitmq-users
Michael, I have the same logs as Ferdinand. Randomized startup delay is not supported with classic config (Peer discovery backend rabbit_peer_discovery_classic_config does not support registration, skipping randomized startup delay) and when two cluster nodes enter cluster formation simultaneously, they both start as a standalone node (Could not successfully contact any node of: rabbit@host1,rabbit@host2 (as in Erlang distribution). Starting as a blank standalone node...).

Is it feasible to support randomized startup delay for classic cluster config?
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,
Aug 13, 2018, 6:14:48 PM8/13/18
to rabbitm...@googlegroups.com
It can be added but the right thing to do is to add cluster formation retries.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages