Hi,
When initially forming a RabbitMQ cluster using rabbit_peer_discovery_classic_config and starting all nodes (docker containers) at the same time, sometimes only two of three nodes form a cluster while the third remains a standalone node.
Logs
2018-07-13 06:36:19.728 [info] <0.191.0> Configured peer discovery backend: rabbit_peer_discovery_classic_config
2018-07-13 06:36:19.728 [info] <0.191.0> Will try to lock with peer discovery backend rabbit_peer_discovery_classic_config
2018-07-13 06:36:19.728 [info] <0.191.0> Peer discovery backend does not support locking, falling back to randomized delay
2018-07-13 06:36:19.728 [info] <0.191.0> Peer discovery backend rabbit_peer_discovery_classic_config does not support registration, skipping randomized startup delay.
2018-07-13 06:36:19.728 [info] <0.191.0> All discovered existing cluster peers: rabbit@dockerhost2, rabbit@dockerhost1, rabbit@dockerhost0
2018-07-13 06:36:19.729 [info] <0.191.0> Peer nodes we can cluster with: rabbit@dockerhost2, rabbit@dockerhost0
2018-07-13 06:36:19.764 [warning] <0.191.0> Could not auto-cluster with node rabbit@dockerhost2: {error,tables_not_present}
2018-07-13 06:36:19.770 [warning] <0.191.0> Could not auto-cluster with node rabbit@dockerhost0: {error,tables_not_present}
2018-07-13 06:36:19.771 [warning] <0.191.0> Could not successfully contact any node of: rabbit@dockerhost2,rabbit@dockerhost0 (as in Erlang distribution). Starting as a blank standalone node...
Used versions
RabbitMQ Version: 3.7.7
Erlang Version: 20.2.3
Used config
cluster_formation.peer_discovery_backend = rabbit_peer_discovery_classic_config
log.console.level = debug
cluster_formation.classic_config.nodes.1 = rabbit@dockerhost0
cluster_formation.classic_config.nodes.2 = rabbit@dockerhost1
cluster_formation.classic_config.nodes.3 = rabbit@dockerhost2
The issue looks like a race condition, however, at
https://www.rabbitmq.com/cluster-formation.html#initial-formation-race-condition it is stated, that the rabbit_peer_discovery_classic_config backend avoids the issue of race conditions by relying on a pre-configured set of peers. When performing a randomized sleep in the start script before starting rabbitmq, the problem occurs less likely depending on the actual delay.
The probability of the occurrence of this problem is much higher with three nodes than with two nodes, but it was observed with two nodes as well.
Is this normal behavior?
Thanks
Ferdinand