In 3.6.5, when all nodes are restarted, it has to happen in a particular order. The log has a message that suggests as much:
> This cluster node was shut down while other nodes were still running.
> To avoid losing data, you should start the other nodes first, then
> start this one. To force this node to start, first invoke
> "rabbitmqctl force_boot". If you do so, any changes made on other
> cluster nodes after this one was shut down may be lost.
Starting with 3.6.7 it's not required
outside of feature version upgrades.
There are also messages about inability to contact epmd, an auxiliary daemon used to resolve inter-node
communication ports:
> attempted to contact: ['rabbit@ip-100-65-5-154','rabbit@ip-100-65-5-19']
>
> rabbit@ip-100-65-5-154:
> * unable to connect to epmd (port 4369) on ip-100-65-5-154: address (cannot connect to host/port)
>
> rabbit@ip-100-65-5-19:
> * unable to connect to epmd (port 4369) on ip-100-65-5-19: address (cannot connect to host/port)
So either the peer node was rebooted/down/unreachable, or something blocked inter-node communication.
Lastly there's a lot of
> client unexpectedly closed TCP connection
which was already mentioned above.
My best guess is that
* The entire cluster was restarted or rebooted, intentionally or by accident
* Since this version requires rolling restarts to be performed in a certain order, some nodes refused to start with expected errors in the log
* Log exceptions were incorrectly identified as the root cause
* Assuming that the operator(s) wasn't aware of the node restart order requirement, random measures such as "lets restart things N times until they re-cluster" were attempted until they succeeded
The word "crash" is used by the runtime to describe unhandled exceptions and is not an indication of a node going down.
I don't see any evidence of nodes going down or any exceptions that would report any major failures.