I have just attempted once again to upgrade from Erlang 25.1.2 and RabbitMQ 3.11.4. This time I tried upgrading to Erlang 26.2.1, RabbitMQ 3.12.12. After the upgrade we are still seeing the same behaviour as last time. Certain tabs on the management UI become unresponsive once a the RabbitMQ service on one node is stopped and we either a) log out of the stopped node, or b) turn the node off. The two tabs where we see this behaviour the most are the Overview tab and the Queues and Streams tab. The rabbitmqctl cluster_status command also takes a long time to complete.
I've been looking at various network traffic with Wireshark and I have noticed that the behaviour of the epmd traffic on 4369 is quite different between the old and new versions. With the old version, all RabbitMQ traffic between the running nodes and the stopped node ceases as soon as the RabbitMQ service is stopped. The UI on the running nodes remains responsive. Starting the stopped node causes a couple of epmd exchanges with the other nodes and then normal RabbitMQ traffic resumes over 25672.
With the newer versions, epmd traffic commences as soon as one node is stopped. The epmd traffic is continuous between the stopped node and the running nodes until I either log out of Windows on the stopped node, or turn the stopped node off completely. When I do either of these two things I observe that epmd traffic then ceases, the running nodes continually attempt to connect to the stopped nodes on 4369, and the management UI becomes very unresponsive. I assume that epmd traffic ceases when I log out of Windows on the stopped node because epmd.exe is running in user space and this then gets terminated during the logoff process. Responsive can be restored by logging back into the stopped node and running a command such as rabbitmq-service stop. Even though the RabbitMQ service is already stopped, this command results in epmd.exe starting back up and resuming communication with the running nodes on 4369.
Has anyone else observed this behaviour. My cluster is three nodes running Windows Server 2019. All three nodes are on the same network with no firewall appliances in-between. The cluster runs in pause_minority mode. So far I have tried the following and nothing has fixed this issue:
- Turned off Antivirus
- Turned off Windows firewall
- Turned off encryption on internode cluster communication
- Turned off pause_minority mode
- Completely removed Erlang and RabbitMQ from all nodes, destroyed the RabbitMQ data folders and re-installed from scratch.