RabbitMQ cluster issue after Upgrade to v3.12.2

482 views
Skip to first unread message

Kannan Sankaran

unread,
Aug 22, 2023, 5:43:03 AM8/22/23
to rabbitmq-users
Hello,

We have recently upgraded our RabbitMQ servers from v3.11.4 to v3.12.2 and Erlang from v25.1.2 to v25.3.2.5. The cluster consists of 3 nodes and cluster is set to pause_minority mode for cluster partition handling. We are running the RabbitMQ applications on 3 separate windows servers. We have 3 Quorum queues.

We observer the below issue in the upgraded environment . The cluster normally functions fine, however, when one of the node is taken down by stopping the Windows service, the cluster becomes unresponsive and even when accessed from the Management console of other node, the 'Queues and Streams' tab does not open and gives an error as server not available. The other tabs work in Management console fine, but that also occasionally has issues and becomes unresponsive. 

When we run the RabbitMQ cluster status command, the results take too long to return the status , but does not indicate any issue. The log file also does not have any obvious errors. As soon as the node is started back, the cluster functions fine with the quorum queues having 2 followers. We expect the cluster to function as normal post upgrade since we did not change any configuration. We observe the same behaviour in 2 environment. Even tried uninstalling completely and did a fresh install, but the issue is persistent. Can someone please advise on this issue?

Thanks

Best regards
Kannan Sankaran 


Martin Hinchy

unread,
Aug 23, 2023, 3:21:57 AM8/23/23
to rabbitmq-users
Also seeing this behaviour. Same versions of Erlang and RabbitMQ. Running on Windows Server 2019. 

After stopping a node, the management UI and RabbitMQ command line tools become very unresponsive. Can take anywhere from 5 to 30 seconds to action a request that will normally occur in under a second.

kjnilsson

unread,
Aug 23, 2023, 9:38:32 AM8/23/23
to rabbitmq-users
Hi,

I can't see a slow down when testing this myself but I have a suspicion. The cluster status as well as the queues mgmt ui page does a lot of inter node rpc calls to gather the required information. Each time it attempts one of them to the down node it tries to re-establish the TCP connection. Typically this is quite fast but there can be cases when it is slower. One case I have seen a few times is when DNS lookup is slow. Please can you check that DNS lookup on your servers isn't taking a long time for the hostnames used.

Cheers
Karl

kjnilsson

unread,
Aug 23, 2023, 10:26:05 AM8/23/23
to rabbitmq-users
Another thing your could try to check is if there is a difference between just shutting the windows service down compared to shutting the server down.

Martin Hinchy

unread,
Aug 23, 2023, 6:22:37 PM8/23/23
to rabbitmq-users
Shutting down the server, as opposed to just shutting down the windows service, makes things much worse to the point where the management UI just times out completely on occasion and displays an error.

I've now tested this on two different clusters on two separate networks. Both clusters were running Erlang 25.1.2 and RabbitMQ 3.11.4 and this behaviour was not occurring. It only started happening after we upgraded to Erlang 25.3.2.5 and RabbitMQ 3.12.2.

I'm not sure if it matters but our cluster partition handling is set to pause_minority and all of our internode traffic on 25672 is encrypted using TLS certificates.

Martin Hinchy

unread,
Aug 23, 2023, 6:23:52 PM8/23/23
to rabbitmq-users
DNS lookup is working fine on our network.

Martin Hinchy

unread,
Aug 23, 2023, 6:44:12 PM8/23/23
to rabbitmq-users
I removed the TLS certificates from the internode communication so it is now unencrypted and it made no difference to the timeouts.

One observation regarding the management UI, the worst delays are when clicking on the Overview page and the Queues and Streams page. There is very little to no delay when clicking on the Connections, Channels and Admin tabs.

Martin Hinchy

unread,
Jan 11, 2024, 11:02:34 PM1/11/24
to rabbitmq-users
I have just attempted once again to upgrade from Erlang 25.1.2 and RabbitMQ 3.11.4. This time I tried upgrading to Erlang 26.2.1, RabbitMQ 3.12.12. After the upgrade we are still seeing the same behaviour as last time. Certain tabs on the management UI become unresponsive once a the RabbitMQ service on one node is stopped and we either a) log out of the stopped node, or b) turn the node off. The two tabs where we see this behaviour the most are the Overview tab and the Queues and Streams tab. The rabbitmqctl cluster_status command also takes a long time to complete.

I've been looking at various network traffic with Wireshark and I have noticed that the behaviour of the epmd traffic on 4369 is quite different between the old and new versions. With the old version,  all RabbitMQ traffic between the running nodes and the stopped node ceases as soon as the RabbitMQ service is stopped. The UI on the running nodes remains responsive. Starting the stopped node causes a couple of epmd exchanges with the other nodes and then normal RabbitMQ traffic resumes over 25672. 

With the newer versions, epmd traffic commences as soon as one node is stopped. The epmd traffic is continuous between the stopped node and the running nodes until I either log out of Windows on the stopped node, or turn the stopped node off completely. When I do either of these two things I observe that epmd traffic then ceases, the running nodes continually attempt to connect to the stopped nodes on 4369, and the management UI becomes very unresponsive. I assume that epmd traffic ceases when I log out of Windows on the stopped node because epmd.exe is running in user space and this then gets terminated during the logoff process. Responsive can be restored by logging back into the stopped node and running a command such as rabbitmq-service stop. Even though the RabbitMQ service is already stopped, this command results in epmd.exe starting back up and resuming communication with the running nodes on 4369. 

Has anyone else observed this behaviour. My cluster is three nodes running Windows Server 2019. All three nodes are on the same network with no firewall appliances in-between. The cluster runs in pause_minority mode. So far I have tried the following and nothing has fixed this issue:
- Turned off Antivirus
- Turned off Windows firewall
- Turned off encryption on internode cluster communication
- Turned off pause_minority mode
- Completely removed Erlang and RabbitMQ from all nodes, destroyed the RabbitMQ data folders and re-installed from scratch.

kjnilsson

unread,
Jan 12, 2024, 4:40:52 AM1/12/24
to rabbitmq-users
Can you try the latest 3.13 rc? It contains https://github.com/rabbitmq/rabbitmq-server/pull/9874 which may help for the queues page.
Reply all
Reply to author
Forward
0 new messages