Connection autorecovery issue

356 views

Skip to first unread message

Roberto Pesce

unread,

Aug 31, 2023, 3:12:07 AM8/31/23

to rabbitm...@googlegroups.com

Hi,

I'm struggling with some issues in connections autorecovery during the nightly vm backup performed by the customer's IT department. My configuration is the following:

I have a 4 node cluster behind a load balancer provided by the customer; RabbitMQ version currently used is still 3.11.15. I have about 300 queues of quorum type.

the processes connecting to these queues are about 10 C# .Net Core based processes; I have several threads on each process, each one consuming a specific queue. the RabbitMQ client nuget package is updated to latest version 6.5.0.

I've tried two different approaches:

- creating one connection per each channel/consumer

- creating one connection for several consumers (not exactly just one connection per process but per "queue family" as I have different connecton credentials)

currently I'm focusing on the second approach so I have less connections to recover in case of problems.

the connection factory is created as follows:

fact = new ConnectionFactory()
{
HostName = GetConf("Server"),
Port = int.Parse(GetConf("Port")),
VirtualHost = GetConf("VirtualHost"),
UserName = GetConf("User"),
Password = !useEnc ? GetConf("Password") : CustomSetup.D(GetConf("Password")),
DispatchConsumersAsync = async,
AutomaticRecoveryEnabled=true,
TopologyRecoveryEnabled=true,
RequestedHeartbeat = TimeSpan.FromSeconds(20)
};

fact.Ssl.Enabled = true;
fact.Ssl.ServerName = fact.HostName;
fact.Ssl.CertificateValidationCallback = RemoteCertificateValidationCallback;
fact.Ssl.Version = System.Security.Authentication.SslProtocols.Tls12;

Everything works quite fine during the normal processing operations, but the customer IT performs a backup every night that temporarily suspend and resume two VMs at the same time, this causes a lot of disconnections and recovery from Rabbit client and sometimes some connections unfortunately don't recover properly... I have some suspects that this could be related to the fact that the LB keeps trying to use the nodes that were suspended/resumed while they haven't completely recovered and that when this time is too long the autorecovery stops trying.

My questions:

- do you have any advice about how this could be detected? in previous RabbitMQ client I had an event about Failed autoRecovery which is not present anymore in version 6.5.0, is there any other event that I could use on my main thread loop to trigger a reconnection?

- as an alternative, will in this cases help to check in my main loop if _connection.IsOpen() and reset connection/channel?

- for this second method, what will happen to connections that have multiple channels? I fear that first channel will see a broken connection and reset it, the second channel will also see a broken connection and reset it, actually breaking the connection object that was just resumed for the first channel and so on

Thank you

Roberto

Luke Bakken

unread,

Aug 31, 2023, 2:50:07 PM8/31/23

to rabbitmq-users

Hello -

The cluster should be an odd number of nodes (3 or 5) - https://www.rabbitmq.com/clustering.html#node-count

" IT performs a backup every night that temporarily suspend and resume two VMs at the same time"

This is a pretty bad way to make a backup when running RabbitMQ. Some things to consider:

Does a backup even need to happen for a RabbitMQ server? You have a cluster for a reason. If the goal is to preserve entities within RabbitMQ (queues, exchanges, vhosts, etc) then exporting definitions should suffice.
If a backup must happen, the following should happen:
- Stop RabbitMQ on a server
- Pause and backup that server
- Restart server, then restart RabbitMQ
- Wait for the RabbitMQ node to come up, and verify cluster is healthy
- Move on to the next server

Everything works quite fine during the normal processing operations, but the customer IT performs a backup every night that temporarily suspend and resume two VMs at the same time, this causes a lot of disconnections and recovery from Rabbit client and sometimes some connections unfortunately don't recover properly... I have some suspects that this could be related to the fact that the LB keeps trying to use the nodes that were suspended/resumed while they haven't completely recovered and that when this time is too long the autorecovery stops trying.

The LB should be configured to check that port 5672 is open and can be connected to. Maybe that's the reason? Here is how I configure haproxy for use with a RabbitMQ cluster - https://github.com/lukebakken/docker-rabbitmq-cluster/blob/main/haproxy.cfg

- do you have any advice about how this could be detected? in previous RabbitMQ client I had an event about Failed autoRecovery which is not present anymore in version 6.5.0, is there any other event that I could use on my main thread loop to trigger a reconnection?

Could you point out the exact event name? That could be a good feature to re-add to the client.

- as an alternative, will in this cases help to check in my main loop if _connection.IsOpen() and reset connection/channel?
- for this second method, what will happen to connections that have multiple channels? I fear that first channel will see a broken connection and reset it, the second channel will also see a broken connection and reset it, actually breaking the connection object that was just resumed for the first channel and so on

It might be best for you to handle recovery yourself rather than depend on what the client library provides. If you'd like assistance with that, please start a discussion here:

https://github.com/rabbitmq/rabbitmq-dotnet-client/discussions

Please provide a git repository I can clone, compile, and run to see what you're trying to do. I can then fork that repo and use pull requests to assist you.

Thanks,

Luke

Reply all

Reply to author

Forward

0 new messages