Hi,
I'm struggling with some issues in connections autorecovery during the nightly vm backup performed by the customer's IT department. My configuration is the following:
I have a 4 node cluster behind a load balancer provided by the customer; RabbitMQ version currently used is still 3.11.15. I have about 300 queues of quorum type.
the processes connecting to these queues are about 10 C# .Net Core based processes; I have several threads on each process, each one consuming a specific queue. the RabbitMQ client nuget package is updated to latest version 6.5.0.
I've tried two different approaches:
- creating one connection per each channel/consumer
- creating one connection for several consumers (not exactly just one connection per process but per "queue family" as I have different connecton credentials)
currently I'm focusing on the second approach so I have less connections to recover in case of problems.
the connection factory is created as follows:
fact = new ConnectionFactory()
{
HostName = GetConf("Server"),
Port = int.Parse(GetConf("Port")),
VirtualHost = GetConf("VirtualHost"),
UserName = GetConf("User"),
Password = !useEnc ? GetConf("Password") : CustomSetup.D(GetConf("Password")),
DispatchConsumersAsync = async,
AutomaticRecoveryEnabled=true,
TopologyRecoveryEnabled=true,
RequestedHeartbeat = TimeSpan.FromSeconds(20)
};
fact.Ssl.Enabled = true;
fact.Ssl.ServerName = fact.HostName;
fact.Ssl.CertificateValidationCallback = RemoteCertificateValidationCallback;
fact.Ssl.Version = System.Security.Authentication.SslProtocols.Tls12;
Everything works quite fine during the normal processing operations, but the customer IT performs a backup every night that temporarily suspend and resume two VMs at the same time, this causes a lot of disconnections and recovery from Rabbit client and sometimes some connections unfortunately don't recover properly... I have some suspects that this could be related to the fact that the LB keeps trying to use the nodes that were suspended/resumed while they haven't completely recovered and that when this time is too long the autorecovery stops trying.
My questions:
- do you have any advice about how this could be detected? in previous RabbitMQ client I had an event about Failed autoRecovery which is not present anymore in version 6.5.0, is there any other event that I could use on my main thread loop to trigger a reconnection?
- as an alternative, will in this cases help to check in my main loop if _connection.IsOpen() and reset connection/channel?
- for this second method, what will happen to connections that have multiple channels? I fear that first channel will see a broken connection and reset it, the second channel will also see a broken connection and reset it, actually breaking the connection object that was just resumed for the first channel and so on
Thank you
Roberto