Unexpected Connection / Channel Shutdowns - New Thread

642 views
Skip to first unread message

Jaco De Villiers

unread,
Sep 2, 2022, 1:55:36 AM9/2/22
to rabbitmq-users
Similar to other question here, we get unexpected connection closures during the course of a day.  

Here are the information I have:
  • Client Version: <PackageReference Include="RabbitMQ.Client" Version="6.2.4" />
  • Server Version:  RabbitMQ: v3.10.5, Erlang: 24.3.4.1, Linux: Ubuntu 20.04
  • Architecture / Topology: 
    • We have a single self hosted AWS RabbitMQ Instance, AWS LoadBalancer and a minimum of 8 servers connecting.  Each server has two RabbitMQ Connections (Producer and Consumer) with many Producers and Consumers using the two connections.  Most of the channels are long running but some only live for 'n brief period each day
Common log entries on the server, they come in pairs for each running server at that moment and it seems it is all the running servers
client unexpectedly closed TCP connection
closing AMQP connection <0.19350.203> (172.29.17.10:37820 -> 172.27.0.2:5672 - 192.168.64.3-Consumer, vhost: '/', user: 'iiabapi'):

Then the client seem to get the End of Stream errors:
Failed to publish 2 - Already closed: The AMQP operation was interrupted: AMQP close-reason, initiated by Library, code=0, text='End of stream', classId=0, methodId=0, cause=System.IO.EndOfStreamException: Reached the end of the stream. Possible authentication failure.

We have now changed the following:
  • Removed the load balancers and connect to the RabbitMQ server directly
  • Changed the connection settings to: NetworkRecoveryInterval = TimeSpan.FromSeconds(1);// Reduced from 5 to 1
  • Added code to the publish logic that will try and recreated a model/channel (See below)
Questions:
  • Do we need to manually also cater for disconnect/connect
  • If so, is the code snippet below correct/good practice
  • If not a good approach, Would it be better to queue messages in memory and then publish them when a reconnected event occurs (I don't see ANY log entries for _model.BasicRecoverOk)?
Thanks in advance
 


// Connection 
new ConnectionFactory
      {
        HostName = hostName,
        DispatchConsumersAsync = true,
        AutomaticRecoveryEnabled = true,
        TopologyRecoveryEnabled = true,
        RequestedHeartbeat = TimeSpan.FromSeconds(30),//<-- Defaults --> //TODO From Settings
        NetworkRecoveryInterval = TimeSpan.FromSeconds(1)//<-- Defaults --> //TODO From Settings
      };

// Publish logic snippet
lock (_locker)
    {
      try
      {
             ..........
            _verifyAndRetryOrThrowChannelConnection(_model);

          // Ensure that the channel is not used simultaneously by different threads
          _model.BasicPublish(
----------------------------------
//  New wait/create
  private void _verifyAndRetryOrThrowChannelConnection(IModel model)
  {
    if (_model.IsOpen)
      return;

    _log.Warn($"Manually reconnecting the publisher model. {_getLogLine()}");

    if (_disposed)
      throw new InvalidOperationException($"Attempting to publish on a disposed instance: {_getLogLine()}");

    var maxAttempts = 11; //TODO: Move to settings
    var attemptCount = 0;

    // NetworkRecoveryInterval is 1s
    // Wait for just more than two seconds before creating a new channel

    while (attemptCount <= maxAttempts)
    {
      if (_model.IsOpen)
        break; // Auto recovery worked
      if (_connectionsProvider.IsProducerConnected && attemptCount > 6)
        break;// Connection recovered by channel NOT
      attemptCount++;
      new ManualResetEvent(false).WaitOne(200);
    }

    if (_model.IsOpen)
      return; // Auto recovery worked

    _log.Warn($"Manually creating the publisher model. {_getLogLine()}");

    try
    {
      _model.CallbackException -= _model_CallbackException;
      _model.BasicRecoverOk -= _model_BasicRecoverOk;
      _model.ModelShutdown -= _model_ModelShutdown;
      if (_confirmPublish)
      {
        _model.BasicAcks -= _model_BasicAcks;
        _model.BasicNacks -= _model_BasicNacks;
      }
    }
    catch (Exception e)
    {
      _log.Error($"Attempted to un register publisher model events.  {e.Message}",e);
    }
    _createModel(_connectionsProvider.GetProducerConnectionProvider, _confirmPublish);
  }

Luke Bakken

unread,
Sep 3, 2022, 9:25:01 AM9/3/22
to rabbitmq-users
Thank you for starting a new discussion.

On Thursday, September 1, 2022 at 10:55:36 PM UTC-7 jac...@gmail.com wrote:
Common log entries on the server, they come in pairs for each running server at that moment and it seems it is all the running servers
client unexpectedly closed TCP connection
closing AMQP connection <0.19350.203> (172.29.17.10:37820 -> 172.27.0.2:5672 - 192.168.64.3-Consumer, vhost: '/', user: 'iiabapi'):

This means that either your applications close the connection abruptly or a network device in between does. 
  • Removed the load balancers and connect to the RabbitMQ server directly
With a reliable network and client applications, it should be very rare for you to ever see client unexpectedly closed TCP connection in the RabbitMQ log files.

Questions:
  • Did removing the load balancers have any effect on the frequency of the errors?
  • Are you certain that your applications are behaving correctly and not crashing or experiencing internal exceptions that could cause the connection to close abruptly?
I suggest upgrading to .NET client 6.4.0 and enabling connection and topology recovery (like you've done). The correct pattern is to queue messages to memory (and to disk if you really don't want to lose them) and wait for the connection / model recovery events before publishing again.

We should have an example of doing that so I'll see if I can find time to work on it.

Thanks,
Luke

Jaco De Villiers

unread,
Sep 5, 2022, 1:27:20 AM9/5/22
to rabbitm...@googlegroups.com
Thanks Luke, I will upgrade.

The code is currently in DEV/UAT.  It will only deploy to PROD on Thu this week.  We are trying to limit the errors, but it is difficult if they do not occur in DEV/UAT.  The connection string change will also go in at the same time.

The only reason that we did not wait for the recovery to happen on its own, is that we do not see any recovery events.  We need to ensure that the messages are not lost and for that reason we are doing a manual recovery attempt.  

I am open to any suggestions/help/samples.  Please shout if you have any.  Can we keep the discussion open until I can let you know that the issues have been resolved?





--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/41FQEtXIoQM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/bd2f2a3d-f2b7-4c07-b884-245632a9ee70n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages