Connections disappearing exactly 10 minutes after reconnect (without notifying the client)

266 views
Skip to first unread message

urrmurrmur

unread,
Mar 30, 2019, 5:45:14 PM3/30/19
to rabbitmq-users
Hi,

First some context: I am currently involved in a project where we are using RabbitMQ to communicate between a server and several Unity clients. I'm a beginner at RabbitMQ and not very experienced at network communication in general, our RabbitMQ code was written by one of my colleagues who is currently unavailable. What I should really be doing is to spend a couple of days reading up on RabbitMQ and experimenting with it to gain a deeper understanding of the system, but we are on a very tight deadline (project launch is next week) and unfortunately I don't have the time to approach this problem as thoroughly as I should. So I'm hoping to get some help here, because I suspect there is some fundamental flaw in our code that may be obvious to more experienced people.


Ok, here's the issue. Each of our clients creates a connection to the server, that uses an EventingBasicConsumer to process messages which we use to login to our iOS app, which is written in Unity. Occasional message loss is not that important, but our connections really have to stay open. That's crucial. So we have tried to write a double fallback for lost connections: on the one hand, we use the ConnectionFactory's automaticRecoveryEnabled, and on the other, I check _connection.isOpen() every 10 seconds and if it is not, I recreate out entire MessageQueueResource from scratch. We used to have only the automatic reconnect, but we still lost connections so I added the second system because I suspected the client was sometimes not notified when connections died. But it doesn't seem to make much of a difference.

The process always seems to be the same:
  1. A connection is lost for some reason
  2. The automatic recovery tries to reopen the connection. I have no idea if this is successful
  3. The 10 second heartbeat notices the connection is not open, and tries to recreate it
  4. The connection continues to exist on the server and sends and receives heartbeats. It still seems functional (if I try to trigger messages I can see a traffic spike in the management panel).
  5. Exactly 10 minutes after the reconnect, the client suddenly stops sending heartbeats (at least I think so: the traffic in the management panel stops showing traffic from client) and the connection disappears
  6. The client still seems to think the connection is open, because my 10 second heartbeat doesn't trigger a reconnect (and neither does the auto-recovery)
This only happens when connections die naturally, I cannot recreate the problem by manually killing connections on the server. I suspect we are doing something wrong in the way we reconnect. Perhaps we are not disposing everything correctly, which causes the reinitialize the make a semi-working connection that automatically dies after 10 minutes?


I've attached the our MessageQueueResource class which handles a connection and Consumes events, and the relevant parts of our RFIDManager which maintains an instance of MessageQueueResource, tries to recreate it if a connection was lost, and processes the messages.
MessageQueueResource.cs
RFIDManager.cs

Luke Bakken

unread,
Mar 31, 2019, 2:40:38 PM3/31/19
to rabbitmq-users
Hello,

It would be great to know what versions of software you're using - RabbitMQ, Erlang and the RabbitMQ .NET client.

What is logged by RabbitMQ from the time a connection "dies naturally" and during the reconnect attempts?

Thanks -
Luke

urrmurrmur

unread,
Mar 31, 2019, 7:40:08 PM3/31/19
to rabbitmq-users
Hi Luke,

Thanks for replying. We are using RabbitMQ 3.6.10, Erlang 19.2.1 and .NET client 5.1.0.

I'm not sure what logs are available, but I have attached the ones I found. Are there any others I should be looking for? Below is a relevant snippet from an event like the one I described above. The log confirms my hypothesis: a connection is lost (at 19:08:09), restored as intended, and almost exactly 10 minutes later (at 19:18:01), the connection is lost again. Only the second time it doesn't come back. It seems the client suddenly stops sending heartbeats, but I can't figure out why. There are numerous similar events in the log after that one.

Note, we deployed a new version of the app today around 16:00, which is broken in a new way. Now our app hangs entirely when attempting the reconnect, so clearly we're doing something wrong there. I think the most relevant part of the logs are the ones from 30/03 around 18:00 until 31/03 until 16:00.


 
=ERROR REPORT==== 30-Mar-2019::19:08:09 ===
closing AMQP connection <0.26255.1> (84.193.108.71:2306 -> 185.135.12.176:5671 - Wildemanshut 17):
missed heartbeats from client, timeout: 10s

=INFO REPORT==== 30-Mar-2019::19:08:11 ===
accepting AMQP connection <0.27539.1> (84.193.108.71:49236 -> 185.135.12.176:5671)

=INFO REPORT==== 30-Mar-2019::19:08:11 ===
Connection <0.27539.1> (84.193.108.71:49236 -> 185.135.12.176:5671) has a client-provided name: Wildemanshut 17

=INFO REPORT==== 30-Mar-2019::19:08:11 ===
connection <0.27539.1> (84.193.108.71:49236 -> 185.135.12.176:5671 - Wildemanshut 17): user 'bruegelgamebokrijkbe' authenticated and granted access to vhost 'bruegelgamebokrijkbe'

=INFO REPORT==== 30-Mar-2019::19:13:54 ===
accepting AMQP connection <0.27608.1> ([::1]:34294 -> [::1]:5672)

=INFO REPORT==== 30-Mar-2019::19:13:54 ===
connection <0.27608.1> ([::1]:34294 -> [::1]:5672): user 'bruegelgamebokrijkbe' authenticated and granted access to vhost 'bruegelgamebokrijkbe'

=INFO REPORT==== 30-Mar-2019::19:13:54 ===
closing AMQP connection <0.27608.1> ([::1]:34294 -> [::1]:5672, vhost: 'bruegelgamebokrijkbe', user: 'bruegelgamebokrijkbe')

=WARNING REPORT==== 30-Mar-2019::19:15:59 ===
closing AMQP connection <0.27502.1> (94.224.122.132:49473 -> 185.135.12.176:5671 - Testopstelling Jelle 28, vhost: 'bruegelgamebokrijkbe', user: 'bruegelgamebokrijkbe'):
client unexpectedly closed TCP connection

=ERROR REPORT==== 30-Mar-2019::19:18:01 ===
closing AMQP connection <0.27539.1> (84.193.108.71:49236 -> 185.135.12.176:5671 - Wildemanshut 17):
missed heartbeats from client, timeout: 10s
rabbit@intracto-web087-sasl.log
rabbit@intracto-web087.log

Luke Bakken

unread,
Apr 1, 2019, 2:10:13 PM4/1/19
to rabbitmq-users
Hello,

I quickly reviewed the code you provided. You have automatic recovery enabled in MessageQueueResource, but are also trying to implement recovery in RFIDManager via handling the ConnectionShutdown event.

So, my first inclination is to create MessageQueueResource once and let it handle connection recovery. Modify RFIDManager to not listen to the ConnectionShutdown event. If that addresses your issue, remove raising that event entirely.

Thanks,
Luke

urrmurrmur

unread,
Apr 2, 2019, 2:22:53 AM4/2/19
to rabbitmq-users
Hi,

That's what we did initially, and then we noticed our connections dropping over time. We've had several different implementations by now:
  1. Only using the autorecover
  2. Using the Autorecover and reinitialize the MQResource when an OnConnectionShutdown event is detected
  3. NOT using the autorecover and checking every 10 seconds if the connection is still active (with _connection.IsOpen)
  4. Same as 3, but also using _consumer != null and _consumer.IsRunning
  5. Same as 4, but trying the reconnect in a different thread (and autorecover back on)
  6. Just brute forcing it and deleting the whole mqResource every 9.5 minutes and recreating it
  7. Using a button to reconnect manually by destroying the mqResource and recreating it
The core issue is the same in every case: reconnecting works fine initially, as long as a connection has never died by itself on the server. If it has, the first reconnect works, and every next reconnect in the next 10 minutes also works, but the the connection stops receiving heartbeats on the server and dies again without notifying the client (so autorecover does not get triggered). If we try to reconnect after that, the reconnect doesn't work anymore, it just blocks the thread in which it is running indefinitely. Only fully closing and restarting the app works in that case.

It feels like the connection dying doesn't get handled properly if it is not triggered by the app or the force close button on the server. The client still thinks the connection is alive (_connection.IsOpen returns true), but it does stop sending heartbeats after 10 minutes. And we cannot cleanly dispose of everything to make the reconnect work. The only thing I'm not sure of is whether the _consumer.IsRunning also returns true. I don't think it does, I think it knows that the consumer has stopped or is null, but I'm not sure of that.

Regards,

Jeroen

Michael Klishin

unread,
Apr 3, 2019, 1:30:52 PM4/3/19
to rabbitmq-users
Connections are not "dying" here. There's evidence of missed heartbeats detected by the server in the provided log [1].
[1] explains why client won't detect it at the same time. In other words, connection failure detection is asynchronous on both ends.

You can use a lower (say, 10 seconds — but not lower than 5) heartbeat timeout interval to see if it works better for your case.
Heartbeat timeout used to be close to 10 minutes (580 seconds?)
at one point. According to Git log [2] it is 60 seconds in 3.6.10 (which is out of support, by the way).

Automatic connection recovery reacts to connection termination in the client. Closed TCP connection, when detected, is one of the triggers.
A connection.closed by the server is another (this is what happens when connection is force closed using rabbitmqctl or HTTP  API).

We are not aware of anyissues with heartbeat or connection recovery implementation in the 5.x versions of the .NET client, at least nothing is commonly
reported on this list.


--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
MK

Staff Software Engineer, Pivotal/RabbitMQ
Reply all
Reply to author
Forward
0 new messages