Connections closing unexpectedly with Rabbit 3.5.0

1,587 views
Skip to first unread message

Michael Watson

unread,
Apr 20, 2015, 2:54:39 PM4/20/15
to rabbitm...@googlegroups.com
We have long-running services that sometimes get no traffic for hours (or days).  We are seeing the server forcibly closing connections after some period of time (appears to be anywhere from 6 - 12 hours, but the timing is not consistent.  This does not happen in our older production server (Rabbit 3.0.2), so it does not appear to be something environmental.  Rather, this is new behavior with some version of Rabbit server and/or .Net client DLL after 3.0.2 (we're currently running 3.5.0).

In the RabbitMQ log we see information like:

=ERROR REPORT==== 18-Apr-2015::05:50:35 ===
closing AMQP connection <0.348.0> ([::1]:42670 -> [::1]:5672):
{heartbeat_timeout,running}

We've explicitly NOT requested a heartbeat.

We were hoping that a workaround would be to request a 60 second heartbeat on each connection, but we're also seeing that when we do enable heartbeat connections are still periodically terminated. If we have the heartbeat enabled, at least one (but by no means 'all') of our connections will be terminated overnight with a timeout.  This makes it hard to a nice reproducible test case. 

In our application we have dozens of connections, each with multiple channels (most have 3-5, but some have as many as 50). So far we've been unable to determine anything about which connection will be terminated, only that we always see one or more connections terminated at some point.

We've also tried enabling AutomaticRecoveryEnabled, but as far as we've been able to determine, that option has no effect as we are still getting a System.IO.EndOfStreamException in the client when the server closes the connection as a "heartbeat_timeout".

Ideally we'd like to be using both a 60 second heartbeat (to avoid lost connections due to network configuration etc), with auto-recovery such that if the connection is closed by the server it will be automatically reestablished by the client. So far we've not seen recovery work successfully.  Perhaps we've misunderstood the meanings of the client configuration options.

Here's a sample program that, when left running overnight, reproduces the timeout.  This one has RequestedHeartbeat set to 0:

        static void Main(string[] args)
        {
            :
            :

            var userName = args[0];
            var password = args[1];
            var exchangeName = "TestExchange";
            var queueName = "test";
            var factory = new ConnectionFactory
            {
                HostName = "localhost",
                UserName = userName,
                Password = password,
                //AutomaticRecoveryEnabled = true,
                RequestedHeartbeat = 0
                //RequestedHeartbeat = 60
            };
            var connection = factory.CreateConnection();
            var channel = connection.CreateModel();
            channel.ExchangeDeclare(exchangeName, ExchangeType.Topic, true);

            channel.QueueDeclare(queueName, false, false, true, null);
            var messageQueue = new SharedQueue<BasicDeliverEventArgs>();
            var consumer = new QueueingBasicConsumer(channel, messageQueue);
            channel.BasicConsume(queueName, true, consumer);
            channel.BasicQos(0, 100, false);

            channel.QueueBind(queueName, exchangeName, queueName + ".#");

            Console.WriteLine("Waiting for a message, heartbeat disabled");
            while (true)
            {
                BasicDeliverEventArgs eventArguments;
                if (messageQueue.Dequeue(100, out eventArguments))
                {
                    // do something
                    Console.WriteLine("got a message");
                }
            }
            connection.Close();
        }



Michael Watson

unread,
Apr 20, 2015, 3:03:31 PM4/20/15
to rabbitm...@googlegroups.com
Further information:

  • With RequestedHeartbeat set to 0, all connections for long-running services that don't get traffic will die overnight
  • With HeartbeatRequested set to 60, the timeouts happen far more frequently, but less consistently.  Sometimes the timeout will happen in just five minutes.

Michael Klishin

unread,
Apr 21, 2015, 12:17:21 AM4/21/15
to Michael Watson, rabbitm...@googlegroups.com
On 20 April 2015 at 22:03:34, Michael Watson (mswa...@gmail.com) wrote:
> Further information:
>
> With RequestedHeartbeat set to 0, all connections for long-running
> services that don't get traffic will die overnight
> With HeartbeatRequested set to 60, the timeouts happen far more
> frequently, but less consistently. Sometimes the timeout will
> happen in just five minutes.

.NET client was changed to use timers in 3.5.0 instead of separate threads for heartbeats. 

Please capture a traffic session with a relatively low timeout value, e.g. 20, and send it to use (off-list if desired).
--
MK

Staff Software Engineer, Pivotal/RabbitMQ


Michael Watson

unread,
Apr 21, 2015, 7:14:24 AM4/21/15
to Michael Klishin, rabbitm...@googlegroups.com
Thanks, will do!

Where do I configure the timeout to be 20 seconds. Is that in the server, or the client?

Michael Klishin

unread,
Apr 21, 2015, 8:13:43 AM4/21/15
to Michael Watson, rabbitm...@googlegroups.com
 On 21 April 2015 at 14:14:23, Michael Watson (mswa...@gmail.com) wrote:
> Where do I configure the timeout to be 20 seconds. Is that in the
> server, or the client?

http://www.rabbitmq.com/heartbeats.html

Since you are trying to evaluate a client's implementation, configure it in the client.

John Smithson

unread,
Apr 23, 2015, 10:20:08 AM4/23/15
to rabbitm...@googlegroups.com
Looks like we are having the same issue. On production servers (old version) everything is fine.
But currently I'm working on deployment of our test environment.

I installed version 3.5.1 (win2008R2) and after we started our servers (.net / client 3.5.1 / running on same host) we saw that our consumers were reconnecting with random intervals.
In rabbit logs we've discovered "heartbeat_timeout" error for every reconnect.

Heartbeat value is 2min. No message traffic. 
Code of our consumers almost the same as Michael's, with only difference - we are using regular queue without exchange and our dequeue timeout is 1000.

We've tried different Erlang versions - seems not affecting the situation.
Also tried version 3.5.0 (I think 3.4.4 as well, not sure though) - same error.

I ended up installing version 3.4.3. Servers running for 2 days, so far no errors spotted.

I thought it's because of our environment - our server is Rabbit veteran, having almost all versions installed for last 2 years.
But looks like it's something else, I saw related question on stackoverflow.

Michael Klishin

unread,
Apr 23, 2015, 10:24:54 AM4/23/15
to John Smithson, rabbitm...@googlegroups.com
 On 23 April 2015 at 17:20:10, John Smithson (box...@gmail.com) wrote:
> I installed version 3.5.1 (win2008R2) and after we started
> our servers (.net / client 3.5.1 / running on same host) we saw
> that our consumers were reconnecting with random intervals.
> In rabbit logs we've discovered "heartbeat_timeout" error
> for every reconnect.
  
This means that *client* didn't send heartbeats as expected. I'm investigating this.

John Smithson

unread,
Apr 23, 2015, 10:32:29 AM4/23/15
to rabbitm...@googlegroups.com, box...@gmail.com
Thanks for replying.
Please let me know if you need any assistance.

Michael Klishin

unread,
Apr 23, 2015, 10:39:07 AM4/23/15
to John Smithson, rabbitm...@googlegroups.com
On 23 April 2015 at 17:32:31, John Smithson (box...@gmail.com) wrote:
> We've tried different Erlang versions - seems not affecting
> the situation.
> Also tried version 3.5.0 (I think 3.4.4 as well, not sure though)
> - same error.
>
> I ended up installing version 3.4.3. Servers running for 2 days,
> so far no errors spotted.

John, 

Just to clarify: the versions above refer to RabbitMQ server, correct?

There were no heartbeat-related changes in RabbitMQ server 3.5. So you can use the 3.5 server
if needed with a 3.4 client (or any client since 2.0, for that matter).

John Smithson

unread,
Apr 23, 2015, 11:30:51 AM4/23/15
to rabbitm...@googlegroups.com, box...@gmail.com
Versions are same for both, server and client.
After installing MQ Server, we recompiled our servers with corresponding clients.

I will try to update the server to 3.5.1 and keep clients as 3.4.3.
Will return with results, most probably tomorrow.

Thank you.

Michael Klishin

unread,
Apr 23, 2015, 6:35:03 PM4/23/15
to John Smithson, rabbitm...@googlegroups.com
On 23 April 2015 at 18:30:53, John Smithson (box...@gmail.com) wrote:
> I will try to update the server to 3.5.1 and keep clients as 3.4.3.
> Will return with results, most probably tomorrow.

John,

I have a fix for you to try:
https://github.com/rabbitmq/rabbitmq-dotnet-client/issues/68#issuecomment-95735683

Please find attached a DLL build from the branch above.

Thank you.
RabbitMQ.Client.dll

Michael Klishin

unread,
Apr 23, 2015, 9:14:43 PM4/23/15
to John Smithson, rabbitm...@googlegroups.com
On 24 April 2015 at 01:35:00, Michael Klishin (mkli...@pivotal.io) wrote:
> I have a fix for you to try:
> https://github.com/rabbitmq/rabbitmq-dotnet-client/issues/68#issuecomment-95735683
>
> Please find attached a DLL build from the branch above.

I've pushed a few more updates to the branch, so a new DLL is here.
RabbitMQ.Client.dll

John Smithson

unread,
Apr 24, 2015, 7:04:23 AM4/24/15
to rabbitm...@googlegroups.com, box...@gmail.com
Michael,

I made few consistent tests:
1) Server v3.5.1 + Client 4.3.4 - seems working fine, but I've waited only ~30 minutes.
2) Server v3.5.1 + Client 3.5.1 - problem remains, but occurred only hour later. In our previous tests it was like 10-15 minutes.
3) Server v3.5.1 + your fixed client (latest) - running for 2+ hours at the moment. Seems fixed.

Going to test it with some minor message traffic and leave it running for the weekend.

Thanks for your efforts.

John Smithson

unread,
Apr 27, 2015, 8:59:09 AM4/27/15
to rabbitm...@googlegroups.com, box...@gmail.com
Michael,

After running on weekend everything looks OK.
Nothing suspicious in the logs.

Thank you.
Reply all
Reply to author
Forward
0 new messages