Strange Heartbeat behaviour between client and server

1,417 views
Skip to first unread message

Leendert Dommicent

unread,
Oct 18, 2018, 3:55:03 AM10/18/18
to rabbitmq-users
We are investigating some production problems we had with connection recovery due to network failures.

Our environment:
- Window operating system
- RabbitMQ 3.7.8
- Official .Net Client 5.1.0
- Heartbeat timeout: 60s
- Reconnection interval: 10s
- Ntework failure is simulated by pulling ethernet cord from client

Issue we encounter:
During network recovery we get resource locked exceptions.
"operation queue.declare caused a channel exception resource_locked: cannot obtain exclusive access to locked queue 'flx.cb49fb54359a466eac06fa5b57bc4386' in vhost 'TNP_PRD'"
And consumers are not recovered.

What we already discovered:
For consuming from certain exchanges we use exclusive, auto-delete queues with name "flx.{GUID}". During the network failure the client already dropped the connection an tries to recover. The server however still had the connection open, so when topology recovery tries to create the flx. queue it gets the resource_locked exception and the channel is closed. Because of this all other recovery fails because of an AlreadyClosedException. We are already investigating if we can change the flx. queues with server generated names. However we also want to tune the heartbeat and reconnection interval so the connection is closed on the server before the client tries to recover.

What we expect is that if the Heartbeat timeout is for example 20s both the client and the server are sending heartbeat frames every 10s. When 2 frames are not received the connection is dropped. This is however not what we experience during testing. When tracing with wireshark we see that the client is sending a frame every time-out / 4 (5s) and the server every time-out (20s).

Also the dropping of the connection does not seem consistent.

On the client we always receive the following error:
Error: Exception: System.Exception
AMQP close-reason, initiated by Library, code=541, text="Unexpected Exception", classId=0, methodId=0, cause=System.Net.Sockets.SocketException (10060): A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
   at RabbitMQ.Client.Impl.InboundFrame.ReadFrom(NetworkBinaryReader reader)
   at RabbitMQ.Client.Framing.Impl.Connection.MainLoopIteration()
   at RabbitMQ.Client.Framing.Impl.Connection.MainLoop()

On the server we sometimes receive:
client unexpectedly closed TCP connection
and if the time-out is small enough we receive
missed heartbeats from client, timeout: 4s

We experimented with different time-outs but they do not give a consistent timing on when they disconnect (timings at the bottom). For several values the time between disconnect by the client and the server is almost 10 seconds. When it is over 10 seconds the recovery fails.

Can anybody explain why there is a discrepancy between the frames sent by the client with the frames sent by the server and why they both do not comply with the documentation?
Does anybody have advice on how to tune the heartbeat time-out and the reconnection interval so this cannot occur anymore?

We can provide more logging if necessary.

Thank you in advance,
Leendert Dommicent

-- 30s --
Last successful heartbeat client: 11:39:17
Last successful heartbeat Server: 11:39:09
Network interruption: 11:39:20
Client disconnect: 11:39:43 -> 26s
Server disconnect: 11:39:58 -> 41s

-- 15s --
Last successful heartbeat client: 10:24:40
Last successful heartbeat Server: 10:24:29
Network interruption: 10:24:44
Client disconnect: 10:24:59 -> 19s
Server disconnect: 10:25:09 -> 40s

Last successful heartbeat client: 10:50:38
Last successful heartbeat Server: 10:50:26
Network interruption: 
Client disconnect: 10:50:57 -> 19s
Server disconnect: 10:51:00 -> 34s

Last successful heartbeat client: 11:24:59
Last successful heartbeat Server: 11:24:55
Network interruption: 11:25:03
Client disconnect: 11:25:22 -> 23s
Server disconnect: 11:25:29 -> 33s

-- 10s --
Last successful heartbeat client: 11:51:09
Last successful heartbeat Server: 11:51:04
Network interruption: 11:51:12
Client disconnect: 11:51:24 -> 15s
Server disconnect: 11:51:33 -> 29s

-- 5s --
Last successful heartbeat client: 09:00:24
Last successful heartbeat Server: 08:47:22
Network interruption: 09:00:25
Client disconnect: 09:00:31 -> 7s
Server disconnect: 09:00:32 -> 10s

Michael Klishin

unread,
Oct 18, 2018, 11:11:52 AM10/18/18
to rabbitm...@googlegroups.com
Before making any conclusions around heartbeats, keep 2 things in mind:

 * Any traffic counts for a heartbeat and the two ends of a connection do not perform checks at exactly the same time.

 * There was a fair amount of confusion during the early days of client library implementations where
   some configured a "timeout" (2 skipped intervals) and some configured an interval. I suspect that's the real reason here.

We do not recommend values lower than 5 seconds [1].

The "client unexpectedly closed TCP connection" is covered in [2].

Lastly, there's a natural race condition between what clients perform during recovery and exclusive queue deletion. It's unusual to see
that with a 10s recovery delay but with a lot of exclusive queues it can happen. Java client introduced recovery operation retries
because there is no way RabbitMQ or clients can avoid it otherwise. This is not a feature available in the .NET client yet.


--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Leendert Dommicent

unread,
Oct 19, 2018, 2:19:37 AM10/19/18
to rabbitmq-users
Hi,

We kept are test setup were we did the wireshark tracing very claen. There was just one consumer. So the only frames that went over the wire were heartbeat frames, we confirmed this.

Also I indeed already read about the confusion in the early days. However I think that this does not explain the behavior we are seeing. If it was indeed this confusion, I would suspect that with a timeout of 20s, the RabbitMQ server would send heartbeats every 10 seconds and the client every 20 seconds. However this is not what we see. The server is sending every 20 seconds and the client is sending frames every 5 seconds. That is why we are confused. I understand that there will always be a difference between the client detecting a dead TCP connection and the server detecting a dead TCP connection, but the high heartbeat interval gap between the 2 makes the occurrence much more likely in my opinion.

Also isn't the change of occurrence determined by both the the heartbeat timeout and the recovery delay. I suspect that a recovery delay with a heartbeat of 60s still has a lot of change that this occurs. We can consistently reproduce this with only 1 exclusive queue. We see that for example a heartbeat timeout of 15s together with a recovery delay of 10s makes it much less likely but still not impossible.

I wonder how the connection recovery retries work in the Java client. Because when we receive the resource_locked exception, this is a channel level exception and causes the channel to be closed. So every attempt to recreate object of that channel will fail. That is why in our case all consumers fail to recover because of just on exclusive queue. I think that the only solution is that when a channel level exception is thrown, the whole channel is recreated. Is this something the Java client does?

Thank you for your response and the information,
Leendert Dommicent



Op donderdag 18 oktober 2018 17:11:52 UTC+2 schreef Michael Klishin:

Michael Klishin

unread,
Oct 19, 2018, 3:33:47 AM10/19/18
to rabbitm...@googlegroups.com
I'm not sure I understand what the root cause is. Consider putting together an executable example.

Heartbeats and connection recovery delay are orthogonal. The latter kicks in after a connection failure has been detected,
thanks to heartbeats or any other way.

Luke Bakken

unread,
Oct 19, 2018, 11:06:49 AM10/19/18
to rabbitmq-users
Hi Leendert,

Relatively recently I had to review how heartbeats were implemented in the Pika client (python), so I thought I'd address this statement:

What we expect is that if the Heartbeat timeout is for example 20s both the client and the server are sending heartbeat frames every 10s. When 2 frames are not received the connection is dropped. This is however not what we experience during testing. When tracing with wireshark we see that the client is sending a frame every time-out / 4 (5s) and the server every time-out (20s)

This is a common source of confusion that I had to re-read to understand when working on Pika. The AMQP spec and our documentation refers to the "heartbeat timeout" as what is negotiated during the AMQP connection establishment. So, when you use a value of 60 seconds that represents a timeout, not an interval. If RabbitMQ does not receive any data nor heartbeat frames on a connection 60 seconds after the last time it received data or a heartbeat frame, that connection's heartbeat timeout has expired and the connection is considered dead.

The "heartbeat interval", i.e. how often a heartbeat frame is sent, is half the negotiated heartbeat timeout. I just ran RabbitMQ 3.7.8 and a single Pika consumer with a negotiated heartbeat timeout of 10 seconds. RabbitMQ sends heartbeat frames every 5 seconds so I'm wondering what's up in your environment if your application negotiates a 20 second interval and RabbitMQ is sending every 20 seconds. Could you re-check that?

As for the rest of your question, if you could provide a sample project to reproduce this I'll take a look.

Thanks -
Luke

Leendert Dommicent

unread,
Oct 20, 2018, 8:13:48 AM10/20/18
to rabbitmq-users
Hi,

How you explain it is indeed how I understood it, but it is not what I see in our environments. It is good to know that you do see a correct interval of 10 seconds. So maybe it is related to the .Net client.
I made a sample application in .Net Core that creates a connection and a single consumer on a temporary queue.
You can find the application here: https://1drv.ms/u/s!AkQZLBHyBM41gZZcDoWQmZMAMHgetQ
Just unzip it, addapt the appsettings.json and run the executable on a x64 windows machine. Pressing enter will terminate the connection and pressing enter again will terminate the application.

I tested it with this application and the latest RabbitMQ inside a docker container. With a timeout of 60s, I see the server sending heatbeats every 60s instead of the expected 30s.
You can see the graph of de RabbitMQ admin panel here: https://1drv.ms/u/s!AkQZLBHyBM41gZZbURtP9l-hocJoRw. I also confirmed it with Wireshark.
I will try to do the same thing with another client also.

Big thanks for your help, already.
Kind regards,
Leendert Dommicent

Op vrijdag 19 oktober 2018 17:06:49 UTC+2 schreef Luke Bakken:

Leendert Dommicent

unread,
Oct 20, 2018, 8:47:39 AM10/20/18
to rabbitmq-users
I just confirmed that also with the javascript client I see a heartbeat every 60s from the server when the timeout is 60s.

Kind regards,
Leendert Dommicent

Op zaterdag 20 oktober 2018 14:13:48 UTC+2 schreef Leendert Dommicent:

Luke Bakken

unread,
Oct 22, 2018, 2:18:21 PM10/22/18
to rabbitmq-users
Hi Leendert,

I can see the same behavior as you and actually that's also what I see with my python code - it's negotiating 5 seconds, not 10.

I'm going to figure out if our docs are inaccurate or if something in RabbitMQ changed.

Thanks,
Luke

Luke Bakken

unread,
Oct 22, 2018, 3:14:10 PM10/22/18
to rabbitmq-users
Hi Leendert,

After studying the heartbeat code in RabbitMQ I can say that it's working correctly. From the docs: https://www.rabbitmq.com/heartbeats.html

Any traffic (e.g. protocol operations, published messages, acknowledgements) counts for a valid heartbeat

So, this means that the TCP ACK that is sent in response to a client heartbeat frame is considered traffic and will prevent a server-side heartbeat frame from being sent. You can see the code that performs this check here: link

Basically, as long as the send_oct stat for a socket is increasing, no heartbeat frame will be sent to a client. In the case of applications like your C# and my python app that only consumes without any other activity on the connection but heartbeats, the send_oct stat increases in such a way that heartbeat frames are sent to the client at an interval equal to the negotiated timeout.

I can't add anything to Michael's explanation with regard to connection recovery. The .NET client should be modified to re-try operations in the case of certain failures in the same manner that the Java client does. For now you'll have to handle re-tries yourself.

Thanks,
Luke

Michael Klishin

unread,
Oct 23, 2018, 6:35:19 AM10/23/18
to rabbitmq-users
Luke identified an interesting issue in the heartbeat implementation:

Basically heartbeat sender keeps track of socket statistics (sent and received bytes) but didn't account for the fact
that its own actions affect them.

We are QA'ing the PR and expect it to ship in 3.7.9.

Luke Bakken

unread,
Oct 23, 2018, 10:31:49 AM10/23/18
to rabbitmq-users
Leendert -

For what it's worth, this was after I wrote my explanation and thought about it some more.

Leendert Dommicent

unread,
Oct 23, 2018, 3:00:26 PM10/23/18
to rabbitmq-users
Hi,

Our first thought was also that the heartbeat of the client was counting as traffic but then we would expect no heartbeats from the server because the client is sending every timeout/4.
But if he counts his own heartbeat as traffic, that indeed explains the behaviour.

Thank you for looking into the problem some more and for providing a fix in the next release.

Kind regards,
Leendert Dommicent

Op dinsdag 23 oktober 2018 16:31:49 UTC+2 schreef Luke Bakken:

Michael van der Werve

unread,
Jan 15, 2019, 5:16:08 AM1/15/19
to rabbitmq-users
Hi,

It does not seem to be fully resolved yet. When we test it with 3.7.10, it seems like the first heartbeat is missed.

Overview what is happening (default heartbeat is set to 10):
- We get a Tune frame, with the heartbeat timeout value of 10
- We accept the 10 on the TuneOk frame
- We get an Connection Open frame, at which point we start a timer to send heartbeats every 5 seconds

We go ahead and send heartbeats every 5 seconds (timeout / 2), however, RabbitMQ only sends its first heartbeat after the full 10 seconds (timeout), followed by every 5 seconds after that (which should be the correct behavior). The latter portion is correct, but it seems weird that once the connection is started, the first heartbeat is only sent after the full timeout value. 

Any reason why it would be different on a connection that has just been opened? I'm not an erlang expert so it's very hard to comprehend what is happening in the pr that was submitted and merged to fix the heartbeats and why it still may not be correct for the first heartbeat.

Sincerely,
Michael van der Werve

Luke Bakken

unread,
Jan 15, 2019, 11:15:17 AM1/15/19
to rabbitmq-users
Hi Michael,

It is correct to send the first heartbeat after 10 seconds, because the process of opening the connection and tuning it sends data over the socket, which is counted as a "heartbeat". Don't forget that data sent over a connection will prevent an actual heartbeat frame from being sent. It is only over completely idle connections that heartbeats frames are actually sent.

Thanks,
Luke

Michael van der Werve

unread,
Jan 15, 2019, 1:33:08 PM1/15/19
to rabbitmq-users
Hey,

Thanks for your reply, but that's not what we see happening. It should send a heartbeat every 5 seconds if the timeout is 10, but if I look at the tcp dump, the first heartbeat is exactly 10 seconds after the last tcp traffic from the server.

I would expect it to send the heartbeat after 5 seconds however, since after the first heartbeat it continues with a 5 second interval normally. If you want, I can attach the dumpfile, or a screenshot of the AMQP traffic. Would that help? Because I am pretty sure the current implementation is still incorrect from all the testing I did today, and the observed behaviour.

Luke Bakken

unread,
Jan 15, 2019, 3:02:19 PM1/15/19
to rabbitmq-users
Hi Michael,

I see the same thing. It's due to the fact that two half intervals must expire to observe no difference in the send_oct statistic, which triggers a heartbeat from RabbitMQ at 10 seconds. Subsequent activity on the socket ensures that they will be sent every 5 seconds. There is no downside to this behavior that I can think of, since clients consider two missed intervals to be indicative of a dead connection.

I might see about changing this last edge case but only if I can do so with minimal modifications.

Thanks,
Luke

Michael Klishin

unread,
Jan 15, 2019, 3:06:51 PM1/15/19
to rabbitm...@googlegroups.com
This discrepancy doesn't have any practical impact according to the explanation, whereas at the beginning of this thread it did:
in some cases the server could incorrectly compute/interprete socket's data transfer stats, meaning the delivery schedule was 
off for more or less the duration of a connection.

The 2 skipped heartbeats rule exists exactly because expecting a perfectly timely delivery of each frame is unrealistic, so a single skipped one
is too likely to produce false positives.

Specifically in this case, after a connection is open there almost certainly is further traffic on the connection and it will be counted
as a heartbeat.


--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.

Michael van der Werve

unread,
Jan 15, 2019, 6:10:33 PM1/15/19
to rabbitmq-users
Good to see that I'm not the only one observing the discrepancy. 

Assume heartbeat should be sent every 5 seconds, the second after 5 more seconds, the third after 5 seconds again (so the heartbeats occur at 5, 10, 15. At time 0 is ignored because that is the opening traffic, so no heartbeat is needed). Then assume no further traffic, as nothing more is happening on the connection.

How do you then interpret 'two missed heartbeats'? I had a discussion with a colleague this morning and we can interpret it in one of two ways.
We have seen data, so the first heartbeat is missed at 5 seconds, and the second at 10, so when do we consider it dead? Immediately if on 10 we still do not see the heartbeat, or rather on 15 when the leeway for the second heartbeat has expired, and the third one would just be about to arrive? Or maybe not on 10 but certainly by, for example, 12, when we know for sure the heartbeat that would have arrived at 10 cannot be in transit any more?

On the order of microseconds, our client library decides that the connection should be closed at time 10 because at the time itself it thinks it has not received the past two heartbeats. Little does it know, that the heartbeat may still be in transfer on TCP, but since the first heartbeat was never sent, no traffic has actually been observed for two heartbeat intervals (10 seconds). We have currently worked around this by giving it some leeway of an extra interval, so instead only closing the connection right before the third heartbeat would arrive.  

Are we just misinterpreting the specs and should we give more leeway to the heartbeats, or is this an actual bug? (unlikely as it may be, because usually there may be more data and after the initial period everything actually works correctly). 

Michael Klishin

unread,
Jan 15, 2019, 6:58:11 PM1/15/19
to rabbitm...@googlegroups.com
RabbitMQ server and most popular clients would consider the peer to be unavailable at mark 10. Some clients
chose to send heartbeats more frequently than Timeout/2, e.g. at Timeout/2.2 intervals, to avoid the race condition.

Outside of various client test suites, it's extremely rare to see connections that connect and then do nothing. So if heartbeat
delivery starts at mark 10, it still makes no difference in practice.

As Luke mentioned, we would investigate if we can make delivery start earlier (possibly even unconditionally once after
connection handshake completes) but if more than trivial changes would be required, it's not worth the effort. This is one area
of the code that is much more involved than it sounds and test automation in practice doesn't go beyond a couple of really basic tests,
so it's also annoyingly time consuming.

To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,
Jan 15, 2019, 7:30:15 PM1/15/19
to rabbitm...@googlegroups.com
Luke and I have found an inconsistency with how the negotiated timeout (from connection.tune-ok) is used by RabbitMQ. It will
send heartbeats at Timeout/2 intervals but check if the peer is down at Timeout intervals and forgive one missed interval.
We'll address this for 3.8.0.

In this specific case, however, even then you'd have a natural race condition between RabbitMQ detecting connection unavailability, cleaning up
connection resources including the exclusive queue(s) and client recovery. We've seen some more involved issues in this area with other clients.
There is simply no way for completely address this unless exclusive access operations block and wait for resource availability. This would be very surprising
to the user and ruin any chances of decent throughput if exclusive queues are used.

What you can do is switch to server-named queues. They will get a new name after recovery (this case is specifically handled by the .NET client's recovery code)
and you are guaranteed that no name collisions will take place by the broker's GUID generator (which we haven't had any issue with since day 1 IIRC).

Since you are using exclusive queues, you by definition only use those queues for transient data you can afford to lose, so server-naming makes a lot of sense.
Most importantly, this is a solution you can switch to very quickly without waiting for new RabbitMQ releases and it fundamentally addresses/sidesteps the problem.

HTH.

On Thu, Oct 18, 2018 at 10:55 AM Leendert Dommicent <dommicent...@gmail.com> wrote:
--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,
Jan 15, 2019, 7:35:03 PM1/15/19
to rabbitm...@googlegroups.com
It's also worth mentioning that TCP keepalives can be used as a solid alternative to heartbeats
in AMQP 0-9-1 and other messaging protocols we support [2]. It is not discouraged, just requires
kernel TCP stack tuning which is not an option in some environments.

Of course, even it won't address the race condition between exclusive queue removal and a concurrently recovering client
but I figured it should be mentioned here for completeness. Again, the best part is that it's something that you can adopt without
waiting for any new releases.

Michael Klishin

unread,
Jan 15, 2019, 7:50:32 PM1/15/19
to rabbitm...@googlegroups.com
First change is in [1].

Attached is a filtered Wireshark capture I used to QA it. Highlighted are the connection.open-ok and server-sent heartbeat frames.

rabbitmq-common-293 pcapng.png

Reply all
Reply to author
Forward
0 new messages