Missed heartbeats from client, timeout: 30s

4,825 views
Skip to first unread message

Sushil Chaudhary

unread,
Jun 15, 2017, 4:18:03 PM6/15/17
to rabbitmq-users
Hi All,


we have got a wierd problem with RabbitMQ cluster in AWS.

- We have a 3 node RabbitMQ cluster + class ELB in AWS. RabbitMQ Cluster is enable to use just TLS on both way and TCP port is disabled for the communication. ELB does not have any of SSL certificate deployed.  We have a Java client which make connection with  RabbitMQ cluster using Route53 URL of ELB.  The Java client has set heartbeat = 30 seconds for rabbitmq connection.   Java client is able to post message to Rabbit and consume from Rabbit. So everything works perfectly fine when we have traffice.

When we see there is no traffice, then rabbitmq start throwing the "Missed heartbeats from client, timeout: 30s" in its logs. At the same time,  client also report the similar exception " Caused by: com.rabbitmq.client.MissedHeartbeatException: Heartbeat missing with heartbeat = 30 seconds at"

com.rabbitmq.client.impl.AMQConnection.handleSocketTimeout(AMQConnection.java:723) ~[amqp-client-4.0.2.jar!/:4.0.2] at com.rabbitmq.client.impl.AMQConnection.readFrame(AMQConnection.java:642) ~[amqp-client-4.0.2.jar!/:4.0.2] at com.rabbitmq.client.impl.AMQConnection.access$300(AMQConnection.java:47) ~[amqp-client-4.0.2.jar!/:4.0.2] at com.rabbitmq.client.impl.AMQConnection$MainLoop.run(AMQConnection.java:572) ~[amqp-client-4.0.2.jar!/:4.0.2] ... 1 more



But Surprisly this behavior is random. Even when connection is idle for longer time, it did not happen for whole day. but next day, it happened some time. today again, we do not see any exception like that. Our heartbeat setting looks fine and when there is new message, client is able to make automatic connection with Rabbit.  But sometime its throws the 'Missed heartbeats from client, timeout: 30s".

We do not see any reason to suspect on network glitches/connectivity issue as we do not see any of the message is getting lost or failed.  Just the heartbeat is failing. 

Any idea what is wrong.  Appreciate all help!


Thanks,
Sushil

Michael Klishin

unread,
Jun 15, 2017, 4:22:02 PM6/15/17
to rabbitm...@googlegroups.com
Check ELB idle connection timeouts. Load balancers are known to close connections
that can go idle for a while and heartbeats have a side effect of keeping such connections
alive if the heartbeating period (in AMQP 0-9-1, ~ one half of the timeout)

And, of course, you can reduce your heartbeat timeout to, say, 10-20 seconds.
This will produce traffic on client connections every 5-10 seconds but is not too
low (won't result in false positives).

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Sushil Chaudhary

unread,
Jun 15, 2017, 4:28:01 PM6/15/17
to rabbitmq-users
MK,
thanks for the response.  I forgot to mention that, we have ELB timeout also setup. ELB timeout is set as 60 seconds while Rabbitmq cluster has heartbeat set to 30 seconds.

We can lower down the hearbeat to 10-20 but not sure if that will be helpful

Michael Klishin

unread,
Jun 15, 2017, 5:08:31 PM6/15/17
to rabbitm...@googlegroups.com
That shouldn't be necessary in theory.

Can you post server logs around the time a client detects a heartbeat timeout (1 minutes before and after)?
Can it be that you have a timeout set up for "plain" TCP connection but this client uses TLS (and thus a different port)?

Have you verified e.g. in the management UI that clients actually use the heartbeat timeout you believe
them to use?


--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Sushil Chaudhary

unread,
Jun 15, 2017, 5:43:33 PM6/15/17
to rabbitmq-users


We have ELB timeout setup for TCP port 5671 and client connection to same port using TLS. Below setting from ELB :


5671 (TCP) forwarding to 5671 (TCP)

Stickiness options not available for TCP protocols

Attributes

Idle timeout:
60 seconds
Access logs:
Disabled
Cross-Zone Load Balancing:
Enabled















Below is the log message from Rabbitmq. We did see the error message yesterday,  But there is no error message for today.











=ERROR REPORT==== 14-Jun-2017::15:36:42 ===

closing AMQP connection <0.10562.0> (10.200.122.186:3680 -> 10.200.123.238:5671):

missed heartbeats from client, timeout: 30s


=ERROR REPORT==== 14-Jun-2017::15:36:42 ===

closing AMQP connection <0.10574.0> (10.200.122.186:3686 -> 10.200.123.238:5671):

missed heartbeats from client, timeout: 30s


=ERROR REPORT==== 14-Jun-2017::15:36:43 ===

closing AMQP connection <0.10586.0> (10.200.122.186:3692 -> 10.200.123.238:5671):

missed heartbeats from client, timeout: 30s


=ERROR REPORT==== 14-Jun-2017::15:36:43 ===

closing AMQP connection <0.10609.0> (10.200.122.186:3698 -> 10.200.123.238:5671):

missed heartbeats from client, timeout: 30s


=ERROR REPORT==== 14-Jun-2017::15:36:43 ===

closing AMQP connection <0.10621.0> (10.200.122.186:3704 -> 10.200.123.238:5671):

missed heartbeats from client, timeout: 30s


=ERROR REPORT==== 14-Jun-2017::15:36:43 ===

closing AMQP connection <0.10634.0> (10.200.122.186:3710 -> 10.200.123.238:5671):

missed heartbeats from client, timeout: 30s


=ERROR REPORT==== 14-Jun-2017::15:36:43 ===

closing AMQP connection <0.10646.0> (10.200.122.186:3716 -> 10.200.123.238:5671):

missed heartbeats from client, timeout: 30s


=ERROR REPORT==== 14-Jun-2017::15:36:43 ===

closing AMQP connection <0.10658.0> (10.200.122.186:3722 -> 10.200.123.238:5671):

missed heartbeats from client, timeout: 30s


=INFO REPORT==== 14-Jun-2017::16:00:39 ===

accepting AMQP connection <0.17850.43> (10.200.122.7:35270 -> 10.200.123.238:5671)


=INFO REPORT==== 14-Jun-2017::16:00:40 ===

accepting AMQP connection <0.17898.43> (10.200.123.76:40513 -> 10.200.123.238:5671)




On Thursday, 15 June 2017 16:18:03 UTC-4, Sushil Chaudhary wrote:
Auto Generated Inline Image 1

Michael Klishin

unread,
Jun 15, 2017, 6:33:57 PM6/15/17
to rabbitm...@googlegroups.com
If both client and server report missed heartbeats around the same time, I'm inclined to think that

 * Either the heartbeat setting is not actually what it should be
 * This is not an issue with ELB settings but a genuine network connection quality issue

Only inspecting [1] a tcpdump traffic capture can really tell, so take one and an ops
person familiar with TCP to take a look.


--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Sushil Chaudhary

unread,
Jun 15, 2017, 8:19:01 PM6/15/17
to rabbitm...@googlegroups.com
MK,

Thanks for the response. While we are validating TCP connectivity,
 what else we should we should check to make sure heart beat setting in place. Also, we have traffics as 
Low as 2-3 message in a hours. That also incline me to think that if it's heartbeat setting issue, it should be consistent 
almost all time.

Also, earlier we have been using TCP instead of TLS among the cline and the rabbitmq server,and we never see this issue earlier.
I read, that any packet or sequence  loss with TLS will make connection failure while 
TCP is more flexible. Cools that be reason it is popping up now. Just a thought.


Regards 
Sushil 


On Jun 15, 2017, at 6:33 PM, Michael Klishin <mkli...@pivotal.io> wrote:

If both client and server report missed heartbeats around the same time, I'm inclined to think that

 * Either the heartbeat setting is not actually what it should be
 * This is not an issue with ELB settings but a genuine network connection quality issue

Only inspecting [1] a tcpdump traffic capture can really tell, so take one and an ops
person familiar with TCP to take a look.

On Fri, Jun 16, 2017 at 12:43 AM, Sushil Chaudhary <sushilkuma...@gmail.com> wrote:


We have ELB timeout setup for TCP port 5671 and client connection to same port using TLS. Below setting from ELB :


5671 (TCP) forwarding to 5671 (TCP)

<mime-attachment.png>Stickiness options not available for TCP protocols

--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/oYVfS5g4Jxc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

Michael Klishin

unread,
Jun 15, 2017, 8:27:51 PM6/15/17
to rabbitm...@googlegroups.com
There is a column that lists effective heartbeat timeout on the connections page in the management UI.

TLS is a layer on top of TCP. TLS has no effect on client or server's heartbeat detection and *both*
ends of a connection detect a heartbeat timeout roughly at the same time. An "immediately closed" connection
would result in different messages.

I don't remember any heartbeat implementation bugs or changes in the Java client or server
in a long time. Some other libraries had known issues in certain release series (e.g. .NET in 3.4.x IIRC).

On Fri, Jun 16, 2017 at 3:18 AM, Sushil Chaudhary <sushilkuma...@gmail.com> wrote:
MK,

Thanks for the response. While we are validating TCP connectivity,
 what else we should we should check to make sure heart beat setting in place. Also, we have traffics as 
Low as 2-3 message in a hours. That also incline me to think that if it's heartbeat setting issue, it should be consistent 
almost all time.

Also, earlier we have been using TCP instead of TLS among the cline and the rabbitmq server,and we never see this issue earlier.
I read, that any packet or sequence  loss with TLS will make connection failure while 
TCP is more flexible. Cools that be reason it is popping up now. Just a thought.


Regards 
Sushil 


On Jun 15, 2017, at 6:33 PM, Michael Klishin <mkli...@pivotal.io> wrote:

If both client and server report missed heartbeats around the same time, I'm inclined to think that

 * Either the heartbeat setting is not actually what it should be
 * This is not an issue with ELB settings but a genuine network connection quality issue

Only inspecting [1] a tcpdump traffic capture can really tell, so take one and an ops
person familiar with TCP to take a look.

To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
MK

Staff Software Engineer, Pivotal/RabbitMQ

--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/oYVfS5g4Jxc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-users+unsubscribe@googlegroups.com.

To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Sushil Chaudhary

unread,
Jun 18, 2017, 11:42:04 PM6/18/17
to rabbitm...@googlegroups.com
MK, does wireshark work on remote linux machine in aws, seems to be desktop tool

Sent from my iPhone
To unsubscribe from this group and all its topics, send an email to rabbitmq-user...@googlegroups.com.

Michael Klishin

unread,
Jun 18, 2017, 11:44:30 PM6/18/17
to rabbitm...@googlegroups.com
Wireshark doesn't but tcpdump does. You can open tcpdump captures with Wireshark
since they use the same library (libpcap) under the hood.


On Mon, Jun 19, 2017 at 6:41 AM, Sushil Chaudhary <sushilkuma...@gmail.com> wrote:
MK, does wireshark work on remote linux machine in aws, seems to be desktop tool

Sent from my iPhone

On Jun 15, 2017, at 8:27 PM, Michael Klishin <mkli...@pivotal.io> wrote:

There is a column that lists effective heartbeat timeout on the connections page in the management UI.

TLS is a layer on top of TCP. TLS has no effect on client or server's heartbeat detection and *both*
ends of a connection detect a heartbeat timeout roughly at the same time. An "immediately closed" connection
would result in different messages.

I don't remember any heartbeat implementation bugs or changes in the Java client or server
in a long time. Some other libraries had known issues in certain release series (e.g. .NET in 3.4.x IIRC).

Sushil Chaudhary

unread,
Feb 15, 2018, 12:43:02 PM2/15/18
to rabbitmq-users
Michael,
We find the issue with the classic ELB we have been using AWS in front of Rabbitmq.  ELB goes for maintain and get replaced by new underling instance every week or in 10 days. We are doing to use NLB to replace the it.

Also, if connection is idle, what should be the good numbers for idle connection timeout on ELB vs heartbeat to avoid heartbeat error message. is timeout = 120 seconds  vs 30 seconds as rabbitmq heartbeat configuration should be good enough.  


On Thursday, 15 June 2017 16:18:03 UTC-4, Sushil Chaudhary wrote:

Michael Klishin

unread,
Feb 15, 2018, 1:44:54 PM2/15/18
to rabbitm...@googlegroups.com
Hi Sushil,

Thank you for reporting back.

120 seconds of inactivity on the load balancer with a 30 second RabbitMQ client
heartbeat timeout sounds like plenty of head room: with the 30s value you’ll see
both client and server exchanging heartbeats roughly every 15 seconds.

So those settings seem reasonable to me.

MK
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.

Stanley Lemberger

unread,
Feb 15, 2018, 4:30:24 PM2/15/18
to rabbitmq-users
We had the same problem. I got rid of the ELB and have not had a problem since. 
The ELB can swap out the VMs under it when load gets too high. This will cause a timeout.
A long garbage collection will cause the same problem.
Reply all
Reply to author
Forward
0 new messages