rabbitmq 'heartbeats missed' errors with rabbitmq-server-3.7

2,113 views
Skip to first unread message

po...@platform9.com

unread,
Nov 7, 2018, 9:07:48 PM11/7/18
to rabbitmq-users
Hi Michael,

I needed some help with heartbeat related errors in our rabbitmq server. 
Looks like the old rabbitmq server (version 3.3) had heartbeat=600. We recently upgraded to v3.7 and that has heartbeat=60.
And now we are seeing very frequent connection drops from server since it doesn't receive timely heartbeats from the client.
Is that merely an issue due to sudden decrease in the default from 600s to 60s? Or there are known issues at scale with rabbitmq-3.7?

I should also mention that rabbitmq-3.7.7 seems to work slightly better than rabbit-3.7.8 in this sense. We also see {handshake_timeout, frame_header} errors with rabbitmq-3.7.8.

Even after setting the below in /etc/rabbitmq/rabbitmq.conf, we are seeing frequent errors for 'missed heartbeats from client'.
# heartbeat = 150

# rabbitmqctl status

Status of node rabbit@localhost ...
[{pid,20537},
 {running_applications,
     [{rabbitmq_management,"RabbitMQ Management Console","3.7.7"},
      {rabbitmq_web_dispatch,"RabbitMQ Web Dispatcher","3.7.7"},
      {cowboy,"Small, fast, modern HTTP server.","2.2.2"},
      {rabbitmq_management_agent,"RabbitMQ Management Agent","3.7.7"},
      {rabbit,"RabbitMQ","3.7.7"},
      {amqp_client,"RabbitMQ AMQP Client","3.7.7"},
      {rabbit_common,
          "Modules shared by rabbitmq-server and rabbitmq-erlang-client",
          "3.7.7"},
      {ranch_proxy_protocol,"Ranch Proxy Protocol Transport","1.5.0"},
      {ranch,"Socket acceptor pool for TCP protocols.","1.5.0"},
      {ssl,"Erlang/OTP SSL application","8.2.4"},
      {public_key,"Public key infrastructure","1.5.2"},
      {asn1,"The Erlang ASN1 compiler version 5.0.5","5.0.5"},
      {cowlib,"Support library for manipulating Web protocols.","2.1.0"},
      {inets,"INETS  CXC 138 49","6.5"},
      {xmerl,"XML parser","1.3.16"},
      {os_mon,"CPO  CXC 138 46","2.4.4"},
      {jsx,"a streaming, evented json parsing toolkit","2.8.2"},
      {recon,"Diagnostic tools for production use","2.3.2"},
      {crypto,"CRYPTO","4.2.1"},
      {mnesia,"MNESIA  CXC 138 12","4.15.3"},
      {lager,"Erlang logging framework","3.6.3"},
      {goldrush,"Erlang event stream processor","0.1.9"},
      {compiler,"ERTS  CXC 138 10","7.1.5"},
      {syntax_tools,"Syntax tools","2.1.4"},
      {syslog,"An RFC 3164 and RFC 5424 compliant logging framework.","3.4.2"},
      {sasl,"SASL  CXC 138 11","3.1.1"},
      {stdlib,"ERTS  CXC 138 10","3.4.4"},
      {kernel,"ERTS  CXC 138 10","5.4.3"}]},
 {os,{unix,linux}},
 {erlang_version,
     "Erlang/OTP 20 [erts-9.3] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:128] [hipe] [kernel-poll:true]\n"},........


rabbitmq error logs for reference:
--
2018-11-08 01:18:18.412 [warning] <0.9290.9> closing AMQP connection <0.9290.9> (127.0.0.1:42328 -> 127.0.0.1:5672):
missed heartbeats from client, timeout: 150s
--
2018-11-08 01:25:27.127 [warning] <0.25332.10> closing AMQP connection <0.25332.10> (127.0.0.1:37254 -> 127.0.0.1:5672):
missed heartbeats from client, timeout: 10s
--
2018-11-08 01:30:48.821 [warning] <0.1874.11> closing AMQP connection <0.1874.11> (127.0.0.1:53752 -> 127.0.0.1:5672):
missed heartbeats from client, timeout: 150s
--
2018-11-08 01:48:12.655 [warning] <0.19929.0> closing AMQP connection <0.19929.0> (127.0.0.1:34596 -> 127.0.0.1:5672):
missed heartbeats from client, timeout: 10s
--
2018-11-08 01:53:06.259 [warning] <0.30429.11> closing AMQP connection <0.30429.11> (127.0.0.1:39848 -> 127.0.0.1:5672):
missed heartbeats from client, timeout: 150s


We are using rabbitmq in our OpenStack deployment and seeing these errors very frequently since upgrading from rabbitmq-3.3 to rabbitmq-3.7.

Also, some services seem to be using the new 150s heartbeat (default was 60s), but there are still some errors with 'timeout: 10s'. How does the heartbeat setting work for both client and server?

Appreciate any pointers you may provide. Please let me know if you need packet capture output or any other info for this.

Michael Klishin

unread,
Nov 7, 2018, 9:17:33 PM11/7/18
to rabbitm...@googlegroups.com
This is a mailing list, not my personal address (this is not an invitation to use my personal address), so please address it as such.

I cannot possibly know. The logs clearly state that the heartbeat interval was negotiated to 150s for some connections
and 10s for others.

We recently identified an interesting implementation issue [1] that prevented RabbitMQ from sending heartbeats
but that should be irrelevant for this case since it's the server that detects missed client heartbeats.

You need to take a close look at what clients you use, how they are configured, whether they can be overloaded and so on.


--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
MK

Staff Software Engineer, Pivotal/RabbitMQ

po...@platform9.com

unread,
Nov 7, 2018, 9:39:09 PM11/7/18
to rabbitmq-users
Sorry about that! I did look at the pull request [1] you linked.
Thanks for sharing that! I will try upgrading to version 3.7.9 just in case and see if that helps.

On the client side, we haven't changed anything and the load is pretty much constant. I'll have to see if anything on the oslo.messaging is causing an issue in that case.

Michael Klishin

unread,
Nov 7, 2018, 10:55:48 PM11/7/18
to rabbitm...@googlegroups.com
Note that oslo.messaging used to use a client, Kombu, that did not implement heartbeats at all (I think that was addressed in 2016, though).
If you don't use oslo.messaging as part of OpenStack we highly recommend switching to Pika or RabbitPy.

po...@platform9.com

unread,
Nov 8, 2018, 4:53:50 PM11/8/18
to rabbitmq-users
We are currently using oslo.messaging-5.10.1 in OpenStack and this was working fine with our older rabbitmq version 3.3.

I am trying to understand how the heartbeat timeout negotiation works if I have the below custom config for rabbitmq and a OpenStack service.

Rabbitmq (server) - /etc/rabbitmq/rabbitmq.conf
heartbeat = 120

OpenStack Nova API (client) - /etc/nova/nova.conf
oslo_messaging_rabbit.heartbeat_rate = 2
oslo_messaging_rabbit.heartbeat_timeout_threshold = 60


Does lower of the two values (120 and 60 in this case) get selected as the final heartbeat timeout in which 2 heartbeats should be received by both client and server?

Thanks,
Pooja

Luke Bakken

unread,
Nov 8, 2018, 5:09:12 PM11/8/18
to rabbitmq-users
Hi Pooja,

The easiest way to confirm this is to use Wireshark or another packet capture program and watch port 5672.

When your oslo.messaging app starts up, it will go through the AMQP handshake phase. One step involves negotiating values via the Connection.Tune method. RabbitMQ sends a Connection.Tune-Ok response, and you will see the negotiated heartbeat timeout value in the response. I've attached a screen shot of what this looks like in Wireshark.

So, in my case, the negotiated timeout is 60 seconds. In this case, RabbitMQ will send a heartbeat frame every 30 seconds on an idle connection, and will expect to receive a heartbeat or data from a client on a connection every 60 seconds. Note that when RabbitMQ sends data to a client that data is counted as a "heartbeat" so you may not necessarily see a lot of heartbeat frames from RabbitMQ. Clients may behave this way as well, but it's pretty standard for them to always send heartbeats even if data has been sent on a connection.

What version of Kombu ships with or is required by that version of oslo.messaging? Maybe there is an issue with heartbeats.

Thanks,
Luke
connection-tune-ok.png

po...@platform9.com

unread,
Nov 8, 2018, 6:56:55 PM11/8/18
to rabbitmq-users
Thanks Luke for sharing the wireshark output!

I tried running it multiple times in my environment, but to no avail. I do see the AMQP Heartbeat messages when I restart my Openstack service, but nothing like the Connection.Tune response you shared. But based on the logs, I can deduce that if rabbitmq server has 90s timeout and the client service has 60s, the effective heartbeat timeout is 60s.

2018-11-08 21:51:59.235 [warning] <0.9247.0> closing AMQP connection <0.9247.0> (127.0.0.1:60488 -> 127.0.0.1:5672 - nova-conductor:31133:4c9be1dd-b865-4610-bc9a-2908503a58de):
missed heartbeats from client, timeout: 60s


Currently we are using kombu-3.0.35 with oslo.messaging-5.10.1. Are there any known issues with this version?

I am also going to try switching from kombu to pika driver in oslo.messaging and see if that helps.

Thanks,
Pooja

Michael Klishin

unread,
Nov 8, 2018, 7:31:28 PM11/8/18
to rabbitm...@googlegroups.com
Kombu has heartbeats disabled by default. 

We found that the Pika driver was removed from oslo.messaging (basically they have ignored our recommendations) recently.
Kombu is a *terrible* client I honestly cannot  recommend under any circumstances. If you have an option of using Pika, give it shot.

Kombu has a heartbeat implementation these days via pyamqp but they are disabled by default. It's 2018 and a RabbitMQ client
has trouble supporting heartbeats. Wow.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

po...@platform9.com

unread,
Nov 8, 2018, 8:33:16 PM11/8/18
to rabbitmq-users
Thanks! I just learnt that as well - unfortunately with Rocky release of Openstack the pika driver has been dropped from oslo.messaging :(

I added some more logs to the kombu driver and do see that heartbeats are enabled in my case.

# rabbitmqctl eval 'application:get_env(rabbit, heartbeat).'
{ok,90}

Client logs:

2018-11-09 01:29:52.314 29510 DEBUG oslo_messaging._drivers.amqpdriver [-] CALL msg_id: 2eac12b885cf4119bec41911e6bcb57d exchange 'nova' topic 'conductor' _send /opt/pf9/nova/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py:448
2018-11-09 01:29:52.350 29510 DEBUG oslo_messaging._drivers.amqpdriver [-] received reply msg_id: 2eac12b885cf4119bec41911e6bcb57d __call__ /opt/pf9/nova/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py:296
2018-11-09 01:29:52.350 29510 INFO kombu.transport.pyamqp [-] heartbeat_check: connection.heartbeat 60.0
2018-11-09 01:29:56.979 29510 INFO kombu.transport.pyamqp [-] heartbeat_check: connection.heartbeat 60.0
2018-11-09 01:29:57.300 29510 INFO kombu.transport.pyamqp [-] heartbeat_check: connection.heartbeat 60.0

The negotiated heartbeat is 60s, which seems to be the default oslo.messaging heartbeat timeout. So looks like there is some intermittent bug where kombu does not heartbeat to the rabbitmq server.

Thanks,
Pooja

Michael Klishin

unread,
Nov 8, 2018, 11:51:02 PM11/8/18
to rabbitm...@googlegroups.com
A quick glance at Kombu code suggests that it treats the negotiated value as an *interval* whereas the server treats
as a *timeout* (two missed intervals). This is not unheard of and several clients had to address this in the past.

Previously you had non-heartbeat traffic in the 10 minute window, now the window is only 1 minute. That's not enough
when both client and server (up to 3.7.9) send heartbeats too infrequently.

Michael Klishin

unread,
Nov 8, 2018, 11:51:58 PM11/8/18
to rabbitm...@googlegroups.com
So I suggest that you file an issue for Kombu and perhaps mention that Pika didn't have this issue
and using Kombu in oslo.messaging goes against team RabbitMQ recommendations since it's a Python client
we recommend *against*.

po...@platform9.com

unread,
Nov 9, 2018, 6:30:57 PM11/9/18
to rabbitmq-users
Sounds good. Thanks Michael and Luke!

I will go ahead and increase the rabbitmq server's heartbeat timeout for now to 5 min and see if that helps.

Also trying to change to the pika driver in parallel.

I am also seeing {handshake_timeout,frame_header} errors in the rabbitmq log and would like to know if that is also related to the kombu driver?
Does that indicate a server side or client side error?

2018-11-09 23:29:55.385 [warning] <0.2670.8> closing AMQP connection <0.2670.8> (127.0.0.1:34066 -> 127.0.0.1:5672):
{handshake_timeout,frame_header}
2018-11-09 23:29:57.148 [info] <0.2697.8> accepting AMQP connection <0.2697.8> (127.0.0.1:34220 -> 127.0.0.1:5672)
2018-11-09 23:29:57.254 [warning] <0.2681.8> closing AMQP connection <0.2681.8> (127.0.0.1:34074 -> 127.0.0.1:5672):
{handshake_timeout,frame_header}
2018-11-09 23:29:57.912 [warning] <0.2684.8> closing AMQP connection <0.2684.8> (127.0.0.1:34076 -> 127.0.0.1:5672):
{handshake_timeout,frame_header}
2018-11-09 23:29:58.536 [info] <0.2702.8> accepting AMQP connection <0.2702.8> (127.0.0.1:34222 -> 127.0.0.1:5672)
2018-11-09 23:29:58.559 [info] <0.2705.8> accepting AMQP connection <0.2705.8> (127.0.0.1:34224 -> 127.0.0.1:5672)
2018-11-09 23:29:58.714 [warning] <0.2688.8> closing AMQP connection <0.2688.8> (127.0.0.1:34082 -> 127.0.0.1:5672):
{handshake_timeout,frame_header}


Thanks,
Pooja

Luke Bakken

unread,
Nov 11, 2018, 10:20:10 AM11/11/18
to rabbitmq-users
Hi,

That is a strange error. Something is keeping the client from completing the AMQP connection handshake. It may be a bug in kombu but it could be the code using it that is blocking how kombu sends data.

Luke

Gabriele Santomaggio

unread,
Dec 13, 2018, 4:18:44 PM12/13/18
to rabbitmq-users


to reproduce this problem:


=ERROR REPORT==== 13-Dec-2018::17:21:44 ===
closing AMQP connection
<0.769.0> (10.0.2.2:41646 -> 10.0.2.15:5671):
{handshake_timeout,frame_header}

=ERROR REPORT==== 13-Dec-2018::17:21:44 ===
closing AMQP connection
<0.765.0> (10.0.2.2:41642 -> 10.0.2.15:5671):
{handshake_timeout,frame_header}

=ERROR REPORT==== 13-Dec-2018::17:21:44 ===
closing AMQP connection
<0.761.0> (10.0.2.2:41640 -> 10.0.2.15:5671):
{handshake_timeout,frame_header}

=ERROR REPORT==== 13-Dec-2018::17:21:44 ===
closing AMQP connection
<0.777.0> (10.0.2.2:41650 -> 10.0.2.15:5671):
{handshake_timeout,frame_header}



it is enough to stress the connections:


def stress():
 
with Connection(ssl=True, userid="test", password="test") as conn:
   
with conn.channel() as channel:
      producer
= Producer(channel)


if __name__ == "__main__":
   
for x in xrange(1,1000):
        thread_rmq
= Thread(target=stress)
        thread_rmq
.start()

same error using Java client and kombu client

btw I drastically reduced the error by tuning the TCP parameter
{tcp_listen_options, [
{backlog, 4096},
{nodelay, true},
{linger, {true,0}},
{exit_on_close, false},
{sndbuf, 32768},
{recbuf, 32768}
]},


Will also try to tune  the OS parameters https://www.rabbitmq.com/networking.html#os-tuning I should see more improvements

Erlang 20.3
RabbitMQ 3.6.16

Michael Klishin

unread,
Dec 13, 2018, 5:21:33 PM12/13/18
to rabbitm...@googlegroups.com
There are only so many connections that the TCP stack is willing to keep in the acceptance queue [1] (no relation to RabbitMQ queues)
before it starts rejecting them. This is what the backlog socket/listener parameter controls.

If the above condition is a factor, increasing it to 4096 is indeed expected to help since the default is lower
(1024 according to Ranch docs).


--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gabriele Santomaggio

unread,
Jan 25, 2019, 7:47:10 AM1/25/19
to rabbitmq-users

po...@platform9.com

unread,
Jan 31, 2019, 2:36:52 PM1/31/19
to rabbitmq-users
Thanks Gabriele and Michael!

I read the thread you suggested below, but unfortunately in our case, we do not have SSL enabled in rabbitmq config. So the issue seems to be different?

I was also just browsing through the recent changelogs for rabbitmq-common and came across the below change:

Is this something that would help fix the 'missed heartbeats from client' errors that we see in rabbitmq-server logs occasionally?

Thanks,
Pooja

Michael Klishin

unread,
Feb 4, 2019, 5:06:05 PM2/4/19
to rabbitm...@googlegroups.com
While [1] can help, quite often the issue is on the client end (and the much shorter default timeout value exposed them).
You are definitely encouraged to run 3.7.11 over 3.3.x.

Just like in the thread linked to in [1], a traffic capture is the ultimate source of truth (perhaps in combination with server and client logs).

Reply all
Reply to author
Forward
0 new messages