Heartbeat and connection_closed_abruptly logging message

1,872 views
Skip to first unread message

Michael Denny

unread,
Aug 5, 2014, 7:36:43 AM8/5/14
to rabbitm...@googlegroups.com
Hi everyone, 

Do you know if RabbitMQ writes a specific log message if the connection drops because of a heartbeat missing?

I just trying to find some clarification on the connection life-cycle because I'm experiences dropping connection without apparent motivation:

some of these

=ERROR REPORT==== 5-Aug-2014::12:22:17 ===
closing AMQP connection <0.16252.69> ([::1]:58277 -> [::1]:5672):
{inet_error,econnaborted}

and these:

=WARNING REPORT==== 2-Aug-2014::18:57:59 ===
closing AMQP connection <0.26038.56> ([::1]:53399 -> [::1]:5672):
connection_closed_abruptly

That means the connection is dropped without closing the connection properly, of course the application was up and running, so I thought a lack of resource, but I'm not sure of that, so now I'm thinking on heartbeat, even if as far as I known it should run on 2 separated thread, on for read and one for write, so actually highly difficult to stop those, if the heartbeat stops it is quite sure to say that server lacking of resources, otherwise a bug of the client library (but I don't think so).


Thanks
Michael

Michael Klishin

unread,
Aug 5, 2014, 7:44:04 AM8/5/14
to rabbitm...@googlegroups.com, Michael Denny
 On 5 August 2014 at 15:36:55, Michael Denny (micd...@gmail.com) wrote:
> > Do you know if RabbitMQ writes a specific log message if the connection
> drops because of a heartbeat missing?

It should, search for heartbeat_timeout.

> I just trying to find some clarification on the connection life-cycle
> because I'm experiences dropping connection without apparent
> motivation:
>
> some of these
>
> =ERROR REPORT==== 5-Aug-2014(http://airmail.calendar/2014-08-05%2012:00:00%20GMT+4)::12:22:17
> ===
> closing AMQP connection <0.16252.69> ([::1]:58277 -> [::1]:5672):
> {inet_error,econnaborted}

This suggests ECONNABORTED, which may be due to a (TCP) timeout.

> and these:
>
> =WARNING REPORT==== 2-Aug-2014(http://airmail.calendar/2014-08-02%2012:00:00%20GMT+4)::18:57:59
> ===
> closing AMQP connection <0.26038.56> ([::1]:53399 -> [::1]:5672):
> connection_closed_abruptly

This means a TCP connection was closed w/o properly closing AMQP 0-9-1 connection.
I'd expect this to only happen when the peer (client) explicitly closes TCP
connection one way or another.

> now I'm thinking
> on heartbeat, even if as far as I known it should run on 2 separated
> thread, on for read and one for write, so actually highly difficult
> to stop those, if the heartbeat stops it is quite sure to say that
> server lacking of resources, otherwise a bug of the client library
> (but I don't think so).

Heartbeat threads may be starved of resources but that's fairly unlikely.
Do you collect resource load metrics for your apps? What do they suggest?

If RabbitMQ does not log heartbeat_timeout messages, it's likely something else,
e.g. a load balancer or router between clients and RabbitMQ misbehaving.
--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Michael Denny

unread,
Aug 5, 2014, 7:54:00 AM8/5/14
to Michael Klishin, rabbitm...@googlegroups.com
Thank you very much for quickly reply!

I also see some heartbeat_timeout:

=ERROR REPORT==== 26-Jul-2014::10:30:14 ===
closing AMQP connection <0.1799.12> ([::1]:54624 -> [::1]:5672):
{heartbeat_timeout,running}

but they are really really infrequent, and happened many days ago, anyway could be symptom of high cpu-usage, actually I don't have specific metrics for that days.

The connections are all local, everything is installed on the same machine, in fact the yesterday and today metrics suggest that the server is quite lack of resources, actually not so easy to say, because it happens in a burst scenario, so only sometimes.

I'm in a very small testing environment, so with your thoughts and mine analysis I can say that it should be because of the environment, not RabbitMQ or application stuff.


--
Michael Denny
Software Architect, Trainer and Consultant
Microsoft MVP - Most Valuable Professional - Visual C#

Michael Klishin

unread,
Aug 5, 2014, 7:57:23 AM8/5/14
to Michael Denny, rabbitm...@googlegroups.com


On 5 August 2014 at 15:54:00, Michael Denny (micd...@gmail.com) wrote:
> > The connections are all local, everything is installed on the
> same machine, in fact the yesterday and today metrics suggest
> that the server is quite lack of resources, actually not so easy
> to say, because it happens in a burst scenario, so only sometimes.
>
> I'm in a very small testing environment, so with your thoughts
> and mine analysis I can say that it should be because of the environment,
> not RabbitMQ or application stuff.

Michael,

What's your configured heartbeat timeout value and heap size?
For thread starvation to be a cause it has to be really low (like 2-10 seconds).

GC pauses are more likely but it depends on what's the size of VM heap. 

Michael Denny

unread,
Aug 5, 2014, 8:02:26 AM8/5/14
to Michael Klishin, rabbitm...@googlegroups.com
the timeout value is 10 seconds

rabbitmq is using 119 MB of 1.4GB high watermark.





--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/jCs-YChdkAA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send an email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,
Aug 5, 2014, 8:05:12 AM8/5/14
to Michael Denny, rabbitm...@googlegroups.com


On 5 August 2014 at 16:02:25, Michael Denny (micd...@gmail.com) wrote:
> > the timeout value is 10 seconds
>
> rabbitmq is using 119 MB of 1.4GB high watermark.

I was asking about your app's heap, since it's RabbitMQ that detects missed
heartbeats ;)

The Erlang VM is generally much better than other reasonably common runtimes
at ensuring the processes get a chance to run even under resource congestion, plus
GC pauses is virtually not an issue (because every process has its own heap).
So I think if there are GC pauses causing the problem, they are no on Rabbit's end.

Michael Denny

unread,
Aug 5, 2014, 8:23:02 AM8/5/14
to Michael Klishin, rabbitm...@googlegroups.com
I'm using EasyNetQ, and the rest of the applications uses 2 GB of the 3.5 GB of the server, erl process instead use around 150 MB, then there's 37% of free memory on the server at the moment, and I still seeing some "connection_closed_abruptly", no "heartbeat_timeout" at the moment.

The 10 seconds heartbeat timeout could become a problem? There are only 27 connections to that server, everyone's local. So there's 54 threads, Am I right? 27 read, 27 write on heartbeat stuff. The server has around 1140 threads.




--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/jCs-YChdkAA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send an email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,
Aug 5, 2014, 8:26:01 AM8/5/14
to Michael Denny, rabbitm...@googlegroups.com
On 5 August 2014 at 16:23:02, Michael Denny (micd...@gmail.com) wrote:
> > The 10 seconds heartbeat timeout could become a problem? 

With local clients, shouldn't be. I wonder if abrupt connection closure issues
can be due to EasyNetQ task processing throwing exceptions for some reason? Not familiar
with EasyNetQ to tell but unhandled failures can lead to unexpectedly closed connections.

I wonder if there's a way to make EasyNetQ log more.

> There
> are only 27 connections to that server, everyone's local. So
> there's 54 threads, Am I right? 27 read, 27 write on heartbeat
> stuff. The server has around 1140 threads.

1140 Erlang  processes? That's nothing.

Michael Denny

unread,
Aug 5, 2014, 8:30:01 AM8/5/14
to Michael Klishin, rabbitm...@googlegroups.com
no, that's the whole number of system threads.

the erlang details are:

file descriptors: 73 (8192 available)
socket descriptors: 29 (7280 available)
erlang processes: 4988 (1048576 available)
memory: 116 MB (1.4GB high watermark)
disk space: 97GB (48Mb low watermark)
Uptime: 13d 0h
Type: Disc, Stats

connections: 27
Channels: 895
Exchanges: 273
Queues: 1031
Consumers: 865




Michael Klishin

unread,
Aug 5, 2014, 8:32:19 AM8/5/14
to Michael Denny, rabbitm...@googlegroups.com
On 5 August 2014 at 16:30:01, Michael Denny (micd...@gmail.com) wrote:
> > no, that's the whole number of system threads.

That's not a high number (for moderate hardware) but not low either.
So it can be thread starvation then, worth investigating. 

Michael Denny

unread,
Aug 5, 2014, 8:33:51 AM8/5/14
to Michael Klishin, rabbitm...@googlegroups.com
> I wonder if abrupt connection closure issues
> can be due to EasyNetQ task processing throwing exceptions for some reason? Not familiar
> with EasyNetQ to tell but unhandled failures can lead to unexpectedly closed connections.

I can deep look at the source code of easynetq, but where can be an unhandled exception close the connection? in the HandleBasicDeliver for instance? In those IBasicConsumer handler?


Michael Klishin

unread,
Aug 5, 2014, 8:36:47 AM8/5/14
to Michael Denny, rabbitm...@googlegroups.com


On 5 August 2014 at 16:33:57, Michael Denny (micd...@gmail.com) wrote:
> > I can deep look at the source code of easynetq, but where can be
> an unhandled exception close the connection? in the HandleBasicDeliver
> for instance? In those IBasicConsumer handler?

Nothing in the RabbitMQ .NET client should close the connection when
a consumer exception isn't uncaught but something in EasyNetQ could
do this (e.g. to reconnect and reset its state).

I'll ask EasyNetQ maintainer to chime in ;)

Michael Denny

unread,
Aug 5, 2014, 8:42:55 AM8/5/14
to Michael Klishin, rabbitm...@googlegroups.com
I'm already in touch with Mike :) I'm collaborating on the project as well, even if actually I don't know all about it, but I don't think we reconnect on unhandled exception, because every exception should be intercepted and managed. But I will further investigate with Mike if the problem persists also in a bigger environment, actually I'm the only one complaining about this, so I would expect to be much more a problem related to the resources, but of course would be great understand why I saw a TCP connection drop. Maybe windows sending some tcp reset?! What do you use to investigate if is the operating system killing connections?

I really appreciated your help :)

thanks again!

Michael Klishin

unread,
Aug 5, 2014, 8:45:40 AM8/5/14
to Michael Denny, rabbitm...@googlegroups.com
On 5 August 2014 at 16:42:54, Michael Denny (micd...@gmail.com) wrote:
> > What do you use to investigate if is the operating system killing
> connections?

Wireshark, which should work on Windows, too:
http://www.wireshark.org/docs/wsug_html_chunked/ChBuildInstallWinInstall.html

Michael Denny

unread,
Aug 5, 2014, 1:17:24 PM8/5/14
to Michael Klishin, rabbitm...@googlegroups.com
Now I've reinstalled everything, and updated from rabbitmq server 3.3.1 to 3.3.4 and erlang from 17 to 17.1

The "connection_closed_abruptly" seems to be stopped, but I think because a better connection handling from erlang or rabbitmq, or because of something that is not happening anymore in that system against tcp connection.

Anyway sometimes the server seems to stuck, start taking long time to process few message, we're talking about 10/20 messages per second, and for some reason cpu near to 0%, network near to 0%, and busy disk before the harakiri

Immagine in linea 1 sometimes the cpu and network doesn't die in that way, but I still  Immagine in linea 2

Then here's some questions :)

A bottlenecked disk could slow down publish queuing a lot on disk, isn't it?

Could this also slow down the consumer? or if the consumer is ready to consume (idle), can receive the message before is committed to disk? I'm asking this because I use the EasyNetQ RPC pattern, that consume to a reply queue and timeouts if no message arrive in 10 sec, so I'm wondering if a bottlenecked disk can slow down the reply process. Reply queues are transient, auto-delete and has exclusive owner. We reply by publish on that queue passing a DeliveryMode=0 so i suppose is transient by default (I've already asked Mike if this should be better implemented by choosing a default DeliveryMode of 1 or 2, not 0 that is the byte default).

In general what can happen if there's a disk bottleneck?

I also see messages in unack state for long, really strange all this:

Immagine in linea 1
all those messages are RPC style.

In the meanwhile, there's any suggestion for disks? like allocation unit size? Should I set it to more that the default 4KB?


Many Thanks!

Michael Klishin

unread,
Aug 5, 2014, 1:26:35 PM8/5/14
to Michael Denny, rabbitm...@googlegroups.com
On 5 August 2014 at 21:17:32, Michael Denny (micd...@gmail.com) wrote:
> > A bottlenecked disk could slow down publish queuing a lot on
> disk, isn't it?

For persistent messages, yes.

> Could this also slow down the consumer? or if the consumer is ready
> to consume (idle), can receive the message before is committed
> to disk? 

It can slow down message delivery to consumers if said messages have to
be read from disk when the disk is overloaded.

> I'm asking this because I use the EasyNetQ RPC pattern,
> that consume to a reply queue and timeouts if no message arrive
> in 10 sec, so I'm wondering if a bottlenecked disk can slow down
> the reply process. Reply queues are transient, auto-delete
> and has exclusive owner. We reply by publish on that queue passing
> a DeliveryMode=0 so i suppose is transient by default (I've already
> asked Mike if this should be better implemented by choosing a
> default DeliveryMode of 1 or 2, not 0 that is the byte default).

There is no delivery mode = 0, only 1 (transient) and 2 (persistent).

> In general what can happen if there's a disk bottleneck?

Your entire OS can start swapping, slowing everything down (not just RabbitMQ,
every single running process). This would happen if you run out of available RAM
on the machine.

> I also see messages in unack state for long, really strange all
> this:
>
>
> all those messages are RPC style.

Heavy swapping can indeed result in such delays.

> In the meanwhile, there's any suggestion for disks? like allocation
> unit size? Should I set it to more that the default 4KB?

That won't help much if you're swapping. If you are, add RAM. If not,
but disk is really the bottleneck, either reduce disk load (e.g. by
using transient messages) or move to an SSD.

Michael Denny

unread,
Aug 5, 2014, 1:44:29 PM8/5/14
to Michael Klishin, rabbitm...@googlegroups.com
There is no delivery mode = 0, only 1 (transient) and 2 (persistent).

If you pass an invalid delivery mode like 0, what will be choose from rabbitmq? persistent or non-persistent?

Your entire OS can start swapping,

I have 1.1 GB of free RAM, so I don't think high disk usage is because of swapping. I will investigate which process is, but in the meanwhile I would like to understand why rabbitmq blocks when disk is loaded, just to figure out which point is slowing down, for just managing 10/20 message per sec :(

I will also move rabbitmq in a dedicated disk to see if something change.




Michael Klishin

unread,
Aug 5, 2014, 8:37:16 PM8/5/14
to Michael Denny, rabbitm...@googlegroups.com
 On 5 August 2014 at 21:44:41, Michael Denny (micd...@gmail.com) wrote:
> > If you pass an invalid delivery mode like 0, what will be choose
> from rabbitmq? persistent or non-persistent?

You will get an error:

connection <0.605.0>, channel 2 - soft error:
{amqp_error,precondition_failed,"invalid message: {delivery_mode_unknown,0}",
            'basic.publish'}


> > Your entire OS can start swapping,
>
> I have 1.1 GB of free RAM, so I don't think high disk usage is because
> of swapping. I will investigate which process is, but in the meanwhile
> I would like to understand why rabbitmq blocks when disk is loaded,
> just to figure out which point is slowing down, for just managing
> 10/20(http://airmail.calendar/2014-10-20%2012:00:00%20GMT+4)
> message per sec :(

Are there any blocked connection messages in the log?

Michael Denny

unread,
Aug 6, 2014, 3:48:45 AM8/6/14
to rabbitm...@googlegroups.com, micd...@gmail.com
It is possible that RabbitMQ Client fix invalid delivery mode? Because in EasyNetQ we wrongly send 0, but no error happens

Immagine in linea 2

Michael Klishin

unread,
Aug 7, 2014, 8:06:26 AM8/7/14
to Michael Denny, rabbitm...@googlegroups.com
On 5 August 2014 at 21:44:29, Michael Denny (micd...@gmail.com) wrote:
> > If you pass an invalid delivery mode like 0, what will be choose
> from rabbitmq? persistent or non-persistent?

The primary way of setting delivery mode is via BasicProperties#SetPersistent,
which cannot set it to 0.

I'd need to add a test for setting it to 0 to give a correct answer .

Michael Denny

unread,
Aug 11, 2014, 4:42:06 PM8/11/14
to Michael Klishin, rabbitm...@googlegroups.com
I've fixed the missing set of DeliveryMode in EasyNetQ for RPC calls, so I don't need you further investigate over that concern.

Thank you very much for your time!

P.S.: the timeouts problem at the moment they seems to be an azure hosting problem, I can't confirm it 100%, because I didn't requested support yet, but running in our development and production environments, rabbitmq and the software in general. does not have that symptoms.
Reply all
Reply to author
Forward
0 new messages