Heartbeat and connection_closed_abruptly logging message

Michael Denny

unread,

Aug 5, 2014, 7:36:43 AM8/5/14

to rabbitm...@googlegroups.com

Hi everyone,

Do you know if RabbitMQ writes a specific log message if the connection drops because of a heartbeat missing?

I just trying to find some clarification on the connection life-cycle because I'm experiences dropping connection without apparent motivation:

some of these

=ERROR REPORT==== 5-Aug-2014::12:22:17 ===

closing AMQP connection <0.16252.69> ([::1]:58277 -> [::1]:5672):

{inet_error,econnaborted}

and these:

=WARNING REPORT==== 2-Aug-2014::18:57:59 ===

closing AMQP connection <0.26038.56> ([::1]:53399 -> [::1]:5672):

connection_closed_abruptly

That means the connection is dropped without closing the connection properly, of course the application was up and running, so I thought a lack of resource, but I'm not sure of that, so now I'm thinking on heartbeat, even if as far as I known it should run on 2 separated thread, on for read and one for write, so actually highly difficult to stop those, if the heartbeat stops it is quite sure to say that server lacking of resources, otherwise a bug of the client library (but I don't think so).

Thanks

Michael

Michael Klishin

unread,

Aug 5, 2014, 7:44:04 AM8/5/14

to rabbitm...@googlegroups.com, Michael Denny

On 5 August 2014 at 15:36:55, Michael Denny (micd...@gmail.com) wrote:
> > Do you know if RabbitMQ writes a specific log message if the connection
> drops because of a heartbeat missing?

It should, search for heartbeat_timeout.

> I just trying to find some clarification on the connection life-cycle
> because I'm experiences dropping connection without apparent
> motivation:
>
> some of these
>

> =ERROR REPORT==== 5-Aug-2014(http://airmail.calendar/2014-08-05%2012:00:00%20GMT+4)::12:22:17

> ===
> closing AMQP connection <0.16252.69> ([::1]:58277 -> [::1]:5672):
> {inet_error,econnaborted}

This suggests ECONNABORTED, which may be due to a (TCP) timeout.

> and these:
>
> =WARNING REPORT==== 2-Aug-2014(http://airmail.calendar/2014-08-02%2012:00:00%20GMT+4)::18:57:59

> ===
> closing AMQP connection <0.26038.56> ([::1]:53399 -> [::1]:5672):
> connection_closed_abruptly

This means a TCP connection was closed w/o properly closing AMQP 0-9-1 connection.
I'd expect this to only happen when the peer (client) explicitly closes TCP
connection one way or another.

> now I'm thinking
> on heartbeat, even if as far as I known it should run on 2 separated
> thread, on for read and one for write, so actually highly difficult
> to stop those, if the heartbeat stops it is quite sure to say that
> server lacking of resources, otherwise a bug of the client library
> (but I don't think so).

Heartbeat threads may be starved of resources but that's fairly unlikely.
Do you collect resource load metrics for your apps? What do they suggest?

If RabbitMQ does not log heartbeat_timeout messages, it's likely something else,
e.g. a load balancer or router between clients and RabbitMQ misbehaving.
--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Michael Denny

unread,

Aug 5, 2014, 7:54:00 AM8/5/14

to Michael Klishin, rabbitm...@googlegroups.com

Thank you very much for quickly reply!

I also see some heartbeat_timeout:

=ERROR REPORT==== 26-Jul-2014::10:30:14 ===

closing AMQP connection <0.1799.12> ([::1]:54624 -> [::1]:5672):

{heartbeat_timeout,running}

but they are really really infrequent, and happened many days ago, anyway could be symptom of high cpu-usage, actually I don't have specific metrics for that days.

The connections are all local, everything is installed on the same machine, in fact the yesterday and today metrics suggest that the server is quite lack of resources, actually not so easy to say, because it happens in a burst scenario, so only sometimes.

I'm in a very small testing environment, so with your thoughts and mine analysis I can say that it should be because of the environment, not RabbitMQ or application stuff.

--

Michael Denny

Software Architect, Trainer and Consultant

Microsoft MVP - Most Valuable Professional - Visual C#

http://mvp.microsoft.com/mvp/Michael%20Denny-5000735

about me: http://about.me/micdenny

Michael Klishin

unread,

Aug 5, 2014, 7:57:23 AM8/5/14

to Michael Denny, rabbitm...@googlegroups.com

On 5 August 2014 at 15:54:00, Michael Denny (micd...@gmail.com) wrote:
> > The connections are all local, everything is installed on the
> same machine, in fact the yesterday and today metrics suggest
> that the server is quite lack of resources, actually not so easy
> to say, because it happens in a burst scenario, so only sometimes.
>
> I'm in a very small testing environment, so with your thoughts
> and mine analysis I can say that it should be because of the environment,
> not RabbitMQ or application stuff.

Michael,

What's your configured heartbeat timeout value and heap size?
For thread starvation to be a cause it has to be really low (like 2-10 seconds).

GC pauses are more likely but it depends on what's the size of VM heap.

Michael Denny

unread,

Aug 5, 2014, 8:02:26 AM8/5/14

to Michael Klishin, rabbitm...@googlegroups.com

the timeout value is 10 seconds

rabbitmq is using 119 MB of 1.4GB high watermark.

--

You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/jCs-YChdkAA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send an email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,

Aug 5, 2014, 8:05:12 AM8/5/14

to Michael Denny, rabbitm...@googlegroups.com

On 5 August 2014 at 16:02:25, Michael Denny (micd...@gmail.com) wrote:
> > the timeout value is 10 seconds
>
> rabbitmq is using 119 MB of 1.4GB high watermark.

I was asking about your app's heap, since it's RabbitMQ that detects missed
heartbeats ;)

The Erlang VM is generally much better than other reasonably common runtimes
at ensuring the processes get a chance to run even under resource congestion, plus
GC pauses is virtually not an issue (because every process has its own heap).
So I think if there are GC pauses causing the problem, they are no on Rabbit's end.

Michael Denny

unread,

Aug 5, 2014, 8:23:02 AM8/5/14

to Michael Klishin, rabbitm...@googlegroups.com

I'm using EasyNetQ, and the rest of the applications uses 2 GB of the 3.5 GB of the server, erl process instead use around 150 MB, then there's 37% of free memory on the server at the moment, and I still seeing some "connection_closed_abruptly", no "heartbeat_timeout" at the moment.

The 10 seconds heartbeat timeout could become a problem? There are only 27 connections to that server, everyone's local. So there's 54 threads, Am I right? 27 read, 27 write on heartbeat stuff. The server has around 1140 threads.

--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/jCs-YChdkAA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send an email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,

Aug 5, 2014, 8:26:01 AM8/5/14

to Michael Denny, rabbitm...@googlegroups.com

On 5 August 2014 at 16:23:02, Michael Denny (micd...@gmail.com) wrote:
> > The 10 seconds heartbeat timeout could become a problem?

With local clients, shouldn't be. I wonder if abrupt connection closure issues
can be due to EasyNetQ task processing throwing exceptions for some reason? Not familiar
with EasyNetQ to tell but unhandled failures can lead to unexpectedly closed connections.

I wonder if there's a way to make EasyNetQ log more.

> There
> are only 27 connections to that server, everyone's local. So
> there's 54 threads, Am I right? 27 read, 27 write on heartbeat
> stuff. The server has around 1140 threads.

1140 Erlang processes? That's nothing.

Michael Denny

unread,

Aug 5, 2014, 8:30:01 AM8/5/14

to Michael Klishin, rabbitm...@googlegroups.com

no, that's the whole number of system threads.

the erlang details are:

file descriptors: 73 (8192 available)

socket descriptors: 29 (7280 available)

erlang processes: 4988 (1048576 available)

memory: 116 MB (1.4GB high watermark)

disk space: 97GB (48Mb low watermark)

Uptime: 13d 0h

Type: Disc, Stats

connections: 27

Channels: 895

Exchanges: 273

Queues: 1031

Consumers: 865

Michael Klishin

unread,

Aug 5, 2014, 8:32:19 AM8/5/14

to Michael Denny, rabbitm...@googlegroups.com

On 5 August 2014 at 16:30:01, Michael Denny (micd...@gmail.com) wrote:
> > no, that's the whole number of system threads.

That's not a high number (for moderate hardware) but not low either.
So it can be thread starvation then, worth investigating.

Michael Denny

unread,

Aug 5, 2014, 8:33:51 AM8/5/14

to Michael Klishin, rabbitm...@googlegroups.com

> I wonder if abrupt connection closure issues
> can be due to EasyNetQ task processing throwing exceptions for some reason? Not familiar
> with EasyNetQ to tell but unhandled failures can lead to unexpectedly closed connections.

I can deep look at the source code of easynetq, but where can be an unhandled exception close the connection? in the HandleBasicDeliver for instance? In those IBasicConsumer handler?

Michael Klishin

unread,

Aug 5, 2014, 8:36:47 AM8/5/14

to Michael Denny, rabbitm...@googlegroups.com

On 5 August 2014 at 16:33:57, Michael Denny (micd...@gmail.com) wrote:
> > I can deep look at the source code of easynetq, but where can be
> an unhandled exception close the connection? in the HandleBasicDeliver
> for instance? In those IBasicConsumer handler?

Nothing in the RabbitMQ .NET client should close the connection when
a consumer exception isn't uncaught but something in EasyNetQ could
do this (e.g. to reconnect and reset its state).

I'll ask EasyNetQ maintainer to chime in ;)

Michael Denny

unread,

Aug 5, 2014, 8:42:55 AM8/5/14

to Michael Klishin, rabbitm...@googlegroups.com

I'm already in touch with Mike :) I'm collaborating on the project as well, even if actually I don't know all about it, but I don't think we reconnect on unhandled exception, because every exception should be intercepted and managed. But I will further investigate with Mike if the problem persists also in a bigger environment, actually I'm the only one complaining about this, so I would expect to be much more a problem related to the resources, but of course would be great understand why I saw a TCP connection drop. Maybe windows sending some tcp reset?! What do you use to investigate if is the operating system killing connections?

I really appreciated your help :)

thanks again!

Michael Klishin

unread,

Aug 5, 2014, 8:45:40 AM8/5/14

to Michael Denny, rabbitm...@googlegroups.com

On 5 August 2014 at 16:42:54, Michael Denny (micd...@gmail.com) wrote:
> > What do you use to investigate if is the operating system killing
> connections?

Wireshark, which should work on Windows, too:
http://www.wireshark.org/docs/wsug_html_chunked/ChBuildInstallWinInstall.html

Michael Denny

unread,

Aug 5, 2014, 1:17:24 PM8/5/14

to Michael Klishin, rabbitm...@googlegroups.com

Now I've reinstalled everything, and updated from rabbitmq server 3.3.1 to 3.3.4 and erlang from 17 to 17.1

The "connection_closed_abruptly" seems to be stopped, but I think because a better connection handling from erlang or rabbitmq, or because of something that is not happening anymore in that system against tcp connection.

Anyway sometimes the server seems to stuck, start taking long time to process few message, we're talking about 10/20 messages per second, and for some reason cpu near to 0%, network near to 0%, and busy disk before the harakiri

sometimes the cpu and network doesn't die in that way, but I still Immagine in linea 2

Then here's some questions :)

A bottlenecked disk could slow down publish queuing a lot on disk, isn't it?

Could this also slow down the consumer? or if the consumer is ready to consume (idle), can receive the message before is committed to disk? I'm asking this because I use the EasyNetQ RPC pattern, that consume to a reply queue and timeouts if no message arrive in 10 sec, so I'm wondering if a bottlenecked disk can slow down the reply process. Reply queues are transient, auto-delete and has exclusive owner. We reply by publish on that queue passing a DeliveryMode=0 so i suppose is transient by default (I've already asked Mike if this should be better implemented by choosing a default DeliveryMode of 1 or 2, not 0 that is the byte default).

In general what can happen if there's a disk bottleneck?

I also see messages in unack state for long, really strange all this:

all those messages are RPC style.

In the meanwhile, there's any suggestion for disks? like allocation unit size? Should I set it to more that the default 4KB?

Many Thanks!

Michael Klishin

unread,

Aug 5, 2014, 1:26:35 PM8/5/14

to Michael Denny, rabbitm...@googlegroups.com

On 5 August 2014 at 21:17:32, Michael Denny (micd...@gmail.com) wrote:
> > A bottlenecked disk could slow down publish queuing a lot on
> disk, isn't it?

For persistent messages, yes.

> Could this also slow down the consumer? or if the consumer is ready
> to consume (idle), can receive the message before is committed
> to disk?

It can slow down message delivery to consumers if said messages have to
be read from disk when the disk is overloaded.

> I'm asking this because I use the EasyNetQ RPC pattern,
> that consume to a reply queue and timeouts if no message arrive
> in 10 sec, so I'm wondering if a bottlenecked disk can slow down
> the reply process. Reply queues are transient, auto-delete
> and has exclusive owner. We reply by publish on that queue passing
> a DeliveryMode=0 so i suppose is transient by default (I've already
> asked Mike if this should be better implemented by choosing a
> default DeliveryMode of 1 or 2, not 0 that is the byte default).

There is no delivery mode = 0, only 1 (transient) and 2 (persistent).

> In general what can happen if there's a disk bottleneck?

Your entire OS can start swapping, slowing everything down (not just RabbitMQ,
every single running process). This would happen if you run out of available RAM
on the machine.

> I also see messages in unack state for long, really strange all
> this:
>
>

> all those messages are RPC style.

Heavy swapping can indeed result in such delays.

> In the meanwhile, there's any suggestion for disks? like allocation
> unit size? Should I set it to more that the default 4KB?

That won't help much if you're swapping. If you are, add RAM. If not,
but disk is really the bottleneck, either reduce disk load (e.g. by
using transient messages) or move to an SSD.

Michael Denny

unread,

Aug 5, 2014, 1:44:29 PM8/5/14

to Michael Klishin, rabbitm...@googlegroups.com

> There is no delivery mode = 0, only 1 (transient) and 2 (persistent).

If you pass an invalid delivery mode like 0, what will be choose from rabbitmq? persistent or non-persistent?

> Your entire OS can start swapping,

I have 1.1 GB of free RAM, so I don't think high disk usage is because of swapping. I will investigate which process is, but in the meanwhile I would like to understand why rabbitmq blocks when disk is loaded, just to figure out which point is slowing down, for just managing 10/20 message per sec :(

I will also move rabbitmq in a dedicated disk to see if something change.

Michael Klishin

unread,

Aug 5, 2014, 8:37:16 PM8/5/14

to Michael Denny, rabbitm...@googlegroups.com

On 5 August 2014 at 21:44:41, Michael Denny (micd...@gmail.com) wrote:
> > If you pass an invalid delivery mode like 0, what will be choose
> from rabbitmq? persistent or non-persistent?

You will get an error:

connection <0.605.0>, channel 2 - soft error:
{amqp_error,precondition_failed,"invalid message: {delivery_mode_unknown,0}",
'basic.publish'}

> > Your entire OS can start swapping,
>
> I have 1.1 GB of free RAM, so I don't think high disk usage is because
> of swapping. I will investigate which process is, but in the meanwhile
> I would like to understand why rabbitmq blocks when disk is loaded,
> just to figure out which point is slowing down, for just managing

> 10/20(http://airmail.calendar/2014-10-20%2012:00:00%20GMT+4)
> message per sec :(

Are there any blocked connection messages in the log?

Michael Denny

unread,

Aug 6, 2014, 3:48:45 AM8/6/14

to rabbitm...@googlegroups.com, micd...@gmail.com

It is possible that RabbitMQ Client fix invalid delivery mode? Because in EasyNetQ we wrongly send 0, but no error happens

Michael Klishin

unread,

Aug 7, 2014, 8:06:26 AM8/7/14

to Michael Denny, rabbitm...@googlegroups.com

On 5 August 2014 at 21:44:29, Michael Denny (micd...@gmail.com) wrote:
> > If you pass an invalid delivery mode like 0, what will be choose
> from rabbitmq? persistent or non-persistent?

The primary way of setting delivery mode is via BasicProperties#SetPersistent,
which cannot set it to 0.

I'd need to add a test for setting it to 0 to give a correct answer .

Michael Denny

unread,

Aug 11, 2014, 4:42:06 PM8/11/14

to Michael Klishin, rabbitm...@googlegroups.com

I've fixed the missing set of DeliveryMode in EasyNetQ for RPC calls, so I don't need you further investigate over that concern.

Thank you very much for your time!

P.S.: the timeouts problem at the moment they seems to be an azure hosting problem, I can't confirm it 100%, because I didn't requested support yet, but running in our development and production environments, rabbitmq and the software in general. does not have that symptoms.

Reply all

Reply to author

Forward