server does not drop connections missing heartbeats

495 views
Skip to first unread message

Dmitry Andrianov

unread,
Nov 16, 2017, 11:08:21 AM11/16/17
to rabbitmq-users
Hello.
RabbitMQ 3.6.12 (Erlang 19). I have a few connections that Management UI reports as


Heartbeat30s

but I know for sure they are dead. I tried running tcpdump with the source port and there is no traffic for a long time.
And yet the server does not close these connections for some reason.
I am not sure if behaviour is new or not - previously there was amazon ELB classic in front of RabbitMQ that would close connection itself after 60 seconds.

Thanks

Luke Bakken

unread,
Nov 17, 2017, 6:57:40 AM11/17/17
to rabbitmq-users
Hi Dmitry,

It would be very helpful to tell us which client library you are using (and the version) as each client library implements heartbeats its own way.

Does the client application report any errors? Can you share your tcpdump output showing a heartbeat and then the period after which the heartbeats apparently stopped?

Thanks,
Luke

Dmitry Andrianov

unread,
Nov 17, 2017, 10:20:09 PM11/17/17
to rabbitmq-users
Hello.
"capabilities", as reported by "sudo rabbitmqctl list_connections":
[{"version","3.6.5"},{"platform","Java"},{"information","Licensed under the MPL. See http://www.rabbitmq.com/"},{"capabilities",[{"consumer_cancel_notify",true},{"publisher_confirms",true},{"basic.nack",true},{"authentication_failure_close",true},{"connection.blocked",true},{"exchange_exchange_bindings",true}]},{"copyright","Copyright (c) 2007-2016 Pivotal Software, Inc."},{"product","RabbitMQ"}]

I am not entirely sure why you asking about client because in this particular case timeout should be detected by the server...

So, fact #1 - with a simple test I cannot reproduce the issue - if I start my Java client and then stop it with kill -STOP, Ctrl+Z or just ipfilter that blocks traffic to the server, after some short time server detects that, logs "missed heartbeats from client, timeout: 30s" and closes the connection.

Fact #2 - even though I cannot reproduce it with my client at will, I still see it happening on the server with real clients from time to time.

$ sudo rabbitmqctl list_connections pid peer_host peer_port state channels protocol auth_mechanism timeout frame_max channel_max recv_oct recv_cnt send_oct send_cnt send_pend connected_at | grep xx.xx.xx.xx
Sat Nov 18 03:03:45 UTC 2017
<rab...@localhost.1.1996.22> xx.xx.xx.xx 48555 running 1 {0,9,1} EXTERNAL 30 131072 0 0 0 0 0 0 1510729737817
<rab...@localhost.1.6751.57> xx.xx.xx.xx 50306 running 1 {0,9,1} EXTERNAL 30 131072 0 0 0 0 0 0 1510840739431
<rab...@localhost.1.31757.78> xx.xx.xx.xx 51428 running 1 {0,9,1} EXTERNAL 30 131072 0 0 0 0 0 0 1510907517645
<rab...@localhost.1.31217.81> xx.xx.xx.xx 51580 running 1 {0,9,1} EXTERNAL 30 131072 0 0 0 0 0 0 1510916989723
<rab...@localhost.1.19912.86> xx.xx.xx.xx 51859 running 1 {0,9,1} EXTERNAL 30 131072 0 0 0 0 0 0 1510931577289
<rab...@localhost.1.6941.93> xx.xx.xx.xx 52172 running 1 {0,9,1} EXTERNAL 30 131072 0 0 0 0 0 0 1510952465489
<rab...@localhost.1.21251.93> xx.xx.xx.xx 52202 running 1 {0,9,1} EXTERNAL 30 131072 0 0 0 0 0 0 1510953829505
<rab...@localhost.1.26585.96> xx.xx.xx.xx 52373 running 1 {0,9,1} EXTERNAL 30 131072 0 0 0 0 0 0 1510963814951
<rab...@localhost.1.29754.97> xx.xx.xx.xx 52400 running 1 {0,9,1} EXTERNAL 30 131072 0 0 0 0 0 0 1510966727121
$

and at the same time

$ sudo netstat -nat | grep xx.xx.xx.xx

so from the OS perspective there is no connection at all...
I saw one of these connections above "turning" into that zombie connection - first output of rabbitmqctl looked like

xx.xx.xx.xx 52400 running 1 {0,9,1} EXTERNAL 30 131072 0 11268 31 3235 16 0 1510966727121

note the stats and then it turned

51.9.165.227 52400 running 1 {0,9,1} EXTERNAL 30 131072 0 0 0 0 0 0 1510966727121

(all zeroes)

From tcpdump perspective it looked like the connection closure was initiated by the server:

01:02:23.512938 IP 10.0.175.117.5671 > xx.xx.xx.xx.52400: Flags [.], ack 12847, win 23472, length 0
01:02:23.514126 IP 10.0.175.117.5671 > xx.xx.xx.xx.52400: Flags [P.], seq 3321:3406, ack 12847, win 23472, length 85
01:02:23.514216 IP 10.0.175.117.5671 > xx.xx.xx.xx.52400: Flags [F.], seq 3406, ack 12847, win 23472, length 0
01:02:23.539430 IP xx.xx.xx.xx.52400 > 10.0.175.117.5671: Flags [.], ack 3406, win 583, length 0
01:02:23.545260 IP xx.xx.xx.xx.52400 > 10.0.175.117.5671: Flags [F.], seq 12847, ack 3407, win 583, length 0
01:02:23.545281 IP 10.0.175.117.5671 > xx.xx.xx.xx.52400: Flags [.], ack 12848, win 23472, length 0

At 01:02:23 RabbitMQ log has only one line:

2017-11-18 01:02:23.514 [error] <0.29752.97> SSL: {connection,{alert,2,20,{"tls_record.erl",488},undefined}}: ssl_connection.erl:861:Fatal error: unexpected message

But it is not exactly the same connection pid that you can see for this host:port in the rabbitmqctl output above....

Hope that helps.

Cheers

Michael Klishin

unread,
Nov 20, 2017, 11:29:55 AM11/20/17
to rabbitm...@googlegroups.com
list_connections will list RabbitMQ connection processes and the TLS implementation message likely logs a TLS socket process or a helper process of some kind.
Sockets have owner processes but I doubt TLS modules ever log them or make any assumptions about them.


hivehome.com



Hive | London | Cambridge | Houston | Toronto
The information contained in or attached to this email is confidential and intended only for the use of the individual(s) to which it is addressed. It may contain information which is confidential and/or covered by legal professional or other privilege. The views expressed in this email are not necessarily the views of Centrica plc, and the company, its directors, officers or employees make no representation or accept any liability for their accuracy or completeness unless expressly stated to the contrary. 
Centrica Connected Home Limited (company no: 5782908), registered in England and Wales with its registered office at Millstream, Maidenhead Road, Windsor, Berkshire SL4 5GD.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Dmitry Andrianov

unread,
Nov 20, 2017, 12:31:40 PM11/20/17
to rabbitmq-users
Michael,
was it about my last comment that pid in the ssl error and in list_connections do not match?

That particular thing does not bother me too much. My only concern is that connection numbers are quite inflated because they seem to include these "already dead" connections.
So the initial diagnosis was wrong - it is not like rabbit does not close them for inactivity, Rabbit closed them some time ago but keeps listing for a long time (and reporting in overview via REST API)...

Cheers

Michael Klishin

unread,
Nov 20, 2017, 6:25:53 PM11/20/17
to rabbitm...@googlegroups.com
"Dead" TCP connections take a while to be detected. That's very old news:

hivehome.com



Hive | London | Cambridge | Houston | Toronto
The information contained in or attached to this email is confidential and intended only for the use of the individual(s) to which it is addressed. It may contain information which is confidential and/or covered by legal professional or other privilege. The views expressed in this email are not necessarily the views of Centrica plc, and the company, its directors, officers or employees make no representation or accept any liability for their accuracy or completeness unless expressly stated to the contrary. 
Centrica Connected Home Limited (company no: 5782908), registered in England and Wales with its registered office at Millstream, Maidenhead Road, Windsor, Berkshire SL4 5GD.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Dmitry Andrianov

unread,
Nov 21, 2017, 5:06:13 PM11/21/17
to rabbitmq-users
Michael, I am not sure I follow.

1. The link describes how heartbeat should be used to detect the dead connections but we DO use heartbeats and the interval is 30 seconds so it should kill the connection after a minute tops.

2. I am not sure if you saw it in my previous reply but it does not look like a case of connection dying of a timeout. Instead server logs

2017-11-18 01:02:23.514 [error] <0.29752.97> SSL: {connection,{alert,2,20,{"tls_record.erl",488},undefined}}: ssl_connection.erl:861:Fatal error: unexpected message

and closes connection (FIN is sent to the client and FINACK received back).

It really looks like a bug in the server to me that connection is not removed properly after all that happened.

Cheers
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Dmitry Andrianov

unread,
Nov 23, 2017, 10:32:10 AM11/23/17
to rabbitmq-users
Guys, this thing does really look like a bug. Do I need to collect any other evidences to prove it?
Let me know what needs to be collected then please. I have plenty of instances with these dead connections listed.

Thanks

Luke Bakken

unread,
Nov 25, 2017, 6:05:59 PM11/25/17
to rabbitmq-users
Hi Dmitry,

I will try to find some time to investigate this coming week.

Thanks,
Luke

Markevych Alexander

unread,
Nov 27, 2017, 10:48:18 AM11/27/17
to rabbitmq-users
Hello.
Possibly we have simillary issue.

RabbitMQ 3.6.12, Erlang 19.3.6.2

Consumer doesn't use TLS to connect to queue. Also we use client library that don't support heartbeat  php-amqp. Our consumer's nodes frequently started and stopped, and consumer maybe doesn't close connection to RabbitMQ.
And some part of this connection continue to hang on the server. We think that they can't send Ack when they failed, and we see grow number of Unacked messages.
But if consumer failed/exited - server may detect failed connection to powered off consumer node?

четверг, 16 ноября 2017 г., 18:08:21 UTC+2 пользователь Dmitry Andrianov написал:

Michael Klishin

unread,
Nov 28, 2017, 6:49:55 PM11/28/17
to rabbitm...@googlegroups.com
That error suggests an exception in Erlang's TLS implementation.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Dmitry Andrianov

unread,
Nov 30, 2017, 6:43:50 AM11/30/17
to rabbitmq-users
Even if Erlang TLS throws, shouldn't RabbitMQ handle it and properly close connection?

On Tuesday, 28 November 2017 23:49:55 UTC, Michael Klishin wrote:
That error suggests an exception in Erlang's TLS implementation.
On Mon, Nov 27, 2017 at 6:48 PM, Markevych Alexander <rabot...@gmail.com> wrote:
Hello.
Possibly we have simillary issue.

RabbitMQ 3.6.12, Erlang 19.3.6.2

Consumer doesn't use TLS to connect to queue. Also we use client library that don't support heartbeat  php-amqp. Our consumer's nodes frequently started and stopped, and consumer maybe doesn't close connection to RabbitMQ.
And some part of this connection continue to hang on the server. We think that they can't send Ack when they failed, and we see grow number of Unacked messages.
But if consumer failed/exited - server may detect failed connection to powered off consumer node?

четверг, 16 ноября 2017 г., 18:08:21 UTC+2 пользователь Dmitry Andrianov написал:
Hello.
RabbitMQ 3.6.12 (Erlang 19). I have a few connections that Management UI reports as


Heartbeat30s

but I know for sure they are dead. I tried running tcpdump with the source port and there is no traffic for a long time.
And yet the server does not close these connections for some reason.
I am not sure if behaviour is new or not - previously there was amazon ELB classic in front of RabbitMQ that would close connection itself after 60 seconds.

Thanks

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,
Nov 30, 2017, 8:45:06 PM11/30/17
to rabbitm...@googlegroups.com
RabbitMQ connections terminate on pretty much every unknown exception
or dependent process death they encounter. We need a way to reproduce to say why this may not be the case.
hivehome.com



Hive | London | Cambridge | Houston | Toronto
The information contained in or attached to this email is confidential and intended only for the use of the individual(s) to which it is addressed. It may contain information which is confidential and/or covered by legal professional or other privilege. The views expressed in this email are not necessarily the views of Centrica plc, and the company, its directors, officers or employees make no representation or accept any liability for their accuracy or completeness unless expressly stated to the contrary. 
Centrica Connected Home Limited (company no: 5782908), registered in England and Wales with its registered office at Millstream, Maidenhead Road, Windsor, Berkshire SL4 5GD.

--

Dmitry Andrianov

unread,
Dec 1, 2017, 5:05:54 AM12/1/17
to rabbitmq-users
Michael, I have no problems with RabbitMQ terminating connections on unknown exceptions. I would do the same.
The problem there is that while connection is gone from OS perspective, RabbitMQ keeps listing it with "rabbitmqctl list_connections" for a very long time if not forever.

I do not know how to reproduce. I have plenty connection in that state already, so I can run some Erlang voodoo you may want me to run to get some additional info, but how to trigger it - I have got no clue unfortunately.

Cheers

Michael Klishin

unread,
Dec 8, 2017, 1:04:16 PM12/8/17
to rabbitm...@googlegroups.com
RabbitMQ can only know if a connection is gone if one of the following events happen:

 * The kernel and the runtime tell it by e.g. raising a socket exception (sockets can be "proactive" and send messages to their owning processes in Erlang)
 * A heartbeat timeout is detected [1]

And the latter exists exactly because the former can take forever.


To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Dmitry Andrianov

unread,
Dec 14, 2017, 5:00:42 AM12/14/17
to rabbitmq-users

So what is happening in our case then?
The heartbeats definitely cannot be happening on these dead sockets because they are closed from the OS perspective.

Michael Klishin

unread,
Dec 18, 2017, 2:23:51 AM12/18/17
to rabbitm...@googlegroups.com
I'm not sure what to add to the above. We are not aware of any cases where the heartbeat timeout mechanism would not
kick in.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Dmitry Andrianov

unread,
Dec 21, 2017, 6:08:56 AM12/21/17
to rabbitmq-users
Okay than. Here is how to reproduce.
RabbitMQ 3.6.12, Erlang 19.3.6

The client application below. Note that it is doing something extremely dirty that a normal client should not do - it gets access to raw bytes of TLS stream and injects an invalid frame there. The point is to trigger the SSL error on the server side.


public class FAMQPS {
................
    private static Object getField(final Object target, final String name) {
try {
final Field field = target.getClass().getDeclaredField(name);
field.setAccessible(true);
return field.get(target);
} catch (Exception e) {
throw new RuntimeException("Failed to get field '" + name + "' on " + target);
}
}

public static void main(String[] args) {
try {
final ConnectionFactory factory = new ConnectionFactory();
factory.setHost(args[0]);
factory.useSslProtocol(createSslContext());
factory.setSaslConfig(DefaultSaslConfig.EXTERNAL);
factory.setAutomaticRecoveryEnabled(false);
final Connection connection = factory.newConnection();

System.out.println("connection = " + connection);

// Get actual TCP socket raw output stream
final SocketFrameHandler _frameHandler = (SocketFrameHandler) getField(connection, "_frameHandler");
final SSLSocketImpl _socket = (SSLSocketImpl) getField(_frameHandler, "_socket");
final OutputStream sockOutput = (OutputStream) getField(_socket, "sockOutput");

System.out.println("Sleeping for 5 sec");
Thread.sleep(5000);

sockOutput.flush();
// record type = 0x17 - APPLICATION_DATA
// version = 0x303 - TLS 1.2
// length = 0x0002
// then two zero bytes of "payload"
sockOutput.write(new byte[] {0x17, 0x03, 0x03, 0x00, 0x02, 0x00, 0x00});
sockOutput.flush();

System.out.println("Rubbish sent");

System.out.println("Sleeping for 10 sec");
Thread.sleep(10000);

System.out.println("Finished");

System.exit(0);

} catch (Exception e) {
e.printStackTrace();
}
}
}



Of course normal clients should not do that but the point is to demonstrate problem on the server side.

I start it, and it is going to wait for 5 seconds before injecting that bad payload. At that time, on a broker, listing connections from the client machine:

$ sudo /usr/sbin/rabbitmqctl list_connections pid peer_host peer_port recv_oct recv_cnt send_oct send_cnt send_pend connected_at ssl_protocol ssl_key_exchange ssl_cipher ssl_hash | grep "10.0.86.50"
<rab...@localhost.1.1142.0> 10.0.86.50 50916 1929 9 2174 5 0 1513853515748 tlsv1.2 ecdhe_rsa aes_256_cbc sha384

All good, connection is established. Then 5 secs pass and the client injects that bad frame. Which is reflected in RabbitMQ log file on the broker machine as

2017-12-21 10:51:55.752 [info] <0.1142.0> accepting AMQP connection <0.1142.0> (10.0.86.50:50916 -> 10.0.170.32:5671)
...
2017-12-21 10:52:00.918 [error] <0.1139.0> SSL: {connection,{alert,2,20,{"ssl_cipher.erl",276},decryption_failed}}: ssl_connection.erl:861:Fatal error: unexpected message

And if we repeat the same rabbitmqctl command we will see

$ sudo /usr/sbin/rabbitmqctl list_connections pid peer_host peer_port recv_oct recv_cnt send_oct send_cnt send_pend connected_at ssl_protocol ssl_key_exchange ssl_cipher ssl_hash | grep "10.0.86.50"
<rab...@localhost.1.1142.0> 10.0.86.50 50916 0 0 0 0 0 1513853515748

So the connection is still listed although all the details are gone now.
And netstat -nt | grep 50916 shows nothing as connection is gone.

It may very well be a bug in Erlang or its TLS but to me as an end user it looks like a bug in RabbitMQ.

Cheers

Michael Klishin

unread,
Dec 24, 2017, 12:41:48 AM12/24/17
to rabbitm...@googlegroups.com
Well, injecting invalid frames definitely can confuse the parse and connection state machine among other things.
But what I'd expect to happen here is that a TLS alert happens yet the error is not propagated to the TLS socket by the runtime
quickly. That's my best guess anyway.

In 3.7.x you can enable debug log level and see more logs, sometimes from parts of OTP if not RabbitMQ itself (I can't say we added a whole lot of debug
logging to mature subsystems such as connection handling).

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Dmitry Andrianov

unread,
Jan 3, 2018, 7:17:02 PM1/3/18
to rabbitmq-users
Happy new year everyone!

Would it be possible for you guys to reproduce issue given the client source I posted? From my experiments the success rate is 100%
I even re-checked with the latest RabbitMQ 3.6.14 and Erlang/OTP 20.1.7 and the issue is still reproducible - it leaves these "dead" sessions hanging.
Although the error is logged slightly differently:

RabbitMQ 3.6.12, Erlang 19.6.3

2018-01-04 00:12:11.694 [error] <0.905.2530> SSL: {connection,{alert,2,20,{"ssl_cipher.erl",276},decryption_failed}}: ssl_connection.erl:861:Fatal error: unexpected message

RabbitMQ 3.6.14, Erlang 20.1.7

2018-01-04 00:00:33.118 [info] <0.545.537> TLS server: In state {connection,
                         {alert,2,20,
                             {"ssl_cipher.erl",276},
                             undefined,decryption_failed}} at ssl_connection.erl:848 generated SERVER ALERT: Fatal - Unexpected Message

We cannot switch to 3.7 at this point.

You guys are in much better position to say if it is an Erlang issue or not. If I attempt to raise a ticket in their bugtracker, I am sure they will say the problem is on RabbitMQ side and suggest contacting you instead.

Cheers

Luke Bakken

unread,
Jan 4, 2018, 11:15:11 AM1/4/18
to rabbitmq-users
Hello Dmitry,

Thank you for the code you used to reproduce this issue. I have made a note to visit this when I get a chance.

Luke
--
Staff Software Engineer
Pivotal / RabbitMQ

Michael Klishin

unread,
Jan 7, 2018, 12:11:09 PM1/7/18
to rabbitm...@googlegroups.com
The OTP team are actually pretty helpful when enough information is provided, and fixing TLS implementation
bugs and edge cases is not something they are new to.

For example, Luke has recently filed https://bugs.erlang.org/browse/ERL-539 and we've already seen some progress on it.


hivehome.com



Hive | London | Cambridge | Houston | Toronto
The information contained in or attached to this email is confidential and intended only for the use of the individual(s) to which it is addressed. It may contain information which is confidential and/or covered by legal professional or other privilege. The views expressed in this email are not necessarily the views of Centrica plc, and the company, its directors, officers or employees make no representation or accept any liability for their accuracy or completeness unless expressly stated to the contrary. 
Centrica Connected Home Limited (company no: 5782908), registered in England and Wales with its registered office at Millstream, Maidenhead Road, Windsor, Berkshire SL4 5GD.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Dmitry Andrianov

unread,
Jan 7, 2018, 12:59:11 PM1/7/18
to rabbitmq-users
Michael, the key there is "when enough information is provided".
I am no Erlang developer and I have no idea how to provide them what is needed. Or even how to reproduce it outside of RabbitMQ.

Moreover, from my perspective it looks like a bug in RabbitMQ not Erlang. Because even if runtime failed to tell you about socket closure, these connections are still configured with 30s heartbeat so after some time, RabbitMQ should kill them for no activity. And this is not happening.
(There, of course, can be but in Erlang in addition to bug in RabbitMQ but, again, it is difficult to see from where I sit - after all error was logged so it looks like something detected problem and did something about it).

Cheers
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,
Jan 11, 2018, 11:29:06 AM1/11/18
to rabbitm...@googlegroups.com
Someone has provided a curious example that may or may not be related directly
but is worth mentioning: https://github.com/rabbitmq/rabbitmq-java-client/issues/341.

The issue there is that a RabbitMQ client library socket can get into a state where it is not usable but
the runtime does not throw any exceptions in the I/O loop, so the client never learns that the socket is dead
and never initiates connection recovery.



To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Dmitry Andrianov

unread,
Jan 11, 2018, 3:50:53 PM1/11/18
to rabbitmq-users
I read the thread but I am not sure it is related.
It is definitely not the case of transmission freezing because TCP buffers are full and the other side does not consume quick enough.

Again - the OS level socket is gone so both sides exchanged FIN/ACK messages, there is no socket from netstat's perspective.
The only problem is that that connection is still listed with rabbitmqctl list_connections

May I at least raise issue on Github? :)

Cheers

Michael Klishin

unread,
Jan 11, 2018, 4:24:08 PM1/11/18
to rabbitm...@googlegroups.com
Go ahead :)

Dmitry Andrianov

unread,
Jan 11, 2018, 8:48:33 PM1/11/18
to rabbitmq-users
Here we go: https://github.com/rabbitmq/rabbitmq-server/issues/1474
This is the test app and how exactly the rabbit was installed - https://github.com/dimas/rabbitmq-broker-tls-issue-1474

It is very straight forward and should be super easy to reproduce but if you prefer I can probably take a snapshot the EC2 machine where I ran the test and somehow share AMI with you.

Cheers!

On Thursday, 11 January 2018 21:24:08 UTC, Michael Klishin wrote:
Go ahead :)



Dmitry Andrianov

unread,
Jan 28, 2018, 7:34:45 PM1/28/18
to rabbitmq-users

Hi. Did you guys have a chance to reproduce it?

I should have mentioned that before but we started seeing the problem when we switched from Amazon classic ELB to NLB. So my suspicion is that load balancer triggers the SSL issue but then it all down to RabbitMQ to make it a big deal :)

Cheers

Dmitry Andrianov

unread,
Feb 5, 2018, 3:11:17 AM2/5/18
to rabbitmq-users

Guys, sorry for nagging but are there any news about this?
Are we really the only people who reported the issue and it does not look like it affects others?

Cheers

Michael Klishin

unread,
Feb 5, 2018, 7:58:54 AM2/5/18
to rabbitm...@googlegroups.com
We are not aware of other reports. We will try to allocate some time for this but no promises.

hivehome.com



Hive | London | Cambridge | Houston | Toronto
The information contained in or attached to this email is confidential and intended only for the use of the individual(s) to which it is addressed. It may contain information which is confidential and/or covered by legal professional or other privilege. The views expressed in this email are not necessarily the views of Centrica plc, and the company, its directors, officers or employees make no representation or accept any liability for their accuracy or completeness unless expressly stated to the contrary. 
Centrica Connected Home Limited (company no: 5782908), registered in England and Wales with its registered office at Millstream, Maidenhead Road, Windsor, Berkshire SL4 5GD.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,
Feb 5, 2018, 8:25:10 AM2/5/18
to rabbitmq-users
There were known scenarios where connections would not detect a TCP socket closure while
in alarmed state. It was addressed in [1] and we haven't
seen that topic brought up on this list more than a few times ever since.

When it was brought up, IIRC switching to TCP keepalives with sensible values [2][3] — since heartbeats only exist
because Linux still uses values that made sense in 1996 — was a functional alternative. Is that something you
can investigate in your case?

1. https://github.com/rabbitmq/rabbitmq-common/pull/31
2. http://www.rabbitmq.com/heartbeats.html#tcp-keepalives
3. http://www.rabbitmq.com/networking.html


On Monday, February 5, 2018 at 1:58:54 PM UTC+1, Michael Klishin wrote:
We are not aware of other reports. We will try to allocate some time for this but no promises.

Michael Klishin

unread,
Feb 6, 2018, 8:14:23 AM2/6/18
to rabbitm...@googlegroups.com
Here's what our findings so far are:
https://github.com/rabbitmq/rabbitmq-server/issues/1474.

While there is an interesting RabbitMQ connection behavior that arguably should be different,
we can reproduce this behaviour only with some Erlang versions, so there's clearly a difference in TLS socket
behavior.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Dmitry Andrianov

unread,
Feb 6, 2018, 1:19:00 PM2/6/18
to rabbitmq-users
Sorry, Michael, just to clarify - are you suggesting us trying to ditch AMQP-level heartbeats in favour of standard TCP/IP keepalives?

I do not know answer for that from the top of my head - it will probably work but it is a substantial investment to test it properly - we need to make sure these are properly supported by all the clients we have including the cut-down Linux versions. Maybe it really worth trying - it will lower our idle traffic a bit.

But what makes you think it will solve the issue described here? As I said, from the OS perspective the connection is closed - it is not present in the netstat. The connection closure is NOT caused by a missed heartbeat so changing one heartbeat mechanism to another (keepalive) probably won't do much. It is the unhandled error from SSL layer is the problem...

Cheers


On Tuesday, 6 February 2018 13:14:23 UTC, Michael Klishin wrote:
Here's what our findings so far are:
https://github.com/rabbitmq/rabbitmq-server/issues/1474.

While there is an interesting RabbitMQ connection behavior that arguably should be different,
we can reproduce this behaviour only with some Erlang versions, so there's clearly a difference in TLS socket
behavior.
On Mon, Feb 5, 2018 at 2:25 PM, Michael Klishin <mkli...@pivotal.io> wrote:
There were known scenarios where connections would not detect a TCP socket closure while
in alarmed state. It was addressed in [1] and we haven't
seen that topic brought up on this list more than a few times ever since.

When it was brought up, IIRC switching to TCP keepalives with sensible values [2][3] — since heartbeats only exist
because Linux still uses values that made sense in 1996 — was a functional alternative. Is that something you
can investigate in your case?

1. https://github.com/rabbitmq/rabbitmq-common/pull/31
2. http://www.rabbitmq.com/heartbeats.html#tcp-keepalives
3. http://www.rabbitmq.com/networking.html

On Monday, February 5, 2018 at 1:58:54 PM UTC+1, Michael Klishin wrote:
We are not aware of other reports. We will try to allocate some time for this but no promises.


--
MK

Staff Software Engineer, Pivotal/RabbitMQ

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,
Feb 6, 2018, 1:58:23 PM2/6/18
to rabbitm...@googlegroups.com
They are not mutually exclusive.

Anyhow, now that we know what's going on (see https://github.com/rabbitmq/rabbitmq-server/issues/1474)
and contributed a patch for Erlang/OTP, that recommendation no longer stands:

 * https://bugs.erlang.org/browse/ERL-562
 * https://github.com/erlang/otp/pull/1709

hivehome.com



Hive | London | Cambridge | Houston | Toronto
The information contained in or attached to this email is confidential and intended only for the use of the individual(s) to which it is addressed. It may contain information which is confidential and/or covered by legal professional or other privilege. The views expressed in this email are not necessarily the views of Centrica plc, and the company, its directors, officers or employees make no representation or accept any liability for their accuracy or completeness unless expressly stated to the contrary. 
Centrica Connected Home Limited (company no: 5782908), registered in England and Wales with its registered office at Millstream, Maidenhead Road, Windsor, Berkshire SL4 5GD.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Dmitry Andrianov

unread,
Feb 7, 2018, 5:36:56 AM2/7/18
to rabbitmq-users
Yeah, I saw it after replying. That is super cool guys that you got to the bottom of it!
Thanks a lot.
Reply all
Reply to author
Forward
0 new messages