Short lived stale zombie connection after recovery with short NetworkRecoveryInterval

Austin Nakasone

unread,

Oct 9, 2019, 4:38:49 PM10/9/19

to rabbitmq-users

Hello rabbitmq-users,

I'm running:

rabbitmq server 3.7.12 with erlang 21.2.6 (clustered setup)

rabbitmq client 5.7.3

And I'm seeing a possible issue where after a disconnection and recovery, there's a short lived stale / zombie connection. Is that a problem?

The issue happens with client config:

factory.setRequestedHeartbeat(60);

factory.setAutomaticRecoveryEnabled(true);

factory.setNetworkRecoveryInterval(5000); // 5s is the default

For example, after the initial connection

- view the admin UI to ensure that the client consumer is connected

- disconnect the client's network cable

- wait for a SocketException or UnknownHostException on the client

- reconnect the client's network cable

- the client successfully recovers as expected

- view the admin UI and check if there are two consumer connections

- after awhile, the original connection is cleared

However, is it a problem that there are two connections for a short time before the server cleans up the old connection? I suspect that this happens because after the client first disconnects, the server is still waiting for 3 missed heartbeats before it clears the old connection. But when the client reconnects, because the default recovery interval is only 5s, now there are two connections. During recovery should the client check for existing connections and close them before opening a new connection?

Thanks,

Austin.

ps - I was able to work around this by setting the setNetworkRecoveryInterval(150000) or about 1 1/2 times greater than the heartbeat timeout. Or similarly I set the heartbeat timeout to 30s and the network recovery interval to 90s.

Michael Klishin

unread,

Oct 9, 2019, 4:53:18 PM10/9/19

to rabbitmq-users

If a connection is detected as unavailable [1], it cannot be "recovered". A new connection cannot be "attached" to an existing TCP connection.

The OS can throw a socket error in the client when you yank the cable but because of how TCP works [1], the RabbitMQ node won't find out about that

for a while. Failure detection with TCP is asymmetric and asynchronous. [1] explains it in quite a bit of detail.

[2] explains what to do when this becomes a mass event (e.g. 1K clients lose connections and reconnect in a short period of time).

[3] can be used as an alternative to heartbeats but the nature of failure detection does not fundamentally change with it.

1. https://www.rabbitmq.com/heartbeats.html

2. https://www.rabbitmq.com/networking.html#dealing-with-high-connection-churn

3. https://www.rabbitmq.com/networking.html#tcp-keepalives

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/43ece308-291f-4a85-af34-50390efc931e%40googlegroups.com.

--

MK

Staff Software Engineer, Pivotal/RabbitMQ

Austin

unread,

Oct 10, 2019, 5:48:56 PM10/10/19

to rabbitmq-users

Hello Michael,

Thank you for the quick reply and the links you posted. Would it be possible, in the future, when the client attempts to recover, it sends the server some info about the old / bad connection? Then, before the server creates the new connection, it will clean up the old connection... if it's hasn't already been cleaned due to missed heartbeats?

Thanks,

Austin

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitm...@googlegroups.com.

To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/43ece308-291f-4a85-af34-50390efc931e%40googlegroups.com.

Michael Klishin

unread,

Oct 10, 2019, 7:34:56 PM10/10/19

to rabbitmq-users

If there was a stable way to identify a connection it would be possible with a protocol extension. There isn't enough interest

for this feature. If you are experiencing connection churn or need quicker "stale" connection detection, there are existing

mechanism and documentation guides that explain how to cope with/achieve that [1][2].

MQTT has a similar feature, the uniqueness of client ID. It can lead to operationally problematic and confusing behavior where

clients with a shared client ID (unintentionally due to the lack of understanding of this feature or a misconfiguration)

keep reconnecting and kicking each other out. This solves few problems and causes a lot of confusion.

I personally think it's one of the "be careful about what you wish for" stories in networking protocols and distributed systems.

1. https://www.rabbitmq.com/heartbeats.html#false-positives

2. https://www.rabbitmq.com/networking.html#dealing-with-high-connection-churn

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/48194c81-eb00-43fb-a74c-96d2f4ce250e%40googlegroups.com.

Christopher A Sandstrom

unread,

Oct 24, 2019, 2:29:01 AM10/24/19

to rabbitmq-users

I was looking for this type of post which is related to Network Recovery Interval. Finally I got the solution. Thank you Michael.

Helpful

Reply all

Reply to author

Forward