detecting "broken" TCP connections

234 views
Skip to first unread message

Alen Vrečko

unread,
Nov 29, 2016, 4:29:24 AM11/29/16
to mechanica...@googlegroups.com
Got a situation where thread hanged on socket read (old school socket
bio code). One side was in TCP established while the other in
fin_wait_2. The customer was "upgrading" the switches at the time this
happened.

The thread will never complete. It should get a timeout exception. But
it doesn't. There is the call to Socket#setSoTimeout in the code. It
should do the job. My first though was there must be a bug in
setSoTimeout. I never had much faith in SoTimeout. Was not surprised
to find a lot of bug reports related to socketRead0 hangs. Reminded me
of this blog post about hanged postgres connection [1].

I'd use nio and app level timeouts. But it is legacy code that I
can't/don't want to touch.

Been thinking of using a custom SocketFactory that wraps the sockets
with some monitoring code. Pretty ugly. It doesn't feel right.

Found quite a few discussions about this. But not really any solutions
that don't require app level changes.

Any thoughts? Anybody in a similar boat?

[1] https://tech.zalando.com/blog/hack-to-terminate-tcp-conn-postgres/

Wojciech Kudla

unread,
Nov 29, 2016, 4:35:18 AM11/29/16
to mechanica...@googlegroups.com

Any chance that socket connection is handled by some sort of kernel bypass?
All bets with blocking IO are off when running with onload/offload drivers.


--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Alen Vrečko

unread,
Nov 29, 2016, 4:48:36 AM11/29/16
to mechanica...@googlegroups.com
No. It is just a typical "off the shelf" Linux setup. Thanks for the insight.

Justin Mason

unread,
Nov 29, 2016, 7:00:49 AM11/29/16
to mechanica...@googlegroups.com
I think that, as the Zalando blog post suggested, you could use OS-level TCP keepalive to test the connections regularly, so the kernel will eventually notice that the TCP connection is now dead: http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html -- by default this waits for 2 hours of inactivity, which seems too long for many use cases.

I generally prefer to perform app-level keepalives with app-controlled timeouts and retry settings, but in this case if it's legacy code, a kernel-level sysctl tweak may be more palatable!

--j.

Greg Young

unread,
Nov 29, 2016, 7:16:42 AM11/29/16
to mechanica...@googlegroups.com
In my experience protocol level tcp keep alives don't always work
between implementations. BSD - windows used to be a primary culprit,
though they were set they would not get hit in some cases. Things may
be better today. On same implementation they should work quite well.
Definitely worth testing if you deal with multiple implementations.
Studying for the Turing test

Alen Vrečko

unread,
Dec 1, 2016, 9:17:23 AM12/1/16
to mechanica...@googlegroups.com
Thank you all for your advice. I dismissed KA early on for the wrong
reasons. I thought there must be something better available that I
missed. I'll go with keep alive.
Reply all
Reply to author
Forward
0 new messages