On Mon, 4 Jun 2012, GH wrote:
> Now what's still puzzling me is as in my original post when the child
> does not close those fds: netstat shows on client side the connection in
> ESTABLISHED state while on server side netstat reports no such
> connection. I would expect that kernel on the server should close the
> socket and recv() on client should return 0. But what I saw is that
> client hangs on recv().
In one case I saw similar symptoms. Both client and server belonged to the
same organization. They had internal NAT. After establishing the
connection and exchanging some data initially, there was no traffic for a
long time. The connection tracking entries in "some" (*) routers between
them had an expiration of 10+ minutes or so. When such and entry was
deleted, both sides remained in ESTABLISHED. As soon as one side wanted to
transmit, it timed out after a while and closed its socket. (The main
problem was that this timeout blocked message transfer for too long.) The
other side could stay in ESTABLISHED indefinitely if it only tried to
read.
(*) We simply could not figure out which router caused the problem, let
alone raise the conntrack entries' lifetime. This was an organizatorial
problem, not a technical one. Our pain was simply not important enough for
network operations. Modifying the peers didn't appear possible (ie.
introducing application level pings), we had the source to neither.
I wrote an SSL socket factory in Java that set the SO_KEEPALIVE socket
option on the client side and delegated the "rest" of the work to the real
SSL socket factory. (J2EE provides some way to "preload" such a factory
IIRC.) Simultaneously the OS keepalive interval (pre-probe wait) was
lowered (from the default 2 hours) to 10 minutes. These kept the unknown
conntrack entries alive, and I was admitted into the department of
clandestine enterprise operations. (Just kidding.)
Laszlo