issue with tcp sockets on linux

GH

unread,

May 22, 2012, 9:25:29 AM5/22/12

to

Hi all, I've run my application into an issue with recv() hanging while the server already exited. recv() is not returning any data or error. More puzzling is that netstat shows on client side the connection in ESTABLISHED state while on server side netstat reports no such connection. Can someone shed some light or share any experience on this? Below is my system info. Many thanks.
$ uname -rv
2.6.18-194.el5 #1 SMP Fri Apr 2 14:58:14 EDT 2010

Rainer Weikusat

unread,

May 22, 2012, 9:57:18 AM5/22/12

to

GH <yzg...@gmail.com> writes:
> Hi all, I've run my application into an issue with recv() hanging
> while the server already exited. recv() is not returning any data or
> error. More puzzling is that netstat shows on client side the
> connection in ESTABLISHED state while on server side netstat reports
> no such connection. Can someone shed some light or share any
> experience on this?

The usual cause of this phenomenon would be that the server started
another program which is still running and inherited the established
socket because FD_CLOEXEC wasn't set.

GH

unread,

May 22, 2012, 2:24:31 PM5/22/12

to

Thank you for your reply. It is very insightful although my situation is not exactly as you described. My app does not call exec() or similar ones. It does conditionally call fork() where the child does not close() the open fd's. But in all cases the child would live only briefly and then exit()'ed.

Rainer Weikusat

unread,

May 22, 2012, 3:45:17 PM5/22/12

to

GH <yzg...@gmail.com> writes:
> On Tuesday, May 22, 2012 9:57:18 AM UTC-4, Rainer Weikusat wrote:
>> The usual cause of this phenomenon would be that the server started
>> another program which is still running and inherited the established
>> socket because FD_CLOEXEC wasn't set.
>

> Thank you for your reply. It is very insightful although my
> situation is not exactly as you described. My app does not call
> exec() or similar ones. It does conditionally call fork() where the
> child does not close() the open fd's.

[Could you please order quoted text and new text in this way?]

Since descriptors don't change at all accross a fork, that obviously
suffers from the same issue.

On linux, the netstat -np command could be used to determine which
process keeps your connection open.

GH

unread,

Jun 4, 2012, 10:01:11 AM6/4/12

to

I agree that the problem arose from fork() where the child does not close the open fds and I believe I had the solution for the problem.

Now what's still puzzling me is as in my original post when the child does not close those fds: netstat shows on client side the connection in ESTABLISHED state while on server side netstat reports no such connection. I would expect that kernel on the server should close the socket and recv() on client should return 0. But what I saw is that client hangs on recv().

On Tuesday, May 22, 2012 3:45:17 PM UTC-4, Rainer Weikusat wrote:

Ersek, Laszlo

unread,

Jun 4, 2012, 6:22:28 PM6/4/12

to

On Mon, 4 Jun 2012, GH wrote:

> Now what's still puzzling me is as in my original post when the child
> does not close those fds: netstat shows on client side the connection in
> ESTABLISHED state while on server side netstat reports no such
> connection. I would expect that kernel on the server should close the
> socket and recv() on client should return 0. But what I saw is that
> client hangs on recv().

In one case I saw similar symptoms. Both client and server belonged to the
same organization. They had internal NAT. After establishing the
connection and exchanging some data initially, there was no traffic for a
long time. The connection tracking entries in "some" (*) routers between
them had an expiration of 10+ minutes or so. When such and entry was
deleted, both sides remained in ESTABLISHED. As soon as one side wanted to
transmit, it timed out after a while and closed its socket. (The main
problem was that this timeout blocked message transfer for too long.) The
other side could stay in ESTABLISHED indefinitely if it only tried to
read.

(*) We simply could not figure out which router caused the problem, let
alone raise the conntrack entries' lifetime. This was an organizatorial
problem, not a technical one. Our pain was simply not important enough for
network operations. Modifying the peers didn't appear possible (ie.
introducing application level pings), we had the source to neither.

I wrote an SSL socket factory in Java that set the SO_KEEPALIVE socket
option on the client side and delegated the "rest" of the work to the real
SSL socket factory. (J2EE provides some way to "preload" such a factory
IIRC.) Simultaneously the OS keepalive interval (pre-probe wait) was
lowered (from the default 2 hours) to 10 minutes. These kept the unknown
conntrack entries alive, and I was admitted into the department of
clandestine enterprise operations. (Just kidding.)

Laszlo

telsar

unread,

Aug 1, 2012, 4:55:00 PM8/1/12

to

Perhaps you do not want to do a blocking call to recv() when there is no
data or event to recv(). Check to see first, before doing it.

--
Steal a little and go to jail, steal a lot and become King.