socket flushing/buffering problem, app hangs on close

WC

unread,

Feb 1, 2010, 8:29:23 PM2/1/10

to

I've written a TCL app that receives data from a single TCP source and
distributes this data to multiple TCP receivers using a very simple
ASCII protocol. The server is non-blocking using TCL's event loop. Most
of the receivers are not under my control and sometimes behave poorly.
This means I don't have access to code/application and in some cases the
owner of those applications.

Here is my problem.

TCL has called my writable handler indicating that a channel is ready
for data. I write data to the channel but the client stops reading data
at some point, but does not close the connection. TCP's flow control
kicks in and data ends up being buffered in the receivers TCP input
buffer, my hosts TCP output buffer and finally my application's TCL
channel output buffer.

If at this point I connect to another port and issue a command for my
application to shutdown it hangs. I forced a core dump and noticed that
it's hanging in send(). The man page for TCL's close indicates that TCL
will put the channel into blocking mode and attempt to flush the channel
of any remaining data, the interpreter does this for each open channel
when exit is called. However if the TCP stack is not accepting data the
application will never be able to exit or close channels without exiting
for that matter. This appears to be a pretty serious bug. I need to
'kill -9' in order to force an exit... very ugly. Seems like what is
needed is an option to the close command to discard any data buffered in
the TCL channel's output buffer and close the channel.

I coded a small extension in C that closes the OS specific handle for
the channel and the unregisters the channel from the interpreter. This
causes send() to return -1 but the interpreter doesn't care at that
point and shutdown continues successfully.

Anyone else run into this? I'm I totally missing something here?

BTW I'm using TCL 8.4 on Linux and HP-UX but a review of the current 8.5
API it seems like this deadlock could still exist.

Any input/ideas are greatly appreciated,
Wayne

tom.rmadilo

unread,

Feb 1, 2010, 9:51:00 PM2/1/10

to

Right, so it sounds like your wrote an application which gets
stuck...probably due to poor coding. It also sounds like you ran it in
background so you couldn't control it except via signals. The TCP
connection should still time out if you let it sit long enough.

BTW, a channel becomes readable/writable if an error occurs, it is
something of a blunt indicator. In this case is sounds like the
application is simply waiting around to send or receive data. I'm not
sure how this adds up to a bug.

WC

unread,

Feb 2, 2010, 1:56:35 AM2/2/10

to

Did you even read my post or were you just looking for someone to criticize?

1) Backgrounding does not imply that an application can only be
controlled via signals. In fact I'm using a control socket on another
port, as stated in my message, to send the app a stop message. But this
is beside the point I'm not sure why you brought it up?

2) You need to go back and study blocking sockets, if the remote end
stops reading data but the IP buffers on both ends are full and you
attempt to write more data, the write end will block until the remote
end begins to read data thus clearing IP buffers or it simply closes the
connection. Neither of which are happening. There is no timeout to wait
for, TCP is operating as designed in this case.

3) I know about read/write handlers, I have both installed on these
channels. The write handler is not getting called because the remote end
is not reading and the read handler is not getting called becuase the
remote end is not closing the socket nor sending my application any
data. I know this because I see that on my host system netstat shows
around 40K in the TCP write Q and the connection is in the ESTABLISHED
state.

Perhaps "bug" is a strong word, it appears that TCL is operating as
designed but there should be a way to close an output channel and
instruct TCL to just discard any data that it has left and not attempt
to send it for the exact reason cited above. It does not sound like a
good design if a remote machine can cause my application to hang while
attempting to close a channel or exit the application simply becuase the
interpreter mandates that it must flush all data from it's queues.

David Gravereaux

unread,

Feb 2, 2010, 2:34:09 AM2/2/10

to

Can't you just close them manually? Off hand:

foreach sock [chan names sock*] {
# enables dump on close
fconfigure $sock -blocking no
close $sock
}

--

signature.asc

WC

unread,

Feb 2, 2010, 2:52:10 AM2/2/10

to

Unfortunately not, if I do this while the application is running and I
leave the socket non-blocking. TCL will return from close immediately
and try to flush the data in the background. So the script layer
"thinks" it's closed but a file descriptor is forever allocated to the
interpreter. Many opens and closes with the bad server eventually causes
file descriptor starvation in the process.

When the application finally attempts to exit it hangs since it is the
interpreter's policy to flush and close all open channels before it
exists. So all those background tasks prevent it from exiting.

If I put the channel in blocking mode as you suggest above I don't even
get the benefit of the interp attempting to close the channel in the
background. It hangs on the close until the other side reads the data or
terminates the connection. Which means that none of the my other
socket handlers are being serviced as they are in the non-blocking
scenario. Essentially the application gives the impression that it is
locked at this point.

I appreciate the suggestion though!
Thanks.

Uwe Klein

unread,

Feb 2, 2010, 3:49:48 AM2/2/10

to

WC wrote:
> Unfortunately not, if I do this while the application is running and I
> leave the socket non-blocking. TCL will return from close immediately
> and try to flush the data in the background. So the script layer
> "thinks" it's closed but a file descriptor is forever allocated to the
> interpreter. Many opens and closes with the bad server eventually causes
> file descriptor starvation in the process.

n.B.
I once wrote a "self scriptable" ( not tcl ;-) multiplexer in
C for distributing messages ( duplicating, logging) between different
processes. ( Most processes where run of the mill tty/cmdline oriented programms )
Outgoing problems were handled via sigpipe. ( would not have caught an
unresponsive client either )

Limit the problem. don't try to reconnect? A dead client is dead, dead, dead
Would that work for you?
Limit buffering space?

uwe

PaulWalton

unread,

Feb 2, 2010, 4:19:16 AM2/2/10

to

Why does this work?

Interp 1:
% socket -server accept 1515
sock5
% proc accept {socket clientAddr clientPort} {
puts "Accepted $socket."
puts $socket "hello"
return
}
% after 60000 exit
after#0
% vwait forever
Accepted sock7.
Accepted sock8.
MacBookPro:~ paul$

Interp 2:
% socket localhost 1515
sock5
% close sock5
% socket localhost 1515
sock5
% close sock5
% exit

I ran 'exit' in Interp 2 before the 'after' was triggered in Interp 1.
As you can see tclsh exits fine for me. Or is there a flaw in this
test? I'm on Mac OS 10.4.

WC

unread,

Feb 2, 2010, 9:56:36 AM2/2/10

to

PaulWalton wrote:

>
> Why does this work?
>
> Interp 1:
> % socket -server accept 1515
> sock5
> % proc accept {socket clientAddr clientPort} {

> ...
> ...
> ...

>
>
> I ran 'exit' in Interp 2 before the 'after' was triggered in Interp 1.
> As you can see tclsh exits fine for me. Or is there a flaw in this
> test? I'm on Mac OS 10.4.

Hi Paul,

Well you're close but, it is not a valid test. The TCL IO system's
buffers were able to flush before it exited. You sent a very small
amount of data. Though you didn't do a read in your client app TCL was
able to clear it's IO buffers down the TCP stack, in which the OS close
succeeds.

My application is streaming data to a number of clients and it is not
unusual for it to build up a half meg of data rather quickly. For the
test to be valid the receivers TCP input queue needs to be full as well
as the senders TCP output queue. With moderm TCP stacks this can be
several hundred K of data. Only then will TCL begin to buffer data in
it's interp's IO buffers. That will definitely cause TCL to block when
attempting to clear those buffers.

Thanks,
Wayne

WC

unread,

Feb 2, 2010, 10:50:31 AM2/2/10

to

Uwe Klein wrote:
>
> n.B.
> I once wrote a "self scriptable" ( not tcl ;-) multiplexer in
> C for distributing messages ( duplicating, logging) between different
> processes. ( Most processes where run of the mill tty/cmdline oriented
> programms )
> Outgoing problems were handled via sigpipe. ( would not have caught an
> unresponsive client either )
>
> Limit the problem. don't try to reconnect? A dead client is dead, dead,
> dead
> Would that work for you?
> Limit buffering space?
>
>
> uwe

LOL, that would work for me... But not for my boss:( We get paid for the
data we send them. Yes this is an annoying scenario since the problem is
the customers application. But it is what it is.

I'm attempting to replace a version of this same application that I
wrote in C a few years ago. But it doesn't make a very strong case if I
need to include a C extension with the script in order to terminate
badly behaving clients:( The C application happily frees it's back
queue, sets the TCP linger timer to 0 and closes the socket. It then
reconnects and sends the client data... until the client app stops
responding again.

Alexandre Ferrieux

unread,

Feb 2, 2010, 11:23:12 AM2/2/10

to

On Feb 2, 2:29 am, WC <wcu...@cox.net> wrote:
>
> [...]

> will put the channel into blocking mode and attempt to flush the channel
> of any remaining data, the interpreter does this for each open channel
> when exit is called. However if the TCP stack is not accepting data the
> application will never be able to exit or close channels without exiting
> for that matter. This appears to be a pretty serious bug. I need to
> 'kill -9' in order to force an exit... very ugly. Seems like what is
> needed is an option to the close command to discard any data buffered in
> the TCL channel's output buffer and close the channel.
>
> I coded a small extension in C that closes the OS specific handle for
> the channel and the unregisters the channel from the interpreter. This
> causes send() to return -1 but the interpreter doesn't care at that
> point and shutdown continues successfully.
>
> Anyone else run into this? I'm I totally missing something here?
>
> BTW I'm using TCL 8.4 on Linux and HP-UX but a review of the current 8.5
> API it seems like this deadlock could still exist.
>
> Any input/ideas are greatly appreciated,
> Wayne

You are absolutely right: that's a design flaw. If our hands were
free, we'd fix that instantly. The problem is the existing base of Tcl
apps... So we can only extend, not reform. Something like [chan
unflush], or [chan discard].

You can file a TIP for that; however, in the meantime, you can use
the following workaround:

set ff [open "|cat >@ $sok" w]
# do writes on $ff, reads on $sok
# you can still fconfigure $ff -blocking 0

# now assume it's time to close
exec kill -INT [pid $ff]
catch {close $ff}

Ugly, eh ? Yup. Just one percent simpler than an extension ;-)

-Alex

tom.rmadilo

unread,

Feb 2, 2010, 12:33:35 PM2/2/10

to

On Feb 2, 8:23 am, Alexandre Ferrieux <alexandre.ferri...@gmail.com>
wrote:

Why not try using [chan pending ]?

In my recent experiment with htclient, I found that the only way to
avoid failure on read (potential DOS attack) was to read only bytes
available in the tcl buffer.

The biggest deficit in the Tcl channel code is the lack of timeouts,
but the manpage for [chan puts] indicates that applications should
take care to not push too much data into the output channel with each
writable event.

Until the actual code is posted, hard to say this is a tcl failing, or
exactly what the failure is.

PaulWalton

unread,

Feb 2, 2010, 1:46:09 PM2/2/10

to

Thank you for the explanation.

Wayne

unread,

Feb 2, 2010, 1:49:05 PM2/2/10

to

On Feb 2, 11:23 am, Alexandre Ferrieux <alexandre.ferri...@gmail.com>
wrote:
>

> You are absolutely right: that's a design flaw. If our hands were
> free, we'd fix that instantly. The problem is the existing base of Tcl
> apps... So we can only extend, not reform. Something like [chan
> unflush], or [chan discard].
>
> You can file a TIP for that; however, in the meantime, you can use
> the following workaround:
>
> set ff [open "|cat >@ $sok" w]
> # do writes on $ff, reads on $sok
> # you can still fconfigure $ff -blocking 0
>
> # now assume it's time to close
> exec kill -INT [pid $ff]
> catch {close $ff}
>
> Ugly, eh ? Yup. Just one percent simpler than an extension ;-)
>
> -Alex

Alex,

Ugly... but clever my man!! Very Nice:) Do you think piping to
something like netcat would work bi-directionally so I can stick with
a single channel?

Ok, I will file a TIP for that as I feel it is very important to have
some mechanism in place for this condition. Either a separate function
as you mentioned or a flag to close. This would allow backward
compatibility.

close -noflush $chan

Thanks Alex,

Wayne

unread,

Feb 2, 2010, 2:10:50 PM2/2/10

to

On Feb 2, 12:33 pm, "tom.rmadilo" <tom.rmad...@gmail.com> wrote:
>
> Why not try using [chan pending ]?
>
> In my recent experiment with htclient, I found that the only way to
> avoid failure on read (potential DOS attack) was to read only bytes
> available in the tcl buffer.
>
> The biggest deficit in the Tcl channel code is the lack of timeouts,
> but the manpage for [chan puts] indicates that applications should
> take care to not push too much data into the output channel with each
> writable event.
>
> Until the actual code is posted, hard to say this is a tcl failing, or
> exactly what the failure is.

Currently I'm only concerned with the [puts] case as I'm not reading
from clients, unless they close the connection.

Yeah, I saw that in the manpage, the problem with verbiage like that
is how much is "too much"? In fact just 1 byte in the interp's IO
buffers will cause the interp to block.

[chan pending] won't help because by the time there is data in the
interp's buffer it's to late:(

I'm stuck with 8.4 on my production system right now:(... so no [chan]
function for me.

It would be nice if one could turn off TCL buffering completely, I
have [fconfigure $chan -buffering none] but in non-blocking mode the
interp still accepts data via [puts] and this is stated so in the
[puts] manpage.

I understand they can't change the semantics since this API has
existed for many releases but I would have expected that if you
disable buffering and attempt to write to a non-blocking channel which
cannot accept data that [puts] would return just the number of bytes
it was able to write to the channel, in which case the application
could handle the buffering instead of the interp. Which is exactly
what my C implementation is doing.

Alexandre Ferrieux

unread,

Feb 2, 2010, 2:27:26 PM2/2/10

to

On Feb 2, 7:49 pm, Wayne <wcu...@gmail.com> wrote:
> On Feb 2, 11:23 am, Alexandre Ferrieux <alexandre.ferri...@gmail.com>
> wrote:
>
>
>
>
>
>
>
> > You are absolutely right: that's a design flaw. If our hands were
> > free, we'd fix that instantly. The problem is the existing base of Tcl
> > apps... So we can only extend, not reform. Something like [chan
> > unflush], or [chan discard].
>
> > You can file a TIP for that; however, in the meantime, you can use
> > the following workaround:
>
> > set ff [open "|cat >@ $sok" w]
> > # do writes on $ff, reads on $sok
> > # you can still fconfigure $ff -blocking 0
>
> > # now assume it's time to close
> > exec kill -INT [pid $ff]
> > catch {close $ff}
>
> > Ugly, eh ? Yup. Just one percent simpler than an extension ;-)
>
> > -Alex
>
> Alex,
>
> Ugly... but clever my man!! Very Nice:) Do you think piping to
> something like netcat would work bi-directionally so I can stick with
> a single channel?

Oh yes of course, if you don't absolutely want Tcl sockets or close
monitoring of the connection attempt, [open "|nc ..." r+] will work
similarly.

> Ok, I will file a TIP for that as I feel it is very important to have
> some mechanism in place for this condition. Either a separate function
> as you mentioned or a flag to close. This would allow backward
> compatibility.
>
> close -noflush $chan

Not sure what you mean by backward compat here, since the current
[close] syntax doesn't have prefix options, but yes, I like this
syntax too :)

-Alex

WC

unread,

Feb 1, 2010, 8:20:03 PM2/1/10

to

I've written a TCL app that receives data from a single TCP source and
distributes this data to multiple TCP receivers using a very simple
ASCII protocol. The server is non-blocking using TCL's event loop. Most
of the receivers are not under my control and sometimes behave poorly.
This means I don't have access to code/application and in some cases the
owner of those applications.

Here is my problem.

TCL has called my writable handler indicating that a channel is ready
for data. I write data to the channel but the client stops reading data
at some point, but does not close the connection. TCP's flow control
kicks in and data ends up being buffered in the receivers TCP input
buffer, my hosts TCP output buffer and finally my application's TCL
channel output buffer.

If at this point I connect to another port and issue a command for my
application to shutdown it hangs. I forced a core dump and noticed that
it's hanging in send(). The man page for TCL's close indicates that TCL

WC

unread,

Feb 1, 2010, 8:06:47 PM2/1/10

to

WC

unread,

Feb 1, 2010, 8:05:44 PM2/1/10

to

WC

unread,

Feb 1, 2010, 8:04:11 PM2/1/10

to

WC

unread,

Feb 1, 2010, 8:07:32 PM2/1/10

to

WC

unread,

Feb 1, 2010, 8:22:02 PM2/1/10

to