Hitting weird limit in Tcl 8.4.19

Adrian Ho

unread,

Aug 25, 2010, 5:13:59 AM8/25/10

to

One of my client's Tcl-based servers recently experienced strange
hangups, and I was asked to help them diagnose the issue. I wrote the
scripts below to see what happens to a simple echo server when thousands
of echo clients tried to connect to it.

I ran the server script on a Linux box, carefully running "ulimit -n
2048" beforehand to bump up the max number of open files to 2048.

I then ran the client script on a second Linux box with "-instances
1000". The server handled all 1000 connections without any issues,
merrily bouncing messages back and forth.

I then ran the client script on a third Linux box, also with "-instances
1000". To my surprise, the server seemed to lock up after the first 21
connections from the second client. In particular, the "MARK" log line
that the server prints every 10 seconds stops appearing, as if the
server's Tcl event loop simply stopped working.

Questions:

[1] What exactly is happening here? It seems as if I hit a limit related
to 1024 open file descriptors, but as I've already bumped the process FD
limit to 2048 *and* the Tcl event loop stopped working, there's evidently
more than meets the eye.

[2] If this is not a bug on my part, is it a known issue? If so, is it
fixed in a later Tcl version? (It's unlikely that my client will be
willing to upgrade their existing Tcl base, so this is just for my own
reference.)

Thanks much for any light shed on this situation.

Best Regards,
Adrian

======================= test-tcp-cli.tcl ===========================
#!/usr/bin/env bash
# \
exec /export/home/invantest/tools/tcl/8.4.19.20091124/bin/tclsh "$0" "$@"

array set ::config {
-host localhost
-port 5746
-instances 10
}

if { $argc > 0 } {
array set ::config $argv
}

array set ::fp {}
array set ::stats {}

proc Log {msg} {
puts stderr "[clock format [clock seconds] -format {%Y%m%d.%H%M%S}]
$msg"
}

proc ResetStats {i} {
set ::stats(push,${i}) 0
set ::stats(read,${i}) 0
}

proc PrintStats {i} {
Log "REPORT: $i: push $::stats(push,${i}), read $::stats(read,${i})"
ResetStats $i
}

proc ReportStats {i} {
PrintStats $i
after 10000 ReportStats $i
}

proc PushData {i} {
if {[catch {puts $::fp($i) "Test string for $i"; flush $::fp($i)} err]}
{
Log "SENDERR: $i: $err"
} else {
incr ::stats(push,$i)
}
}

proc ReadData {i} {
if {[eof $::fp($i)]} {
Log "DISCONN: $i"
catch {close $::fp($i)}
PrintStats $i
after cancel ReportStats $i
OpenSocket $i
} else {
incr ::stats(read,$i)
gets $::fp($i)
after 1 [list PushData $i]
}
}

proc OpenSocket {i} {
if {[catch {set ::fp($i) [socket $::config(-host) $::config(-port)]}
err]} {
Log "CONNERR: $i: $err"
} else {
Log "CONNECT: $i"
ResetStats $i
after 10000 ReportStats $i
fileevent $::fp($i) read [list ReadData $i]
puts $::fp($i) "CLIENT $i"
after 1 [list PushData $i]
}
}

set prefix [info hostname]

for {set i 0} {$i < $::config(-instances)} {incr i} {
OpenSocket ${prefix}-${i}
}

vwait ::forever
====================================================================

======================= test-tcp-srv.tcl ===========================
#!/usr/bin/env bash
# \
exec /export/home/invantest/tools/tcl/8.4.19.20091124/bin/tclsh "$0" "$@"

array set ::config {
-disconnect_secs 60
-port 5746
}

if { $argc > 0 } {
array set ::config $argv
}

array set ::fp {}
array set ::cli {}
array set ::stats {}

proc Log {msg} {
puts stderr "[clock format [clock seconds] -format {%Y%m%d.%H%M%S}]
$msg"
}

proc IncomingConn {chan addr port} {
set ::fp($chan) "${addr}:${port}"
ResetStats $chan
fileevent $chan read [list EchoData $chan]
Log "CONNECT: $::fp($chan) ([array size ::fp] total)"
}

proc RegisterClient {name chan} {
set ::cli($chan) $name
Log "REGISTER: $chan = $name"
}

proc ResetStats {chan} {
set ::stats($chan) 0
}

proc PrintStats {chan} {
if {[catch {set ::cli($chan)} cli]} {
Log "REPORT_ERR: $cli"
} else {
Log "REPORT: $cli$$chan$: $::stats(${chan})"
ResetStats $chan
}
}

proc EchoData {chan} {
if {[eof $chan]} {
Log "EOF: $::fp($chan)"
Disconnect $chan
} else {
set line [gets $chan]
switch -glob $line {
"CLIENT *" {
RegisterClient [lindex $line 1] $chan
}
default {
puts $chan $line
flush $chan
incr ::stats($chan)
}
}
}
}

proc Disconnect {chan} {
set conninfo $::fp($chan)
if {[catch {set ::cli($chan)} cli]} {
set cli ""
} else {
set cli "(${cli})"
}
if {[catch {close $chan} err]} {
Log "DISCONN: ${conninfo}${cli} close error: $err"
}
PrintStats $chan
unset ::fp($chan)
unset ::cli($chan)
unset ::stats($chan)
Log "DISCONN: ${conninfo}${cli} ([array size ::fp] total)"
}

proc DisconnectAll {} {
foreach chan [array names ::fp] {
Disconnect $chan
}
after [expr $::config(-disconnect_secs) * 1000] DisconnectAll
}

socket -server IncomingConn $::config(-port)

DisconnectAll

vwait ::forever
====================================================================

Gerald W. Lester

unread,

Aug 25, 2010, 7:15:01 AM8/25/10

to

Which OS?

>...

--
+------------------------------------------------------------------------+
| Gerald W. Lester, President, KNG Consulting LLC |
| Email: Gerald...@kng-consulting.net |
+------------------------------------------------------------------------+

Alexandre Ferrieux

unread,

Aug 25, 2010, 5:56:45 PM8/25/10

to

On Aug 25, 11:13 am, Adrian Ho <lexfi...@gmail.com> wrote:
>
> [1] What exactly is happening here? It seems as if I hit a limit related
> to 1024 open file descriptors, but as I've already bumped the process FD
> limit to 2048 *and* the Tcl event loop stopped working, there's evidently
> more than meets the eye.

That's the well-known problem of the fd_setsize limitation,
effectively putting an upper bound of 1024 on select-based monitoring
(Tcl belongs to this class). So basically, even though you can open
tens of thousands of fds, only 0..1023 will be amenable to fileevent :
(

Various ideas have been proposed to circumvent that, mainly with
epoll(), but nothing concrete has been committed yet, sorry. In the
meantime you can do what I do: use epoll() in a separate process and
delegate IO to it.

-Alex

Adrian Ho

unread,

Aug 26, 2010, 2:06:04 AM8/26/10

to

On Wed, 25 Aug 2010 06:15:01 -0500, Gerald W. Lester wrote:

> Which OS?

Linux all around.

Best Regards,
Adrian

Adrian Ho

unread,

Aug 26, 2010, 2:24:52 AM8/26/10

to

On Wed, 25 Aug 2010 14:56:45 -0700, Alexandre Ferrieux wrote:

> On Aug 25, 11:13 am, Adrian Ho <lexfi...@gmail.com> wrote:
>>
>> [1] What exactly is happening here? It seems as if I hit a limit
>> related to 1024 open file descriptors, but as I've already bumped the
>> process FD limit to 2048 *and* the Tcl event loop stopped working,
>> there's evidently more than meets the eye.
>
> That's the well-known problem of the fd_setsize limitation, effectively
> putting an upper bound of 1024 on select-based monitoring (Tcl belongs
> to this class). So basically, even though you can open tens of thousands
> of fds, only 0..1023 will be amenable to fileevent : (

Thanks Alex! I'd forgotten about the fd_setsize limit myself, having
used poll() in favor of select() for a long time now.

That said, why would this limit b0rk the *Tcl event loop*? At the very
least, I would've expected [fileevent] to throw an error when going over
this limit, rather than a catastrophic failure of all event servicing.

> Various ideas have been proposed to circumvent that, mainly with
> epoll(), but nothing concrete has been committed yet, sorry. In the
> meantime you can do what I do: use epoll() in a separate process and
> delegate IO to it.

I'll certainly keep that in mind. Thanks again!

Best Regards,
Adrian

Donal K. Fellows

unread,

Aug 26, 2010, 4:42:41 AM8/26/10

to

On 25 Aug, 22:56, Alexandre Ferrieux <alexandre.ferri...@gmail.com>
wrote:

> Various ideas have been proposed to circumvent that, mainly with
> epoll(), but nothing concrete has been committed yet, sorry. In the
> meantime you can do what I do: use epoll() in a separate process and
> delegate IO to it.

I started to write an alternate notifier that uses those more recent
APIs but it's rather difficult coding (well, it is if you're only
working on it very part time) so it got pushed on to my back-burner.
Probably permanently. (It doesn't help that I'm using OSX Leopard at
the moment, where many of the non-select() APIs are actually broken
due to limitations in kqueue() w.r.t. terminals. Snow Leopard fixes
that, but I've not upgraded.)

Donal.

Alexandre Ferrieux

unread,

Aug 26, 2010, 8:20:19 AM8/26/10

to

On Aug 26, 8:24 am, Adrian Ho <lexfi...@gmail.com> wrote:
>
> That said, why would this limit b0rk the *Tcl event loop*? At the very
> least, I would've expected [fileevent] to throw an error when going over
> this limit, rather than a catastrophic failure of all event servicing.

Yes, I would've expected it too :)
Unfortunately, one minute of source diving shows that one internal
function on the call path through the various abstractions layers ...
is not equipped to report failure :(

void
Tcl_CreateFileHandler(
int fd, /* Handle of stream to watch. */
int mask, /* OR'ed combination of TCL_READABLE,
* TCL_WRITABLE, and TCL_EXCEPTION: indicates
* conditions under which proc should be
* called. */
Tcl_FileProc *proc, /* Function to call for each selected
* event. */
ClientData clientData) /* Arbitrary data to pass to proc. */
;

(and it's stubbed, to make things worse...)

One thing that comes to mind though, is to Tcl_Panic, but I somehow
shied away from proposing the patch since it may break some long-
running application (with an abort()) that formerly "survived" to fds
above 1023 being silent...

-Alex

Alexandre Ferrieux

unread,

Aug 26, 2010, 8:26:05 AM8/26/10

to

On Aug 26, 10:42 am, "Donal K. Fellows"

Same here :)
I started coding an hybrid notifier that would start its life in
ol'good select() mode due to its economy of fd's (yes, epoll()
consumes one fd, which means +33% for the simplest processes), and
switch to epoll() once an fd above a given threshold is given to
fileevent (it might even be a good idea to set this threshold much
lower than 1024 for performance reasons, since the select() method is
ridden with linear scans of the masks).
But I never finished (backburner syndrom), since the "external epoll"
was a *much* simpler solution to my real-life needs :/ Selfish, eh ?

-Alex

DTM

unread,

Aug 27, 2010, 11:18:11 PM8/27/10

to

On Aug 25, 5:13 am, Adrian Ho <lexfi...@gmail.com> wrote:
> One of my client's Tcl-based servers recently experienced strange
> hangups, and I was asked to help them diagnose the issue. I wrote the
> scripts below to see what happens to a simple echo server when thousands
> of echo clients tried to connect to it.
>
> I ran the server script on a Linux box, carefully running "ulimit -n
> 2048" beforehand to bump up the max number of open files to 2048.
>
> I then ran the client script on a second Linux box with "-instances
> 1000". The server handled all 1000 connections without any issues,
> merrily bouncing messages back and forth.
>
> I then ran the client script on a third Linux box, also with "-instances
> 1000". To my surprise, the server seemed to lock up after the first 21
> connections from the second client. In particular, the "MARK" log line
> that the server prints every 10 seconds stops appearing, as if the
> server's Tcl event loop simply stopped working.
>
> Questions:
>
> [1] What exactly is happening here? It seems as if I hit a limit related
> to 1024 open file descriptors, but as I've already bumped the process FD
> limit to 2048 *and* the Tcl event loop stopped working, there's evidently
> more than meets the eye.
>
> [2] If this is not a bug on my part, is it a known issue? If so, is it
> fixed in a later Tcl version? (It's unlikely that my client will be
> willing to upgrade their existing Tcl base, so this is just for my own
> reference.)
>

Well, to those of use who write client/server applications for Unix/
Linux this is all very expected.

Things seem to hangs up after you have opened 1024 sockets on the
server. This happens to be the default maximum number of simultaneous,
open sockets on a Linux box. The clients are just waiting for a socket
connection that can't be made since the server ran out of sockets.

The operating system can be reconfigured to allow up to 65536
simultaneous, open sockets. However, most programmers just close the
sockets when they are no longer needed and 1024 ends up being enough
sockets.

Dennis LaBelle

Adrian Ho

unread,

Aug 30, 2010, 3:05:14 AM8/30/10

to

On Fri, 27 Aug 2010 20:18:11 -0700, DTM wrote:

> Things seem to hangs up after you have opened 1024 sockets on the
> server. This happens to be the default maximum number of simultaneous,
> open sockets on a Linux box. The clients are just waiting for a socket
> connection that can't be made since the server ran out of sockets.

Actually, the clients are working just fine (i.e. properly failing to
connect). It's the *server* that locked up, specifically its Tcl event
loop. Also, as I mentioned, I'd already bumped the server process's open
FD limit to 2048, and used two clients so that neither one exceeded 1000
open connections.

I'm convinced that this is actually a Tcl bug, but given Alex's
observations elsewhere in this thread, I hold little hope of this being
fixed soon, if at all.

My major concern at this point is whether this event loop issue is
restricted to registering too many file event sources, or can also happen
under some other event-related conditions, as all my client's Tcl
programs (and my own, for that matter) are completely event-driven. Any
in the know care to comment?

Best Regards,
Adrian

Alexandre Ferrieux

unread,

Aug 30, 2010, 3:30:16 AM8/30/10

to

On Aug 30, 9:05 am, Adrian Ho <lexfi...@gmail.com> wrote:
>
> My major concern at this point is whether this event loop issue is
> restricted to registering too many file event sources, or can also happen
> under some other event-related conditions, as all my client's Tcl
> programs (and my own, for that matter) are completely event-driven. Any
> in the know care to comment?

Warning: it is not "registering too many fileevents", it can hit you
even if you open 1021 channels (assuming stdin/out/err are already
there), do absolutely nothing with them, and then open an 1022nd one
and add a single fileevent to it !

Summary: no fd above fd_setsize-1 (==1023) can currently be
fileevent'ed on, period.

Note that it is strictly a select() limitation, and its script-level
impact is mostly restricted to fileevents, though you can imagine
contorted situations where an extension uses an internal fd (eg the X
socket in Tk), but comes to life at a point where all fd slots
[0..1023] are in use, so this fd will be >=1024... Contorted indeed:
extensions mostly do their stuff at init time, well before activity
gets a chance to consume the fd space. It can still be a concern if
the extension creates fds later in its life (eg opening a device on
demand). YMMV.

-Alex

Adrian Ho

unread,

Aug 30, 2010, 4:13:29 AM8/30/10

to

On Mon, 30 Aug 2010 00:30:16 -0700, Alexandre Ferrieux wrote:

> On Aug 30, 9:05 am, Adrian Ho <lexfi...@gmail.com> wrote:
>>
>> My major concern at this point is whether this event loop issue is
>> restricted to registering too many file event sources, or can also
>> happen under some other event-related conditions, as all my client's
>> Tcl programs (and my own, for that matter) are completely event-driven.
>> Any in the know care to comment?
>
> Warning: it is not "registering too many fileevents", it can hit you
> even if you open 1021 channels (assuming stdin/out/err are already
> there), do absolutely nothing with them, and then open an 1022nd one and
> add a single fileevent to it !

Oddly enough, I'd just modified my test client to connect but not send
anything, then ran the same test again while strace'ing the server. Sure
enough, here's what strace logged:

5258 15:35:52.064705 gettimeofday({1283153752, 64746}, NULL) = 0
5258 15:35:52.064803 select(1024, [3 4 5 ... 1021 1022 1023], [], [],
{6, 71149}) = 1 (in [3], left {6, 71149})
5258 15:35:52.066910 gettimeofday({1283153752, 66967}, NULL) = 0
5258 15:35:52.067069 accept(3, {sa_family=AF_INET, sin_port=htons
(35598), sin_addr=inet_addr("192.168.1.8")}, [16]) = 1024
5258 15:35:52.067281 fcntl64(1024, F_SETFD, FD_CLOEXEC) = 0
5258 15:35:52.067624 gettimeofday({1283153752, 67672}, NULL) = 0
5258 15:35:52.067887 write(2, "20100830.153552 CONNECT: 192.168"..., 55)
= 55
5258 15:35:52.068081 write(2, "\n", 1) = 1
5258 15:35:52.068199 gettimeofday({1283153752, 68240}, NULL) = 0
5258 15:35:52.068297 select(1025, [3 4 5 ... 1021 1022 1023 1024], [0],
[1024], {6, 67655}) = 2 (in [3 1024], left {6, 67655})
5258 15:35:52.070400 gettimeofday({1283153752, 70456}, NULL) = 0
5258 15:35:52.070732 recv(1024,

Looks like the select() call only manipulated the first 1024 fd bits,
hence the call to recv() on an FD which blocks with nothing to read...

So, it's clearly important to ensure that, until/unless Tcl switches to
poll() and its ilk, forcing a 1024-open-FD limit is the right thing to do.

I now have the unhappy task of asking the client to either re-engineer
their protocol to limit the number of simultaneous connections, or write
a poll()-based proxy. 8-)

Best Regards,
Adrian

Alexandre Ferrieux

unread,

Aug 30, 2010, 4:33:33 AM8/30/10

to

On Aug 30, 10:13 am, Adrian Ho <lexfi...@gmail.com> wrote:
>
> So, it's clearly important to ensure that, until/unless Tcl switches to
> poll() and its ilk, forcing a 1024-open-FD limit is the right thing to do.

Glad to see you believe me now ;-)
(I agree, firsthand experience is unbeatable)

> I now have the unhappy task of asking the client to either re-engineer
> their protocol to limit the number of simultaneous connections, or write
> a poll()-based proxy. 8-)

By the way, do the established connection contexts have to share a
central process ?
If they don't (ie they are independent), then why not just make your
server process an inetd/xinetd delegate ? Or even do all this in HTTP
flavour and make it a CGI, served by Apache (for example) ?
Granted, this means a fork/exec per client, but avoids the dangerous
fd concentration...

-Alex

Adrian Ho

unread,

Aug 30, 2010, 5:07:14 AM8/30/10

to

On Mon, 30 Aug 2010 01:33:33 -0700, Alexandre Ferrieux wrote:

> On Aug 30, 10:13 am, Adrian Ho <lexfi...@gmail.com> wrote:
>>
>> So, it's clearly important to ensure that, until/unless Tcl switches to
>> poll() and its ilk, forcing a 1024-open-FD limit is the right thing to
>> do.
>
> Glad to see you believe me now ;-)
> (I agree, firsthand experience is unbeatable)

I never disbelieved you, but I was concerned that it was not the only way
to b0rk the event loop. My strace session was to get some visibility
about what exactly was happening, and unless I misread the output, it
actually wasn't the event loop that "died", but a blocking read that was
incorrectly triggered due to "undefined behavior" in select().

>> I now have the unhappy task of asking the client to either re-engineer
>> their protocol to limit the number of simultaneous connections, or
>> write a poll()-based proxy. 8-)
>
> By the way, do the established connection contexts have to share a
> central process ?

Yes for performance reasons, but I can't say more without violating their
NDA. 8-)

Best Regards,
Adrian

Alexandre Ferrieux

unread,

Aug 30, 2010, 5:24:15 AM8/30/10

to

On Aug 30, 11:07 am, Adrian Ho <lexfi...@gmail.com> wrote:
>
> > By the way, do the established connection contexts have to share a
> > central process ?
>
> Yes for performance reasons, but I can't say more without violating their
> NDA. 8-)

Then what about mod_tcl in Apache ? *Much* smaller startup overhead.
Still not fast enough ?

-Alex

Adrian Ho

unread,

Aug 30, 2010, 5:37:19 AM8/30/10

to

On Mon, 30 Aug 2010 09:07:14 +0000, Adrian Ho wrote:

> I never disbelieved you, but I was concerned that it was not the only
> way to b0rk the event loop. My strace session was to get some
> visibility about what exactly was happening, and unless I misread the
> output, it actually wasn't the event loop that "died", but a blocking
> read that was incorrectly triggered due to "undefined behavior" in
> select().

And to address the obvious: [fconfigure]ing each channel to be non-
blocking on the server side causes the event loop to run normally again,
at the cost of constant (and mostly erroneous) triggering of fileevent
handlers, driving server CPU usage to 100%.

Best Regards,
Adrian

Adrian Ho

unread,

Aug 30, 2010, 5:44:11 AM8/30/10

to

This, at least, I can safely answer: The client-server protocol isn't
HTTP-based. Using mod_tcl would definitely constitute "protocol re-
engineering".

Best Regards,
Adrian

Alexandre Ferrieux

unread,

Aug 30, 2010, 6:04:23 AM8/30/10

to

Indeed :)

Then there's TclX's [fork]. This one will save you exec(), Tcl and
application intialization, while keeping your custom TCP-based
protocol intact.

-Alex

Uwe Klein

unread,

Aug 30, 2010, 6:16:06 AM8/30/10

to

wouldn't mod_tcl be adaptable to other protocol types?

use shared mem? ( but discrete threads/processes with 500 connections each? )
( there used to be a shared memory "driver" for tcl around somewhere )

uwe

Adrian Ho

unread,

Aug 31, 2010, 2:18:02 AM8/31/10

to

OK, I'm confused. How does [fork] figure into this?

Best Regards,
Adrian

Adrian Ho

unread,

Aug 31, 2010, 2:40:00 AM8/31/10

to

On Mon, 30 Aug 2010 12:16:06 +0200, Uwe Klein wrote:

> Adrian Ho wrote:
>> On Mon, 30 Aug 2010 02:24:15 -0700, Alexandre Ferrieux wrote:
>>>Then what about mod_tcl in Apache ? *Much* smaller startup overhead.
>>>Still not fast enough ?
>>
>>
>> This, at least, I can safely answer: The client-server protocol isn't
>> HTTP-based. Using mod_tcl would definitely constitute "protocol re-
>> engineering".
> wouldn't mod_tcl be adaptable to other protocol types?

You mean have Apache handle all connection-based stuff but use mod_tcl's
hook capabilities to speak something other than HTTP over those
connections? Does anyone know if that can even be done?

> use shared mem? ( but discrete threads/processes with 500 connections
> each? ) ( there used to be a shared memory "driver" for tcl around
> somewhere )

Thanks for that suggestion. I assume you're thinking of one of options
listed in http://wiki.tcl.tk/shared-memory ?

Along those lines, I also stumbled upon tcl-mq (http://tcl-
mq.sourceforge.net/), a Tcl POSIX message queue extension, in case
anyone's interested in such stuff.

Best Regards,
Adrian

Alexandre Ferrieux

unread,

Aug 31, 2010, 3:07:10 AM8/31/10

to

(1) Init your app, including server socket

(2) in the accept callback, [fork]

- in the parent side, close the just-accepted socket and return

- in the child side, close the server socket and do your normal
protocol IO on the accepted socket. You can even do this blockingly if
a single client's requests do not overlap.

This way, for N established clients you have:

- 1 server process sleeping most of the time, with one server
socket

- N client-serving children, each with one established socket.

All (N+1) process thus only have a few open fds. No risk to
[fileevent].
And again, the [fork] overhead is very small if the parent keeps
little state. (only read/write memory that gets written to,
significantly costs in fork/vfork()).

Note that this is *very* different from doing it all with Tcl threads,
because threads share the set of open fds, spoiling the whole scheme.

-Alex

Uwe Klein

unread,

Aug 31, 2010, 3:21:33 AM8/31/10

to

Adrian Ho wrote:
> On Mon, 30 Aug 2010 12:16:06 +0200, Uwe Klein wrote:
>
>
>>Adrian Ho wrote:
>>
>>>On Mon, 30 Aug 2010 02:24:15 -0700, Alexandre Ferrieux wrote:
>>>
>>>>Then what about mod_tcl in Apache ? *Much* smaller startup overhead.
>>>>Still not fast enough ?
>>>
>>>
>>>This, at least, I can safely answer: The client-server protocol isn't
>>>HTTP-based. Using mod_tcl would definitely constitute "protocol re-
>>>engineering".
>>
>>wouldn't mod_tcl be adaptable to other protocol types?
>
>
> You mean have Apache handle all connection-based stuff but use mod_tcl's
> hook capabilities to speak something other than HTTP over those
> connections? Does anyone know if that can even be done?

I haven't done anything with mod_tcl.

That said one facet of apache is (fast) connection handling.
one facet of mod_tcl is having a tcl interpreter "at the ready" for
connections handed down from apache in rapid succession.
another facet is handling http. ( we can ignore or replace some facet, can't we?

>
>
>>use shared mem? ( but discrete threads/processes with 500 connections
>>each? ) ( there used to be a shared memory "driver" for tcl around
>>somewhere )
>
>
> Thanks for that suggestion. I assume you're thinking of one of options
> listed in http://wiki.tcl.tk/shared-memory ?
>
> Along those lines, I also stumbled upon tcl-mq (http://tcl-
> mq.sourceforge.net/), a Tcl POSIX message queue extension, in case
> anyone's interested in such stuff.

Thereabouts.

Some questions whose answers may help you:
Do all of your connections look onto a common state?
Is your app multiplayer like? i.e. most clients connect at some point
and then stay the session.
Or more like web access : constant coming and going.

uwe

>
> Best Regards,
> Adrian

Greg Couch

unread,

Aug 31, 2010, 2:28:00 PM8/31/10

to

On Aug 30, 1:13 am, Adrian Ho <lexfi...@gmail.com> wrote:
> ....

> Looks like the select() call only manipulated the first 1024 fd bits,
> hence the call to recv() on an FD which blocks with nothing to read...
>
> So, it's clearly important to ensure that, until/unless Tcl switches to
> poll() and its ilk, forcing a 1024-open-FD limit is the right thing to do.
>
> I now have the unhappy task of asking the client to either re-engineer
> their protocol to limit the number of simultaneous connections, or write
> a poll()-based proxy. 8-)
>
> Best Regards,
> Adrian

On Linux, can't you just recompile Tcl with __FD_SETSIZE defined to
2048 (or
whatever number of descriptions you need)? By default, the select
system call
uses a fixed 1024 bit fd_set on the C library side, but last I looked
(4.2 BSD,
for you old timers), the kernel side used the number of file
descriptors
argument and expected an array of bits large enough for that number.
The fd_set
structure is just a convenience structure.

Good luck,

Greg

Adrian Ho

unread,

Sep 1, 2010, 3:11:35 AM9/1/10

to

On Tue, 31 Aug 2010 11:28:00 -0700, Greg Couch wrote:

> On Linux, can't you just recompile Tcl with __FD_SETSIZE defined to 2048
> (or
> whatever number of descriptions you need)? By default, the select
> system call
> uses a fixed 1024 bit fd_set on the C library side, but last I looked
> (4.2 BSD,
> for you old timers), the kernel side used the number of file descriptors
> argument and expected an array of bits large enough for that number. The
> fd_set
> structure is just a convenience structure.

From what I've read (and after a bit of poking around in header files to
confirm it), it requires a pretty dirty hack (see http://lkml.indiana.edu/
hypermail/linux/kernel/0112.1/1401.html for details), and there's no
guarantee that select() will actually respect the new fd_set size.

In any case, my client has acknowledged that they don't actually need
that many open FDs (it was actually a bug in their code), so I'm dropping
this issue for now. It would be nice to see the Tcl core switch to poll
() at some point, though. 8-)

Best Regards,
Adrian

Alexandre Ferrieux

unread,

Sep 1, 2010, 3:14:42 AM9/1/10

to

On Sep 1, 9:11 am, Adrian Ho <lexfi...@gmail.com> wrote:
> On Tue, 31 Aug 2010 11:28:00 -0700, Greg Couch wrote:
> > On Linux, can't you just recompile Tcl with __FD_SETSIZE defined to 2048
> > (or
> > whatever number of descriptions you need)? By default, the select
> > system call
> > uses a fixed 1024 bit fd_set on the C library side, but last I looked
> > (4.2 BSD,
> > for you old timers), the kernel side used the number of file descriptors
> > argument and expected an array of bits large enough for that number. The
> > fd_set
> > structure is just a convenience structure.
>
> From what I've read (and after a bit of poking around in header files to

> confirm it), it requires a pretty dirty hack (seehttp://lkml.indiana.edu/

> hypermail/linux/kernel/0112.1/1401.html for details), and there's no
> guarantee that select() will actually respect the new fd_set size.
>
> In any case, my client has acknowledged that they don't actually need
> that many open FDs (it was actually a bug in their code), so I'm dropping
> this issue for now. It would be nice to see the Tcl core switch to poll
> () at some point, though. 8-)

OK. For the record, I'd still like to hear your comments about the
TclX [fork] approach ;-)

-Alex

Adrian Ho

unread,

Sep 1, 2010, 3:47:08 AM9/1/10

to

Ah yes, sorry about that. For the record, the client's app actually
shares state across all its clients as Uwe speculated, so to get the
[fork] approach to work, the slave processes would need to use some other
IPC mechanism to communicate with the master server. Otherwise, it might
have worked pretty well. 8-)

Best Regards,
Adrian