set ip 192.168.1.220
set port 12345
set blocksize 32768
set buffersize 1000000
--<snip-snip>--
set fd [socket $ip $port]
fconfigure $fd -translation binary -buffersize $buffersize
set buffersize [fconfigure $fd -buffersize]
puts "Connecting to $ip TCP/port $port, buffersize $buffersize, $blocksize bytes/read"
set then [clock clicks -milliseconds]
set bytes 0
while {1} {
set s [read $fd $blocksize]
incr bytes [string length $s]
set now [clock clicks -milliseconds]
if {$now-$then >= 1000} {
set seconds [expr {($now-$then)/1000.0}]
puts "$bytes Bytes in $seconds s, [format %.1f [expr {8*$bytes/$seconds/1024/1024}]] MBit/s"
set then $now
set bytes 0
}
}
On the same hardware (double-boot) I get under Linux/Ubuntu
+630MBit/s. i.e. I can read the data without problems.
On the same computer under Windows XP Professional SP3, I get only about
380MBit/s, both with TCL 8.3.3 and ActiveState 8.5.2.0. Taskmanager
shows the network activity at ~40% (which is expected at 1GBit
connection), and CPU at 50-60% (which seems a bit high for simple data
copying?).
Changing the blocksize of the reads does not change the max data rate I
get (using [read $fd 1000000] it even gets worse).
Before writing my own data collection in native-C/C++: is there any
known TCL-related problem which limits the data transfer rates on
Windows? (I could not find anything in the bug database on
sourceforge.)
Thanks
R'
> Before writing my own data collection in native-C/C++: is there any
> known TCL-related problem which limits the data transfer rates on
> Windows? (I could not find anything in the bug database on
> sourceforge.)
Yes, the windows port uses the WSAAsyncSelect method which doesn't have
good performance. Soup it up with the iocpsock extension:
http://sf.net/projects/iocpsock
As a side experiment, I wonder: what kind of performance could we
expect by spawning a netcat instance and reading from the pipe ?
Ralf, if you can try on the very same case, I'm interested in the
results.
-Alex
'netcat' as in http://netcat.sourceforge.net/ ? They don't have a
windows version for download, I'll give it a try in compiling it under
msys/mingw...
R'
Ouch.
| Soup it up with the iocpsock extension:
| http://sf.net/projects/iocpsock
Thanks for the pointer, will try.
R'
Netcat [v1.10] on Linux (different hardware, but I would guess it does
not matter, since the CPU is not the limit) reads the expected 612MBit/s.
Netcat [v1.11] Windows-adjusted sources from securityfocus.com, compiled
with M$ visual studio, reads about 316MBits/s.
Netcat v1.10 'official' sources compiled under Cygwin -- forget it.
40MBit/s.
FYI: Iocpsock TCL extension reads 490MBit/s, better, but not good enough
(see other post).
R'
Better, but not good enough (yet :-)
(Reference: Linux same hardware, 612MBit/s)
Plain TCL: ~390MBit/s
Iocpsock 3.0a3: ~490MBit/s
This is with fd config
-blocking 1 -buffering full -buffersize 1000000
-encoding binary -eofchar {{} {}} -translation {lf lf} -peername
{192.168.1.220 heim.akutech-local.de 5007} -sockname
{192.168.1.200 jaguar 1165} -keepalive 0 -nagle 1 -sendcap {20 0}
-recvmode zero-byte
If I try [-recvmode flow-controlled] which is advertised in the docs as
use-this-for-high-speed, I get no reads at all:
fd config -blocking 1 -buffering full -buffersize 1000000
-encoding binary -eofchar {{} {}} -translation {lf lf} -peername
{192.168.1.220 heim.akutech-local.de 5007} -sockname
{192.168.1.200 jaguar 1168} -keepalive 0 -nagle 1 -sendcap {20 0}
-recvmode flow-controlled
Connecting to 192.168.1.220 TCP/port 5007, buffersize 1000000, 32768 bytes/read
72836 Bytes in 1.0 s, 0.6 MBit/s
0 Bytes in 1.0 s, 0.0 MBit/s
0 Bytes in 1.0 s, 0.0 MBit/s
0 Bytes in 1.0 s, 0.0 MBit/s
Must be doing something wrong here... I also tried -sendcap 25, same results.
Any hints?
R'
Use -blocking 0 and -recvmode {burst-detection 25 25}. With blocking on
you aren't going to any performance out of it at all.
Changed the fconfigure to
... -blocking 0 -recvmode {burst-detection 25 25} ...
and using the original 'blocking' read loop (no event loop)
=>
fd config -blocking 0 -buffering full -buffersize 1000000 -encoding
binary -eofchar {{} {}} -translation {lf lf} -peername {192.168.1.220
heim.akutech-local.de 5007} -sockname {192.168.1.200 jaguar 1492}
-keepalive 0 -nagle 1 -sendcap {20 0} -recvmode {burst-detection 25 1}
Connecting to 192.168.1.220 TCP/port 5007, buffersize 1000000, 32768 bytes/read
2920 Bytes in 1.0 s, 0.0 MBit/s
0 Bytes in 1.0 s, 0.0 MBit/s
0 Bytes in 1.0 s, 0.0 MBit/s
...
Changed the code to using fileevent, but this triggers only once when
using socket2 (original performance using plain socket):
proc readdata {fd} {
while {1} {
set s [read $fd $::blocksize]
if {[eof $fd]} {
puts "EOF on $fd"
close $fd
return
}
if {[string length $s] == 0} {
puts nothing-read
break
}
incr ::bytes [string length $s]
set now [clock clicks -milliseconds]
if {$now-$::then >= 1000} {
set seconds [expr {($now-$::then)/1000.0}]
puts "$::bytes Bytes in $seconds s, [format %.1f [expr {8*$::bytes/$seconds/1024/1024}]] MBit/s"
set ::then $now
set ::bytes 0
}
}
}
fileevent $fd readable [list readdata $fd]
vwait forever
=>
fd config -blocking 0 -buffering full -buffersize 1000000 -encoding
binary -eofchar {{} {}} -translation {lf lf} -peername {192.168.1.220
heim.akutech-local.de 5007} -sockname {192.168.1.200 jaguar 1498}
-keepalive 0 -nagle 1 -sendcap {20 0} -recvmode {burst-detection 25 1}
Connecting to 192.168.1.220 TCP/port 5007, buffersize 1000000, 32768 bytes/read
nothing-read
Hmmm...do you have a short example code how to use this -recvmode?
Do you expect the other recvmodes to be faster than zero-byte?
Should I try different read-blocksize and buffersize?
R'
Doesn't make a difference :-/ I also changed the read blocksize to be
4096 - no bytes read, neither with 'blocking' loop nor with fileevent
based reads. Time for source digging I guess...but this will be
tomorrow.
Thanks anyway, even the 30% performance improvement over plain TCL is a
gain.
R'
When you measured netcat, was it "> NUL:", or output to a file, or
output to the pipe to Tcl ?
(The NUL: redirection should really give the limit performance on the
socket side)
-Alex
> Hmmm...do you have a short example code how to use this -recvmode?
Not really. They were just experiments in trying different algorithms
for different ways of operating overlapped sockets. They shouldn't
cause things to stop working, though.
> Do you expect the other recvmodes to be faster than zero-byte?
Yes. zero-byte uses just uses a single zero-byte buffer for WSARecv()
to just act as a readable alert coming back through the completion port.
The resulting [read] causes another call to WSARecv() in blocking mode
and uses the internal socket buffers rather than the overlapped ones.
Just use this for high concurrency stuff when you need 50K sockets open.
> Should I try different read-blocksize and buffersize?
I wouldn't use any block size for [read] and a smaller -buffersize in
multiples of 4096 (that's the grain of VirtualAlloc FYI)
The problem is in your script. When -blocking is off and you're in a
while loop calling [read] regardless of being notified that it is
readable you will get empty reads, yet you breakout of the loop and
stop. It's only alertable on the first, then becomes polling.
This needs some more error traps, but is closer.
proc readdata {fd} {
set s [read $fd]
if {[incr ::bytes [string length $s]] == 0} {
puts nothing-read
}
if {[eof $fd]} {
puts "EOF on $fd"
close $fd
set ::done 1
}
}
proc readuntildone {fd} {
fileevent $fd readable [list readdata $fd]
vwait ::done
set now [clock clicks -milliseconds]
if {$now-$::then >= 1000} {
set seconds [expr {($now-$::then)/1000.0}]
puts "$::bytes Bytes in $seconds s, [format %.1f [expr
{8*$::bytes/$seconds/1024/1024}]] MBit/s"
set ::then $now
set ::bytes 0
}
}
readuntildone $fd
I would also expect a CPU improvement with iocpsock vs. current Tcl.
FWIW, the WSAAsyncSelect in the core is used because that's the only
API available on all Windows versions. Even if we move to the IOCP
stuff for the core, we have to keep the old variant for systems like
Windows CE.
Jeff
If only I had unlimited free time to spend on it. I want to see
WSAEventSelect used as the fallback and share the JobQueue code it'll
need shared to the other windows channel drivers and move them to
alertable handles. :)
proc readdata {fd} {
if {[catch {read $fd} s]} {
puts "Got error on $fd: $s"
close $fd; set ::done 1; return
}
if {[incr ::bytes [string length $s]] == 0} {
puts nothing-read
}
if {[eof $fd]} {
puts "EOF on $fd"
close $fd; set ::done 1
}
}
proc readuntildone {fd} {
set ::done 0
fconfigure $fd -blocking off -buffersize 4096
if {0} {
fconfigure $fd -recvmode {burst-detection 25 1}
} else {
fconfigure $fd -recvmode flow-controlled
Actually, on WinCE systems, situation is even worst, since
WSAAsyncSelect() itself isn't available in Winsock2 API, and it has to
be emulated (using e.g. WSACreateEvent() & al. API).
Eric
Generally redirected to NUL (/dev/null).
| > Netcat [v1.10] on Linux (different hardware, but I would guess it does
| > not matter, since the CPU is not the limit) reads the expected 612MBit/s.
netcat redirected to /dev/null, manual C-c interupt after 10 seconds
(since the Linux netcat does not print messages when sending it a signal
via kill, reason unknown). The byte count was close enough to the
expected count.
| > Netcat [v1.11] Windows-adjusted sources from securityfocus.com, compiled
| > with M$ visual studio, reads about 316MBits/s.
netcat redirected to NUL, running in background, sleep 10 seconds, then
sending SIGUSR1 to netcat (which made it print the byte count)
netcat ... > nul & sleep 10 ; kill -USR1 $!
| > FYI: Iocpsock TCL extension reads 490MBit/s, better, but not good
| > enough (see other post).
This was the TCL script which discards data anyway so no redirection
required.
R'
Modified version, I'd like to see the stats every second. Plus I
changed the nothing-read condition in 'readata' to check the string
length, not the return value of 'incr':
# ====================================
package require Iocpsock
proc readdata {fd} {
if {[catch {read $fd} s]} {
puts "Got error on $fd: $s"
close $fd; set ::done 1; return
}
set len [string length $s]
if {$len == 0} {
puts nothing-read
} else {
puts "read $len bytes"
incr ::bytes $len
}
if {[eof $fd]} {
puts "EOF on $fd, bytes read so far $::bytes"
close $fd; set ::done 1
}
}
proc readuntildone {fd} {
set ::done 0
fconfigure $fd -blocking off -buffersize 4096
if {0} {
fconfigure $fd -recvmode {burst-detection 25 1}
} else {
fconfigure $fd -recvmode flow-controlled
}
fileevent $fd readable [list readdata $fd]
}
proc stats {then} {
set now [clock clicks -milliseconds]
set seconds [expr {($now-$then)/1000.0}]
if {$seconds > 0} {
puts "$::bytes Bytes in $seconds s, [format %.1f [expr {8*$::bytes/$seconds/1024/1024}]] MBit/s"
set ::bytes 0
}
after 1000 [list stats $now]
}
set heimip 192.168.1.220
set port 5007
set bytes 0
set fd [socket2 $heimip $port]
stats [clock clicks -milliseconds]
readuntildone $fd
vwait ::done
# ====================================
With the
fconfigure $fd -recvmode flow-controlled
branch I get
read 4096 bytes
EOF on iocp1712, bytes read so far 4096
4096 Bytes in 1.0 s, 0.0 MBit/s
0 Bytes in 1.0 s, 0.0 MBit/s
0 Bytes in 1.0 s, 0.0 MBit/s
I.e. an immediate EOF after the first successful read.
With the
fconfigure $fd -recvmode {burst-detection 25 1}
branch I get
read 4096 bytes
4096 Bytes in 1.016 s, 0.0 MBit/s
0 Bytes in 1.0 s, 0.0 MBit/s
0 Bytes in 1.0 s, 0.0 MBit/s
0 Bytes in 1.015 s, 0.0 MBit/s
I.e. one successful read, then no more fileevents.
R'
It would be nice if iocpsock was available in the core as optional
package like http or msgcat. Since it is platform-specific there would
not be the need to include multiple binary versions. But of course it
is not a big deal to install it from separate package. Perhaps a
pointer to it in the 'socket' manpage?
R'
> I.e. an immediate EOF after the first successful read.
Very odd. I know this works.
> I.e. one successful read, then no more fileevents.
Ditto. It's probably something simple, as usual.
There is a branch in core source with it. But is becoming a long
process to incorporate all the new protocol features (ipv6, irda and
bluetooth) and other enhancements.
Is there a source ZIP available, or do I need to download from CVS?
R'
David, a random thought: what about a dichotomy, migrating the
existing ipv4 code to iocp now, and postponing the more exotic
transports ?
-Alex
Do I need a threaded TCL for this to work?
R'
Or a specific TCL version? I'm using 8.3.3, but I see the same results
with ActiveState 8.5.2.0.
R'
1) cvs checkout is needed to get the source:
cvs -d:pserver:anon...@iocpsock.cvs.sourceforge.net:/cvsroot/iocpsock
checkout -P iocpsock
2) threaded tcl isn't needed, but channels can be transfered if you are
using it.
3) there isn't a minimum tcl version that I know of. In reality there
is, and could be as low as the one which introduced channel transforms,
which I think is 8.3. There is a minimum Windows version and that's
Win2K, but there are some ipv6 bugs with it. Best performance is with
XP. It might even work on NT4 (I checked it once), but with no ipv6
support as you'll get an address family error.
> David, a random thought: what about a dichotomy, migrating the
> existing ipv4 code to iocp now, and postponing the more exotic
> transports ?
Doesn't help. We have a bad case of bit-rot. The leg has become
gangrenous and needs amputation for a Steve Austin replacement ;)
Got it.
- as-is, you need at least TCL 8.5, since TCL 8.3.3 does not define
TCL_CHANNEL_VERSION_3 in tcl.h, so compilation of iocpsock_lolevel.c
fails. Since the only place where this is used is to reduce the
channel version to type 2, one could wrap the offending code by a
#ifdef TCL_CHANNEL_VERSION_3, but...
- The E* constants used for error reporting (EWOULDBLOCK, ENOTCONN ...)
are not found properly in the tcl 8.3.3 nor 8.4.16 headers, so you
really need the 8.5+ headers (I used 8.5.2 and 8.5.5).
With TCL 8.3/8.4 I get
.\iocpsock_lolevel.c(811) : error C2065: 'EWOULDBLOCK': nichtdeklarierter Bezeichner
.\iocpsock_lolevel.c(1014) : error C2065: 'ENOTCONN': nichtdeklarierter Bezeichner
...
where "nichtdeklarierter Bezeichner" = undeclared identifier
- Tcl_WinError() is declared in iocpDecls.h as
TCL_EXTERN(CONST char *) Tcl_WinError _ANSI_ARGS_((unsigned int errorCode, Tcl_Interp * interp));
but in a lot of places called as
Tcl_WinError(interp)
This gives a lot of error messages a la
.\iocpsock_lolevel.c(323) : warning C4047: 'Funktion': Anzahl der
Dereferenzierungen bei 'unsigned int' und 'Tcl_Interp *'
unterschiedlich
= different level of redirection (pointer vs int)
.\iocpsock_lolevel.c(323) : warning C4024: 'Tcl_WinError':
Unterschiedliche Typen fuer formalen uebergebenen Parameter 1
= different types for param 1 (pointer vs int)
.\iocpsock_lolevel.c(323) : error C2198: "Tcl_WinError":
Nicht genuegend Argumente fuer Aufruf.
= not enough arguments (1 vs 2 required)
Since there are so many places, I guess I'm doing something wrong
here, but I just don't grok what...
| 2) threaded tcl isn't needed, but channels can be transfered if you
| are using it.
Ok.
| 3) there isn't a minimum tcl version that I know of. In reality there
| is, and could be as low as the one which introduced channel
| transforms, which I think is 8.3.
See above.
R'
I meant 'run with' more than 'build against'. I just tried to build tcl
8.3.3 tonight and couldn't with VC9. All the compiler options are
different and the compiler blows chunks. I just don't feel like
spending the half hour to fix its makefile.vc.
If I did have a build of 8.3 and the header file, I would expect the
missing defines, though. Found in tclWinPort.h I think. Its location
was always a bug as those defines are used by public functions. A
little header magic could probably fix it, but I'm not sure if it's
worth the time. Works without error with 8.4.19, though. I'm glad that
got fixed.
Building with 8.5 (or 8.4 or 8.6) and backloading into 8.3 should work
just fine. Some people say backloading is evil. I think it's good when
you know what you're doing. See IocpGetTclMaxChannelVer() and
InitSockets() in iocpsock_lolevel.c for how it manipulates the channel
type to the core it loads into.
> - Tcl_WinError() is declared in iocpDecls.h as
> TCL_EXTERN(CONST char *) Tcl_WinError _ANSI_ARGS_((unsigned int errorCode, Tcl_Interp * interp));
> but in a lot of places called as
> Tcl_WinError(interp)
>
> This gives a lot of error messages a la
>
> .\iocpsock_lolevel.c(323) : warning C4047: 'Funktion': Anzahl der
> Dereferenzierungen bei 'unsigned int' und 'Tcl_Interp *'
> unterschiedlich
> = different level of redirection (pointer vs int)
>
> .\iocpsock_lolevel.c(323) : warning C4024: 'Tcl_WinError':
> Unterschiedliche Typen fuer formalen uebergebenen Parameter 1
> = different types for param 1 (pointer vs int)
>
> .\iocpsock_lolevel.c(323) : error C2198: "Tcl_WinError":
> Nicht genuegend Argumente fuer Aufruf.
> = not enough arguments (1 vs 2 required)
>
> Since there are so many places, I guess I'm doing something wrong
> here, but I just don't grok what...
Some half finished work I forgot to finish, sorry. Do an update, should
be all fixed.
I'll get the chance to test it tomorrow and see if my code is still
working right.
Oh yes, but unless I'm misunderstanding the perimeter of your work, it
seems the ipv4 part of iocpsock already covers the whole socket code
in the core. So even if we leave out the bionic eye, ear and
teleporting wristwatch, we can still afford a full titanium leg and
hip. And for much less than $3bn, right ? :D
-Alex
Praise to the Stubs interface. :-)
| See IocpGetTclMaxChannelVer() and InitSockets() in iocpsock_lolevel.c
| for how it manipulates the channel type to the core it loads into.
One issue could be that in 8.5.5 the Channel struct is larger than in
earlier TCL versions. But since the channel struct definition is
located inside of iocpsock, it should not matter.
| > - Tcl_WinError()
--<snip-snip>--
| Some half finished work I forgot to finish, sorry. Do an update,
| should be all fixed.
Ok, compiles fine w/o warnings or errors against 8.5.5.
Now testing it against 8.3.3, but don't hold your breath ;-)
Thanks so far.
R'
Could you check whether the following makes sense?
--- iocpsock_lolevel.c 18 Mär 2009 14:49:15 +0100 1.111
+++ iocpsock_lolevel.c 18 Mär 2009 18:48:24 +0100
@@ -948,7 +948,7 @@
buffer = bufPtr->buf;
}
memcpy(*bufPos, buffer, bufPtr->used);
- bytesRead += bufPtr->used;
+ *bytesRead += bufPtr->used;
*bufPos += bufPtr->used;
}
}
R'
Yes that does make sense, committed. You're doing a great job reading
the code, btw. I didn't mean for it to be so difficult to read. It
just turned out that way for the strange way overlapped sockets have
three levels of timing combined with my academic approach for the three
receive modes.
1) initiation of the overlapped call [PostOverlappedXXX()]
2) return of the overlapped call though the completion port [HandleIO()]
3) copying buffers into the channel
Ok. With that change I get a data stream in the flow-controlled case
(otherwise the EOF is seen because the bytecount is left at 0). But the
performance is still poor (130Mbit) compared to the zero-byte condition
with 1MB buffersize (490MBit).
I'm now trying to understand the burst-detection code, but I suspect
there is a logic glitche in there right at the start of the process:
The code posts some initial buffers when the socket is openened
(ws2tcp.c, around #665). At that point, the fconfigure has not yet run,
so the default channel options are used (iocpsock_lolevel.c,
NewSocketInfo(), infoPtr->recvMode = IOCP_RECVMODE_ZERO_BYTE), and thus
a buffer length of 0 is used for those initial buffers.
Then I reconfigure to burst-detection, and then the buffers are read.
At that point the code expects in handleio()/OP_READ for the
burst-detection case that the byte-count is > 0 to repost the buffer,
but with 0-byte initial buffers, this will not be true, so the buffers
do not get reposted, and no more data arrive.
With the following, I get a data stream, but I don't know if this is the
correct thing to do (especially with regards to EOF etc):
--- iocpsock_lolevel.c 18 Mär 2009 14:49:15 +0100 1.111
+++ iocpsock_lolevel.c 18 Mär 2009 22:26:22 +0100
@@ -2459,6 +2459,8 @@
InterlockedDecrement(&infoPtr->outstandingRecvCap);
FreeBufferObj(bufPtr);
break;
+ } else if (infoPtr->recvMode == IOCP_RECVMODE_BURST_DETECT) {
+ infoPtr->needRecvRestart = 1;
}
/* Takes buffer ownership. */
Shall we continue this here or rather in private email?
R'
Here is another one that bugged me: if I query the fconfigure settings
of the channel, I expected to get back what I had set. But instead the
second number in the 'burst-detection' was always 0 or 1, regardless to
what I had set. I *think* the following change should be applied: use
the 'outstandingRecvBufferCap' for the second number, not the
outstandingRecvs.
diff -u -r1.112 iocpsock_lolevel.c
--- iocpsock_lolevel.c 18 Mar 2009 19:51:48 -0000 1.112
+++ iocpsock_lolevel.c 18 Mar 2009 21:45:23 -0000
@@ -1453,7 +1453,7 @@
Tcl_DStringAppendElement(dsPtr, "burst-detection");
TclFormatInt(buf, infoPtr->outstandingRecvCap);
Tcl_DStringAppendElement(dsPtr, buf);
- TclFormatInt(buf, infoPtr->outstandingRecvs);
+ TclFormatInt(buf, infoPtr->outstandingRecvBufferCap);
Tcl_DStringAppendElement(dsPtr, buf);
if (len == 0) {
Tcl_DStringEndSublist(dsPtr);
R'
Here's the back-story on it, just FYI. When I first coded the
PostOverlappedRecv() function, I couldn't help but notice that when a
WSARecv has an immediate completion rather than a post, it indicates
that there the data came from the internal buffers rather than written
directly to a WSABUF -- a kind-of low-water mark if you will. With
overlapped mode it's valid to have more than one outstanding WSARecv
call. So why not use that as an indication to increase the outstanding
count to match the incoming flow?
My academic nature took over.
That's the dangerous looking recursion and the end of the
PostOverlappedRecv() function that keeps increasing the outstanding
count until it gets a WSA_IO_PENDING or outstandingRecvCap is hit. And
I just don't fully remember what outstandingRecvBufferCap was for and
where I was going with it.
The last time I used this heavily was my website with tclhttpd, but I
ran only in zero-byte mode, so I guess all these bugs in my code went
unfound when I last dove deep into it about 3 years ago.
I'll dig into it tonight and see what I can find.
You're doing a great job on this Ralf. I was convinced it became
unreadable.
Email might be better, but I don't mind the forum either. I might get a
bit delayed anyways as my free time is hard to find these days.
Now that's my type of analogy :) Yeah, I could do that.. with ipv6 as
is now.. And with space given in the changes to [socket]'s options to
allow for space to add on to..
Here's the thing though.. I want to get some of it working first to
find my happy place, then back it down to complete patches.
What's the pin-out for the infra-red spectrometer plug we need to put in
or the optional satellite uplink transponder now even though they
haven't yet been made? Which actually is IrDA and it's odd way of
searching for services and Bluetooth is even more different. And how
does that all fit into a single way of doing lookups, preferably
asynchronously. And in the big picture, there are three types name
service methodologies: static (dns), dynamic and persistent.
And then there's QoS, and UDP.. UDP is a big one to have, we all want
it and it'll take some hack jobs to get it in. Most notably [gets] and
[read] will not be used in place of something else like [readfrom] that
bypasses the channel buffer space and returns an address, too, or whatever.
Easier add-ons would be older forgotten protocols like AppleTalk,
IPX/SPX, Byran-Vines, ATM, etc.
And there's the WinCE problem. As IOCP doesn't work there nor does
WSAAsyncSelect. Emulating WSAAsyncSelect using WSAEventSelect to keep
the old code is just silly. That needs a whole new alert system melded
into my code that already does overlapped to manage WSAEventSelect. And
it wouldn't be a waste either as the JobQueue code could be reusable for
the other channel drivers so we can convert them to the waitable handles
that they are now (13 years later from Scott Stanton's original 7.6
windows Win95/32s port work) and stop the blocking
ReadFile()/WriteFile() calls in thread pairs per channel open. You know
what I'm talking about?
This is huge! There's serious *BIT-ROT* in the windows code.
http://en.wikipedia.org/wiki/Software_rot
I'm almost wishing I was unemployed to put the time into it. And a darn
better be careful with what I wish for these days.
Please boys, stay on the newsgroup. This gold mine of engineering
decisions would be wasted in a private link; let Google index'em
(considering all the crap it indexes nowadays, you'll be doing it a
favour in terms of SNR).
-Alex
Ok, since Alexandre asked so politely, let's continue here ;-)
I have reduced the bandwidth for now to 150MBit/s on the sender side.
As soon as the connection is established, the remote devices sends data
with that rate.
I now compared:
plain TCL
fconfigure $fd -blocking 1 -buffersize 1000000
read $fd 32768
=> 150MBit/s, CPU 30%, ~600 fileevents/second
iocpsock/zero-byte
fconfigure $fd -blocking 0 -buffersize 1000000 -recvmode zero-byte
read $fd
=> 150MBit/s, CPU 20%, ~500 fileevents/second
iocpsock/flow-controlled
fconfigure $fd -blocking 0 -buffersize {4096/32768/65536/1000000} -recvmode flow-controlled
read $fd
4096 => ~120MBit/s, CPU 60%, ~160 fileevents/second
32768 => ~125MBit/s, CPU 55%, ~15-30 fileevents/second
65536 => ~130MBit/s, CPU 55%, ~10 fileevents/second
1000000 => 145MBit/s, CPU 55%, ~250 fileevents/second
Right now, the winner is zero-byte with a large buffersize. I have no
idea why flow-controlled does not work at least as well as zero-byte.
If I understand it correctly, the difference between flow-control and
zero-byte is that with flow-control, the buffers posted have buflen
'buffersize' and thus the data are already in the buffers when they get
added to the linked list (i.e. the actual 'read' is done in the
'socket-reader-thread'), whereas with zero-byte only a 0-length
indicator buffer is pushed to the linked list when data are ready to
read, and the actual data transfer (WSARecv) is done in the
'TCL-thread' when IocpInputProc() runs.
| The burst-detection thing is just some horrific experiment mostly.
One that makes sense to me anyway, that's why I considered it useful...
If the data rate is high, it makes sense for the socket-reader thread to
read data as they are available, and store the buffers in the linked
list. The TCL-thread then comsumes the buffers from that linked list
when it is able to do so. That compensates for delays in the
TCL-thread.
| That's the dangerous looking recursion and the end of the
| PostOverlappedRecv() function that keeps increasing the outstanding
| count until it gets a WSA_IO_PENDING or outstandingRecvCap is hit. And
| I just don't fully remember what outstandingRecvBufferCap was for and
| where I was going with it.
It seems to me they serve the same concept in different places.
outstandingRecvCap is solely used for the recursion limit in
PostOverlappedRecv() to control the number of receiving buffers posted.
This is really only used in the burst-detection case, anyway.
outstandingRecvBufferCap is used to determine whether the number of
already 'read' buffers in the linked list has reached a maximum
(consumer did not collect the ready buffers fast enough).
This is only used in the burst-detection case, too.
| I'll dig into it tonight and see what I can find.
Holding my breath... :-) Could you elaborate on the
FilterSingleOpRecvBuf()/FilterPartialRecvBufMerge()?
| You're doing a great job on this Ralf. I was convinced it became
| unreadable.
Not at all :-) Once I grokked the concept, and added a few [TM]
printf(:-)s, the fog lifted quickly...
R'
The difference seems to be the additional memcpy() in the
flow-controlled case to copy the data from the posted buffers to the
final destination. If I comment that out, I get the same performance as
with zero-byte. Makes sense, since the other overhead (post/WSARecv) is
the same in both cases. Only with zero-byte, the data is directly
copied into the user buffer, whereas with flow-controlled, it is first
received into the post-buffer, and then copied into the user buffer.
But this would mean that we're getting nowhere with the flow-controlled
path...
R'
<blush> ...nonsense. Forget about that. Failed to set -translation
binary in the flow-control case... The memcpy() uses takes next to 'no'
time, and the -flow_control with large buffersize is as good as
zero-byte (but not better yet). More detailed stats follow.
R'
Very interesting. Originally, I didn't know what algorithm to choose.
So, of course, I chose all of them :) The whole intent of overlapped is
to allow your buffers (the WSABUfs you give WSARecv) to be owned by
kernel and write to them directly from the hardware by NFD.sys then hand
them back. This is what MS coins as 'zero-copy'. Yet, Tcl's generic
channel layer is a static space, so we have to copy it anyways.
memcpy speed limited? I thought we where below that. I wonder what
-recvmode {burst-detection 25 75} would do? You could even recompile
and change IOCP_RECV_BUFSIZE in iocpsockInt.h to be something larger
like 65536 or some other multiple of 4096
> Holding my breath... :-) Could you elaborate on the
> FilterSingleOpRecvBuf()/FilterPartialRecvBufMerge()?
Don't hold your breath too long or you'll go blue. I didn't get that
far last night. I did just find a bug in DoRecvBufMerge(), though. I
took out the check for a write error. Write errors should not prevent
reading. IOW, if we had an http/1.0 connection to a web server and I
sent a second request but hadn't yet read the first reply, I still want
to be able to read from the socket after the second write fails, which
it will with ENOTCONN or whatever it is.
FilterSingleOpRecvBuf() checks to make sure a WSABUF with an EOF or a
read error is not appended to the channel buffer space when it already
wrote some data there from a prior loop pass (the while loop in
IocpInputProc). It stops it there, reposts it back to the head for the
next iteration and causes the while loop to break.
FilterPartialRecvBufMerge() allows a WSABUF that is larger than the
channel buffer space to be partially copied, split and reposted back to
the head of infoPtr->llPendingRecv for later.
IocpInputProc() would become *way* simpler if those three recv modes
went away.
As I admitted in the other message, we *are* below that. :-/
Pilot error, forgot to set binary encoding in that specific test case.
More tomorrow.
R'
Ok, no worries. Just for fun, this is a screen shot of some socket fun
I made back in 2003 or so:
http://iocpsock.sourceforge.net/netio.jpg
The window on the right is a remote desktop to "rufus" acting as the
receiver and shows how -sendcap behaves as a poor-man's throttle and how
burst-detection (I used to call it -recvburst) starts to tick upwards
when hit harder.
> If I understand it correctly, the difference between flow-control and
> zero-byte is that with flow-control, the buffers posted have buflen
> 'buffersize' and thus the data are already in the buffers when they get
> added to the linked list (i.e. the actual 'read' is done in the
> 'socket-reader-thread'), whereas with zero-byte only a 0-length
> indicator buffer is pushed to the linked list when data are ready to
> read, and the actual data transfer (WSARecv) is done in the
> 'TCL-thread' when IocpInputProc() runs.
Yeah, that's it. flow-controlled just takes only the one WSABUF from
the linklist (infoPtr->llPendingRecv) and copies it right to the channel
buffer where the size of it is equal to the channel buffer space (at
most) depending upon what was actually written to the WSABUF by the
kernel though. The replacement WSARecv is made after it is read into
Tcl, thus flow control is dependent on Tcl actually reading data.
Burst-detect allows the count to grow but the size of the WSABUFs are
hard-code to what IOCP_RECV_BUFSIZE is set at rather than using what the
toRead value was for the last call to IocpInputProc() which is equal to
-buffersize for the channel. The replacement (and growth) happens
behind the scenes in the completion thread to allow for immediate
reaction time. End-to-end flow is still dependent on Tcl reading from
the channel as the caps prevent it from eating the world.
Some big numbers for -recvmode {burst-detect XX XX} might get this to
fly. -recvmode {burst-detect 200 500} ?? max of 200 concurrent
WSARecv calls with 4k buffers each and a max of 500 waiting 4k (2048000
bytes) buffers?
Again, I'll try to dig into it some more and see if it's all working the
way it once was. The -recvmode change from when it's created might be
stuck, and I agree that stuff is a bit odd since I added zero-byte.
I made a few changes over the weekend. A new IocpSetRecvMode() is
called for the setoptionProc now and changes the internal buffering with
a call to setsockopt(SO_RCVBUF). I didn't know the proper buffersize,
though. I can't give it the heavy testing you can as I don't have a
gigabit LAN.
Could you give it a try for me? How's it going so far? Is
'burst-detection' improving flow for you?
Will check.
| Could you give it a try for me? How's it going so far? Is
| burst-detection' improving flow for you?
We're preparing for an exhibition, so I currently don't have the time
slices for iocpsock :-( Might take until next week, since a vacancy is
also pending :-)...
R'
David,
right now my time for experimenting with iocpsock is very limited,
so no exhaustive testing. :-/
I just did a complete cvs checkout again. The new code seems to 'work'
with burst-detection (I get data), but the data rate in my real
application is still unsatisfying.
My application has the constraints that it checks for data periodically
(via 'after 1', does not use fileevents), and there the reading is not
fast enough with zero-byte, and not better than plain TCL with
flow-controlled.
With burst-detection it also works up to a certain threshold
(~20-22MByte/s), but with no different CPU load than flow-controlled (on
a single-CPU Pentium4 with Hyperthreading). If I go above that
threshold, the data seem corrupted (there are sequence numbers in my
data packets, and they get messed up completely).
Since the hardware is going to get delivered, I'm staying with plain TCL
right now in my real application :-(
I will try again with the data flushing script when I get some spare
time.
Two more points:
==================================================
- a small patch to the manpage:
--- man.html 21 Mär 2009 08:20:42 +0100 1.10
+++ man.html 27 Mär 2009 19:05:09 +0100
@@ -139,7 +139,7 @@
return TCL_ERROR;
}
#else
- if (Iocpsock_InitStubs(interp, "3.0", 0 /*exact*/) == TCL_ERROR) {
+ if (Iocpsock_InitStubs(interp, "3.0", 0 /*exact*/) == 0) {
return TCL_ERROR;
}
#endif
==================================================
- I tried to use the iocpsock stubs library in my application, but this
failed due to a null-pointer dereference in
iocpStubLib.c/Iocpsock_InitStubs():
There Tcl_PkgRequireEx() is called and a pointer to the stubs table is
expected to get filled in, but since Tcl_PkgProvideEx() is never called,
the returned pointer is 0, leading to a coredump.
In Iocpsock_Init()/dllmain.c, Tcl_PkgProvideEx() should get called
instead of Tcl_PkgProvide(), but I don't know how to provide the correct
clientData...
R'
This makes a difference, at least for the data-reading simple loop.
Here are the new results for IOCPSOCK_PATCH_LEVEL "3.0a4"
Basically, a data-reading fileevent, which sums up the number of bytes
read and issues a statistic every second. Goal is 612MBit/s.
Channel is fconfigured
-buffersize 1000000 -translation binary -blocking off
and in addition with the -recvmode specified. For comparison at the end
of each line is the TCL socket performance. Numbers are MBit/s and CPU usage.
Windows XP Professional, SP3.
-recvmode zero-byte flow-controlled {burst-detection 250 500} TCL-socket
===================================================================================
tcl 8.3.3 612 35% 518 65% 460 100% 535 75%
tcl 8.4 612 35% 508 75% 420 100% 535 75%
tcl 8.5.2 612 35% 500 70% 370 100% 525 75%
(I have no idea why the plain-TCL MBit/s numbers are higher now than the
week before. I recompiled my TCL 8.3.3 wish main (but not TCL itself),
but the others are the same).
So with iocpsock and zero-byte, I get what I need in terms of raw I/O.
I was confused by the 'switch' in IocpSetRecvMode() which sets a
SO_RCVBUF of the channel buffer size for IOCP_RECVMODE_ZERO_BYTE, but
one of 0 for IOCP_RECVMODE_FLOW_CTRL. If I also set non-zero SO_RCVBUF
for flow-controlled (basically combining IOCP_RECVMODE_FLOW_CTRL and
IOCP_RECVMODE_ZERO_BYTE in IocpSetRecvMode()), I get better performance
of flow-controlled (also 612MBit/s, though it uses 70%CPU instead of
35% in zero-byte).
In addition I played with the IOCP_RECV_BUFSIZE for burst-detection:
Table headers are IOCP_RECV_BUFSIZE/recvcap/recvbufcap:
-recvmode 4096/250/500 32768/30/60 65536/15/30
======================================================
tcl 8.3.3 460 100% 610 61% 545 90%
tcl 8.4 420 100% 550 75% 480 90%
tcl 8.5.2 370 100% 530 80% 480 90%
I also tried (1024*1024)/2/4, but that eventually stopped receiving data
(not checked why).
BTW, David, what do you think of allowing the buffer size for -recvmode
burst-detection as additional third parameter of the fconfigure call
instead of hardcoding it? I see some advantage of larger buffers if the
data rate is high (32k perform best).
-recvmode burst-detection ?recvcap? ?recvbufcap? ?buffersize?(default 4k)
R'
Turns out that with mode zero-byte there is a gotcha in that you need to
configure the buffersize *before* the recvmode, since the buffersize
determines the SO_RCVBUF buffersize, and that one determines the
performance.
So
fconfigure -recvmode zero-byte
fconfigure -buffersize 1000000
uses 4k SO_RCVBUFs and stumbles, where
fconfigure -buffersize 1000000
fconfigure -recvmode zero-byte
uses 1M SO_RCVBUFs and flies. ;-)
(These go sequential since I really call them from C level via Tcl_SetChannelOption().)
R'
Well, the issue really is: why would you do that?
If you're passing "8.3" to the InitStubs function always, you should only
use 8.3 features AFAIK. TCL_VERSION is the best way of doing it IMO. Tcl
wasn't designed for backloading as I believe others have stated before.
Not that there are a lot of functions that really use this. There are
something like 2 or 3 API functions that use stubs for compatibility in
Tcl. Tk uses stubs more, and it has had some incompatible API changes, as
well as ABI changes. Stubs can help with the ABI changes though, for the
most part, but they add cost in terms of indirection for every API function
call.
-George
I'm glad you caught the chicken/egg problem about the change. I should
have mentioned that it doesn't know the buffersize of the channel until
the channel is created.
Cause it's too hard to build 8.3 with the current VC++ differences since
VC++5.0 was common.
> If you're passing "8.3" to the InitStubs function always, you should only
> use 8.3 features AFAIK.
But that's just it, I am only using the API as it existed for 8.3. I
could possibly use Tcl_SetChannelError that came around in 8.5 I think,
but I don't have the need to use anything special beyond the POSIX errors.
> TCL_VERSION is the best way of doing it IMO.
Not fully true. But this is the accepted way. Take for example the
namspace API changes that happened in 8.5(?): they moved public. By
compiling against 8.5 with TCL_VERSION for the version in
Tcl_InitStubs(), you have a binary that can only load in 8.5 or greater.
If you compiled it against 8.4, the binary can load in 8.4 or greater.
And if you are really creative in your source, you can compile against
8.5, yet load into 8.4 after a bit of #ifdef work to reroute the
namespace stuff back to the private table.
> Tcl
> wasn't designed for backloading as I believe others have stated before.
That's what everyone tells me. I don't know why. I look at where it
faulters and adapt my code to make it work.
Just because others like yourself say "you shouldn't do that", doesn't
mean that I can't for my stuff.
In the meantime I also ran the code on a dual core AMD, but the results
are similar: zero-byte is best in terms of CPU usage. Interestingly, on
the dual core (AMD 2x2GHz), all three modes were able to handle the
612MBit load, with CPU usage zero-byte lowest and burst-detection
highest. On my hyperthreaded single core Pentium-4 2.8GHz, only
zero-byte was fast enough (also with lowest CPU).
| Overlapped-I/O is really all about giving the kernel your buffers to
| use its zero-copy feature. Tcl's channels don't really work that way
| as it gives the Tcl_DriverInputProc the allocated space to write to
| then, not before.
Yup. The flow-controlled and burst-detection sound good ("delegate the
hard work of I/O"), but in practice the advantage is lost due to the
necessary copying later in the Tcl_DriverInputProc ('double' I/O
required, once into the receive buffer and again to the TCL buffer).
In that respect zero-byte is better since it writes only once into the
final destination. The actual I/O does not seem to be the bottleneck yet.
R'
And I'm really glad you didn't listen to the others ;-)
R', *still* 8.3.3 after all those years...