On 6/30/21 11:36 PM, Ashok wrote:
> Hi David,
>
> Tx again for your interest and input. With respect to your suggestions,
> as a general comment I'll say I don't plan to spending any time on
> changes to iocp (except bug fixes of course) wrt the TCP component. It
> is fast enough (anywhere from 2-10 times faster than Tcl's socket for
> depending on traffic loads), stable and I have other fish (projects) to
> fry.
I concur 2-10 times faster, more robust, and lower resource usage. Have
your cake, eat it too. Complex code, yeah.
> - Regarding zero-byte copies - this is also suggested in the "Network
> Programmin for Windows" book (two decades old but still the reference
> for Winsock). I did implement this as a strawman but found no
> measureable difference in either throughput or CPU. My guess is that the
> reduced byte copies are offset by increased kernel transitions (one to
> post the zero byte and a second one to then read from the kernel). So
> added complexity for no measurable gain.
The gain regards lowering the per-socket resource usage of the non-paged
global memory pool. Good for a web server with many thousands of
concurrent connections I would think. I also had a user of iocpsock who
benefited from this when matching the SO_RCVBUF of the socket to the
channel buffer size. (IIRC)
If only for academic reasons, can we have the code do all the modes?
Here's a more detailed description of the memory limitation as it exists
in IOCPsock. If one runs this in a tclsh shell:
C:\WINNT\system32>tclsh84
% for {set i 0} {$i < 10000} {incr i} {socket2 localhost 80}
% foreach s [file channels] {if {[string match iocp* $s]} {close $s};
after 5}
% exit
With those 10,000 sockets connected (no data transfer), you'll see on
the tclhttpd status page that "General pool bytes in use" will be
1,800,516 bytes and "Special pool bytes in use" will be at 41,755,025 bytes.
Using a zero-byte receiving algorithm, and the 10,000 socket connection
script, I can lower the "Special pool bytes in use" to a mere 774,472
bytes. "Special pool" is the non-paged pool.
https://techcommunity.microsoft.com/t5/windows-blog-archive/pushing-the-limits-of-windows-paged-and-nonpaged-pool/ba-p/723789
> - Regarding critical sections / thread synchronization - I'm not sure if
> you noticed but the channel locking is based on SRWLocks, not critical
> sections unless you specifically compile for XP which does not support
> the former. Nevertheless, I would not think there is really a need to
> avoid critical sections either. Unless there is contention, they do not
> transition to the kernel (which is expensive) and contention is limited
> because there are at most two threads competing for a channel - the Tcl
> thread owning the channel and the iocp thread. Given how much work the
> Tcl channel subsystem has to do outside of the TCP driver itself, I
> really would not expect much contention.
Years back I did some profiling and found Tcl's event loop to poll event
sources ran at ~1,500 iterations per sec while the completion thread ran
at least ~3,500 per sec. So yes, more faster than Tcl could consume.
> - Regarding atomic lock-less lists, I do not know of any that support
> *queues* as opposed to *stacks* (which is what your link points to).
> Moreover, the list manipulation is associated with other sync'ed state
> as well so making the lists lockfree does not really help as access to
> other state information still has to be sync-ed.
I haven't done a deep read yet :) I have to setup a windows dev
environment so I can have some fun with your work. You see, this right
here is where I left off around 14 years ago. Let me have my fun.
https://docs.microsoft.com/en-us/windows/win32/sync/interlocked-singly-linked-lists
Yeah, that's incomplete. Off-hand, there's quite a few caveats about
using InterlockedCompare64Exchange128() directly. I think I'm starting
to remember why this was challenging.
Off-hand, there are only two "free-running" operations the completion
thread is essentially doing -- replacing completed AcceptEx calls to the
pool and growing the count of overlapped WSARecv calls if in a burst
condition given it is under the limit. Shaving a few uSec on those are,
I feel, worth the effort of profiling on a range of test conditions.
When it comes down to it, it isn't the normal case behavior but how the
use of IOCP responds to abuse for a DDoS attack that interests me.
...and then there's UDP and multicast.
...and all the other LSP types such as Bluetooth and irda. Who even
uses Netware anymore? Is it still even there in Windows 10? Appletalk?
...and other native HANDLE types that support overlapped I/O as a
complete overhaul of Tcl's I/O underpinnings on windows.
This is exciting. I gave up coding ages ago, even as a hobby. I'll
come out of retirement for this.