Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Sample using winsock2 IOCompletionPort and OVERLAPPED io

313 views
Skip to first unread message

Markus Ruettimann

unread,
Oct 1, 2003, 7:28:51 AM10/1/03
to
Hi all
I'm currently trying to write a small socket lib that uses
IOCompletionPort mechanics. I make some progress, but it is very hard to
digg throu the win32 api. The samples i found are not very usefull.
Does anybody know a good C++ sample that demonstrates these technics?

Thanks Markus

--
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

David Gravereaux

unread,
Oct 1, 2003, 8:32:50 AM10/1/03
to
Markus Ruettimann <to...@bluewin.ch> wrote:

>Hi all
>I'm currently trying to write a small socket lib that uses
>IOCompletionPort mechanics. I make some progress, but it is very hard to
>digg throu the win32 api. The samples i found are not very usefull.
>Does anybody know a good C++ sample that demonstrates these technics?
>
>Thanks Markus

http://cvs.sourceforge.net/viewcvs.py/iocpsock/iocpsock/

It's in straight C, though. Could be some good reading. Start in
ws2tcp.c then proceed to iocpsock_lolevel.c. All the other ws2*.c files
are unfinished; I can't do message oriented due to a model problem that
this code fits into as only stream types would work. See the completion
thread start in InitializeIocpSubSystem() and then CompletionThreadProc()
for where all the guts of the completion routine are.

The note about 'One thread per CPU' is false. I only need to use one
thread, process wide. I couldn't find a benefit in using more than one
thread as I don't use the completion thread(s) to drive my application.
It runs a tight loop and is just a producer. The rest of the application
is a consumer to what this produces. I found that any calls to
EnterCriticalSection() in HandleIo() are somewhat expensive... The least
locking required, the better. YMMV.
--
David Gravereaux <davy...@pobox.com>
[species: human; planet: earth,milkyway(western spiral arm),alpha sector]

Len Holgate

unread,
Oct 1, 2003, 4:34:23 PM10/1/03
to
> I'm currently trying to write a small socket lib that uses
> IOCompletionPort mechanics. I make some progress, but it is very hard to
> digg throu the win32 api. The samples i found are not very usefull.
> Does anybody know a good C++ sample that demonstrates these technics?

I've written some articles on this and these all come with sample code that
provides a C++ framework for an IOCP based server. The code is intended to
be a reasonably good starting point around 80% of the time - so it may not
be an exact fit for what you need, but we usually find we can customise it
reasonably easilly for special situations; less locking, more active
connections, higher throughput, etc.

Anyway, this is what the articles are about; hope you find them useful.

1) A reusable socket server class using IO Completion Ports and WinSock 2
Writing a high performance server that runs on Windows NT and uses sockets
to communicate with the outside world isn't that hard once you dig through
the API references. What's more most of the code is common between all of
the servers that you're likely to want to write. It should be possible to
wrap all of the common code up in some easy to reuse classes.
http://www.jetbyte.com/portfolio-showarticle.asp?articleId=37&catId=1&subcatId=2

2) Business logic processing in a socket server
To maintain performance a socket server shouldn't make any calls that should
block from its IO thread pool. In this article we develop a business logic
thread pool and add this to the server developed in the previous article.
http://www.jetbyte.com/portfolio-showarticle.asp?articleId=38&catId=1&subcatId=2

3) Speeding up socket server connections with AcceptEx
When a server has to deal with lots of short lived client connections it's
advisable to use the Microsoft extension function for WinSock, AcceptEx(),
to accept connections. Creating a socket is a relatively "expensive"
operation and by using AcceptEx() you can create the socket before the
connection occurs rather than as it occurs, thus speeding the establishment
of the connection. What's more, AcceptEx() can perform an initial data read
at the same time as doing the connection establishment which means you can
accept a connection and retrieve data with a single call.
http://www.jetbyte.com/portfolio-showarticle.asp?articleId=39&catId=1&subcatId=2

4) Handling multiple pending socket read and write operations
"How do you handle the problem of multiple pending WSARecv() calls?" is a
common question on the Winsock news groups. It seems that everyone knows
that it's often a good idea to have more than one outstanding read waiting
on a socket and everyone's equally aware that sometimes code doesn't work
right when you do that. This article explains the potential problems with
multiple pending recvs.
http://www.jetbyte.com/portfolio-showarticle.asp?articleId=44&catId=1&subcatId=2

5) Testing socket servers with C#
When you're developing a TCP/IP server application it's easy to test it
poorly. In this article we develop a test framework that does most of the
hard work for you.
http://www.jetbyte.com/portfolio-showarticle.asp?articleId=46&catId=4&subcatId=11

The latest code updates are available here:
http://www.lenholgate.com/archives/000082.html and here:
http://www.lenholgate.com/archives/000088.html

Enjoy.

--
Len Holgate - http://www.lenholgate.com
JetByte Limited - http://www.jetbyte.com
The right code, right now.
Contract Programming and Consulting Services.


SenderX

unread,
Oct 1, 2003, 5:12:29 PM10/1/03
to
> I've written some articles on this and these all come with sample code
that
> provides a C++ framework for an IOCP based server. The code is intended to
> be a reasonably good starting point around 80% of the time - so it may not
> be an exact fit for what you need, but we usually find we can customise it
> reasonably easilly for special situations; less locking, more active
> connections, higher throughput, etc.

Its a very straight-forward socket server.

I recommend it.


> http://www.lenholgate.com/archives/000082.html

How ya doin Len?

The move to a hashed-lock system is a massive improvement. I have been
pushing people in this group to use one. Now they can see it in action. You
also seem to have added outgoing connections, NICE. Your server ( or peer I
should say ) has greatly improved!

=P


P.S.

Remember that old AppCore code I sent you? Did that ever crash on you? That
code doesn't even barely resemble the new stuff I am creating...

The peer code I am working on now, has the most streamlined and efficient,
mostly lock-free, scheduling system I have ever written.

It uses cohort-scheduling:

http://www.usenix.org/events/usenix02/larus.html

You should really take a look at this Len. It will improve you servers
performance, for sure.

--
The designer of the experimental, SMP and HyperThread friendly, AppCore
library.

http://AppCore.home.comcast.net


SenderX

unread,
Oct 1, 2003, 8:22:16 PM10/1/03
to
> It runs a tight loop and is just a producer.

An IOCP thread is a consumer of IOCP completions.

Just don't block in an IOCP thread.

If you need to block, produce the IOCP completion to another thread-pool.


> The least
> locking required, the better. YMMV.

No doubt!

=P

Len Holgate

unread,
Oct 2, 2003, 3:52:37 AM10/2/03
to
> How ya doin Len?

Pretty good :)

> The move to a hashed-lock system is a massive improvement. I have been
> pushing people in this group to use one. Now they can see it in action.
You

Our production servers have always used one, I just never go around to
updating the article.

> also seem to have added outgoing connections, NICE. Your server ( or peer
I
> should say ) has greatly improved!

The outgoing connection thing was something we never needed before. Then we
had a client come along who needed their servers to talk to one another. It
was an easy addition and means we can use the framework for client side
stuff too, which is handy for us.

> Remember that old AppCore code I sent you? Did that ever crash on you?

Don't think so, but I didn't play with it that much.

> The peer code I am working on now, has the most streamlined and efficient,
> mostly lock-free, scheduling system I have ever written.

Cool.

> You should really take a look at this Len. It will improve you servers
> performance, for sure.

Once we get a client who needs us to improve the performance I'll take a
look. So far the framework seems to fit for most of our client's needs and I
don't have the time to improve it just for the sake of it (unfortunately).

David Gravereaux

unread,
Oct 2, 2003, 7:56:48 AM10/2/03
to
"SenderX" <x...@xxx.xxx> wrote:

>> It runs a tight loop and is just a producer.
>
>An IOCP thread is a consumer of IOCP completions.

The thread that runs the completion routine is a producer to the upper
layer. Therefore, I call it a producer. Yeah, it consumes from
winsock... whatever... Is an MPEG decoder a consumer from a file or a
producer to the framing routine? Both.. trick question.

>Just don't block in an IOCP thread.

What I mean to say, is that if you are using the threads that block in
GetQueuedCompletionStatus on the completion port, to then drive the rest
of your protocol after a WSARecv comes in and subsequently your whole
application, multiple threads waiting on the completion port is probably a
'good thing'.

But if you're like me and all you do is sit there sucking winsock dry and
moving those buffers to a number of linkedlists and notifying an upper
layer that will eventually come back with another few threads to flush off
those buffers, multiple threads waiting on the completion port have no
benefit.

>If you need to block, produce the IOCP completion to another thread-pool.

I just don't see a benefit to using threads in that model. If all that a
single thread that services the completion port does is just moves around
buffers and replaces I/O calls, there's no need to have to add management
for the out-of-order problem. And even with the out-of-order shuffling
added, and you use multiple threads on GQCS, you still have to lock a
linkedlist and own it for the shuffling. Therefore, blocking the other
threads your using from it. That isn't much for concurrency when a common
resource is in the way:

thread#1: GQCS -> CR, handle WSARecv ---\
thread#2: GQCS -> CR, handle WSARecv --\_\__ common instream linkedlist
thread#3: GQCS -> CR, handle WSARecv --/ /
thread#4: GQCS -> CR, handle WSARecv ---/

Yikes.. This is all that my WSARevc handler does:

case OP_READ:

if (infoPtr->flags & IOCP_CLOSING) {
FreeBufferObj(bufPtr);
break;
}

IocpPushRecvAlertToTcl(infoPtr, bufPtr);

if (bytes > 0) {
/*
* Create a new buffer object to replace the one that just
* came in, but use a hard-coded size for now until a
* method to control the receive buffer size exists.
*
* TODO: make an fconfigure for this and store it in the
* SocketInfo struct.
*/

newBufPtr = GetBufferObj(infoPtr, IOCP_RECV_BUFSIZE);
PostOverlappedRecv(infoPtr, newBufPtr, 1);
}
break;

thread#1andOnly: GQCS -> CR, handle WSARecv -> instream linkedlist
^--------------------------/

If my completion routine did an 'amount' of work that can run in
isolation, I could see how threads could be beneficial. But with it doing
near nothing before returning to the GQCS, I see that multiple threads on
the GQCS would be inefficient for me. Maybe the thread scheduling magic
that GQCS does wouldn't hurt the efficiency, but I really am doing near
nothing before it has to hit a common resource lock.

>> The least
>> locking required, the better. YMMV.
>
>No doubt!

I want some of that Atomic power. I really want to look into it and have
the lists run lock free. How's that going? When the consumer layer reads
the instream, the list gets locked and the completion routine can't fill
it for those moments. That network code of mine is already 10x faster
than the server it runs in, but 20x,30x would be much more impressive ;)

input 3k connections per sec... app 163 per sec

My backqueue fills fast, the new connections may wait a bit, but no
errors... EVER. It's nice having a huge safety factor. AcceptEx in
overlapped mode is beautiful :)

SenderX

unread,
Oct 2, 2003, 7:19:14 PM10/2/03
to
> The thread that runs the completion routine is a producer to the upper
> layer. Therefore, I call it a producer. Yeah, it consumes from
> winsock... whatever... Is an MPEG decoder a consumer from a file or a
> producer to the framing routine? Both.. trick question.

;)


> But if you're like me and all you do is sit there sucking winsock dry and
> moving those buffers to a number of linkedlists and notifying an upper
> layer that will eventually come back with another few threads to flush off
> those buffers, multiple threads waiting on the completion port have no
> benefit.

You add the buffers to a linked-list after they come in? I don't think that
would solve the out-of-order problem?

Example:


Producer:

WSARecv #1
WSARecv #2
WSARecv #3


Consumer:

WSARecv #3 - Happens to complete first!
WSARecv #1
WSARecv #2


Now your linked-list would have to be as follows:

WSARecv #3 -> WSARecv #1 -> WSARecv #2


Atomically add your buffers to the list before you call WSARecv.

> I want some of that Atomic power. I really want to look into it and have
> the lists run lock free. How's that going?

Read this whole paper:

http://www.cs.tau.ac.il/~shanir/reading%20group/p73-michael.pdf

The x86 instruction-set is 100% compatible with it.

If you have trouble implementing the algo. in C or C++, I will post my full
working code for it.

This is the best lock-free linked-list algo out there. It works SOOOOO good.
From IBM of course!

David Gravereaux

unread,
Oct 2, 2003, 7:41:43 PM10/2/03
to
"SenderX" <x...@xxx.xxx> wrote:

>Now your linked-list would have to be as follows:
>
>WSARecv #3 -> WSARecv #1 -> WSARecv #2
>
>
>Atomically add your buffers to the list before you call WSARecv.

How would I know that they're ready to be consumed?

SenderX

unread,
Oct 2, 2003, 8:00:40 PM10/2/03
to
Dave, guess what...


I found my "old and crusty" AppCore SocketPeer code on an old CD laying
around.

Here it is:

http://appcore.home.comcast.net/old/AppCore_OLD.zip

It works ( hasn't crashed yet! ;), but has not been through the massive bug
testing that I run all my production code through.

I created an Echo Server, and a HTTP/1.0 Proxy Server so you can play around
with them.

It has ZERO lock-free algos, and its resource locking schema has NOT been
optimized.

But I don't remember it crashing on me, so give it a whirl...


P.S.

Take a look at the Workplace.c file and the:

sys_clbk_acOnWorkplaceWork
sys_acEnsureWorkOrder

functions. They will show you how I keep multi-WSARecvs in order. Its a
pretty simple algo I developed for it.

Take a look the the SocketPeer.c file. Thats the winsock code.

The new AppCore that I am working uses basically the same ordering algo.

The features for this old code of mine are:

Run on WinNT && Win9x

Post multi WSARecv's and WSASend's

Aggressive AcceptEx reposting algo

Post a boatload of outgoing connects

Post a ton of address resolves

Aggressive Send and Recv gracefull shutdowns

Aggressive thread-pooling, with on the fly thread creation.

Intergrated application callback system

C++ Wrappers

PerSocket to PerSocket communication

PerSocket grouping

And much more!

This is old code. Imagine what the new AppCore SocketPeer can do!

Muuuhahahahahah!

;)

0 new messages