Best messaging model for threaded producer/consumer in win32

Alan R. White

unread,

Dec 28, 1999, 3:00:00 AM12/28/99

to

Hi, I've not much in-depth coding experience on NT, so a quick question on
how I should port my code from unix.

The code follows a standard model where a number of producer threads fill
buffers and place messages on a fifo|circular|ring buffer from which one or
more consumer threads take the data, tag it and send it across the network
to a server app.

In the windows way of doing things should I just use the in built messaging
for queueing up the buffers? The actual message length is short, a couple of
pointers and so on. On unix I abandoned the o/s memory handling routines,
allocated all the memory I needed up front and managed the pools, queues,
mutex etc in the code. Ater some research I can't find a way of doing
mutexes and conditions as neatly (the way I'm used to) on nt (probably my
problem not nt). It seems as if just using the builting messaging my
problems might go away - the message rate will hopefully be very high - is
this an issue for the system as a whole?

Any guidance appreciated
-- Al

Felix Kasza [MVP]

unread,

Dec 28, 1999, 3:00:00 AM12/28/99

to

Al,

> In the windows way of doing things should I just use the
> in built messaging for queueing up the buffers?

If you are referring to the WM_foo / PostMessage() / GetMessage() stuff,
I recommend against it -- you'd have to manage an additional chunk of
memory for every message to keep your data in, and there are faster
solutions anyway.

> On unix I abandoned the o/s memory handling routines,
> allocated all the memory I needed up front and managed
> the pools, queues, mutex etc in the code.

On NT, you want to minimise user/kernel mode transitions. That means
you prefer critical sections, or one of the alternatives discussed in a
recent thread in here (Slava, help me out here -- thanks). Your desire
for speed also means that you don't want to use NT's Heap*() allocator,
except to get large-ish blocks to suballocate. (I still haven't taken
the time to play with Emery Berger's hoard allocator, but being
optimised for speed in multi-threaded environments sounds like just the
ticket for you.)

--

Cheers,
Felix.

If you post a reply, kindly refrain from emailing it, too.
Please consider migrating to microsoft.public.platformsdk.*
where the MS folks plan on hanging around. See you there!

Slava M. Usov

unread,

Dec 29, 1999, 3:00:00 AM12/29/99

to

Felix Kasza [MVP] <fel...@mvps.org> wrote in message
news:386a2890....@207.46.180.25...

> Al,
>
> > In the windows way of doing things should I just use the
> > in built messaging for queueing up the buffers?
>
> If you are referring to the WM_foo / PostMessage() / GetMessage() stuff,
> I recommend against it -- you'd have to manage an additional chunk of
> memory for every message to keep your data in, and there are faster
> solutions anyway.
>
> > On unix I abandoned the o/s memory handling routines,
> > allocated all the memory I needed up front and managed
> > the pools, queues, mutex etc in the code.
>
> On NT, you want to minimise user/kernel mode transitions. That means
> you prefer critical sections, or one of the alternatives discussed in a
> recent thread in here (Slava, help me out here -- thanks).

Indeed, if you're talking "fast", forget mutexes and semaphores. For simple
mutual exclusion locks, Win32 critical sections are what you want; for
things like one-writer/many-readers there are fast implementations, too. If
your code does not depend on timeouts and all threads are running at the
same priority, Stefen Gustafsson's RW lock is very simple and efficient. Its
code was posted to this forum a couple of months ago. BTW, Stefen, do you
have any updates?

[Felix, have I done my homework well? (-: ]

> Your desire
> for speed also means that you don't want to use NT's Heap*() allocator,
> except to get large-ish blocks to suballocate. (I still haven't taken
> the time to play with Emery Berger's hoard allocator, but being
> optimised for speed in multi-threaded environments sounds like just the
> ticket for you.)

It's really really very fast, but since the original poster says he's
already implemented all of the bookkeeping and only needs one huge chunk of
memory, this would look like a repeated effort and resource wastage. If the
original implementation is *actually* good at managing stuff, that is.

--

Slava

Please send any replies to this newsgroup.
microsoft.public.win32.programmer.kernel

Stefan Gustafsson

unread,

Dec 29, 1999, 3:00:00 AM12/29/99

to

> same priority, Stefen Gustafsson's RW lock is very simple and efficient.
Its
> code was posted to this forum a couple of months ago. BTW, Stefen, do you
> have any updates?

No, I haven't had any time to spare.

I do not think that RW locks will be much use in this scenario though.

I think the best solution here is probably to use a completion port.

Your producer threads can post messages to the completeion port using
PostQueuedCompletionStatus and the consumer threads can retrieve the posted
information using GetQueuedCompletionStatus.

Not that when using PostQueuedCompletionStatus, the system does not assign
any meaning to the three parameters. They are simply passed on to
GetQueuedCompletionStatus.

The beauty of using completion ports is that it will automatically manage
the number of running threads in an optimal way.

You might even be able to rewrite the application so it only uses one set of
threads. If you use asynchronous IO, you can have IO completion packages
posted directly to the completion port. The thread reading from the port can
then get the raw data, assemble it to a complete message and when the
message is ready it can simply make a blocking call to send it over the
network to the server. This will give maximum performance since it avoids
unnecessary context switches.

/SG

Felix Kasza [MVP]

unread,

Dec 29, 1999, 3:00:00 AM12/29/99

to

Stefan,

> I think the best solution here is probably to use
> a completion port.

... but it might require a major rewrite. I was under the impression
that Alan was looking for a drop-in replacement for whatever
synchronisation objects he is using right now.

Slava M. Usov

unread,

Dec 29, 1999, 3:00:00 AM12/29/99

to

Felix Kasza [MVP] <fel...@mvps.org> wrote in message

news:386a649b....@207.46.180.25...

> Stefan,
>
> > I think the best solution here is probably to use
> > a completion port.
>
> ... but it might require a major rewrite. I was under the impression
> that Alan was looking for a drop-in replacement for whatever
> synchronisation objects he is using right now.

Actually, I believe, we might get a good insight by comparing the
performance of both the highly optimized circular buffer, running entirely
in user mode, and the design with a completion port... for educational
purposes, because I'm not at all convinced that a completion port can
out-perform a cleverly coded user-mode queue.

Alan R. White

unread,

Dec 29, 1999, 3:00:00 AM12/29/99

to

[snip]

> > ... but it might require a major rewrite. I was under the impression
> > that Alan was looking for a drop-in replacement for whatever
> > synchronisation objects he is using right now.
>
> Actually, I believe, we might get a good insight by comparing the
> performance of both the highly optimized circular buffer, running entirely
> in user mode, and the design with a completion port... for educational
> purposes, because I'm not at all convinced that a completion port can
> out-perform a cleverly coded user-mode queue.

Sounds OK by me if you guys will help me out with measuring the stuff and
getting going with this unfamiliar territory (completion ports - I'll read
up on them next).

As suggested on the way, if I could drop in a simple replacement for the
stuff I'm used to I'll then take up the challenge of recoding. Its pretty
silly talking about this in indirect terms, here's the code at the heart of
it all - all code quality comments welcome too....it's all pretty standard
stuff though, deliberately avoiding direct mutex lock spins at all times -
more intelligent code would contain a spin and back off algorithm (slated
for 28 versions time ;-)

typedef struct S_qmBuff
{
unsigned stVersion ; /* version number */
unsigned FileId; /* assigned by mapper */
unsigned SeqNo; /* Message Sequencing */
unsigned Action; /* File Begin|Write|End */
char *Data;
unsigned DataLen ;
} qmBuff;

#define qmBuffVersion 1;

typedef struct S_Queue_t
{
pthread_mutex_t qLock; /* lock for access to queue */
pthread_cond_t qPopulated; /* for when q was empty */
unsigned qPopWaitCount; /* number of 'get' waits */
pthread_cond_t qFreeEntry; /* for when q was full */
unsigned qEntryWaitCount; /* number of 'put' waits */
int qHead; /* Pointer to start of queue */
int qFullCount; /* How many entries in use */
int qDepth; /* number of entries in queue */
qmBuff **qBuffer; /* Pointer to array */
} Queue_t;

/* qmGet: Generic method for obtaining a buffer from a queue
*/
void qmGet(Queue_t *Q, qmBuff **Buffer)
{
unsigned wait_count=0;

pthread_mutex_lock(&Q->qLock);
while (Q->qFullCount==0)
{
wait_count += 1;
pthread_cond_wait(&Q->qPopulated, &Q->qLock);
}

/* gather some stats on contention for a free buffer */
Q->qPopWaitCount += wait_count;

*Buffer=Q->qBuffer[Q->qHead];

Q->qHead = ( Q->qHead + 1 ) % Q->qDepth;
Q->qFullCount -= 1;

/* We've removed a buffer from the queue so tell any waiters there's
a buffer free */
pthread_cond_signal(&Q->qFreeEntry);
pthread_mutex_unlock(&Q->qLock);
}

void qmPut(Queue_t *Q, qmBuff *Buffer)
{
unsigned wait_count=0;

pthread_mutex_lock(&Q->qLock);
while (Q->qFullCount==Q->qDepth)
{
wait_count += 1;
pthread_cond_wait(&Q->qFreeEntry, &Q->qLock);
}

/* gather some stats on contention for a free buffer */
Q->qEntryWaitCount += wait_count;

Q->qBuffer[(Q->qHead + Q->qFullCount) % Q->qDepth]=Buffer;
Q->qFullCount += 1;

/* We've put a buffer on the queue so tell any waiters there's a
buffer to be read */
pthread_cond_signal(&Q->qPopulated);
pthread_mutex_unlock(&Q->qLock);
}

Chris Becke

unread,

Dec 29, 1999, 3:00:00 AM12/29/99

to

I heard a rumour that I cannot substantiate that SQL 7 is as fast as it is
because the designe replaced the use of IO completion ports (rife with
synchronization objects) with a design involving the use of fibres. As
fibers are co-operativly multitasked (ie must yield control manually) there
can be a much lower synchronization object overhead.

hmmm, not my area of expierence though :( They sound like fun...

Chris.
--

"Slava M. Usov" <stripit...@usa.net> wrote in message
news:uh4g7sjU$GA.204@cppssbbsa05...

>
> Felix Kasza [MVP] <fel...@mvps.org> wrote in message
> news:386a649b....@207.46.180.25...
> > Stefan,
> >
> > > I think the best solution here is probably to use
> > > a completion port.
> >

> > ... but it might require a major rewrite. I was under the impression
> > that Alan was looking for a drop-in replacement for whatever
> > synchronisation objects he is using right now.
>
> Actually, I believe, we might get a good insight by comparing the
> performance of both the highly optimized circular buffer, running entirely
> in user mode, and the design with a completion port... for educational
> purposes, because I'm not at all convinced that a completion port can
> out-perform a cleverly coded user-mode queue.
>

Felix Kasza [MVP]

unread,

Dec 29, 1999, 3:00:00 AM12/29/99

to

Slava,

> I'm not at all convinced that a completion port can
> out-perform a cleverly coded user-mode queue.

If you have concurrency control for your user-mode queue, and if you
don't need kernel transitions to fill/drain the queue (i.e., you don't
talk to the outside world), then I am pretty certain that a
pure-user-mode is faster. But what use is a server that is deaf and
dumb? How do you handle a worker that _has_ to block? Or should we,
instead of blocking, start an overlapped I/O with a completion routine
that enqueues the result (sort of "here you are, go on now")?

Felix Kasza [MVP]

unread,

Dec 29, 1999, 3:00:00 AM12/29/99

to

Chris,

> I heard a rumour that I cannot substantiate that SQL 7
> is as fast as it is because the designe replaced the use

> of IO completion ports [...] with a design involving the
> use of fibres.

This would come as a major surprise to me -- the big win when moving up
from SQL Server 4.x was that fibres went out the window ... I think
I'll have to ask around, this is interesting.

Felix Kasza [MVP]

unread,

Dec 29, 1999, 3:00:00 AM12/29/99

to

Alan,

the quickest way to adapt this code would be using critical sections.
Like pthreads sync objects, they _usually_ run in user mode, unless the
critsec is already claimed, in which case they wait on a kernel object
(a semaphore). On SMP systems, you get the spin-a-while for free, too.
You will have to replace the pthread_cond_wait() calls, however.

Alan R. White

unread,

Dec 29, 1999, 3:00:00 AM12/29/99

to

Felix et al.,

> You will have to replace the pthread_cond_wait() calls, however.
>

Thanks. Looks like I can replace the mutex stuff with
WaitForSingleObject/ReleaseObject calls or critical sections - easy enough
but I can't find a drop-in for the condition variable paradigm.

Guess a little rethinking is required on how to do this under win32. Perhaps
something hokey with interlocked variables......

Fibers look like a version +38 thing - scheduling is something I'll leave
the o/s to do for now. Interesting though.

Cheers
-- Al

Phil McRevis

unread,

Dec 29, 1999, 3:00:00 AM12/29/99

to

"Alan R. White" <a...@tipper.demon.co.uk> spake the secret code
<946509931.2038.0...@news.demon.co.uk> thusly:

>Guess a little rethinking is required on how to do this under win32. Perhaps
>something hokey with interlocked variables......

An alternative technique would be to use an off-the-shelf POSIX
pthreads library for Win32.
--
http://www.xmission.com/~legalize Legalize Adulthood!
lega...@xmission.com
``Ain't it funny that they all fire the pistol, <URL: http://
at the wrong end of the race?''--PDBT www.eden.com/~thewho>

Slava M. Usov

unread,

Dec 30, 1999, 3:00:00 AM12/30/99

to

Felix Kasza [MVP] <fel...@mvps.org> wrote in message

news:386e7d89....@207.46.180.25...

> Alan,
>
> the quickest way to adapt this code would be using critical sections.
> Like pthreads sync objects, they _usually_ run in user mode, unless the
> critsec is already claimed, in which case they wait on a kernel object
> (a semaphore). On SMP systems, you get the spin-a-while for free, too.

> You will have to replace the pthread_cond_wait() calls, however.

To replace pthread_cond_wait(), you need to replace the qPopulated and
qFreeEntry by manual-reset events, and do something like:

[assuming Q->qLock is a CRITICAL_SECTION, and

pthread_mutex_lock(&Q->qLock);

is replaced by

EnterCriticalSection(&Q->qLock);
]

LeaveCriticalSection(&Q->qLock);
WaitForSingleObject(Q->qPopulated, INFINITE);
EnterCriticalSection(&Q->qLock);

and replace

pthread_cond_signal(&Q->qPopulated);

with

PulseEvent(&Q->qPopulated);

Slava M. Usov

unread,

Dec 30, 1999, 3:00:00 AM12/30/99

to

Felix Kasza [MVP] <fel...@mvps.org> wrote in message

news:386b7a8b....@207.46.180.25...

> Slava,
>
> > I'm not at all convinced that a completion port can
> > out-perform a cleverly coded user-mode queue.
>
> If you have concurrency control for your user-mode queue, and if you
> don't need kernel transitions to fill/drain the queue (i.e., you don't
> talk to the outside world), then I am pretty certain that a
> pure-user-mode is faster. But what use is a server that is deaf and
> dumb? How do you handle a worker that _has_ to block? Or should we,
> instead of blocking, start an overlapped I/O with a completion routine
> that enqueues the result (sort of "here you are, go on now")?

Felix, really, I didn't mean to say that "IOCPs are *always* a waste of
time, both coding and execution wise, so avoid them like a plague". It
seemed to me, that in this *particular* case their unconditional use is,
well, arguable. I have a strong suspsect that blind replacing of the queue
with IOCP will result in a sub-optimal performance.

As you said, IOCPs are benefitical when threads have to block... when they
don't, IOCPs aren't really necessary. So I assume both producers and
consumers do block... producers block to get some data and consumers block
to dispatch some data... if we replace the queue with an IOCP, a producer
thread will look like:

for( ;; )
{
// ...
ReadFile(); // or whatever, block here
PostQueuedCompletionStatus();
// ...
}

and a concumer will look like

for( ;; )
{
// ...
GetQueuedCompletionStatus();
WriteFile(); // block here
// ...
}

while with *proper* use of IOCP one would have a neutral producer/consumer
therad like:

for( ;; )
{
GetQueuedCompletionStatus();
if( completed == read )
{
WriteFile(); // post async write and keep running
}
else if( completed == write )
{
ReadFile(); // post async read and keep running
}
else
{
// do something else
}
}

Do you see what I mean?

Felix Kasza [MVP]

unread,

Dec 30, 1999, 3:00:00 AM12/30/99

to

Slava,

> To replace pthread_cond_wait(), [...] manual-reset events,
> and do something like:

If you are going to wait on a kernel object anyway, you might as well
ditch the critsec for a mutex and call the (atomic)
SignalObjectAndWait() ...

Felix Kasza [MVP]

unread,

Dec 30, 1999, 3:00:00 AM12/30/99

to

Slava,

> Do you see what I mean?

Yes, I do. One thing that would worry me, though, is the queue depth on
the IOCP. Does it have limitations? I don't know of any ... and if the
producer runs at full speed (unlikely in a scenario involving I/O and a
two-way protocol, but here?), what happens? Does PQCS() block on
reaching a certain queue depth? Or will it merrily exhaust whatever
memory resources it can lay its hands on?

Ziv Caspi

unread,

Dec 30, 1999, 3:00:00 AM12/30/99

to

On Wed, 29 Dec 1999 23:31:26 -0000, "Alan R. White"
<a...@tipper.demon.co.uk> wrote:
[...]

>but I can't find a drop-in for the condition variable paradigm.

Try Doug Schmidt's ACE implementation of condition variables
(you can find it by doing a Google search on "doug schmidt ace
condition variable"). I haven't used that one, but it looks
like what you're looking for.

HTH

---------------------------------------------
Ziv Caspi
zi...@netvision.net.il

Stefan Gustafsson

unread,

Dec 30, 1999, 3:00:00 AM12/30/99

to

Felix,

I have done some practical experimentation with PQCS().

It seems that the queue for a IOCP does not have any practical limits.

What happens is that it is using up all the *nonpaged pool*, when my program
has allocated 50M from the nonpaged pool, PQCS returns failure. It does NOT
block.

Of course, using up all the nonpaged pool on the machine is a bad idea, so
if it is possible that the producers will run faster than the consumers, you
would have to add some other means of throttling. (Perhaps just a counter
that keeps track of how many items are queued, and a simple Sleep() in the
writer when the number gets too high).

Anyway, I have also measured the performance of IOCP. It is possible to pass
about 100,000 messages per second using multiple writers and multiple
readers (266MHz PII).

This a very respectable number, a user-mode queue would need very careful
coding to be better. For example, the original UNIX-code posted here used
mutual exclusion among writers. If this was implemented the same way on NT
using a critsec, that would mean that there would be context switch for each
posted message (assuming that the writers can run at full speed). This would
give very much lower performance.

I think the only way to do better than an IOCP is to use a separate queue
for each writer and careful user-mode synchronization. This results in much
more complicated coding.

/SG

Felix Kasza [MVP] <fel...@mvps.org> wrote in message

news:386bf41c....@207.46.180.25...

Stefan Gustafsson

unread,

Dec 30, 1999, 3:00:00 AM12/30/99

to

Slava M. Usov <stripit...@usa.net> wrote in message

news:O22TYPmU$GA...@cppssbbsa02.microsoft.com...

>
> Felix Kasza [MVP] <fel...@mvps.org> wrote in message

I totally agree that IOCPs are most useful if readers and writers block. I
would even extend that to: IOCPs are fast enough if your threads are
performing any system calls at all (including non-blocking IO)

The only reason to use an optimized user-mode queue is if your threads
perform their processing entirely in user mode.

I would just like to add some more about the *proper* use of IOCP

First of all, it is perfectly OK for the thread to block while it is sending
out the data. Alan said that he was sending data over a network. This can
easily be done with blocking IO. (Of course, you can use async IO if you
want, but it is more complicated)

Second, I would start a new async read as soon as a previous read is
completed. I would not wait until a write was completed. It is not even
certain that there is a 1-1 correlation between completed reads an writes.

Something like this:

ReadFile(); // async read to start the process

for( ;; )
{
GetQueuedCompletionStatus();
if( completed == read )
{

// Let us assume that you need to assemble several
// incoming messages to a single outgoing message
AddDataToMessage(); // Add data to output message

ReadFile(); // Start a new async read.

if (MessageComplete)
SendMessage(); // Do whatever you want to send the message
(blocking is OK)
}
}

/SG

Tomas Restrepo

unread,

Dec 30, 1999, 3:00:00 AM12/30/99

to

Stefan,

> What happens is that it is using up all the *nonpaged pool*, when my
program
> has allocated 50M from the nonpaged pool, PQCS returns failure. It does
NOT
> block.

That's interesting.... Personally, I'd prefer PQCS() blocked.

>
> Of course, using up all the nonpaged pool on the machine is a bad idea, so
> if it is possible that the producers will run faster than the consumers,
you
> would have to add some other means of throttling. (Perhaps just a counter
> that keeps track of how many items are queued, and a simple Sleep() in the
> writer when the number gets too high).

Hum.... The problem would be then, what would you consider high :) OTOH, the
method would work fine if you are posting the IOpackets, but I'm much more
worried about what happens when the system itself queues up the io
packets.... Sounds like an IOCP server could potentially facilitate a DOS
attack if the programmer wasn't careful eanough to limit the number of
connections made....

--
Tomas Restrepo
http://members.xoom.com/trestrep/

Slava M. Usov

unread,

Dec 30, 1999, 3:00:00 AM12/30/99

to

Stefan Gustafsson <s...@nospam.se> wrote in message
news:#2oMQNrU$GA.259@cppssbbsa05...

[...]

> I totally agree that IOCPs are most useful if readers and writers block. I
> would even extend that to: IOCPs are fast enough if your threads are
> performing any system calls at all (including non-blocking IO)
>
> The only reason to use an optimized user-mode queue is if your threads
> perform their processing entirely in user mode.

Exactly. This statement covers some holes in my speculations.

> I would just like to add some more about the *proper* use of IOCP
>
> First of all, it is perfectly OK for the thread to block while it is
sending
> out the data. Alan said that he was sending data over a network. This can
> easily be done with blocking IO. (Of course, you can use async IO if you
> want, but it is more complicated)

Then one would have to have a lot more threads, more context switches, more
everything. If this is tolerable, then, well.

> Second, I would start a new async read as soon as a previous read is
> completed. I would not wait until a write was completed. It is not even
> certain that there is a 1-1 correlation between completed reads an writes.

Apparently this depends on the application. My pseudo-code was only to
demonstrate an idea. If we're speaking "network", then it's very likely that
we have to do a number of reads before receiving a complete messsage... but
I preferred to ignore such details to make the concept and the point of my
concern clear enough.

> Something like this:
>
> ReadFile(); // async read to start the process
> for( ;; )
> {
> GetQueuedCompletionStatus();
> if( completed == read )
> {
> // Let us assume that you need to assemble several
> // incoming messages to a single outgoing message
> AddDataToMessage(); // Add data to output message
>
> ReadFile(); // Start a new async read.
>
> if (MessageComplete)
> SendMessage(); // Do whatever you want to send the message
> (blocking is OK)
> }
> }
>
>
> /SG

--

Slava M. Usov

unread,

Dec 30, 1999, 3:00:00 AM12/30/99

to

Felix Kasza [MVP] <fel...@mvps.org> wrote in message

news:386af346....@207.46.180.25...

> Slava,
>
> > To replace pthread_cond_wait(), [...] manual-reset events,
> > and do something like:
>
> If you are going to wait on a kernel object anyway, you might as well
> ditch the critsec for a mutex and call the (atomic)
> SignalObjectAndWait() ...

I would not do so, because the thread does not *necessarily* wait every time
it enters get/put routine... if there is a ready item/free slot it merrily
jumps through the rest of the routine.

However, it is indeed necessary to modify the code I dumped yesterday,
because there is a bug - a race condition. When the critsec is released and
just before WaitForSingleObject() is called, there is a [small] possibility
of a producer sneaking in, putting an item, and pulsing the event, so that
the consumer will be waiting while an item is ready for dequeuing. The code
can be trivially modified:

ResetEvent(Q->qPopulated);

LeaveCriticalSection(&Q->qLock);
WaitForSingleObject(Q->qPopulated, INFINITE);
EnterCriticalSection(&Q->qLock);

and PulseEvent() to be replaced by SetEvent().

But this may and will often result in excessive calls of SetEvent() and
ResetEvent(), which I believe must be avoided. This can be circuimvented by
adding four more unsigned into the queue structure:

unsigned qFreeEntry_state;
unsigned qFreeEntry_waiters;
unsigned qPopulated_state;
unsigned qPopulated_waiters;

*_state must match the state of the corresponding events, if the event is
set, the state var is non zero, and vice versa.

The code then:

[consumer willing to wait on qPopulated event]

if( Q->qPopulated_state )
{
ResetEvent(Q->qPopulated);
Q->qPopulated_state = 0;
}
Q->qPopulated_waiters++;

LeaveCriticalSection(&Q->qLock);
WaitForSingleObject(Q->qPopulated, INFINITE);
EnterCriticalSection(&Q->qLock);

Q->qPopulated_waiters--;

[end consumer wait]

and SetEvent() in the producer:

if( Q->qPopulated_waiters && !Q->qPopulated_state )
{
SetEvent(Q->qPopulated);
Q->qPopulated_state = -1;
}

Of course, apparently there should be orthogonal code for qFreeEntry* guys.

Stefan Gustafsson

unread,

Dec 30, 1999, 3:00:00 AM12/30/99

to

> >
> > First of all, it is perfectly OK for the thread to block while it is
> sending
> > out the data. Alan said that he was sending data over a network. This
can
> > easily be done with blocking IO. (Of course, you can use async IO if you
> > want, but it is more complicated)
>
> Then one would have to have a lot more threads, more context switches,
more
> everything. If this is tolerable, then, well.

Well, obviously non-blocking IO is the most efficient on NT. My point is
that you do not HAVE to use non-blocking IO when you use IOCPs. Sometimes it
is MUCH easier to write blocking code. The performance difference if often
very small.

As a matter of fact, if you use non-blocking IO only, you need just one
thread per CPU. You then have no need for the advanced features of IOCPs.
One of the main reasons for having IOCPs is that they allow blocking code.

>
> > Second, I would start a new async read as soon as a previous read is
> > completed. I would not wait until a write was completed. It is not even
> > certain that there is a 1-1 correlation between completed reads an
writes.
>
> Apparently this depends on the application. My pseudo-code was only to
> demonstrate an idea. If we're speaking "network", then it's very likely
that
> we have to do a number of reads before receiving a complete messsage...
but
> I preferred to ignore such details to make the concept and the point of my
> concern clear enough.
>

Sure, nothing wrong with leaving out details in pseudocode. I think your
code probably did a good job of explaining to Alan how IOCPs are *supposed*
to be used.

I just thought your code missed a couple of good points about IOCPs:

1. It is OK to block! IOCPs are designed to handle blocking threads
gracefully. As soon as one worker blocks, another is released from GQCS().
This simplifies the coding.

2. By starting the next read when a write was completed, your code did not
acheive maximum performance. There is no reason why you would want to delay
reading the next block of data from disk (or whatever) until a network write
had been completed.

/SG

Felix Kasza [MVP]

unread,

Dec 30, 1999, 3:00:00 AM12/30/99

to

Tomas,

> Sounds like an IOCP server could potentially facilitate a DOS
> attack

Nope -- in the classic IOCP case (a TCP-based server), received packets
outside the current receive window are discarded by the IP stack.

Felix Kasza [MVP]

unread,

Dec 30, 1999, 3:00:00 AM12/30/99

to

Stefan,

> It seems that the queue for a IOCP does not have any practical limits.

Thanks for your experiment -- I had planned on doing something similar,
by my problem is the Copious Free Time, as usual.

If I did the math right, then your result comes out to an average of
just over 2,500 cycles per (enqueue plus dequeue). I agree that it
would be hard to do better by hand -- and hard enough to not be worth my
while; it's cheaper to buy faster hardware.

Ziv Caspi

unread,

Dec 30, 1999, 3:00:00 AM12/30/99

to

Stefan Gustafsson <s...@nospam.se> wrote in message

news:#DO7Q8qU$GA.78@cppssbbsa05...

> Anyway, I have also measured the performance of IOCP. It is possible to
pass
> about 100,000 messages per second using multiple writers and multiple
> readers (266MHz PII).
>
> This a very respectable number, a user-mode queue would need very careful
> coding to be better. For example, the original UNIX-code posted here used
> mutual exclusion among writers. If this was implemented the same way on NT
> using a critsec, that would mean that there would be context switch for
each
> posted message (assuming that the writers can run at full speed). This
would
> give very much lower performance.
>
> I think the only way to do better than an IOCP is to use a separate queue
> for each writer and careful user-mode synchronization. This results in
much
> more complicated coding.

Perhaps Doug Schmidt can expand on this. In ACE v5.0, he implemented a
specialized NT version of the message queue (which was based on condition
variables) which used IOCP. Reportedly, this has been done to reduce the
handle resources required by a queue, but perhaps he has carried
measurements
comparing the two implementations.

Ziv.

Slava M. Usov

unread,

Dec 30, 1999, 3:00:00 AM12/30/99

to

Stefan Gustafsson <s...@nospam.se> wrote in message

news:OuGJPNtU$GA.266@cppssbbsa05...

[...]

> Well, obviously non-blocking IO is the most efficient on NT. My point is
> that you do not HAVE to use non-blocking IO when you use IOCPs. Sometimes
it
> is MUCH easier to write blocking code. The performance difference if often
> very small.

Agreed :-)

> As a matter of fact, if you use non-blocking IO only, you need just one
> thread per CPU. You then have no need for the advanced features of IOCPs.

Hmm.. well, using APCs... perhaps... The problem is that even issuing only
async requests, the thread isn't guaranteed to never block. It may very well
block on say a page-fault... although this is very unlikely when the memory
usage is efficient...

Well, you definitely have a point here.

> One of the main reasons for having IOCPs is that they allow blocking code.

Indeed.

> 1. It is OK to block! IOCPs are designed to handle blocking threads
> gracefully. As soon as one worker blocks, another is released from GQCS().
> This simplifies the coding.

Yep.

> 2. By starting the next read when a write was completed, your code did not
> acheive maximum performance. There is no reason why you would want to
delay
> reading the next block of data from disk (or whatever) until a network
write
> had been completed.

I'm really sorry if my pseudo-code made this impression. What I tried to
express was: use async reads as much as necessary to get a full message and
then use async writes as much as necessary to dispatch a complete reply. It
does not make sense otherwise, because no server [except telnet, perhaps]
can send a meaningful reply until a full message has been received.

Slava M. Usov

unread,

Dec 30, 1999, 3:00:00 AM12/30/99

to

Tomas Restrepo <tom...@mvps.org> wrote in message
news:#sc$BksU$GA.264@cppssbbsa04...

> Stefan,
> > What happens is that it is using up all the *nonpaged pool*, when my
> program
> > has allocated 50M from the nonpaged pool, PQCS returns failure. It does
> NOT
> > block.
>
> That's interesting.... Personally, I'd prefer PQCS() blocked.

A matter of preference :-) I vote for non blocking, because this means more
control.

[...]

> Hum.... The problem would be then, what would you consider high :) OTOH,
the
> method would work fine if you are posting the IOpackets, but I'm much more
> worried about what happens when the system itself queues up the io
> packets.... Sounds like an IOCP server could potentially facilitate a DOS
> attack if the programmer wasn't careful eanough to limit the number of
> connections made....

In most cases, IOCP version of a server is no worse than a "one thread per
client" version. Because most of the time, IOCP servers have only one read
and/or write posted per socket. The only way to develop a DOS attack is to
establish and maintain a lot of full blown TCP connections, and, I guess, a
"one thread per client" will go down *much* faster under such conditions.

But take latest Felix's sample on IOCPs: it's invulnerable to such attacks,
although there is a chance that the attacker will consume all of the sockets
thus mostly achieving the same effect, from the viewpoint of legitimate
users. Note that maintaining an "unlimited" number of sockets is only
marginally better, because:

1. There are only 64K ports possible.
2. Incidentally, a process can have no more than 64K handles.

So the number of sockets is always *less* than 64K.

It's possible to fight this DOS by "scavenging" the full list of sockets,
killing every one that has been idle for too much time.

Unless, of course, the attacker follows the application protocol and issues
meaningful transactions over the TCP connection [like sending NOOP over
POP3/SMTP and HEAD / over HTTP]. In this case, it's an issue of higher-level
security policy...

Tomas Restrepo

unread,

Dec 30, 1999, 3:00:00 AM12/30/99

to

Felix,

>
> > Sounds like an IOCP server could potentially facilitate a DOS
> > attack
>

> Nope -- in the classic IOCP case (a TCP-based server), received packets
> outside the current receive window are discarded by the IP stack.

Thanks for clarifying that!

Felix Kasza [MVP]

unread,

Dec 30, 1999, 3:00:00 AM12/30/99

to

Slava,

tomasr> That's interesting.... Personally, I'd prefer PQCS() blocked.
> A matter of preference :-) I vote for non blocking [...]

And I vote for a flag switching between the two. :-)

Stefan Fleischmann

unread,

Dec 30, 1999, 3:00:00 AM12/30/99

to

Slava wrote:
> > A matter of preference :-) I vote for non blocking [...]

Felix wrote:
> And I vote for a flag switching between the two. :-)

Slava usually votes with *both* hands, so I'm afraid
you are outvoted, Felix... ;-)

(Sorry for intruding into this thread.)

Stefan

Alan R. White

unread,

Dec 30, 1999, 3:00:00 AM12/30/99

to

> Perhaps Doug Schmidt can expand on this. In ACE v5.0, he implemented a
> specialized NT version of the message queue (which was based on condition
> variables) which used IOCP. Reportedly, this has been done to reduce the
> handle resources required by a queue, but perhaps he has carried
> measurements
> comparing the two implementations.
>

Did some research - this is from the ACE v5 release notes

>>>>
ACE_Message_Queue_NT is a new adapter class inherited from
ACE_Message_Queue_Base that implement a message queue using NT's IO
completion ports. As it's name implies, this is a specialized version for
NT. It is not as versatile as ACE's general MQ implementation. However, this
implementation consumes less NT handles which in some circumstances may be a
big win (because ACE_Message_Queue uses condition variables and on NT,
condition variables are simulated and consume 4 handles each.
ACE_Message_Queue_NT only uses one handle and one critical section.)
<<<<

Guess I'll rethink the inter thread communication from the ground up on NT.
It might better to rewrite this bit than try to shoe-horn in some compromise
simply because I understand how to code it. I'm sure it will come back and
bite me if I get this bit wrong.

-- Al

Tomas Restrepo

unread,

Dec 30, 1999, 3:00:00 AM12/30/99

to

Slava,

> In most cases, IOCP version of a server is no worse than a "one thread per
> client" version. Because most of the time, IOCP servers have only one read
> and/or write posted per socket. The only way to develop a DOS attack is to
> establish and maintain a lot of full blown TCP connections, and, I guess,
a
> "one thread per client" will go down *much* faster under such conditions.

OK, perhaps I made a poor choice of words, as I was not specifically talking
about socket based servers. I was more concerned about what could happen
with multiple IOCP based apps running on the same machine (say a socket
server and a DBMS), both with full loads. I'll grant you, this is an
unlikely case, and you'd do much better buying new hardware :)

After trying out Stefan's sample for a while, I did notice it can consume
npp memory quite fast on my machine (which, I'll admit, is a outdated,
obsolete piece of crap <g>). With the out-of-the-box IOCP version, it takes
up about 90% of the npp. Removing the semaphore lock causes the application
to exit in about 10 seconds, after consuming 100% of the npp (around
17.5MB).

Curiously enough, running two simultanious copies of the sample bring NPP
consumption to 100%, too, but both apps run to completion, even mantaining
around 90% of the throughput of the single instance (and together only
consume around 12.5MB of the NPP).

Tomas Restrepo

unread,

Dec 30, 1999, 3:00:00 AM12/30/99

to

> Slava usually votes with *both* hands, so I'm afraid
> you are outvoted, Felix... ;-)

No, no, no... In that case, I'll side with Felix, and have both options
(I'll even take two sets of PQCS(), one blocking, one non-blocking)!

And Stefan... Why feel sorry? We're allowed to have some fun everyonce in a
while, aren't we? <g> (and besides, your contributions to these groups are
very much appreciated ...at least by me, learned quite a lot from Slava's
and you last discution!).

Tomas Restrepo

unread,

Dec 30, 1999, 3:00:00 AM12/30/99

to

Stefan,
>(Extending them to
> buffer a small struct instead of a single int is left as an exercise :-) )

Diligent reader that I am (yeah right <g>) here's a template-based version
(albeit, untested) that can queue pointers to whatever struct you wish:

template <
class T, // type of buffer entries
int BuferSize // max number of entries in the queue
> class IOCPBuffer
{
public:
IOCPBuffer() :
m_free(BufferSize)
{
m_hPort = CreateIoCompletionPort(INVALID_HANDLE_VALUE, 0,0,0);
m_count = 0;
}
void put ( T* x )
{
// Limit the number of element in the buffer
m_free.wait();
BOOL ok = PostQueuedCompletionStatus(m_hPort, 0, (ULONG_PTR)x,
(LPOVERLAPPED)0);
assert ( ok );
InterlockedIncrement ( &m_count );
}
T* get ( )
{
DWORD dummy1, dummy2;
T* x;
GetQueuedCompletionStatus(m_hPort, &dummy1, (ULONG_PTR*)&x,
(LPOVERLAPPED*)&dummy2, INFINITE);
InterlockedDecrement ( &m_Count );
m_free.signal();
return x;
}
private:
HANDLE m_hPort;
long m_count;
Semaphore m_free; // Counts the number of free slots in the buffer
};

Tomas Restrepo
tom...@mvps.org

Slava M. Usov

unread,

Dec 31, 1999, 3:00:00 AM12/31/99

to

Stefan Fleischmann <s...@sf-soft.de> wrote in message
news:386BC183...@sf-soft.de...

> Slava wrote:
> > > A matter of preference :-) I vote for non blocking [...]
>
> Felix wrote:
> > And I vote for a flag switching between the two. :-)
>

> Slava usually votes with *both* hands, so I'm afraid
> you are outvoted, Felix... ;-)
>

> (Sorry for intruding into this thread.)

Hehe. If we had some guys from Russian Duma, we would have far more voting
hands than the total number of attendees :-)

Slava M. Usov

unread,

Dec 31, 1999, 3:00:00 AM12/31/99

to

Stefan Gustafsson <s...@nospam.se> wrote in message

news:OyFdSztU$GA.266@cppssbbsa05...

[...]

> If you have a single reader and a single writer, the user-mode buffer can
> deliver about 500k messages/second
> As soon as you add more threads the performance drops drastically, down to
> about 30k messages/sec
>
> The IOCP version is steady at 100k messages regardless of the number of
> threads.

Guess the message is clear :-)

Felix Kasza [MVP]

unread,

Dec 31, 1999, 3:00:00 AM12/31/99

to

Stefan,

> Slava usually votes with *both* hands, so I'm afraid
> you are outvoted, Felix... ;-)

I feel underprivileged. To my east, they can vote as many hands as they
manage to raise; to my west, presidents are supplied with cute little
interns; and I sit here, slaving over a keyboard. :-(

Alan R. White

unread,

Jan 1, 2000, 3:00:00 AM1/1/00

to

Thanks for all the pointers. Its all up and running now - a little in depth
reading after what I learned in this discussion came up with the following
generic solution. The compiled and tested code is on my other machine but
here's what I did in the end (btw if you want to see the code let me know
and I'll get it posted)

Posix mutex = Win32 Mutex
Posix condition = Win32 Event

Instead of pthread_mutex_t and pthread_cond_t, use HANDLE.
Use CreateMutex for the mutex and CreateEvent for the conditions.
Instead of pthread_mutex_lock use WaitForSingleObject
Instead of pthread_cond_wait use SignalObjectAndWait(qLock,qPopulated,..)
but then get the mutex again using WaitForSingleObject as the big difference
is that SignalObjectAndWait does not reacquire the mutex.
Instead of pthread_cond_signal use SetEvent and obviously ReleaseMutex
replaces pthread_mutex_unlock.

I was a tad concerned about the scalability after reading all the comments
about IOCPs etc. My reading indicates that the WaitForSingleObject will not
cause a context switch etc if the mutex is available so not a major problem.
I couldn't determine if it spins for a while or not but it certainly chews
CPU using this as the core of a simple file copy program (just a test rig -
overlapped IO would be better for such a small task).

I still need to check out how condition variables are implemented in ACE as
ACE carries a warning about them being simulated and therefore expensive.

Happy 1st of the century
-- Al

Alan R. White <a...@tipper.demon.co.uk> wrote in message
news:946414667.24322.0...@news.demon.co.uk...
> Hi, I've not much in-depth coding experience on NT, so a quick question on
> how I should port my code from unix.
>
> The code follows a standard model where a number of producer threads fill
> buffers and place messages on a fifo|circular|ring buffer from which one
or
> more consumer threads take the data, tag it and send it across the network
> to a server app.
>
> In the windows way of doing things should I just use the in built
messaging
> for queueing up the buffers? The actual message length is short, a couple
of
> pointers and so on. On unix I abandoned the o/s memory handling routines,
> allocated all the memory I needed up front and managed the pools, queues,
> mutex etc in the code. Ater some research I can't find a way of doing
> mutexes and conditions as neatly (the way I'm used to) on nt (probably my
> problem not nt). It seems as if just using the builting messaging my
> problems might go away - the message rate will hopefully be very high - is
> this an issue for the system as a whole?
>
> Any guidance appreciated
> -- Al
>
>
>
>
>
>

Felix Kasza [MVP]

unread,

Jan 1, 2000, 3:00:00 AM1/1/00

to

Al,

> I was a tad concerned about the scalability after reading
> all the comments about IOCPs etc.

You have to realise that some of the participants in this discussion --
I, for one -- are slightly anal-retentive when it comes to scaling. If,
for example, operation X costs 1 msec of CPU time, then a server using
it would have a hard upper limit of 1000 requests per second per CPU.
For some of the things we do, this is Not Good.

> My reading indicates that the WaitForSingleObject will not

> cause a context switch etc if the mutex is available [...]

If a context switch is necessary, it is because a wait is required. In
that case, the overhead is not worth bothering about, compared to the
wait plus ctx switch. We were more discussing the cost of the
transition to kernel-mode (and back) for the immediate-success case.

> I couldn't determine if it spins for a while or not but it
> certainly chews CPU using this as the core of a simple file
> copy program

On single-CPU systems, it doesn't, because (barring an interrupt routine
mixing things up) a spin would keep the current owner from executing and
from eventually releasing the synchronisation object. On SMP systems, I
*think* it doesn't spin either (the critsec, which does is a user-mode
construct) -- but don't hold me to that.

Tomas Restrepo

unread,

Jan 1, 2000, 3:00:00 AM1/1/00

to

Felix,

> > My reading indicates that the WaitForSingleObject will not
> > cause a context switch etc if the mutex is available [...]
>
> If a context switch is necessary, it is because a wait is required. In
> that case, the overhead is not worth bothering about, compared to the
> wait plus ctx switch. We were more discussing the cost of the
> transition to kernel-mode (and back) for the immediate-success case.

FWIW, I spent a long time experimenting with Stefan's sample, and putting in
some code of my one, and I did found one interesting thing:
One of the reasons Stefan's IOCP-based buffer gave such great performance
was because the Semaphore class he used posed an interesting behavior: It
never blocked threads until the queue itself was filled up. That meant that
it could return more quickly, not calling WaitForSingleObject(), when the
queue was not full. After a whilem I decided to try out a simpler semaphore
class, which simply used the semaphore in an inverse way ( sinple calls to
WFSO() and RS() with a semaphore created with the Maximum count set to the
depth of the queue), and found out that in that case, the queue never has a
chance to fully fill up. In fact, the queue never went past a couple hundred
elements, and even then, the read/write count was about 1000 or 2000
messages less that with Stefan's class. (around the 27000 reads per second).

Felix Kasza [MVP]

unread,

Jan 1, 2000, 3:00:00 AM1/1/00

to

Tomas,

> One of the reasons Stefan's IOCP-based buffer gave
> such great performance was because the Semaphore
> class he used posed an interesting behavior

For even more fun, make a few small changes:

eliminate all mentions of m_free (the semaphore). As expected, this has
not-so-good effects on the queue depth, especially if you have more
writers than readers ... but now, add a line to the startup of reader():

SetThreadPriority( GetCurrentThread(), READER_PRIORITY_BUMP );

Set the number of writers to whatever you like (it makes no difference)
and watch the number of messages per second go up by 40% or more. The
queue doesn't overflow as long as the reader keeps going; as long as the
IOCP still has something when reader() calls GQCS(), the queue will be
emptied before it can refill. SMP machines present a more complex
picture, but as long as at least (number_of_CPUs) of the IOCP threads
remain unblocked, the queue will never fill.

Tomas Restrepo

unread,

Jan 1, 2000, 3:00:00 AM1/1/00

to

Felix,

> eliminate all mentions of m_free (the semaphore). As expected, this has
> not-so-good effects on the queue depth, especially if you have more
> writers than readers ... but now, add a line to the startup of reader():
>
> SetThreadPriority( GetCurrentThread(), READER_PRIORITY_BUMP );

Huhh... Sorry If I'm being thick, but I can't find mention of
READER_PRIORITY_BUMP anywhere...Do you mean THREAD_PRIORITY_ABOVE_NORMAL? If
so, I did try it, and message count does go up. However, I also tried it
_without_ removing the semaphore calls, and performance doesn't seem to
suffer much from it.

Felix Kasza [MVP]

unread,

Jan 2, 2000, 3:00:00 AM1/2/00

to

Tomas,

> I can't find mention of READER_PRIORITY_BUMP anywhere

Sorry -- think any small positive value of your liking here. +1 will do
nicely.

Slava M. Usov

unread,

Jan 2, 2000, 3:00:00 AM1/2/00

to

Alan R. White <a...@tipper.demon.co.uk> wrote in message

news:946745100.20207.0...@news.demon.co.uk...

[...]

> Instead of pthread_cond_wait use SignalObjectAndWait(qLock,qPopulated,..)
> but then get the mutex again using WaitForSingleObject as the big
difference
> is that SignalObjectAndWait does not reacquire the mutex.

Note that after WaitForSingleObject() has returned, you can *not* assume
that the qPopulated event is still set, so you have to spin. Instead of
SignalObjectAndWait(), I'd recommend ReleaseMutex() and
WaitForMultipleObjects(); but then again, I'd recommend dropping the mutex
altogether and going for a critical section instead... want some performance
figures?

Entering/leaving a critical section is about 80 clock times on PII;
Entering/leaving a mutex is about 2000 clock times on PII.

Yes, that's true, 2000 clocks without any waiting, context switching, etc...
just the pure overhead of transitioning to kernel mode and going back... now
take into account that you'll have to enter/leave the mutex on both enqueue
and dequeue, thus the total *minimum* en/de overhead is 4000 clocks, then
consider that you will have occasionally wait on and set/reset events, which
are roughly the same overhead. Compare that to 2500 clocks per en/de
achievable with IOCPs... or 500 clocks with light-weight semaphores...

[...]

> I was a tad concerned about the scalability after reading all the comments
> about IOCPs etc.

Figuring what to use, IOCP or [light-weight] mutexes, is up to you, because
there are surely factors that we're not aware of. If the role of the queue,
in your application, is to control throughput and you have lots of
consumers, which do or don't do blocking operations, then you probably
should consider IOCPs seriously; if, contrarily, you have only two or so
consumers that practically never block, think about fast exclusion
mechanisms. On the other hand, if you have already implemented a
throughput/concurrency control mechanism, and are not going to modify it, th
en fast exclusion is again what you want to use.

[...]

> I still need to check out how condition variables are implemented in ACE
as
> ACE carries a warning about them being simulated and therefore expensive.

As mentioned above, you can emulate condition variables with
WaitForMultipleObjects(); just feed an array consisting of the mutex and the
condition(s). The problem is that the mutex will not be automatically reset,
and you will have to call ReleaseMutex() additionally, and this results in
more overhead plus some potential problems coming from the lack of
atomicity...

Stefan Gustafsson

unread,

Jan 2, 2000, 3:00:00 AM1/2/00

to

Alan

Well, what you have described is a straight-forward translation of your
UNIX-code to NT. There is nothing wrong with that, except that it does not
take NTs special properties into account.

I have not tested your approach, but I know from experience that a buffer
implemented in the manner you describe will never be able to pass more than
about 10k messages per second. (The main problem is that if you have several
threads competing for a mutex, you will get a context switch for each
message)

The IOCP-based buffer I posted recently acheives more than 100k messages
/sec. That means that it consumes 10 times less CPU than your solution.

As Felix so correctly pointed out, I might be a little anal-retentive when
it comes to performance, but you asked for the best approach, and I feel it
is a pity that you settle for an inferior solution.

By the way, please post the code. I can modify my sample to include your
buffer for direct comparision.

/SG

Alan R. White wrote in message
<946745100.20207.0...@news.demon.co.uk>...

>Thanks for all the pointers. Its all up and running now - a little in depth
>reading after what I learned in this discussion came up with the following
>generic solution. The compiled and tested code is on my other machine but
>here's what I did in the end (btw if you want to see the code let me know
>and I'll get it posted)
>
>Posix mutex = Win32 Mutex
>Posix condition = Win32 Event
>
>Instead of pthread_mutex_t and pthread_cond_t, use HANDLE.
>Use CreateMutex for the mutex and CreateEvent for the conditions.
>Instead of pthread_mutex_lock use WaitForSingleObject

>Instead of pthread_cond_wait use SignalObjectAndWait(qLock,qPopulated,..)
>but then get the mutex again using WaitForSingleObject as the big
difference
>is that SignalObjectAndWait does not reacquire the mutex.

>Instead of pthread_cond_signal use SetEvent and obviously ReleaseMutex
>replaces pthread_mutex_unlock.
>

>I was a tad concerned about the scalability after reading all the comments

>about IOCPs etc. My reading indicates that the WaitForSingleObject will not
>cause a context switch etc if the mutex is available so not a major
problem.

>I couldn't determine if it spins for a while or not but it certainly chews

Stefan Gustafsson

unread,

Jan 3, 2000, 3:00:00 AM1/3/00

to

Felix,

Interesting experiment.

Increasing the priority of the reader might work well in this sample, but it
is not a general-purpose solution. (I am sure that you know this, I just
want to make it clear to everybody else)

Without the m_free semaphore, the queue will fill up as soon as data is
produced faster than it can be consumed. The only way to prevent this is to
add some means of blocking the producers (writers in my sample).

Also, when I increase the reader priority, the performance stays almost
exactly the same. I do NOT see the 40% speedup you talk about. I run on a
single PII machine, did you test it on a SMP box?

An interesting point is that if you watch the number of context
switches/second using perfmon, you will see that with the elevated reader
priority, you get a context switch for each message. It is quite amazing
that these context switches obviously cost nothing in performance. It is a
clear indicator that IOCPs are heavily optimized in the OS.

If you use a critsec or a mutex and you get a context switch for each
message, the performance drops drastically.

/SG

Felix Kasza [MVP] <fel...@mvps.org> wrote in message

news:386e5d51....@207.46.180.25...

> Tomas,
>
> > One of the reasons Stefan's IOCP-based buffer gave
> > such great performance was because the Semaphore
> > class he used posed an interesting behavior
>
> For even more fun, make a few small changes:
>

> eliminate all mentions of m_free (the semaphore). As expected, this has
> not-so-good effects on the queue depth, especially if you have more
> writers than readers ... but now, add a line to the startup of reader():
>
> SetThreadPriority( GetCurrentThread(), READER_PRIORITY_BUMP );
>

> Set the number of writers to whatever you like (it makes no difference)
> and watch the number of messages per second go up by 40% or more. The
> queue doesn't overflow as long as the reader keeps going; as long as the
> IOCP still has something when reader() calls GQCS(), the queue will be
> emptied before it can refill. SMP machines present a more complex
> picture, but as long as at least (number_of_CPUs) of the IOCP threads
> remain unblocked, the queue will never fill.
>

Felix Kasza [MVP]

unread,

Jan 3, 2000, 3:00:00 AM1/3/00

to

Stefan,

> Also, when I increase the reader priority, the performance
> stays almost exactly the same. I do NOT see the 40% speedup
> you talk about.

I ran (and just re-ran) my tests on a single-CPU machine (Xeon). The
speedup is due to the complete elimination of m_free (and its associated
kernel transitions); increasing the reader priority just makes sure that
the queue will get emptied faster than it fills up. Oh, and you are
right about IOCPs being optimised; I was a bit surprised myself at how
well this worked.

Alan R. White

unread,

Jan 3, 2000, 3:00:00 AM1/3/00

to

> Note that after WaitForSingleObject() has returned, you can *not* assume
> that the qPopulated event is still set, so you have to spin.

I just posted the code - note that I re-acquire the mutex immediately after
SOAW() completes and loop around until I get qPopulated. (makes the whole
process even less efficient going on your cycle figures).

> SignalObjectAndWait(), I'd recommend ReleaseMutex() and
> WaitForMultipleObjects(); but then again, I'd recommend dropping the mutex
> altogether and going for a critical section instead... want some
performance
> figures?
>

Understood - the ability to post user completion on IOCPs looks like the
thing to use. I am going to have to get my head around how to control queue
full and empty conditions and have waiters hanging around those conditions.
Maybe I'll work out the potential race conditions with the suggested
critical sections and WFMO() approach for this cut - I suspect that the less
(O/S implemented) atomicity in the approach will lead to racing.

Is there an instruction or other counters I can wrap around the calls to get
the kind of info you have? (is GetTickCount() good enough?)

I also have plans to introduce high and low water marks on these types of
queues - under some runtime conditions you get a thread just waiting for the
next free (or full) entry - this is not good for keeping tape drives
streaming - a far better approach is to block and release larger lumps of
data together or offer some complex self-tuning algorithm to attain the best
average data throughput.

Interesting stuff to do but as with all development you have to leave
certain functionality to later cuts otherwise nothing ever gets out there
;-)

Cheers
-- Al

Slava M. Usov

unread,

Jan 3, 2000, 3:00:00 AM1/3/00

to

Alan R. White <a...@tipper.demon.co.uk> wrote in message

news:946910756.17883.0...@news.demon.co.uk...

> > Note that after WaitForSingleObject() has returned, you can *not* assume
> > that the qPopulated event is still set, so you have to spin.
>
> I just posted the code - note that I re-acquire the mutex immediately
after
> SOAW() completes and loop around until I get qPopulated. (makes the whole
> process even less efficient going on your cycle figures).

If you're calling WaitForSingleObject() in the first place, you can as well
call WaitForMultipleObjects(), so the entire

WaitForSingleObject(&Q->qLock,INFINITE);

while (Q->qFullCount==0)
{
wait_count +=1;
SignalObjectAndWait(&Q->qLock,&Q->qPopulated,INFINITE,FALSE);
WaitForSingleObject(&Q->qLock,INFINITE);
}

will get replaced by

WaitForMultipleObjects(2, Q->qGetGadget, TRUE, INFINTE);

where qGetGadget is HANDLE[2], populated by qLock and qPopulated.

Overhead of WFMO is the same as WFSO, but this is a single call, no spinning
and other costly calls.

Note that both qPopulated and qFreeEntry must be made manual reset to make
everything work, and also you will need to replace

Q->qFullCount -= 1;

by

if( --Q->qFullCount == Q_FULL_LOW_MARK )
ResetEvent(Q->qPopulated);

if( Q->qDepth - Q->qFullCount == Q_FREE_HIGH_MARK )
SetEvent(Q->qFreeEntry);

In qmPut(), replace

WaitForSingleObject(&Q->qLock,INFINITE);

while (Q->qFullCount==Q->qDepth)
{
wait_count +=1;
SignalObjectAndWait(&Q->qLock,&Q->qFreeEntry,INFINITE,FALSE);
WaitForSingleObject(&Q->qLock,INFINITE);
}

by

WaitForMultipleObjects(2, Q->qPutGadget, TRUE, INFINTE);

where qPutGadget is an array holding qLock and qFreeEntry,

and replace

Q->qFullCount += 1;

by

if( ++Q->qFullCount == Q_FULL_HIGH_MARK )
SetEvent(Q->qPopulated);

if( Q->qDepth - Q->qFullCount == Q_FREE_LOW_MARK )
ResetEvent(Q->qFreeEntry);

This way, the spinning is avoided, race conditions are eliminated, kernel
calls are minimized, plus the low/high marks are maintained [caution here:
if low/high marks are not 0/1 there is a potential problem with unfolding
the queue at termination, so you'll have to give it more thinking].
Unfortunately, wait_count is not maintained, so you may want to replace it
by direct clock/time measuring, see below.

> Is there an instruction or other counters I can wrap around the calls to
get
> the kind of info you have? (is GetTickCount() good enough?)

GetTickCount() is only 18 ms some accurate; QueryPerformanceCounter() is the
same accuracy on most machines, and with greater accuracy on SMP, but it has
huge overhead so you should avoid it. Use inline assembly to implement RDTSC
instruction, that returns the number of clock times passed since startup.

> Interesting stuff to do but as with all development you have to leave
> certain functionality to later cuts otherwise nothing ever gets out there
> ;-)

Understood.