Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

condvar for win32?

27 views
Skip to first unread message

Scott McPhillips [MVP]

unread,
Sep 10, 2008, 11:25:38 PM9/10/08
to
I would appreciate any recommendations for a reliable condvar implementation
for Win32. I can't use the new Vista built-in version because I need
compatiblity with older versions of Windows, especially XP.

Thanks for any leads.

--
Scott McPhillips [VC++ MVP]

David Schwartz

unread,
Sep 11, 2008, 2:53:34 AM9/11/08
to
On Sep 10, 8:25 pm, "Scott McPhillips [MVP]" <org-dot-mvps-at-
scottmcp> wrote:

> I would appreciate any recommendations for a reliable condvar implementation
> for Win32.  I can't use the new Vista built-in version because I need
> compatiblity with older versions of Windows, especially XP.
>
> Thanks for any leads.

I strongly urge you not to do this. Condvars are the POSIX way, they
are just not the Win32 way. It is almost always a mistake to try to
make one platform behave like another at such a low level, and you
don't get much lower than synchronization primitives.

What are you doing with condvars? I can tell you the Win32 way to do
whatever that thing is.

A lot of Windows software is crappy because of minimalistic porting.
Windows is not UNIX. You have to port the code that uses the
primitives, the primitives themselves don't port.

DS

Markus Elfring

unread,
Sep 11, 2008, 4:40:25 AM9/11/08
to
> I would appreciate any recommendations for a reliable condvar
> implementation for Win32.

Is the approach "http://sources.redhat.com/pthreads-win32/" reliable and
complete enough for you?

Regards,
Markus

Chris M. Thomasson

unread,
Sep 14, 2008, 6:35:59 PM9/14/08
to
"David Schwartz" <dav...@webmaster.com> wrote in message
news:ac3f474a-16c3-488f...@n33g2000pri.googlegroups.com...

On Sep 10, 8:25 pm, "Scott McPhillips [MVP]" <org-dot-mvps-at-
scottmcp> wrote:

> > I would appreciate any recommendations for a reliable condvar
> > implementation
> > for Win32. I can't use the new Vista built-in version because I need
> > compatiblity with older versions of Windows, especially XP.
> >
> > Thanks for any leads.

> I strongly urge you not to do this. Condvars are the POSIX way, they
> are just not the Win32 way. It is almost always a mistake to try to
> make one platform behave like another at such a low level, and you
> don't get much lower than synchronization primitives.

> What are you doing with condvars? I can tell you the Win32 way to do
> whatever that thing is.

Contrived consumer-side code:
______________________________________________________________
pthread_mutex_lock(&m);

for (;;) {
if (! queuea.is_empty()) {
if (! queueb.is_empty()) {
if (global_counter.is_equal_to(5)) {
ia = queuea.pop();
ib = queueb.pop();
counter = global_counter.get();
global_counter.dec();
break;
}
}
}
pthread_cond_wait(&c, &m);
}

assert(ia && ib && counter = 5);

/* process the atomically acquired complex event... */

pthread_mutex_unlock(&m);
______________________________________________________________


How can I efficiently implement that in Windows XP? For the consumer-side I
guess one could do something like:
________________________________________________________________
HANDLE mutex;
HANDLE manual_reset_event;

BOOL process_predicate() {
item_a* ia = NULL;
item_b* ib = NULL;
int counter = 0;
if (! queuea.is_empty()) {
if (! queueb.is_empty()) {
if (global_counter.is_equal_to(5)) {
ia = queuea.pop();
ib = queueb.pop();
counter = global_counter.get();
global_counter.dec();
break;
}
}
}
}

for (;;) {
WaitForSingleObject(mutex, INFINITE);
if (! process_predicate()) {
SignalObjectAndWait(mutex, manual_reset_event, INFINITE);
continue;
}
ResetEvent(manual_reset_event);
ReleaseMutex(mutex);
break;
}
________________________________________________________________


producer code would do something like:
________________________________________________________________
BOOL mutate_predicate();

WaitForSingleObject(mutex, INFINITE);
if (mutate_predicate()) {
SetEvent(manual_reset_event);
} else {
ResetEvent(manual_reset_event);
}
ReleaseMutex(mutex);
________________________________________________________________


> A lot of Windows software is crappy because of minimalistic porting.
> Windows is not UNIX. You have to port the code that uses the
> primitives, the primitives themselves don't port.

http://technet.microsoft.com/en-us/interopmigration/bb380242.aspx

Chris M. Thomasson

unread,
Sep 14, 2008, 6:35:59 PM9/14/08
to
"David Schwartz" <dav...@webmaster.com> wrote in message
news:ac3f474a-16c3-488f...@n33g2000pri.googlegroups.com...
On Sep 10, 8:25 pm, "Scott McPhillips [MVP]" <org-dot-mvps-at-
scottmcp> wrote:

> > I would appreciate any recommendations for a reliable condvar
> > implementation
> > for Win32. I can't use the new Vista built-in version because I need
> > compatiblity with older versions of Windows, especially XP.
> >
> > Thanks for any leads.

> I strongly urge you not to do this. Condvars are the POSIX way, they
> are just not the Win32 way. It is almost always a mistake to try to
> make one platform behave like another at such a low level, and you
> don't get much lower than synchronization primitives.

> What are you doing with condvars? I can tell you the Win32 way to do
> whatever that thing is.

Contrived consumer-side code:
______________________________________________________________
pthread_mutex_lock(&m);

pthread_mutex_unlock(&m);
______________________________________________________________

> A lot of Windows software is crappy because of minimalistic porting.
> Windows is not UNIX. You have to port the code that uses the
> primitives, the primitives themselves don't port.

http://technet.microsoft.com/en-us/interopmigration/bb380242.aspx

Chris M. Thomasson

unread,
Sep 14, 2008, 6:35:59 PM9/14/08
to
"David Schwartz" <dav...@webmaster.com> wrote in message
news:ac3f474a-16c3-488f...@n33g2000pri.googlegroups.com...
On Sep 10, 8:25 pm, "Scott McPhillips [MVP]" <org-dot-mvps-at-
scottmcp> wrote:

> > I would appreciate any recommendations for a reliable condvar
> > implementation
> > for Win32. I can't use the new Vista built-in version because I need
> > compatiblity with older versions of Windows, especially XP.
> >
> > Thanks for any leads.

> I strongly urge you not to do this. Condvars are the POSIX way, they
> are just not the Win32 way. It is almost always a mistake to try to
> make one platform behave like another at such a low level, and you
> don't get much lower than synchronization primitives.

> What are you doing with condvars? I can tell you the Win32 way to do
> whatever that thing is.

Contrived consumer-side code:
______________________________________________________________
pthread_mutex_lock(&m);

pthread_mutex_unlock(&m);
______________________________________________________________

> A lot of Windows software is crappy because of minimalistic porting.
> Windows is not UNIX. You have to port the code that uses the
> primitives, the primitives themselves don't port.

http://technet.microsoft.com/en-us/interopmigration/bb380242.aspx

Chris M. Thomasson

unread,
Sep 14, 2008, 6:56:17 PM9/14/08
to
Sorry for the triple post; fuc%ing newsreader! ARGHGHGH!

:^/


"Scott McPhillips [MVP]" <org-dot-mvps-at-scottmcp> wrote in message
news:X7OdnVUhKusmElXV...@comcast.com...


>I would appreciate any recommendations for a reliable condvar
>implementation for Win32. I can't use the new Vista built-in version
>because I need compatiblity with older versions of Windows, especially XP.

The most straightforward, reliable and correct condvar impl on windows would
be one in which you use a bin-sema per-thread and manually construct the
waitsets. The only real caveat is SCHED_OTHER; but windows is not real-time
os anyway, so I don't know how bad SCHED_OTHER actually is... Pseudo-code
would look like:
____________________________________________________________________
struct dlink {
struct dlink* next;
struct dlink* prev;
};

struct thread {
struct dlink cvlink;
HANDLE event;
LONG refs; /* = 1 at birth */
};

struct condvar {
CRITICAL_SECTION mutex;
struct dlink waitset;
};


void thread_refs_inc(
struct thread* const this
) {
InterlockedIncrement(&this->refs);
}


void thread_refs_dec(
struct thread* const this
) {
if (! InterlockedDecrement(&this->refs)) {
/* `this' is in a persistent quiescent state */
}
}


void condvar_wait(
struct condvar* const this,
LPCRITICAL_SECTION umutex
) {
struct thread* this_thread = TlsGetValue(...);
EnterCriticalSection(&this->mutex);
thread_refs_inc(this_thread);
dlink_push(&this->waitset, &this_thread->cvlink);
LeaveCriticalSection(&this->mutex);
LeaveCriticalSection(umutex);
WaitForSingleObject(this_thread->event, INFINITE);
thread_refs_dec(this_thread);
EnterCriticalSection(umutex);
}


void condvar_signal(
struct condvar* const this
) {
struct thread* this_thread;
EnterCriticalSection(&this->mutex);
this_thread = (struct thread*)dlink_trypop(&this->waitset);
LeaveCriticalSection(&this->mutex);
if (this_thread) {
SetEvent(this_thread->event);
}
}


void condvar_broadcast(
struct condvar* const this
) {
struct thread* this_thread;
EnterCriticalSection(&this->mutex);
this_thread = (struct thread*)dlink_flush(&this->waitset);
LeaveCriticalSection(&this->mutex);
while (this_thread) {
struct thread* const next_thread =
(struct thread*)this_thread->cvlink.next;
SetEvent(this_thread->event);
this_thread = next_thread;
}
}
____________________________________________________________________


Other than that, well, pthreads-win32 has an algorithm that works...

Dmitriy V'jukov

unread,
Sep 15, 2008, 2:03:14 PM9/15/08
to
On Sep 15, 2:56 am, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> The most straightforward, reliable and correct condvar impl on windows would
> be one in which you use a bin-sema per-thread and manually construct the
> waitsets. The only real caveat is SCHED_OTHER; but windows is not real-time
> os anyway, so I don't know how bad SCHED_OTHER actually is... Pseudo-code
> would look like:


What do you think about following implementation?
The main advantage is that I can release whole generation in one
syscall. But thread descriptors are no more hard-connected to physical
threads, physical threads are basically exchanging with thread
descriptors.


struct dlink {
struct dlink* next;
struct dlink* prev;

/************************************************/
// we need size of list
size_t size;
/************************************************/
};

struct thread {
struct dlink cvlink;

/************************************************/
// now here is semaphore
HANDLE sema;
/************************************************/


LONG refs; /* = 1 at birth */
};

struct condvar {
CRITICAL_SECTION mutex;
struct dlink waitset;
};


void thread_refs_inc(
struct thread* const this
) {
InterlockedIncrement(&this->refs);
}


void thread_refs_dec(
struct thread* const this
) {
if (! InterlockedDecrement(&this->refs)) {
/* `this' is in a persistent quiescent state */
}
}

void condvar_wait(
struct condvar* const this,
LPCRITICAL_SECTION umutex
) {

struct thread* root;


struct thread* this_thread = TlsGetValue(...);
EnterCriticalSection(&this->mutex);
thread_refs_inc(this_thread);
dlink_push(&this->waitset, &this_thread->cvlink);

/************************************************/
// get first element from dlist
root = dlink_get_head(this->waitset);
/************************************************/
LeaveCriticalSection(&this->mutex);
LeaveCriticalSection(umutex);
/************************************************/
// wait not on this_thread's semaphore, but on root semaphore
WaitForSingleObject(root->sema, INFINITE);
EnterCriticalSection(&this->mutex);
// get and remember new thread descriptor
this_thread = dlink_pop_tail(root);
LeaveCriticalSection(&this->mutex);
TlsSetValue(..., this_thread);
/************************************************/
thread_refs_dec(this_thread);
EnterCriticalSection(umutex);
}


void condvar_signal(
struct condvar* const this
) {
struct thread* this_thread;
EnterCriticalSection(&this->mutex);

this_thread = (struct thread*)dlink_trypop_last(&this->waitset);
LeaveCriticalSection(&this->mutex);
if (this_thread) {
ReleaseSemaphore(this_thread->sema, 1, 0);
}
}


void condvar_broadcast(
struct condvar* const this
) {
struct thread* this_thread;
EnterCriticalSection(&this->mutex);
this_thread = (struct thread*)dlink_flush(&this->waitset);
LeaveCriticalSection(&this->mutex);

if (this_thread) {
/************************************************/
// releasing whole generation in one syscall
ReleaseSemaphore(this_thread->sema, dlink_size(this_thread), 0);
/************************************************/
}
}


Dmitriy V'jukov
--
Relacy Race Detector: Make your synchronization correct!
http://groups.google.ru/group/relacy/web

David Schwartz

unread,
Sep 15, 2008, 4:03:19 PM9/15/08
to
On Sep 14, 3:35 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> How can I efficiently implement that in Windows XP? For the consumer-side I
> guess one could do something like:

I wouldn't ever even imagine using a design that backed me into a
corner like this on Windows. I'd queue events with
PostQueuedCompletionStatus and block for events with
GetQueuedCompletionStatus (that is the Windows way). I/O events would
get queued automatically for me by the OS -- a free bonus. And my
number of running threads would auto-tune based on the number of CPUs.

If you have a problem that can't be solved in this way, you need to
explain the *problem*, not one possible contrived solution to it.

DS

Dmitriy V'jukov

unread,
Sep 15, 2008, 4:14:26 PM9/15/08
to


General purpose multi-core support library (something like Intel TBB,
or Cilk, or .NET TPL).
Library creates worker thread per processor. Every worker thread have
to poll for work several sources: (1) first of all, distributed work-
stealing scheduler, (2) a bunch of single-producer/single-consumer
message passing queues for directed messages and (3) several internal
queues/stacks used for memory management etc.

Hallvard B Furuseth

unread,
Sep 15, 2008, 4:36:06 PM9/15/08
to
Dmitriy V'jukov writes:
> On Sep 16, 12:03 am, David Schwartz <dav...@webmaster.com> wrote:
>> On Sep 14, 3:35 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:
>>
>> I wouldn't ever even imagine using a design that backed me into a
>> corner like this on Windows. (...)

>> If you have a problem that can't be solved in this way, you need to
>> explain the *problem*, not one possible contrived solution to it.
>
> General purpose multi-core support library (something like Intel TBB,
> or Cilk, or .NET TPL).

Have the library macroize it, or at least some common use patterns?
Some macros would be no-ops in the Windows version, others in the
pthread version. I don't know if code using Windows primitives will
have similar enough structure to code using Pthreads though. If so,
it might help with a few macros that take code blocks as arguments.

--
Hallvard

David Schwartz

unread,
Sep 15, 2008, 6:18:33 PM9/15/08
to
On Sep 15, 1:14 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:

> > If you have a problem that can't be solved in this way, you need to
> > explain the *problem*, not one possible contrived solution to it.

> General purpose multi-core support library (something like Intel TBB,
> or Cilk, or .NET TPL).
> Library creates worker thread per processor. Every worker thread have
> to poll for work several sources: (1) first of all, distributed work-
> stealing scheduler, (2) a bunch of single-producer/single-consumer
> message passing queues for directed messages and (3) several internal
> queues/stacks used for memory management etc.

Create an IOCP. When work comes in from any source, post a packet to
the CP.

Advantages:

1) I/O works without you have to do something special to get it to a
thread.

2) You can create more threads than processors and still have the
ideal number of threads running.

Whatever it takes to make this design fit the problem, it will likely
produce a better result.

DS

Chris M. Thomasson

unread,
Sep 15, 2008, 7:36:15 PM9/15/08
to
"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message
news:a9bdcef8-204c-4ce3...@a1g2000hsb.googlegroups.com...

On Sep 15, 2:56 am, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> > The most straightforward, reliable and correct condvar impl on windows
> > would
> > be one in which you use a bin-sema per-thread and manually construct the
> > waitsets. The only real caveat is SCHED_OTHER; but windows is not
> > real-time
> > os anyway, so I don't know how bad SCHED_OTHER actually is...
> > Pseudo-code
> > would look like:


> What do you think about following implementation?
> The main advantage is that I can release whole generation in one
> syscall. But thread descriptors are no more hard-connected to physical
> threads, physical threads are basically exchanging with thread
> descriptors.

It definitely a major performance enhancement wrt broadcasting. AFAICT, it
should work out fine. I need to think a little bit more on this. I have an
eventcount which uses the event per-thread scheme and it actually works
great on Windows. However, I am thinking about switching the impl over to
something like the one you posted. Humm... It would be optimizing the
slow-path... I need to think if its worth it. Well, it probably is.

;^D

Chris M. Thomasson

unread,
Sep 15, 2008, 7:45:50 PM9/15/08
to
> "David Schwartz" <dav...@webmaster.com> wrote in message
> news:8b9ffc1d-b8fa-4ef4...@a3g2000prm.googlegroups.com...

> Advantages:

There is a disadvantage if your creating highly robust programs... Back when
I used to create server frameworks on Windows, I never used IOCP for
anything but I/O. This is because, IIRC, PQCS uses a little non-paged
memory, and I did not want to tie that up for anything but I/O. The size of
an IOCP queue is limited my the amount of non-paged memory... My framework
needed to refrain from crashing in tight scenarios. It could handle recourse
exhaustion, and even recover from malloc returning NULL...

David Schwartz

unread,
Sep 15, 2008, 9:34:23 PM9/15/08
to
On Sep 15, 4:45 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> There is a disadvantage if your creating highly robust programs...

No.

> Back when
> I used to create server frameworks on Windows, I never used IOCP for
> anything but I/O. This is because, IIRC, PQCS uses a little non-paged
> memory, and I did not want to tie that up for anything but I/O.

If you have three-hundred tasks waiting for a thread, and you have far
less than three-hundred CPUs, there is no need to queue anything else
to the IOCP. You can use IOCP for job dispatch *when* *it* *matters*
without fear of non-paged pool exhaustion.

> The size of
> an IOCP queue is limited my the amount of non-paged memory...

Which is basically no longer an issue on modern versions of Windows.
But if you have so much queued up to an IOCP that memory is an issue,
there's no advantage to putting more stuff on the queue -- it won't
get done any quicker. You only need to keep jobs you actually need to
schedule to a thread on the IOCP queue.

> My framework
> needed to refrain from crashing in tight scenarios. It could handle recourse
> exhaustion, and even recover from malloc returning NULL...

So if you don't have enough memory to queue something on the IOCP,
don't. That's just the most efficient way to dispatch it, not the only
way. If you want to write robust software, do so. But robust software,
as you noted, sanely handles malloc failure, it doesn't refrain from
calling malloc!

DS

Chris M. Thomasson

unread,
Sep 18, 2008, 12:25:37 PM9/18/08
to
"David Schwartz" <dav...@webmaster.com> wrote in message
news:96300776-7dbb-41b7...@q26g2000prq.googlegroups.com...

On Sep 15, 4:45 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> > There is a disadvantage if your creating highly robust programs...

> No.


> > Back when
> > I used to create server frameworks on Windows, I never used IOCP for
> > anything but I/O. This is because, IIRC, PQCS uses a little non-paged
> > memory, and I did not want to tie that up for anything but I/O.

> If you have three-hundred tasks waiting for a thread, and you have far
> less than three-hundred CPUs, there is no need to queue anything else
> to the IOCP. You can use IOCP for job dispatch *when* *it* *matters*
> without fear of non-paged pool exhaustion.


> > The size of
> > an IOCP queue is limited my the amount of non-paged memory...

> Which is basically no longer an issue on modern versions of Windows.
> But if you have so much queued up to an IOCP that memory is an issue,
> there's no advantage to putting more stuff on the queue -- it won't
> get done any quicker. You only need to keep jobs you actually need to
> schedule to a thread on the IOCP queue.

http://www.microsoft.com/mspress/books/sampchap/5726a.aspx

I personally would not use IOCP for sending private messages between
threads. I want to maximize the number of connections, and the number of
pending I/O operations the server can handle. If you have 50,000 connections
and each connection has multiple overlapped sends and recvs going, and
perhaps multiple disk reads and writes, well, the non-paged memory becomes a
major issue indeed. I don't want to waste that memory on private messages
that have nothing to do with I/O.


> > My framework
> > needed to refrain from crashing in tight scenarios. It could handle
> > recourse
> > exhaustion, and even recover from malloc returning NULL...

> So if you don't have enough memory to queue something on the IOCP,
> don't.

How do you reliably know that the non-paged memory is exhausted before you
call into IOCP? I guess the best case scenario would be that Winsock calls
fail with WSAENOBUFS. Worst cast, well, the system crashes. I have used
several strategies to put my server into a "critical resource alert" mode
which would do some fairly radical things in order to reduce or mellow out
recourse usage. Some of the things involved limiting the number of pending
io per-socket; deferring requests; partitioning connections into groups and
pausing groups in round-robin fashion; drastic measures like disconnecting
connections; lowering the timeout for io activity; destroying data-structure
caches; shutting down other running processes that the admin allowed for;
reducing per-socket buffer sizes; cutting down on pending AcceptEx's;
denying connections; and on and on and on... I even did some special
protocols which could offload connections and session state to other
servers, and did heavy port concentration. Ahh, the good ol days...


> That's just the most efficient way to dispatch it, not the only
> way. If you want to write robust software, do so. But robust software,
> as you noted, sanely handles malloc failure, it doesn't refrain from
> calling malloc!

malloc returning NULL automatically swung the server into "critical resource
alert" mode...

;^)

Dmitriy V'jukov

unread,
Sep 18, 2008, 2:54:38 PM9/18/08
to
On Sep 16, 12:36 am, Hallvard B Furuseth <h.b.furus...@usit.uio.no>
wrote:


Portable Eventcount suites well here. No macros. Performance is
perfect.


Dmitriy V'jukov

Dmitriy V'jukov

unread,
Sep 18, 2008, 3:07:36 PM9/18/08
to


No, thank you, I better stay with eventcount. I don't care whether
it's Unix way, or Windows way, or way of The Dark Side Of The Force. I
only care that I am able to create distributed (read high-performance
and scalable) design, that I am able to process events from different
sources, that I am able to process some events in FIFO order and some
in LIFO order.
Do you able to create design which will have as low overhead per
complete event cycle as few tens of cycles with IOCP? Do you able to
create distributed scalable design with IOCP? Do you able to process
some messages in LIFO non locality destroying order?
I don't see how IOCP can fit the problem.


Dmitriy V'jukov

Hallvard B Furuseth

unread,
Sep 18, 2008, 4:36:01 PM9/18/08
to

If you mean the one you posted, I don't know why you call it "portable"
when you are using a Windows interface. Well, I don't know C++ so there
may be some standard headers I don't know about (you didn't include any
headers or the C++ equivalent), but I kind of doubt C++ would have
standardized an interface which looks just like the Windows interface.

Anyway, I needed the code below to compile with g++ on Linux.
ReleaseSemaphore() in particular doesn't seem like "perfect"
performance. I haven't looked at what the code actually does though,
so maybe you meant one should use something else.

#include <assert.h>
#include <limits.h>
#include <pthread.h>
#include <semaphore.h>
#include <stdlib.h>

typedef pthread_mutex_t CRITICAL_SECTION;
typedef sem_t *HANDLE;
#define E(expr) ((expr) ? assert(0 && (expr)) : (void) 0)
#define InitializeCriticalSection(m) E(pthread_mutex_init(m, NULL))
#define DeleteCriticalSection(m) E(pthread_mutex_destroy(m))
#define EnterCriticalSection(m) E(pthread_mutex_lock(m))
#define LeaveCriticalSection(m) E(pthread_mutex_unlock(m))
#define WaitForSingleObject(s,t) (((t) ? sem_wait : sem_trywait)(s))
enum { INFINITE = -1 };

static inline void ReleaseSemaphore(HANDLE s, int val, int) {
while (--val >= 0) E(sem_post(s));
}
static inline HANDLE CreateSemaphoreW(int, int, long, int) {
HANDLE s = (HANDLE) malloc(sizeof *s);
E(sem_init(s, 0, 0));
return s;
}
static inline void CloseHandle(HANDLE s) { E(sem_destroy(s)); free(s); }

#ifdef __GNUC__
#define __declspec(x) DECLSPEC_##x
#define DECLSPEC_thread __thread
static pthread_cond_t mm_helper = PTHREAD_COND_INITIALIZER;
#define _mm_mfence() E(pthread_cond_signal(&mm_helper)) //force memory barrier
#endif

--
Hallvard

David Schwartz

unread,
Sep 18, 2008, 5:25:57 PM9/18/08
to

Chris M. Thomasson wrote:

> I personally would not use IOCP for sending private messages between
> threads. I want to maximize the number of connections, and the number of
> pending I/O operations the server can handle. If you have 50,000 connections
> and each connection has multiple overlapped sends and recvs going, and
> perhaps multiple disk reads and writes, well, the non-paged memory becomes a
> major issue indeed. I don't want to waste that memory on private messages
> that have nothing to do with I/O.

This is complete and utter nonsense. You can use any capability badly.
Pending multiple overlapped sends and receives on 50,000 connections
would fail badly under any I/O model. Yes, IOCP can be misused. Don't
do anything that utterly stupid. Use IOCP sensibly. Use it to schedule
short-term jobs that will be done soon. This is the most common case,
and to fail to heavily optimize the most common case is just plain
dumb.

> > So if you don't have enough memory to queue something on the IOCP,
> > don't.

> How do you reliably know that the non-paged memory is exhausted before you
> call into IOCP? I guess the best case scenario would be that Winsock calls
> fail with WSAENOBUFS. Worst cast, well, the system crashes. I have used
> several strategies to put my server into a "critical resource alert" mode
> which would do some fairly radical things in order to reduce or mellow out
> recourse usage. Some of the things involved limiting the number of pending
> io per-socket; deferring requests; partitioning connections into groups and
> pausing groups in round-robin fashion; drastic measures like disconnecting
> connections; lowering the timeout for io activity; destroying data-structure
> caches; shutting down other running processes that the admin allowed for;
> reducing per-socket buffer sizes; cutting down on pending AcceptEx's;
> denying connections; and on and on and on... I even did some special
> protocols which could offload connections and session state to other
> servers, and did heavy port concentration. Ahh, the good ol days...

If you want the absolute highest possible performance, you have to do
a lot of effort. But you do it once when you develop your I/O library
and then you have it forever. Many, many real world applications work
fine without handling all these bizarre contingencies -- but if you
want maximum performance and reliability, you have to do some extra
work to get.

However, IOCP is the best way to schedule short term jobs on modern
Windows systems and dispatch them to threads. Really.

> > That's just the most efficient way to dispatch it, not the only
> > way. If you want to write robust software, do so. But robust software,
> > as you noted, sanely handles malloc failure, it doesn't refrain from
> > calling malloc!

> malloc returning NULL automatically swung the server into "critical resource
> alert" mode...

Exactly. And you can keep an emergency pool. And you can do load
shedding. And so on.

Maximum performance and maximum reliability is hard to do, and it
takes a lot of effort. But scheduling normal short-term jobs using
IOCP gets you high performance in the very typical case where the most
common state is no pending jobs. By high performance I mean low
latency and the best chances of staying in the "usually no pending
jobs" state.

DS

David Schwartz

unread,
Sep 18, 2008, 5:27:46 PM9/18/08
to
On Sep 18, 12:07 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:

> No, thank you, I better stay with eventcount. I don't care whether
> it's Unix way, or Windows way, or way of The Dark Side Of The Force. I
> only care that I am able to create distributed (read high-performance
> and scalable) design, that I am able to process events from different
> sources, that I am able to process some events in FIFO order and some
> in LIFO order.

Okay.

> Do you able to create design which will have as low overhead per
> complete event cycle as few tens of cycles with IOCP? Do you able to
> create distributed scalable design with IOCP? Do you able to process
> some messages in LIFO non locality destroying order?
> I don't see how IOCP can fit the problem.

Yes, you can do all these things with IOCP. Really.

IOCP can be used just like futexes in Linux. You don't have to stick
them in the fast path. And with them in the slow path, you can make
the fast path significantly faster.

DS

Dmitriy V'jukov

unread,
Sep 18, 2008, 6:33:41 PM9/18/08
to
On 19 сент, 01:27, David Schwartz <dav...@webmaster.com> wrote:

> > Do you able to create design which will have as low overhead per
> > complete event cycle as few tens of cycles with IOCP? Do you able to
> > create distributed scalable design with IOCP? Do you able to process
> > some messages in LIFO non locality destroying order?
> > I don't see how IOCP can fit the problem.
>
> Yes, you can do all these things with IOCP. Really.
>
> IOCP can be used just like futexes in Linux. You don't have to stick
> them in the fast path. And with them in the slow path, you can make
> the fast path significantly faster.

Can you, please, briefly outline design of distributed scheduler with
IOCP on slow-path which is capable of processing with low-overhead
some events in FIFO order and some events in LIFO order? Maybe I just
missing something.

Dmitriy V'jukov

David Schwartz

unread,
Sep 18, 2008, 7:28:17 PM9/18/08
to
On Sep 18, 3:33 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:

> Can you, please, briefly outline design of distributed scheduler with
> IOCP on slow-path which is capable of processing with low-overhead
> some events in FIFO order and some events in LIFO order? Maybe I just
> missing something.

You check the 'hot list' of tasks before you check the IOCP. You use
the IOCP to block on, when you have an empty work queue (when else
would you block?). Jobs can easily be placed on the head or the tail
of the host list.

DS

Chris M. Thomasson

unread,
Sep 18, 2008, 10:39:39 PM9/18/08
to
"David Schwartz" <dav...@webmaster.com> wrote in message
news:e8e5c934-cd43-4680...@a18g2000pra.googlegroups.com...

That can work. You can use IOCP to amortize work in a sense... Simple
pseudo-code which uses a private queue and IOCP as signal:

// producer
1. create private __batch__ of work in private queue.

2. enqueue single "event" message via PCQS to indicate that batch of work is
ready.


// consumer
1. check for private batch of work from private queue; if any work is there
process it and goto step 2, if not goto step 2 anyway! :^)


2. wait on IOCP via GQCS


3. if dequeue OVERLAPPED struct indicates event of private work goto step 1,
if I/O goto step 4.


4. process I/O


5. goto step 1

The trick is not to enqueue an IOCP event for EVERY private message! You use
IOCP as a signal to check private queues.

Thus, my original position still stands David. DON'T USE IOCP FOR PRIVATE
MESSAGE QUEUE! Use IOCP as a signal to check private message queue indeed.


IMVOH, its not non-sense...

non-paged memory usage is directly proportional to how many connections your
server can handle. Don't waste non-paged memory with private messages.

This is all in my very humble opinion. This is what I choose to do
personally. So be it.


;^|

Chris M. Thomasson

unread,
Sep 18, 2008, 10:43:20 PM9/18/08
to
"David Schwartz" <dav...@webmaster.com> wrote in message
news:e8e5c934-cd43-4680...@a18g2000pra.googlegroups.com...

The "hot list" is not ICON queue right? You admit that private queuing
should not use IOCP? Please clarify? Do you agree with me or not?

AFAICT, you make a clear separation between hot-list and IOCP. If that is
so, then we agree on the point that private queue should not use IOCP, but
use a separate so-called hit list...

Where am I going wrong?

Chris M. Thomasson

unread,
Sep 18, 2008, 11:32:38 PM9/18/08
to
"David Schwartz" <dav...@webmaster.com> wrote in message
news:0d4d4097-4c00-4eb1...@i20g2000prf.googlegroups.com...

What if the program needed to wait on a multi-part predicate in an atomic
fashion? You did say that if somebody gave you a requirement, you could do
it the windows way. Well, how would you resolve the highly contrived problem
I outlined? I think the solution I gave would actually work fine. How would
you do it?

Dmitriy V'jukov

unread,
Sep 19, 2008, 1:01:52 AM9/19/08
to
On 19 сент, 06:39, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> "David Schwartz" <dav...@webmaster.com> wrote in message
>
> news:e8e5c934-cd43-4680...@a18g2000pra.googlegroups.com...
> On Sep 18, 3:33 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:
>
> > > Can you, please, briefly outline design of distributed scheduler with
> > > IOCP on slow-path which is capable of processing with low-overhead
> > > some events in FIFO order and some events in LIFO order? Maybe I just
> > > missing something.
> > You check the 'hot list' of tasks before you check the IOCP. You use
> > the IOCP to block on, when you have an empty work queue (when else
> > would you block?). Jobs can easily be placed on the head or the tail
> > of the host list.
>
> That can work. You can use IOCP to amortize work in a sense... Simple
> pseudo-code which uses a private queue and IOCP as signal:
>
> // producer
> 1. create private __batch__ of work in private queue.

> 2. enqueue single "event" message via PCQS to indicate that batch of work is
> ready.


Producer doesn't create *batch* of events, it's just create single
events. It must call PQCS for every event?


> // consumer
> 1. check for private batch of work from private queue; if any work is there
> process it and goto step 2, if not goto step 2 anyway!  :^)
>
> 2. wait on IOCP via GQCS
>
> 3. if dequeue OVERLAPPED struct indicates event of private work goto step 1,
> if I/O goto step 4.
>
> 4. process I/O
>
> 5. goto step 1


Ok. We've made distributed queues. Now we need to bolt load-balancing
(read - work-stealing) to this. How I can make it?


Dmitriy V'jukov

Dmitriy V'jukov

unread,
Sep 19, 2008, 1:05:10 AM9/19/08
to
On 19 сент, 09:01, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:

> On 19 ÓÅÎÔ, 06:39, "Chris M. Thomasson" <n...@spam.invalid> wrote:
>
>
>
> > "David Schwartz" <dav...@webmaster.com> wrote in message
>
> >news:e8e5c934-cd43-4680...@a18g2000pra.googlegroups.com...
> > On Sep 18, 3:33 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:
>
> > > > Can you, please, briefly outline design of distributed scheduler with
> > > > IOCP on slow-path which is capable of processing with low-overhead
> > > > some events in FIFO order and some events in LIFO order? Maybe I just
> > > > missing something.
> > > You check the 'hot list' of tasks before you check the IOCP. You use
> > > the IOCP to block on, when you have an empty work queue (when else
> > > would you block?). Jobs can easily be placed on the head or the tail
> > > of the host list.
>
> > That can work. You can use IOCP to amortize work in a sense... Simple
> > pseudo-code which uses a private queue and IOCP as signal:
>
> > // producer
> > 1. create private __batch__ of work in private queue.
> > 2. enqueue single "event" message via PCQS to indicate that batch of work is
> > ready.
>
> Producer doesn't create *batch* of events, it's just create single
> events. It must call PQCS for every event?
>
> > // consumer
> > 1. check for private batch of work from private queue; if any work is there
> > process it and goto step 2, if not goto step 2 anyway! š:^)

>
> > 2. wait on IOCP via GQCS
>
> > 3. if dequeue OVERLAPPED struct indicates event of private work goto step 1,
> > if I/O goto step 4.
>
> > 4. process I/O
>
> > 5. goto step 1
>
> Ok. We've made distributed queues. Now we need to bolt load-balancing
> (read - work-stealing) to this. How I can make it?


Oh, forget to mention. Also there must single centralized queue for
'root' tasks. Threads must check that queue when they are out of work
on private queue.


Dmitriy V'jukov

Chris M. Thomasson

unread,
Sep 19, 2008, 4:54:27 AM9/19/08
to
"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message
news:102cd6fd-5412-48e3...@s50g2000hsb.googlegroups.com...

On 19 сент, 06:39, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> > "David Schwartz" <dav...@webmaster.com> wrote in message
> >
> > news:e8e5c934-cd43-4680...@a18g2000pra.googlegroups.com...
> > On Sep 18, 3:33 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:
> >
> > > > Can you, please, briefly outline design of distributed scheduler
> > > > with
> > > > IOCP on slow-path which is capable of processing with low-overhead
> > > > some events in FIFO order and some events in LIFO order? Maybe I
> > > > just
> > > > missing something.
> > > You check the 'hot list' of tasks before you check the IOCP. You use
> > > the IOCP to block on, when you have an empty work queue (when else
> > > would you block?). Jobs can easily be placed on the head or the tail
> > > of the host list.
> >
> > That can work. You can use IOCP to amortize work in a sense... Simple
> > pseudo-code which uses a private queue and IOCP as signal:
> >
> > // producer
> > 1. create private __batch__ of work in private queue.

> > 2. enqueue single "event" message via PCQS to indicate that batch of
> > work is
> > ready.


> Producer doesn't create *batch* of events, it's just create single
> events. It must call PQCS for every event?

Not for every event. It only calls PQCS to signal that a batch of work has
been queued onto separate private queue. Lets see if I can illustrate with a
little pseudo-code:


Version 1
_____________________________________________________________________
#define PRIVATE_EVENT 0x1UL
#define IO_EVENT 0x2UL

struct iocp_event {
OVERLAPPED ol;
unsigned long flags;
int depth;
iocp_event(unsigned long flags_, int depth_)
: ol(), flags(flags_), depth(depth_) {}
};

static thread_safe_queue<work> g_workq;
static HANDLE g_iocp;

void work_producer(int depth) {
struct iocp_event* const ioev = new iocp_event(PRIVATE_EVENT, depth);
for (int i = 0; i < depth; ++i) {
g_workq.push(new work(...));
}
PQCS(g_iocp, &ioev.ol);
}

void work_consumer(iocp_event* ioev) {
work* w;
for (int i = 0; i < ioev->depth; ++i) {
w = g_workq.trypop();
if (! w) { break; }
w->process();
}
}

void worker_thread() {
for (;;) {
iocp_event* ioev;
GQCS(g_iocp, &ioev);
if (ioev->flags & PRIVATE_EVENT) {
work_consumer(ioev);
delete ioev;
} else if (ioev->flags & IO_EVENT) {
handle_io(ioev);
}
}
}
_____________________________________________________________________


Version 2
_____________________________________________________________________
#define IO_EVENT 0x1UL

struct iocp_event {
OVERLAPPED ol;
unsigned long flags;
non_thread_safe_queue<work> workq;
iocp_event(unsigned long flags_, ) : ol(), flags(flags_) {}
};

static HANDLE g_iocp;

void work_producer(int depth) {
struct iocp_event* const ioev = new iocp_event(0);
for (int i = 0; i < depth; ++i) {
ioev->workq.push(new work(...));
}
PQCS(g_iocp, &ioev.ol);
}

void work_consumer(iocp_event* ioev) {
work* w;
while ((w = ioev->workq.trypop())) {
w->process();
}
}

void worker_thread() {
for (;;) {
iocp_event* ioev;
GQCS(g_iocp, &ioev);
if (! ioev->workq.empty()) {
work_consumer(ioev);
}
if (ioev->flags & IO_EVENT) {
handle_io(ioev);
} else {
delete ioev;
}
}
}
_____________________________________________________________________


Version 2 can come in handy because you can create batch of work and I/O
events using the same `iocp_event'. Humm...


> > // consumer
> > 1. check for private batch of work from private queue; if any work is
> > there
> > process it and goto step 2, if not goto step 2 anyway! :^)
> >
> > 2. wait on IOCP via GQCS
> >
> > 3. if dequeue OVERLAPPED struct indicates event of private work goto
> > step 1,
> > if I/O goto step 4.
> >
> > 4. process I/O
> >
> > 5. goto step 1


> Ok. We've made distributed queues. Now we need to bolt load-balancing
> (read - work-stealing) to this. How I can make it?

AFAICT, IOCP already uses smart work-stealing algorithm internally; it does
a good job at balancing completion's across worker threads. Are you writing
about sticking a work-stealing algorithm into a spsc-queue multiplexing
algorithm?

Chris M. Thomasson

unread,
Sep 19, 2008, 5:08:44 AM9/19/08
to
"David Schwartz" <dav...@webmaster.com> wrote in message
news:34a1aa20-262e-4ff9...@b30g2000prf.googlegroups.com...

>
> Chris M. Thomasson wrote:
>
>> I personally would not use IOCP for sending private messages between
>> threads. I want to maximize the number of connections, and the number of
>> pending I/O operations the server can handle. If you have 50,000
>> connections
>> and each connection has multiple overlapped sends and recvs going, and
>> perhaps multiple disk reads and writes, well, the non-paged memory
>> becomes a
>> major issue indeed. I don't want to waste that memory on private messages
>> that have nothing to do with I/O.
>
> This is complete and utter nonsense. You can use any capability badly.
> Pending multiple overlapped sends and receives on 50,000 connections
> would fail badly under any I/O model. Yes, IOCP can be misused. Don't
> do anything that utterly stupid. Use IOCP sensibly. Use it to schedule
> short-term jobs that will be done soon. This is the most common case,
> and to fail to heavily optimize the most common case is just plain
> dumb.

If the server has enough memory, I want to take advantage of it. Therefore,
if the environment allows me to have multiple pending overlapped operations
at 50,000 connections I will do it. If the resources are overloaded, then I
can scale things way back. My main point is that I don't want to use
precious non-pages memory for anything else but I/O. This is my personal
opinion. Humm, perhaps I have to change that line of thinking if I ever
create another I/O framework.

Indeed.


> Many, many real world applications work
> fine without handling all these bizarre contingencies -- but if you
> want maximum performance and reliability, you have to do some extra
> work to get.

Creating servers that are highly fault tolerant can be very interesting and
fun task. IMHO, its fun to actually try and overload and crash your own I/O
server frameworks.


> However, IOCP is the best way to schedule short term jobs on modern
> Windows systems and dispatch them to threads. Really.

Well, yeah your right; its just that I personally want to use IOCP for I/O
only. However, as you know, there are some clever ways to use it for non-I/O
tasks and amortize the number of times the application calls PQCS. That
would be okay.


>> > That's just the most efficient way to dispatch it, not the only
>> > way. If you want to write robust software, do so. But robust software,
>> > as you noted, sanely handles malloc failure, it doesn't refrain from
>> > calling malloc!
>
>> malloc returning NULL automatically swung the server into "critical
>> resource
>> alert" mode...
>
> Exactly. And you can keep an emergency pool. And you can do load
> shedding. And so on.
>
> Maximum performance and maximum reliability is hard to do, and it
> takes a lot of effort. But scheduling normal short-term jobs using
> IOCP gets you high performance in the very typical case where the most
> common state is no pending jobs. By high performance I mean low
> latency and the best chances of staying in the "usually no pending
> jobs" state.

IOCP does excellent job of distributing work evenly across worker threads.
If I ever create a new I/O framework, I will definitely consider using IOCP
for more that just I/O... Thanks David.

:^)

Dmitriy V'jukov

unread,
Sep 19, 2008, 7:15:34 AM9/19/08
to
Chris M. Thomasson wrote:

> > This is complete and utter nonsense. You can use any capability badly.
> > Pending multiple overlapped sends and receives on 50,000 connections
> > would fail badly under any I/O model. Yes, IOCP can be misused. Don't
> > do anything that utterly stupid. Use IOCP sensibly. Use it to schedule
> > short-term jobs that will be done soon. This is the most common case,
> > and to fail to heavily optimize the most common case is just plain
> > dumb.
>
> If the server has enough memory, I want to take advantage of it. Therefore,
> if the environment allows me to have multiple pending overlapped operations
> at 50,000 connections I will do it. If the resources are overloaded, then I
> can scale things way back. My main point is that I don't want to use
> precious non-pages memory for anything else but I/O. This is my personal
> opinion. Humm, perhaps I have to change that line of thinking if I ever
> create another I/O framework.

Why you think that user items in Completion Port use non-paged memory?
It's quite counter-intuitive for me. The link you post only says that
socket handles, pending IO requests (IRP) and unread/unsent data on
sockets in tcp layer use non-paged memory. There is nothing about
events itself in Completion Port...

Dmitriy V'jukov

Dmitriy V'jukov

unread,
Sep 19, 2008, 7:22:35 AM9/19/08
to
On Sep 19, 12:54 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> > > // producer
> > > 1. create private __batch__ of work in private queue.
> > > 2. enqueue single "event" message via PCQS to indicate that batch of
> > > work is
> > > ready.
> > Producer doesn't create *batch* of events, it's just create single
> > events. It must call PQCS for every event?
>
> Not for every event. It only calls PQCS to signal that a batch of work has
> been queued onto separate private queue. Lets see if I can illustrate with a
> little pseudo-code:


In my context producers produce only single work items, not batches. I
don't know when next time producer will decide to produce something. I
have no control over that. And I need solution for this problem.


> AFAICT, IOCP already uses smart work-stealing algorithm internally; it does
> a good job at balancing completion's across worker threads. Are you writing
> about sticking a work-stealing algorithm into a spsc-queue multiplexing
> algorithm?


If we will use IOCP per thread, when it makes no sense that IOCP can
use load-balancing.
I need load-balancing for my work items. If work items transferred
over my own queues (not IOCP), then I need load-balancing for my own
queues.


Dmitriy V'jukov

Chris M. Thomasson

unread,
Sep 19, 2008, 7:47:26 AM9/19/08
to
"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message
news:466ce6ae-6ce9-4696...@f63g2000hsf.googlegroups.com...

I seem to remember doing an experiment on NT 4.0 a long time ago where I
posted tons of items to the port, and watched the non-paged pool usage start
to grow. I the server work was on NT. I wonder if Microsoft changed this
behavior.

Chris M. Thomasson

unread,
Sep 19, 2008, 7:50:49 AM9/19/08
to
"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message
news:0d05a0a6-571b-4c09...@l64g2000hse.googlegroups.com...

On Sep 19, 12:54 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> > > > // producer
> > > > 1. create private __batch__ of work in private queue.
> > > > 2. enqueue single "event" message via PCQS to indicate that batch of
> > > > work is
> > > > ready.
> > > Producer doesn't create *batch* of events, it's just create single
> > > events. It must call PQCS for every event?
> >
> > Not for every event. It only calls PQCS to signal that a batch of work
> > has
> > been queued onto separate private queue. Lets see if I can illustrate
> > with a
> > little pseudo-code:


> In my context producers produce only single work items, not batches. I
> don't know when next time producer will decide to produce something. I
> have no control over that. And I need solution for this problem.

I see.


> > AFAICT, IOCP already uses smart work-stealing algorithm internally; it
> > does
> > a good job at balancing completion's across worker threads. Are you
> > writing
> > about sticking a work-stealing algorithm into a spsc-queue multiplexing
> > algorithm?


> If we will use IOCP per thread, when it makes no sense that IOCP can
> use load-balancing.

Yeah, using an IOCP per-thread defeats the purpose.


> I need load-balancing for my work items. If work items transferred
> over my own queues (not IOCP), then I need load-balancing for my own
> queues.

Right. I think I remember you mentioning that you created an advanced
work-stealing algorithm, and then created Relacy to remove some bugs within
it. How is that going? Does it work?

Chris M. Thomasson

unread,
Sep 19, 2008, 8:03:31 AM9/19/08
to
"Chris M. Thomasson" <n...@spam.invalid> wrote in message
news:zdMAk.10710$Il....@newsfe09.iad...

Nope. On XP it uses non-paged memory. Here is a full program to demonstrate
it:
________________________________________________________________________
#include <stdio.h>
#include <stdlib.h>
#include <windows.h>


int main() {
int i;
HANDLE iocp;

puts("hit <enter> to create");
getchar();

iocp =
CreateIoCompletionPort(
INVALID_HANDLE_VALUE,
NULL,
0,
0
);

for (i = 0; i < 1000000; ++i) {
LPOVERLAPPED ol = calloc(1, sizeof(*ol));
PostQueuedCompletionStatus(
iocp,
123,
i,
ol
);
}

puts("examine non-paged pool and hit <enter> to destroy");
getchar();

for (i = 0; i < 1000000; ++i) {
DWORD bytes = 0;
ULONG_PTR key = 0;
LPOVERLAPPED ol = 0;
GetQueuedCompletionStatus(
iocp,
&bytes,
&key,
&ol,
INFINITE);
free(ol);
}

CloseHandle(iocp);

puts("examine non-paged pool and hit <enter> to exit");
getchar();

return 0;
}
________________________________________________________________________

Run the task manager. Run this program. Find the program in task manager
list. Examine non-paged memory pool. Hit enter... Example non-paged memory
pool again; it should be bigger. Hit enter. Examine non-paged memory pool
one last time; it should be smaller.


On my system, this program makes the non-paged memory pool grow to over
39,000 KB!

Ouch!


Chris M. Thomasson

unread,
Sep 19, 2008, 8:06:00 AM9/19/08
to
"Chris M. Thomasson" <n...@spam.invalid> wrote in message
news:EsMAk.10664$PS3....@newsfe06.iad...
[...]

Can somebody please run this program on Vista? I am curious if Microsoft has
changed it on that OS. Anyway Dmitriy, I totally agree with you that this is
counter-intuitive. It plain sucks! This is why I never use IOCP for anything
but I/O.

;^o

Dmitriy V'jukov

unread,
Sep 19, 2008, 9:11:20 AM9/19/08
to
On Sep 19, 4:06 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> >> I seem to remember doing an experiment on NT 4.0 a long time ago where I
> >> posted tons of items to the port, and watched the non-paged pool usage
> >> start to grow. I the server work was on NT. I wonder if Microsoft changed
> >> this behavior.
>
> > Nope. On XP it uses non-paged memory. Here is a full program to
> > demonstrate
>
> [...]
>
> Can somebody please run this program on Vista? I am curious if Microsoft has
> changed it on that OS. Anyway Dmitriy, I totally agree with you that this is
> counter-intuitive. It plain sucks! This is why I never use IOCP for anything
> but I/O.


I've run the test. On my XP system program uses 39M of non-paged pool
too. No more questions :)
I run test on Vista too. It also uses 39M.

Interesting moment. I increase number of events to 2*10^7. On XP
system with 1GB of memory, program use up to 197M, then
PostQueuedCompletionStatus() calls fail. On Vista system with 2GB of
memory, program use 780M (!) of non-paged pool, no
PostQueuedCompletionStatus() calls fail. It seems that they implement
dynamic growing of non-paged pool in Vista...

Also I measure execution time of program. PostQueuedCompletionStatus()/
GetQueuedCompletionStatus() pair takes around 10800 cycles. It can be
used for general purpose messaging with great reserve.


Dmitriy V'jukov

Dmitriy V'jukov

unread,
Sep 19, 2008, 9:13:28 AM9/19/08
to
On Sep 19, 3:50 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> > I need load-balancing for my work items. If work items transferred
> > over my own queues (not IOCP), then I need load-balancing for my own
> > queues.
>
> Right. I think I remember you mentioning that you created an advanced
> work-stealing algorithm, and then created Relacy to remove some bugs within
> it. How is that going? Does it work?


Yes, it works. It doesn't issue heavy operations not only in push()
function, but also in pop() function too. But it is suitable only for
'Cilk-style workload', i.e. parallel slack + busy leaves.


Dmitriy V'jukov

David Schwartz

unread,
Sep 19, 2008, 12:02:36 PM9/19/08
to
On Sep 18, 8:32 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> What if the program needed to wait on a multi-part predicate in an atomic
> fashion? You did say that if somebody gave you a requirement, you could do
> it the windows way. Well, how would you resolve the highly contrived problem
> I outlined? I think the solution I gave would actually work fine. How would
> you do it?

I wouldn't resolve the highly-contrived problem. I'd construct a
solution that doesn't cause the problem. That problem is a problem
with a proposed solution to a problem, it's not a real-world problem.

For example, real-world problems can't possibly care whether a job is
done first or second in the case where you have a large number of
small, logically-independent jobs. But your problem statement requires
me to care.

DS

Dmitriy V'jukov

unread,
Sep 19, 2008, 2:56:00 PM9/19/08
to
On Sep 19, 12:36 am, Hallvard B Furuseth <h.b.furus...@usit.uio.no>

wrote:
> Dmitriy V'jukov writes:
> >On Sep 16, 12:36 am, Hallvard B Furuseth <h.b.furus...@usit.uio.no> wrote:
> >> Have the library macroize it, or at least some common use patterns?
> >> Some macros would be no-ops in the Windows version, others in the
> >> pthread version.  I don't know if code using Windows primitives will
> >> have similar enough structure to code using Pthreads though.  If so,
> >> it might help with a few macros that take code blocks as arguments.
>
> > Portable Eventcount suites well here. No macros. Performance is
> > perfect.
>
> If you mean the one you posted, I don't know why you call it "portable"
> when you are using a Windows interface.  Well, I don't know C++ so there
> may be some standard headers I don't know about (you didn't include any
> headers or the C++ equivalent), but I kind of doubt C++ would have
> standardized an interface which looks just like the Windows interface.
>
> Anyway, I needed the code below to compile with g++ on Linux.
> ReleaseSemaphore() in particular doesn't seem like "perfect"
> performance.  I haven't looked at what the code actually does though,
> so maybe you meant one should use something else.


I meant that algorithm is portable, not implementation. Sorry for any
confusion.
ReleaseSemaphore() called by 'producer' *only* when there is a blocked
'consumer', it's unavoidable thing, so to say. Otherwise producer only
loads 1 variable and makes 1 conditional branch. No mutex
acquisitions, no even atomic RMW operations.


Dmitriy V'jukov

0 new messages