Portable eventcount

Dmitriy V'jukov

unread,

Sep 12, 2008, 1:21:46 PM9/12/08

to

This eventcount algorithm is based on semaphore, so it can be easily
implemented in POSIX and/or Windows. Implementation is mostly lock-
free. Wait generations are maintained manually.

class eventcount
{
public:
typedef uint32_t state_t;

eventcount()
{
state_ = waiters_offset;
sema_ = CreateSemaphoreW(0, 0, LONG_MAX, 0);
if (0 == sema_)
throw std::runtime_error("error creating semaphore");
}

~eventcount()
{
CloseHandle(sema_);
}

state_t prepare_wait()
{
return _InterlockedExchangeAdd((long*)&state_, waiters_inc);
}

void retire_wait()
{
_InterlockedExchangeAdd((long*)&state_, -(int)waiters_inc);
}

void wait(state_t cmp)
{
for (;;)
{
state_t cmp0 = state_;
if ((cmp & generation_mask) != (cmp0 & generation_mask))
{
_InterlockedExchangeAdd((long*)&state_, -
(int)waiters_inc);
return;
}
WaitForSingleObject(sema_, INFINITE);
cmp0 = state_;
if ((cmp & generation_mask) != (cmp0 & generation_mask))
return;
ReleaseSemaphore(sema_, 1, 0);
Sleep(0);
}
}

void signal()
{
_mm_mfence();
signal_relaxed();
}

void signal_relaxed()
{
state_t cmp = state_;
int waiters = (int)(cmp & waiters_mask) - (int)waiters_offset;
if (waiters <= 0)
return;
for (;;)
{
state_t xchg = (cmp & ~waiters_mask) + generation_inc +
waiters_offset;
state_t cmp0 = _InterlockedCompareExchange((long*)&state_,
xchg, cmp);
if (cmp0 == cmp)
{
ReleaseSemaphore(sema_, waiters, 0);
return;
}
cmp = cmp0;
waiters = (int)(cmp & waiters_mask) - (int)waiters_offset;
if (waiters <= 0)
return;
}
}

private:
state_t volatile state_;
HANDLE sema_;

static state_t const waiters_inc = 1;
static state_t const waiters_mask = (1 << 20) - 1;
static state_t const waiters_offset = 1 << 19;
static state_t const generation_inc = 1 << 20;
static state_t const generation_mask = ~waiters_mask;

eventcount(eventcount const&);
eventcount& operator = (eventcount const&);
};

************************************************************

Here is a little helper class for blocking threads and usage example:

class eventcount_blocking
{
public:
eventcount_blocking(eventcount& ec)
: ec_(ec)
{
cmp_ = ec_.prepare_wait();
wait_ = false;
}

void wait()
{
assert(false == wait_);
wait_ = true;
ec_.wait(cmp_);
}

~eventcount_blocking()
{
if (false == wait_)
ec_.retire_wait();
}

private:
eventcount& ec_;
eventcount::state_t cmp_;
bool wait_;

eventcount_blocking(eventcount_blocking const&);
eventcount_blocking& operator = (eventcount_blocking const&);
};

************************************************************

Usage example:

queue q;
eventcount ec;

void producer(void* data)
{
q.enqueue(data);
ec.signal(); // or signal_relaxed()
}

void* consumer()
{
void* data = 0;
if (data = q.dequeue())
return data;
for (;;)
{
eventcount_blocking block (ec);
if (data = q.dequeue())
return data;
block.wait();
if (data = q.dequeue())
return data;
}
}

Dmitriy V'jukov
--
Relacy Race Detector: Make your synchronization correct!
http://groups.google.ru/group/relacy/web

Anthony Williams

unread,

Sep 12, 2008, 4:43:21 PM9/12/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> writes:

> void wait(state_t cmp)
> {
> for (;;)
> {
> state_t cmp0 = state_;
> if ((cmp & generation_mask) != (cmp0 & generation_mask))
> {
> _InterlockedExchangeAdd((long*)&state_, -
> (int)waiters_inc);
> return;
> }
> WaitForSingleObject(sema_, INFINITE);
> cmp0 = state_;
> if ((cmp & generation_mask) != (cmp0 & generation_mask))
> return;
> ReleaseSemaphore(sema_, 1, 0);
> Sleep(0);
> }
> }

Even with the Sleep(0) this code can keep stealing it's own signals,
preventing another thread from actually waking. If you have a lot of
threads waiting since the last outstanding notify and only one from
before then the new threads can keep waking up, leaving the thread
that needs to be woken still sleeping.

>
> void signal_relaxed()
> {
> state_t cmp = state_;
> int waiters = (int)(cmp & waiters_mask) - (int)waiters_offset;
> if (waiters <= 0)
> return;
> for (;;)
> {
> state_t xchg = (cmp & ~waiters_mask) + generation_inc +
> waiters_offset;
> state_t cmp0 = _InterlockedCompareExchange((long*)&state_,
> xchg, cmp);
> if (cmp0 == cmp)
> {
> ReleaseSemaphore(sema_, waiters, 0);
> return;
> }
> cmp = cmp0;
> waiters = (int)(cmp & waiters_mask) - (int)waiters_offset;
> if (waiters <= 0)
> return;
> }
> }

This suffers from stolen wakeups.

If there are three threads waiting then a signal will zero the waiters
count and add 3 to the semaphore. If one more thread waits before
those waiters have woken then the waiters count will be
positive. Suppose another thread calls signal: it will see the
non-zero waiter count and proceed into the for(;;) loop. If it gets
pre-empted before it does the CAS and three more threads wait they
will see the not-yet-incremented generation count. When the signalling
thread wakes up and releases 1 thread on the semaphore and increments
the generation, the 4 new threads can now see the new generation count
and steal the wakes intended for the first three threads, which now
remain sleeping until the *next* signal.

Anthony
--
Anthony Williams | Just Software Solutions Ltd
Custom Software Development | http://www.justsoftwaresolutions.co.uk
Registered in England, Company Number 5478976.
Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL

Dmitriy V'jukov

unread,

Sep 12, 2008, 7:18:49 PM9/12/08

to

On 13 сент, 00:43, Anthony Williams <anthony....@gmail.com> wrote:

> "Dmitriy V'jukov" <dvyu...@gmail.com> writes:
> > void wait(state_t cmp)
> > {
> > for (;;)
> > {
> > state_t cmp0 = state_;
> > if ((cmp & generation_mask) != (cmp0 & generation_mask))
> > {
> > _InterlockedExchangeAdd((long*)&state_, -
> > (int)waiters_inc);
> > return;
> > }
> > WaitForSingleObject(sema_, INFINITE);
> > cmp0 = state_;
> > if ((cmp & generation_mask) != (cmp0 & generation_mask))
> > return;
> > ReleaseSemaphore(sema_, 1, 0);
> > Sleep(0);
> > }
> > }
>
> Even with the Sleep(0) this code can keep stealing it's own signals,
> preventing another thread from actually waking. If you have a lot of
> threads waiting since the last outstanding notify and only one from
> before then the new threads can keep waking up, leaving the thread
> that needs to be woken still sleeping.

Anthony, thank you for review and for interest.

Yes, I agree that this algorithm suffers from stealing signals. If we
use only one synchronization variable, then I think it's inevitable.
If we will use for example event per thread than we can avoid
stealing. But then signaling thread will have to make N system calls
to wakeup N threads.
My reasoning here is that 'window of inconsistency' is relatively
small. And Sleep(0) must further diminish the effect. So some
temporary bad things are possible, but eventually system must
stabilize, and right threads will consume semaphore count.
I was not evaluating this algorithm in real application, so I can't
say with confidence how things will be in real life... Do you think
that it will be the real problem?

I do not agree here. Waiter *atomically* increments waiters count and
loads current generation. Signaler *atomically* increments generation
and loads waiters count. So in the example you describe either new
waiters must not see increment of generation (i.e. remain waiting) or
signaler must add 3 to semaphore for new waiters (i.e. all 7 threads
will consume semaphore count).

Dmitriy V'jukov

unread,

Sep 12, 2008, 7:27:02 PM9/12/08

to

On 13 сент, 00:43, Anthony Williams <anthony....@gmail.com> wrote:

> "Dmitriy V'jukov" <dvyu...@gmail.com> writes:
> > void wait(state_t cmp)
> > {
> > for (;;)
> > {
> > state_t cmp0 = state_;
> > if ((cmp & generation_mask) != (cmp0 & generation_mask))
> > {
> > _InterlockedExchangeAdd((long*)&state_, -
> > (int)waiters_inc);
> > return;
> > }
> > WaitForSingleObject(sema_, INFINITE);
> > cmp0 = state_;
> > if ((cmp & generation_mask) != (cmp0 & generation_mask))
> > return;
> > ReleaseSemaphore(sema_, 1, 0);
> > Sleep(0);
> > }
> > }
>
> Even with the Sleep(0) this code can keep stealing it's own signals,
> preventing another thread from actually waking. If you have a lot of
> threads waiting since the last outstanding notify and only one from
> before then the new threads can keep waking up, leaving the thread
> that needs to be woken still sleeping.

Btw, there are important algorithms which doesn't require generations
at all. For example, producer-consumer pattern, it's usually
indifferent what consumer will consume the event.
In such situations support of generations can (and will) only slow
down things. If we use condition variables, then generations are
inevitable. But we can patch our custom eventcount so that it will be
'generationless'. generationless_eventcount. What do you think?
We just have to eliminate variable which holds generation number and
eliminate checks of generation number. So any thread which consume
semaphore count is appropriate for us.

Chris M. Thomasson

unread,

Sep 12, 2008, 9:41:30 PM9/12/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message
news:cabe0495-9164-4430...@26g2000hsk.googlegroups.com...

On 13 сент, 00:43, Anthony Williams <anthony....@gmail.com> wrote:
> > "Dmitriy V'jukov" <dvyu...@gmail.com> writes:
> > > void wait(state_t cmp)
> > > {
> > > for (;;)
> > > {
> > > state_t cmp0 = state_;
> > > if ((cmp & generation_mask) != (cmp0 & generation_mask))
> > > {
> > > _InterlockedExchangeAdd((long*)&state_, -
> > > (int)waiters_inc);
> > > return;
> > > }
> > > WaitForSingleObject(sema_, INFINITE);
> > > cmp0 = state_;
> > > if ((cmp & generation_mask) != (cmp0 & generation_mask))
> > > return;
> > > ReleaseSemaphore(sema_, 1, 0);
> > > Sleep(0);
> > > }
> > > }
> >
> > Even with the Sleep(0) this code can keep stealing it's own signals,
> > preventing another thread from actually waking. If you have a lot of
> > threads waiting since the last outstanding notify and only one from
> > before then the new threads can keep waking up, leaving the thread
> > that needs to be woken still sleeping.

> Btw, there are important algorithms which doesn't require generations
> at all. For example, producer-consumer pattern, it's usually
> indifferent what consumer will consume the event.

So be it...

However, and IMVHO, if you cannot use an eventcount to create a condvar,
then the EC algorithm is busted; period...

[...]

Dmitriy V'jukov

unread,

Sep 13, 2008, 7:40:25 AM9/13/08

to

On 13 сент, 05:41, "Chris M. Thomasson" <n...@spam.invalid>

> > Btw, there are important algorithms which doesn't require generations
> > at all. For example, producer-consumer pattern, it's usually
> > indifferent what consumer will consume the event.
>
> So be it...
>
> However, and IMVHO, if you cannot use an eventcount to create a condvar,
> then the EC algorithm is busted; period...

Hmmm... I have to think about different name :)

Dmitriy V'jukov

Anthony Williams

unread,

Sep 13, 2008, 5:04:23 PM9/13/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> writes:

> On 13 сент, 00:43, Anthony Williams <anthony....@gmail.com> wrote:
>> "Dmitriy V'jukov" <dvyu...@gmail.com> writes:
>> > void wait(state_t cmp)
>> > {
>> > for (;;)
>> > {
>> > state_t cmp0 = state_;
>> > if ((cmp & generation_mask) != (cmp0 & generation_mask))
>> > {
>> > _InterlockedExchangeAdd((long*)&state_, -
>> > (int)waiters_inc);
>> > return;
>> > }
>> > WaitForSingleObject(sema_, INFINITE);
>> > cmp0 = state_;
>> > if ((cmp & generation_mask) != (cmp0 & generation_mask))
>> > return;
>> > ReleaseSemaphore(sema_, 1, 0);
>> > Sleep(0);
>> > }
>> > }
>>
>> Even with the Sleep(0) this code can keep stealing it's own signals,
>> preventing another thread from actually waking. If you have a lot of
>> threads waiting since the last outstanding notify and only one from
>> before then the new threads can keep waking up, leaving the thread
>> that needs to be woken still sleeping.
>
>
> Anthony, thank you for review and for interest.
>
> Yes, I agree that this algorithm suffers from stealing signals. If we
> use only one synchronization variable, then I think it's inevitable.

That depends.

> If we will use for example event per thread than we can avoid
> stealing. But then signaling thread will have to make N system calls
> to wakeup N threads.

That's true. However, we can change the semaphore with each generation
if we're waking all threads, in which case we don't need to worry
about the generation number. I've got a new condition variable
algorithm that does just that.

> My reasoning here is that 'window of inconsistency' is relatively
> small. And Sleep(0) must further diminish the effect. So some
> temporary bad things are possible, but eventually system must
> stabilize, and right threads will consume semaphore count.
> I was not evaluating this algorithm in real application, so I can't
> say with confidence how things will be in real life... Do you think
> that it will be the real problem?

I think it will primarily be a problem if the thread that should be
notified is a lower priority than the thread that keeps stealing the
notify.

Yes, you're right. I misread the code. If you do a lot of signals
(4096?), the generation count can wrap round though so threads might
see a wrapped generation count if it's waiting long enough.

Dmitriy V'jukov

unread,

Sep 15, 2008, 1:15:08 PM9/15/08

to

On Sep 14, 1:04 am, Anthony Williams <anthony....@gmail.com> wrote:

> > Yes, I agree that this algorithm suffers from stealing signals. If we
> > use only one synchronization variable, then I think it's inevitable.
>
> That depends.

Everything depends :) Can you elaborate what depends and on what
depends? :)

> > If we will use for example event per thread than we can avoid
> > stealing. But then signaling thread will have to make N system calls
> > to wakeup N threads.
>
> That's true. However, we can change the semaphore with each generation
> if we're waking all threads, in which case we don't need to worry
> about the generation number. I've got a new condition variable
> algorithm that does just that.

Can you point to implementation? I'm going to implement the same
thing. I think it's promising idea. It's easy to implement with mutex,
but not so easy with atomic RMW operations.

> > My reasoning here is that 'window of inconsistency' is relatively
> > small. And Sleep(0) must further diminish the effect. So some
> > temporary bad things are possible, but eventually system must
> > stabilize, and right threads will consume semaphore count.
> > I was not evaluating this algorithm in real application, so I can't
> > say with confidence how things will be in real life... Do you think
> > that it will be the real problem?
>
> I think it will primarily be a problem if the thread that should be
> notified is a lower priority than the thread that keeps stealing the
> notify.

Yeah, this can be a problem in some contexts. I spot the same 'bug' in
Intel Threading Building Blocks:
http://softwarecommunity.intel.com/isn/Community/en-US/forums/thread/30263365.aspx
But developers even don't want to insert 'Sleep(0)' into busy loop...

> > I do not agree here. Waiter *atomically* increments waiters count and
> > loads current generation. Signaler *atomically* increments generation
> > and loads waiters count. So in the example you describe either new
> > waiters must not see increment of generation (i.e. remain waiting) or
> > signaler must add 3 to semaphore for new waiters (i.e. all 7 threads
> > will consume semaphore count).
>
> Yes, you're right. I misread the code. If you do a lot of signals
> (4096?), the generation count can wrap round though so threads might
> see a wrapped generation count if it's waiting long enough.

Yes, it's a problem too. One can use double-word CAS to make more
space for generation.
Anyway I'm going to implement eventcount with semaphore per
generation.

Dmitriy V'jukov

unread,

Sep 15, 2008, 2:07:08 PM9/15/08

to

On Sep 15, 9:15 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:

> > That's true. However, we can change the semaphore with each generation
> > if we're waking all threads, in which case we don't need to worry
> > about the generation number. I've got a new condition variable
> > algorithm that does just that.
>
> Can you point to implementation? I'm going to implement the same
> thing. I think it's promising idea. It's easy to implement with mutex,
> but not so easy with atomic RMW operations.

Here is crude sketch based on mutex:
http://groups.google.com/group/comp.programming.threads/msg/401538a210715ad7?hl=en

The good thing it that there are no dynamic allocations of
synchronization objects nor memory.
I'm thinking whether it's worth doing implementing it with atomic
RMWs...

Anthony Williams

unread,

Sep 15, 2008, 4:35:48 PM9/15/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> writes:

> On Sep 14, 1:04 am, Anthony Williams <anthony....@gmail.com> wrote:
>
>> > Yes, I agree that this algorithm suffers from stealing signals. If we
>> > use only one synchronization variable, then I think it's inevitable.
>>
>> That depends.
>
> Everything depends :) Can you elaborate what depends and on what
> depends? :)

It depends on whether you mean one synchronization variable for the
whole lifetime of the event count, or something else, such as one
variable per generation (for example).

>> > If we will use for example event per thread than we can avoid
>> > stealing. But then signaling thread will have to make N system calls
>> > to wakeup N threads.
>>
>> That's true. However, we can change the semaphore with each generation
>> if we're waking all threads, in which case we don't need to worry
>> about the generation number. I've got a new condition variable
>> algorithm that does just that.
>
> Can you point to implementation? I'm going to implement the same
> thing. I think it's promising idea. It's easy to implement with mutex,
> but not so easy with atomic RMW operations.

struct condvar
{
struct waiter_data
{
waiter_data* next;
waiter_data* prev;
HANDLE broadcast_event;
};

boost::mutex internal_mutex;
waiter_data* waiters;

condvar():
waiters(NULL)
{}

void register_waiter(waiter_data& waiter)
{
boost::lock_guard<boost::mutex> lk(internal_mutex);
waiter.prev=waiters;
if(waiters)
{
waiters->next=&waiter;
waiter.broadcast_event=waiters->broadcast_event;
}
else
{
waiter.broadcast_event=CreateEvent(NULL,true,false,NULL);
}
waiters=&waiter;

}

bool wakeup(waiter_data& waiter)
{
boost::lock_guard<boost::mutex> lk(internal_mutex);
bool last_waiter=true;
if(waiter.prev)
{
waiter.prev->next=waiter.next;
last_waiter=false;
}
if(waiter.next)
{
waiter.next->prev=waiter.prev;
last_waiter=false;
}
if(waiters==&waiter)
{
waiters=waiter.prev;
}
}

void wait(boost::unique_lock<boost::mutex>& external)
{
waiter_data waiter;
register_waiter(waiter);
external.unlock();
WaitForSingleObject(waiter.broadcast_event,INFINITE);
if(wakeup(waiter))
{
CloseHandle(waiter.broadcast_event);
}
external.lock();
}

void broadcast()
{
boost::lock_guard<boost::mutex> lk(internal_mutex);
if(waiters)
{
HANDLE const old_broadcast_event=waiters->broadcast_event;
SetEvent(old_broadcast_event);
waiters=0;
}
}
};

No dynamic allocation, one event per generation. It uses an internal
mutex to protect the waiters list, but a lock-free list would work
too.

The notify logic builds on top of that. Currently I use an event for a
"gate" similar to Alexander Terekhov's algorithm, and a semaphore for
waking threads. You can't wait whilst there's pending notifies, unless
the whole generation has been notified (even if they haven't woken
yet) or you've done a broadcast. Timeouts are easy too.

Dmitriy V'jukov

unread,

Sep 15, 2008, 6:48:51 PM9/15/08

to

On 16 сент, 00:35, Anthony Williams <anthony....@gmail.com> wrote:

> bool wakeup(waiter_data& waiter)
> {
> boost::lock_guard<boost::mutex> lk(internal_mutex);

/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\

It's about that mutex acquisition.
Assume quad-core machine. No, better 2 processors x 4 cores. No-no-no,
better 64-core Niagara.
Assume thread signals condition variable, and 63 threads on 63 cores
simultaneously start executing. First thing all they have to do is to
acquire the same mutex. More than half of that threads can instantly
go back to the black hole (by black hole here I mean the worst enemy
of lock-free programmer - THE KERNEL). Cr@p!
They born just to die in a moment.
Am I a paranoid?

Chris M. Thomasson

unread,

Sep 15, 2008, 7:37:33 PM9/15/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message

news:7553b3b0-39f9-4979...@j22g2000hsf.googlegroups.com...

On Sep 15, 9:15 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:

> > > That's true. However, we can change the semaphore with each generation
> > > if we're waking all threads, in which case we don't need to worry
> > > about the generation number. I've got a new condition variable
> > > algorithm that does just that.
> >
> > Can you point to implementation? I'm going to implement the same
> > thing. I think it's promising idea. It's easy to implement with mutex,
> > but not so easy with atomic RMW operations.

> Here is crude sketch based on mutex:
> http://groups.google.com/group/comp.programming.threads/msg/401538a210715ad7?hl=en

> The good thing it that there are no dynamic allocations of
> synchronization objects nor memory.
> I'm thinking whether it's worth doing implementing it with atomic
> RMWs...

In the context of an eventcount, you would be optimizing a "really"
slow-path... Humm, is it worth it? That is the question indeed...

;^)

Chris M. Thomasson

unread,

Sep 15, 2008, 7:27:28 PM9/15/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message

news:80d0e572-396c-435a...@m45g2000hsb.googlegroups.com...

On 16 сент, 00:35, Anthony Williams <anthony....@gmail.com> wrote:

> > bool wakeup(waiter_data& waiter)
> > {
> > boost::lock_guard<boost::mutex> lk(internal_mutex);

> /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\

> It's about that mutex acquisition.
> Assume quad-core machine. No, better 2 processors x 4 cores. No-no-no,
> better 64-core Niagara.
> Assume thread signals condition variable, and 63 threads on 63 cores
> simultaneously start executing. First thing all they have to do is to
> acquire the same mutex. More than half of that threads can instantly
> go back to the black hole (by black hole here I mean the worst enemy
> of lock-free programmer - THE KERNEL). Cr@p!
> They born just to die in a moment.

Look into wait-morphing...

> Am I a paranoid?

Not really.

Anthony Williams

unread,

Sep 16, 2008, 3:27:37 AM9/16/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> writes:

> On 16 сент, 00:35, Anthony Williams <anthony....@gmail.com> wrote:
>
>> bool wakeup(waiter_data& waiter)
>> {
>> boost::lock_guard<boost::mutex> lk(internal_mutex);
>
> /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\
>
> It's about that mutex acquisition.
> Assume quad-core machine. No, better 2 processors x 4 cores. No-no-no,
> better 64-core Niagara.
> Assume thread signals condition variable, and 63 threads on 63 cores
> simultaneously start executing. First thing all they have to do is to
> acquire the same mutex. More than half of that threads can instantly
> go back to the black hole (by black hole here I mean the worst enemy
> of lock-free programmer - THE KERNEL). Cr@p!
> They born just to die in a moment.
> Am I a paranoid?

No, you're not paranoid. If we're trying to improve event counts or
cond vars we ought to look at every aspect.

Dmitriy V'jukov

unread,

Sep 16, 2008, 5:51:31 AM9/16/08

to

On Sep 16, 3:27 am, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> "Dmitriy V'jukov" <dvyu...@gmail.com> wrote in message
>
> news:80d0e572-396c-435a...@m45g2000hsb.googlegroups.com...

> On 16 ÓÅÎÔ, 00:35, Anthony Williams <anthony....@gmail.com> wrote:
>
> > > bool wakeup(waiter_data& waiter)
> > > {
> > > boost::lock_guard<boost::mutex> lk(internal_mutex);
> > /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\
> > It's about that mutex acquisition.
> > Assume quad-core machine. No, better 2 processors x 4 cores. No-no-no,
> > better 64-core Niagara.
> > Assume thread signals condition variable, and 63 threads on 63 cores
> > simultaneously start executing. First thing all they have to do is to
> > acquire the same mutex. More than half of that threads can instantly
> > go back to the black hole (by black hole here I mean the worst enemy
> > of lock-free programmer - THE KERNEL). Cr@p!
> > They born just to die in a moment.
>
> Look into wait-morphing...

Into what? Where I can see it?

Dmitriy V'jukov

unread,

Sep 16, 2008, 5:55:37 AM9/16/08

to

On Sep 16, 11:27 am, Anthony Williams <anthony....@gmail.com> wrote:

> "Dmitriy V'jukov" <dvyu...@gmail.com> writes:
> > On 16 сент, 00:35, Anthony Williams <anthony....@gmail.com> wrote:
>
> >> bool wakeup(waiter_data& waiter)
> >> {
> >> boost::lock_guard<boost::mutex> lk(internal_mutex);
>
> > /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\
>
> > It's about that mutex acquisition.
> > Assume quad-core machine. No, better 2 processors x 4 cores. No-no-no,
> > better 64-core Niagara.
> > Assume thread signals condition variable, and 63 threads on 63 cores
> > simultaneously start executing. First thing all they have to do is to
> > acquire the same mutex. More than half of that threads can instantly
> > go back to the black hole (by black hole here I mean the worst enemy
> > of lock-free programmer - THE KERNEL). Cr@p!
> > They born just to die in a moment.
> > Am I a paranoid?
>
> No, you're not paranoid. If we're trying to improve event counts or
> cond vars we ought to look at every aspect.

Ok. Then I will post my lock-free version of eventcount/condvar which
doesn't require memory allocation/deallocation and creation/
destruction of synchronization objects, whole generation can be
signaled with 1 syscall. It seems that I will be able to come up with
algorithm which requires only single word CAS.

Dmitriy V'jukov

Chris M. Thomasson

unread,

Sep 16, 2008, 6:49:48 AM9/16/08

to

> > "Dmitriy V'jukov" <dvy...@gmail.com> wrote in message

> > news:b5baa5d0-154c-41be...@e53g2000hsa.googlegroups.com...

> On Sep 16, 3:27 am, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> > "Dmitriy V'jukov" <dvyu...@gmail.com> wrote in message
> >
> > news:80d0e572-396c-435a...@m45g2000hsb.googlegroups.com...

> > On 16 сент, 00:35, Anthony Williams <anthony....@gmail.com> wrote:
> >
> > > > bool wakeup(waiter_data& waiter)
> > > > {
> > > > boost::lock_guard<boost::mutex> lk(internal_mutex);
> > > /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\
> > > It's about that mutex acquisition.
> > > Assume quad-core machine. No, better 2 processors x 4 cores. No-no-no,
> > > better 64-core Niagara.
> > > Assume thread signals condition variable, and 63 threads on 63 cores
> > > simultaneously start executing. First thing all they have to do is to
> > > acquire the same mutex. More than half of that threads can instantly
> > > go back to the black hole (by black hole here I mean the worst enemy
> > > of lock-free programmer - THE KERNEL). Cr@p!
> > > They born just to die in a moment.
> >
> > Look into wait-morphing...

> Into what?

Atomically transferring the waitset of the condvar to the mutex. 3 threads
are waiting with mutex A... Broadcast occurs; the following is atomic:

thread 1 gains access and threads 2-3 are magically in mutex waitset; no
thundering heard.

> Where I can see it?

I think a version of NPTL futex has wait-morphing...

Dmitriy V'jukov

unread,

Sep 17, 2008, 12:48:24 PM9/17/08

to

On Sep 16, 2:49 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> > > Look into wait-morphing...
> > Into what?
>
> Atomically transferring the waitset of the condvar to the mutex. 3 threads
> are waiting with mutex A... Broadcast occurs; the following is atomic:
>
> thread 1 gains access and threads 2-3 are magically in mutex waitset; no
> thundering heard.
>
> > Where I can see it?
>
> I think a version of NPTL futex has wait-morphing...

I see. It's interesting thing, but as I understand it is applicable
only in kernel, right? At least I don't see how I can implement it...

Dmitriy V'jukov

unread,

Sep 17, 2008, 1:07:00 PM9/17/08

to

On Sep 16, 1:55 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:

> On Sep 16, 11:27šam, Anthony Williams <anthony....@gmail.com> wrote:
>
>
>
> > "Dmitriy V'jukov" <dvyu...@gmail.com> writes:

> > > On 16 ÓÅÎÔ, 00:35, Anthony Williams <anthony....@gmail.com> wrote:
>
> > >> bool wakeup(waiter_data& waiter)
> > >> {

> > >> š š boost::lock_guard<boost::mutex> lk(internal_mutex);

>
> > > /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\
>
> > > It's about that mutex acquisition.
> > > Assume quad-core machine. No, better 2 processors x 4 cores. No-no-no,
> > > better 64-core Niagara.
> > > Assume thread signals condition variable, and 63 threads on 63 cores
> > > simultaneously start executing. First thing all they have to do is to
> > > acquire the same mutex. More than half of that threads can instantly
> > > go back to the black hole (by black hole here I mean the worst enemy
> > > of lock-free programmer - THE KERNEL). Cr@p!
> > > They born just to die in a moment.
> > > Am I a paranoid?
>
> > No, you're not paranoid. If we're trying to improve event counts or
> > cond vars we ought to look at every aspect.
>
> Ok. Then I will post my lock-free version of eventcount/condvar which
> doesn't require memory allocation/deallocation and creation/
> destruction of synchronization objects, whole generation can be
> signaled with 1 syscall. It seems that I will be able to come up with
> algorithm which requires only single word CAS.

Lock-free implementation get more and more messy while I am working on
details. So I'm gonna to retire the idea of lock-free eventcount... at
least for now.
Mutex is okey in condvar/eventcount, the only thing I don't like is
mutex acquisition after wait on event/semaphore. But it turns out to
be not easy to use mutex everywhere and create lock-free path only
after wait on event/semaphore.
I have another idea. Now I am working on algorithm which is 100% mutex
based (not counting single unprotected load in eventcount version,
condvar version will be 100% mutex based).
This algorithm have promising properties:
1. No memory allocation/deallocation
2. No kernel object creation/destruction
3. Broadcasting with single syscall
4. No mutex acquisition after wait on semaphore
5. Portable because based on semaphore

Dmitriy V'jukov

Anthony Williams

unread,

Sep 17, 2008, 4:07:56 PM9/17/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> writes:

Yes. I had a go, and it got too messy, so I decided to leave it for now.

> Mutex is okey in condvar/eventcount, the only thing I don't like is
> mutex acquisition after wait on event/semaphore. But it turns out to
> be not easy to use mutex everywhere and create lock-free path only
> after wait on event/semaphore.

I can sympathise here too.

> I have another idea. Now I am working on algorithm which is 100% mutex
> based (not counting single unprotected load in eventcount version,
> condvar version will be 100% mutex based).
> This algorithm have promising properties:
> 1. No memory allocation/deallocation
> 2. No kernel object creation/destruction
> 3. Broadcasting with single syscall
> 4. No mutex acquisition after wait on semaphore
> 5. Portable because based on semaphore

That sounds promising. I hope it pans out.

Chris M. Thomasson

unread,

Sep 18, 2008, 10:06:39 AM9/18/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message

news:6c8bf80e-c9db-4680...@b1g2000hsg.googlegroups.com...

I have experimented with something like the following on Windows in the
past:
__________________________________________________________________
class condvar {
HANDLE m_sema;
HANDLE m_event; // manual-reset; init to set
HANDLE m_mutex;
unsigned m_waiters; // = 0
unsigned m_generation; // = 0

public:
void broadcast() {
WaitForSingleObject(m_mutex, INFINITE);
if (! m_generation) {
unsigned const waiters = m_waiters;
if (waiters) {
m_generation = waiters;
m_waiters = 0;
ResetEvent(m_event);
ReleaseSemaphore(m_sema, waiters, NULL);
}
}
ReleaseMutex(m_mutex);
}

void wait(HANDLE umutex) {
HANDLE wset[3] = { umutex, m_mutex, m_sema };
SignalObjectAndWait(umutex, m_mutex, INFINITE, FALSE);
++m_waiters;
SignalObjectAndWait(m_mutex, m_event, INFINITE, FALSE);
WaitForMultipleObjects(wset, 3, TRUE, INFINITE);
--m_generation;
if (! m_generation) {
SetEvent(m_event);
}
ReleaseMutex(m_mutex);
}
};
__________________________________________________________________

Any thoughts?

Chris M. Thomasson

unread,

Sep 18, 2008, 10:27:48 AM9/18/08

to

"Chris M. Thomasson" <n...@spam.invalid> wrote in message
news:7atAk.6966$PS3....@newsfe06.iad...

> "Dmitriy V'jukov" <dvy...@gmail.com> wrote in message
> news:6c8bf80e-c9db-4680...@b1g2000hsg.googlegroups.com...
> On Sep 16, 2:49 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:
>
>> > > > Look into wait-morphing...
>> > > Into what?
>> >
>> > Atomically transferring the waitset of the condvar to the mutex. 3
>> > threads
>> > are waiting with mutex A... Broadcast occurs; the following is atomic:
>> >
>> > thread 1 gains access and threads 2-3 are magically in mutex waitset;
>> > no
>> > thundering heard.
>> >
>> > > Where I can see it?
>> >
>> > I think a version of NPTL futex has wait-morphing...
>
>> I see. It's interesting thing, but as I understand it is applicable
>> only in kernel, right?

Yeah. Although, if you manually constructed your own waitsets for mutex and
condvar, then it could probably be implemented.

> At least I don't see how I can implement it...
>
> I have experimented with something like the following on Windows in the
> past:
> __________________________________________________________________

> __________________________________________________________________
> Any thoughts?

I don't know for sure if Windows uses wait-morphing for
`WaitForMultipleObjects'. It seems like it should. I am atomically acquiring
the user mutex, the internal mutex and decrementing the internal semaphore
in a single atomic operation. If there is a thundering heard, its going to
be hidden by the internal Windows impl. I hope they would optimize for this
scenario... Humm... I should probably ask this on a Windows Kernel group...

Chris M. Thomasson

unread,

Sep 18, 2008, 10:45:03 AM9/18/08

to

"Chris M. Thomasson" <n...@spam.invalid> wrote in message
news:7atAk.6966$PS3....@newsfe06.iad...

I forgot to add the signal function!

void signal() {

WaitForSingleObject(m_mutex, INFINITE);
if (! m_generation) {
unsigned const waiters = m_waiters;
if (waiters) {

m_generation = 1;
m_waiters = waiters - 1;
ResetEvent(m_event);
ReleaseSemaphore(m_sema, 1, NULL);

Dmitriy V'jukov

unread,

Sep 23, 2008, 7:44:38 AM9/23/08

to

On Sep 18, 12:07 am, Anthony Williams <anthony....@gmail.com> wrote:

> > I have another idea. Now I am working on algorithm which is 100% mutex
> > based (not counting single unprotected load in eventcount version,
> > condvar version will be 100% mutex based).
> > This algorithm have promising properties:
> > 1. No memory allocation/deallocation
> > 2. No kernel object creation/destruction
> > 3. Broadcasting with single syscall
> > 4. No mutex acquisition after wait on semaphore
> > 5. Portable because based on semaphore
>
> That sounds promising. I hope it pans out.

Here is latest version:
http://groups.google.com/group/comp.programming.threads/msg/2ca163be701cf984
What do you think? For the time present I don't see any caveats.

Dmitriy V'jukov

unread,

Sep 23, 2008, 7:48:11 AM9/23/08

to

On Sep 18, 6:06 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> > > > > Look into wait-morphing...
> > > > Into what?
>
> > > Atomically transferring the waitset of the condvar to the mutex. 3
> > > threads
> > > are waiting with mutex A... Broadcast occurs; the following is atomic:
>
> > > thread 1 gains access and threads 2-3 are magically in mutex waitset; no
> > > thundering heard.
>
> > > > Where I can see it?
>
> > > I think a version of NPTL futex has wait-morphing...
> > I see. It's interesting thing, but as I understand it is applicable
> > only in kernel, right? At least I don't see how I can implement it...
>
> I have experimented with something like the following on Windows in the
> past:

> [...]
> Any thoughts?

Hmmm... Cool! With WaitForMultipleObjects() it's possible to implement
wait-morphing in user space.
But it requires mutex to be kernel object too. I think this
annihilates any advantages from wait-morphing.

Dmitriy V'jukov

unread,

Sep 23, 2008, 7:54:29 AM9/23/08

to

On Sep 18, 6:27 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> I don't know for sure if Windows uses wait-morphing for
> `WaitForMultipleObjects'. It seems like it should.

I think that wait-morphing is not applicable to MFMO, because MFMO
waits for several objects *atomically*, not for one object and then
for another.
But it works like MFMO uses wait-morphing, i.e. there are no
intermediate wakeups. So you win an objective.

Dmitriy V'jukov

Chris M. Thomasson

unread,

Oct 26, 2008, 7:33:32 AM10/26/08

to

"Chris M. Thomasson" <n...@spam.invalid> wrote in message
news:7atAk.6966$PS3....@newsfe06.iad...

> "Dmitriy V'jukov" <dvy...@gmail.com> wrote in message
> news:6c8bf80e-c9db-4680...@b1g2000hsg.googlegroups.com...
> On Sep 16, 2:49 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:
>
>> > > > Look into wait-morphing...
>> > > Into what?
>> >
>> > Atomically transferring the waitset of the condvar to the mutex. 3
>> > threads
>> > are waiting with mutex A... Broadcast occurs; the following is atomic:
>> >
>> > thread 1 gains access and threads 2-3 are magically in mutex waitset;
>> > no
>> > thundering heard.
>> >
>> > > Where I can see it?
>> >
>> > I think a version of NPTL futex has wait-morphing...
>
>> I see. It's interesting thing, but as I understand it is applicable
>> only in kernel, right? At least I don't see how I can implement it...
>
> I have experimented with something like the following on Windows in the
> past:
> __________________________________________________________________

[...]
> __________________________________________________________________
>
>
>
> Any thoughts?

Its busted, but I wanted to show how it breaks in Relacy. I am looking
forward to an implementation of `WaitForMultipleObjects' in the Win32 Relacy
emulation layer.

;^)