This eventcount algorithm is based on semaphore, so it can be easily implemented in POSIX and/or Windows. Implementation is mostly lock- free. Wait generations are maintained manually.
class eventcount { public: typedef uint32_t state_t;
Even with the Sleep(0) this code can keep stealing it's own signals, preventing another thread from actually waking. If you have a lot of threads waiting since the last outstanding notify and only one from before then the new threads can keep waking up, leaving the thread that needs to be woken still sleeping.
If there are three threads waiting then a signal will zero the waiters count and add 3 to the semaphore. If one more thread waits before those waiters have woken then the waiters count will be positive. Suppose another thread calls signal: it will see the non-zero waiter count and proceed into the for(;;) loop. If it gets pre-empted before it does the CAS and three more threads wait they will see the not-yet-incremented generation count. When the signalling thread wakes up and releases 1 thread on the semaphore and increments the generation, the 4 new threads can now see the new generation count and steal the wakes intended for the first three threads, which now remain sleeping until the *next* signal.
Anthony -- Anthony Williams | Just Software Solutions Ltd Custom Software Development | http://www.justsoftwaresolutions.co.uk Registered in England, Company Number 5478976. Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL
> Even with the Sleep(0) this code can keep stealing it's own signals, > preventing another thread from actually waking. If you have a lot of > threads waiting since the last outstanding notify and only one from > before then the new threads can keep waking up, leaving the thread > that needs to be woken still sleeping.
Anthony, thank you for review and for interest.
Yes, I agree that this algorithm suffers from stealing signals. If we use only one synchronization variable, then I think it's inevitable. If we will use for example event per thread than we can avoid stealing. But then signaling thread will have to make N system calls to wakeup N threads. My reasoning here is that 'window of inconsistency' is relatively small. And Sleep(0) must further diminish the effect. So some temporary bad things are possible, but eventually system must stabilize, and right threads will consume semaphore count. I was not evaluating this algorithm in real application, so I can't say with confidence how things will be in real life... Do you think that it will be the real problem?
> If there are three threads waiting then a signal will zero the waiters > count and add 3 to the semaphore. If one more thread waits before > those waiters have woken then the waiters count will be > positive. Suppose another thread calls signal: it will see the > non-zero waiter count and proceed into the for(;;) loop. If it gets > pre-empted before it does the CAS and three more threads wait they > will see the not-yet-incremented generation count. When the signalling > thread wakes up and releases 1 thread on the semaphore and increments > the generation, the 4 new threads can now see the new generation count > and steal the wakes intended for the first three threads, which now > remain sleeping until the *next* signal.
I do not agree here. Waiter *atomically* increments waiters count and loads current generation. Signaler *atomically* increments generation and loads waiters count. So in the example you describe either new waiters must not see increment of generation (i.e. remain waiting) or signaler must add 3 to semaphore for new waiters (i.e. all 7 threads will consume semaphore count).
> Even with the Sleep(0) this code can keep stealing it's own signals, > preventing another thread from actually waking. If you have a lot of > threads waiting since the last outstanding notify and only one from > before then the new threads can keep waking up, leaving the thread > that needs to be woken still sleeping.
Btw, there are important algorithms which doesn't require generations at all. For example, producer-consumer pattern, it's usually indifferent what consumer will consume the event. In such situations support of generations can (and will) only slow down things. If we use condition variables, then generations are inevitable. But we can patch our custom eventcount so that it will be 'generationless'. generationless_eventcount. What do you think? We just have to eliminate variable which holds generation number and eliminate checks of generation number. So any thread which consume semaphore count is appropriate for us.
> > Even with the Sleep(0) this code can keep stealing it's own signals, > > preventing another thread from actually waking. If you have a lot of > > threads waiting since the last outstanding notify and only one from > > before then the new threads can keep waking up, leaving the thread > > that needs to be woken still sleeping. > Btw, there are important algorithms which doesn't require generations > at all. For example, producer-consumer pattern, it's usually > indifferent what consumer will consume the event.
So be it...
However, and IMVHO, if you cannot use an eventcount to create a condvar, then the EC algorithm is busted; period...
On 13 ÓÅÎÔ, 05:41, "Chris M. Thomasson" <n...@spam.invalid>
> > Btw, there are important algorithms which doesn't require generations > > at all. For example, producer-consumer pattern, it's usually > > indifferent what consumer will consume the event.
> So be it...
> However, and IMVHO, if you cannot use an eventcount to create a condvar, > then the EC algorithm is busted; period...
>> Even with the Sleep(0) this code can keep stealing it's own signals, >> preventing another thread from actually waking. If you have a lot of >> threads waiting since the last outstanding notify and only one from >> before then the new threads can keep waking up, leaving the thread >> that needs to be woken still sleeping.
> Anthony, thank you for review and for interest.
> Yes, I agree that this algorithm suffers from stealing signals. If we > use only one synchronization variable, then I think it's inevitable.
That depends.
> If we will use for example event per thread than we can avoid > stealing. But then signaling thread will have to make N system calls > to wakeup N threads.
That's true. However, we can change the semaphore with each generation if we're waking all threads, in which case we don't need to worry about the generation number. I've got a new condition variable algorithm that does just that.
> My reasoning here is that 'window of inconsistency' is relatively > small. And Sleep(0) must further diminish the effect. So some > temporary bad things are possible, but eventually system must > stabilize, and right threads will consume semaphore count. > I was not evaluating this algorithm in real application, so I can't > say with confidence how things will be in real life... Do you think > that it will be the real problem?
I think it will primarily be a problem if the thread that should be notified is a lower priority than the thread that keeps stealing the notify.
>> If there are three threads waiting then a signal will zero the waiters >> count and add 3 to the semaphore. If one more thread waits before >> those waiters have woken then the waiters count will be >> positive. Suppose another thread calls signal: it will see the >> non-zero waiter count and proceed into the for(;;) loop. If it gets >> pre-empted before it does the CAS and three more threads wait they >> will see the not-yet-incremented generation count. When the signalling >> thread wakes up and releases 1 thread on the semaphore and increments >> the generation, the 4 new threads can now see the new generation count >> and steal the wakes intended for the first three threads, which now >> remain sleeping until the *next* signal.
> I do not agree here. Waiter *atomically* increments waiters count and > loads current generation. Signaler *atomically* increments generation > and loads waiters count. So in the example you describe either new > waiters must not see increment of generation (i.e. remain waiting) or > signaler must add 3 to semaphore for new waiters (i.e. all 7 threads > will consume semaphore count).
Yes, you're right. I misread the code. If you do a lot of signals (4096?), the generation count can wrap round though so threads might see a wrapped generation count if it's waiting long enough.
Anthony -- Anthony Williams | Just Software Solutions Ltd Custom Software Development | http://www.justsoftwaresolutions.co.uk Registered in England, Company Number 5478976. Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL
On Sep 14, 1:04 am, Anthony Williams <anthony....@gmail.com> wrote:
> > Yes, I agree that this algorithm suffers from stealing signals. If we > > use only one synchronization variable, then I think it's inevitable.
> That depends.
Everything depends :) Can you elaborate what depends and on what depends? :)
> > If we will use for example event per thread than we can avoid > > stealing. But then signaling thread will have to make N system calls > > to wakeup N threads.
> That's true. However, we can change the semaphore with each generation > if we're waking all threads, in which case we don't need to worry > about the generation number. I've got a new condition variable > algorithm that does just that.
Can you point to implementation? I'm going to implement the same thing. I think it's promising idea. It's easy to implement with mutex, but not so easy with atomic RMW operations.
> > My reasoning here is that 'window of inconsistency' is relatively > > small. And Sleep(0) must further diminish the effect. So some > > temporary bad things are possible, but eventually system must > > stabilize, and right threads will consume semaphore count. > > I was not evaluating this algorithm in real application, so I can't > > say with confidence how things will be in real life... Do you think > > that it will be the real problem?
> I think it will primarily be a problem if the thread that should be > notified is a lower priority than the thread that keeps stealing the > notify.
> > I do not agree here. Waiter *atomically* increments waiters count and > > loads current generation. Signaler *atomically* increments generation > > and loads waiters count. So in the example you describe either new > > waiters must not see increment of generation (i.e. remain waiting) or > > signaler must add 3 to semaphore for new waiters (i.e. all 7 threads > > will consume semaphore count).
> Yes, you're right. I misread the code. If you do a lot of signals > (4096?), the generation count can wrap round though so threads might > see a wrapped generation count if it's waiting long enough.
Yes, it's a problem too. One can use double-word CAS to make more space for generation. Anyway I'm going to implement eventcount with semaphore per generation.
On Sep 15, 9:15 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:
> > That's true. However, we can change the semaphore with each generation > > if we're waking all threads, in which case we don't need to worry > > about the generation number. I've got a new condition variable > > algorithm that does just that.
> Can you point to implementation? I'm going to implement the same > thing. I think it's promising idea. It's easy to implement with mutex, > but not so easy with atomic RMW operations.
The good thing it that there are no dynamic allocations of synchronization objects nor memory. I'm thinking whether it's worth doing implementing it with atomic RMWs...
"Dmitriy V'jukov" <dvyu...@gmail.com> writes: > On Sep 14, 1:04 am, Anthony Williams <anthony....@gmail.com> wrote:
>> > Yes, I agree that this algorithm suffers from stealing signals. If we >> > use only one synchronization variable, then I think it's inevitable.
>> That depends.
> Everything depends :) Can you elaborate what depends and on what > depends? :)
It depends on whether you mean one synchronization variable for the whole lifetime of the event count, or something else, such as one variable per generation (for example).
>> > If we will use for example event per thread than we can avoid >> > stealing. But then signaling thread will have to make N system calls >> > to wakeup N threads.
>> That's true. However, we can change the semaphore with each generation >> if we're waking all threads, in which case we don't need to worry >> about the generation number. I've got a new condition variable >> algorithm that does just that.
> Can you point to implementation? I'm going to implement the same > thing. I think it's promising idea. It's easy to implement with mutex, > but not so easy with atomic RMW operations.
No dynamic allocation, one event per generation. It uses an internal mutex to protect the waiters list, but a lock-free list would work too.
The notify logic builds on top of that. Currently I use an event for a "gate" similar to Alexander Terekhov's algorithm, and a semaphore for waking threads. You can't wait whilst there's pending notifies, unless the whole generation has been notified (even if they haven't woken yet) or you've done a broadcast. Timeouts are easy too.
Anthony -- Anthony Williams | Just Software Solutions Ltd Custom Software Development | http://www.justsoftwaresolutions.co.uk Registered in England, Company Number 5478976. Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL
It's about that mutex acquisition. Assume quad-core machine. No, better 2 processors x 4 cores. No-no-no, better 64-core Niagara. Assume thread signals condition variable, and 63 threads on 63 cores simultaneously start executing. First thing all they have to do is to acquire the same mutex. More than half of that threads can instantly go back to the black hole (by black hole here I mean the worst enemy of lock-free programmer - THE KERNEL). Cr@p! They born just to die in a moment. Am I a paranoid?
> > > That's true. However, we can change the semaphore with each generation > > > if we're waking all threads, in which case we don't need to worry > > > about the generation number. I've got a new condition variable > > > algorithm that does just that.
> > Can you point to implementation? I'm going to implement the same > > thing. I think it's promising idea. It's easy to implement with mutex, > > but not so easy with atomic RMW operations. > Here is crude sketch based on mutex: > http://groups.google.com/group/comp.programming.threads/msg/401538a21... > The good thing it that there are no dynamic allocations of > synchronization objects nor memory. > I'm thinking whether it's worth doing implementing it with atomic > RMWs...
In the context of an eventcount, you would be optimizing a "really" slow-path... Humm, is it worth it? That is the question indeed...
> > bool wakeup(waiter_data& waiter) > > { > > boost::lock_guard<boost::mutex> lk(internal_mutex); > /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\ > It's about that mutex acquisition. > Assume quad-core machine. No, better 2 processors x 4 cores. No-no-no, > better 64-core Niagara. > Assume thread signals condition variable, and 63 threads on 63 cores > simultaneously start executing. First thing all they have to do is to > acquire the same mutex. More than half of that threads can instantly > go back to the black hole (by black hole here I mean the worst enemy > of lock-free programmer - THE KERNEL). Cr@p! > They born just to die in a moment.
> It's about that mutex acquisition. > Assume quad-core machine. No, better 2 processors x 4 cores. No-no-no, > better 64-core Niagara. > Assume thread signals condition variable, and 63 threads on 63 cores > simultaneously start executing. First thing all they have to do is to > acquire the same mutex. More than half of that threads can instantly > go back to the black hole (by black hole here I mean the worst enemy > of lock-free programmer - THE KERNEL). Cr@p! > They born just to die in a moment. > Am I a paranoid?
No, you're not paranoid. If we're trying to improve event counts or cond vars we ought to look at every aspect.
Anthony -- Anthony Williams | Just Software Solutions Ltd Custom Software Development | http://www.justsoftwaresolutions.co.uk Registered in England, Company Number 5478976. Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL
> > > bool wakeup(waiter_data& waiter) > > > { > > > boost::lock_guard<boost::mutex> lk(internal_mutex); > > /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\ > > It's about that mutex acquisition. > > Assume quad-core machine. No, better 2 processors x 4 cores. No-no-no, > > better 64-core Niagara. > > Assume thread signals condition variable, and 63 threads on 63 cores > > simultaneously start executing. First thing all they have to do is to > > acquire the same mutex. More than half of that threads can instantly > > go back to the black hole (by black hole here I mean the worst enemy > > of lock-free programmer - THE KERNEL). Cr@p! > > They born just to die in a moment.
> > It's about that mutex acquisition. > > Assume quad-core machine. No, better 2 processors x 4 cores. No-no-no, > > better 64-core Niagara. > > Assume thread signals condition variable, and 63 threads on 63 cores > > simultaneously start executing. First thing all they have to do is to > > acquire the same mutex. More than half of that threads can instantly > > go back to the black hole (by black hole here I mean the worst enemy > > of lock-free programmer - THE KERNEL). Cr@p! > > They born just to die in a moment. > > Am I a paranoid?
> No, you're not paranoid. If we're trying to improve event counts or > cond vars we ought to look at every aspect.
Ok. Then I will post my lock-free version of eventcount/condvar which doesn't require memory allocation/deallocation and creation/ destruction of synchronization objects, whole generation can be signaled with 1 syscall. It seems that I will be able to come up with algorithm which requires only single word CAS.
> > > > bool wakeup(waiter_data& waiter) > > > > { > > > > boost::lock_guard<boost::mutex> lk(internal_mutex); > > > /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\ > > > It's about that mutex acquisition. > > > Assume quad-core machine. No, better 2 processors x 4 cores. No-no-no, > > > better 64-core Niagara. > > > Assume thread signals condition variable, and 63 threads on 63 cores > > > simultaneously start executing. First thing all they have to do is to > > > acquire the same mutex. More than half of that threads can instantly > > > go back to the black hole (by black hole here I mean the worst enemy > > > of lock-free programmer - THE KERNEL). Cr@p! > > > They born just to die in a moment.
> > Look into wait-morphing... > Into what?
Atomically transferring the waitset of the condvar to the mutex. 3 threads are waiting with mutex A... Broadcast occurs; the following is atomic:
thread 1 gains access and threads 2-3 are magically in mutex waitset; no thundering heard.
> Where I can see it?
I think a version of NPTL futex has wait-morphing...
> > > It's about that mutex acquisition. > > > Assume quad-core machine. No, better 2 processors x 4 cores. No-no-no, > > > better 64-core Niagara. > > > Assume thread signals condition variable, and 63 threads on 63 cores > > > simultaneously start executing. First thing all they have to do is to > > > acquire the same mutex. More than half of that threads can instantly > > > go back to the black hole (by black hole here I mean the worst enemy > > > of lock-free programmer - THE KERNEL). Cr@p! > > > They born just to die in a moment. > > > Am I a paranoid?
> > No, you're not paranoid. If we're trying to improve event counts or > > cond vars we ought to look at every aspect.
> Ok. Then I will post my lock-free version of eventcount/condvar which > doesn't require memory allocation/deallocation and creation/ > destruction of synchronization objects, whole generation can be > signaled with 1 syscall. It seems that I will be able to come up with > algorithm which requires only single word CAS.
Lock-free implementation get more and more messy while I am working on details. So I'm gonna to retire the idea of lock-free eventcount... at least for now. Mutex is okey in condvar/eventcount, the only thing I don't like is mutex acquisition after wait on event/semaphore. But it turns out to be not easy to use mutex everywhere and create lock-free path only after wait on event/semaphore. I have another idea. Now I am working on algorithm which is 100% mutex based (not counting single unprotected load in eventcount version, condvar version will be 100% mutex based). This algorithm have promising properties: 1. No memory allocation/deallocation 2. No kernel object creation/destruction 3. Broadcasting with single syscall 4. No mutex acquisition after wait on semaphore 5. Portable because based on semaphore
>> > > It's about that mutex acquisition. >> > > Assume quad-core machine. No, better 2 processors x 4 cores. No-no-no, >> > > better 64-core Niagara. >> > > Assume thread signals condition variable, and 63 threads on 63 cores >> > > simultaneously start executing. First thing all they have to do is to >> > > acquire the same mutex. More than half of that threads can instantly >> > > go back to the black hole (by black hole here I mean the worst enemy >> > > of lock-free programmer - THE KERNEL). Cr@p! >> > > They born just to die in a moment. >> > > Am I a paranoid?
>> > No, you're not paranoid. If we're trying to improve event counts or >> > cond vars we ought to look at every aspect.
>> Ok. Then I will post my lock-free version of eventcount/condvar which >> doesn't require memory allocation/deallocation and creation/ >> destruction of synchronization objects, whole generation can be >> signaled with 1 syscall. It seems that I will be able to come up with >> algorithm which requires only single word CAS.
> Lock-free implementation get more and more messy while I am working on > details. So I'm gonna to retire the idea of lock-free eventcount... at > least for now.
Yes. I had a go, and it got too messy, so I decided to leave it for now.
> Mutex is okey in condvar/eventcount, the only thing I don't like is > mutex acquisition after wait on event/semaphore. But it turns out to > be not easy to use mutex everywhere and create lock-free path only > after wait on event/semaphore.
I can sympathise here too.
> I have another idea. Now I am working on algorithm which is 100% mutex > based (not counting single unprotected load in eventcount version, > condvar version will be 100% mutex based). > This algorithm have promising properties: > 1. No memory allocation/deallocation > 2. No kernel object creation/destruction > 3. Broadcasting with single syscall > 4. No mutex acquisition after wait on semaphore > 5. Portable because based on semaphore
That sounds promising. I hope it pans out.
Anthony -- Anthony Williams | Just Software Solutions Ltd Custom Software Development | http://www.justsoftwaresolutions.co.uk Registered in England, Company Number 5478976. Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL
> > > > Look into wait-morphing... > > > Into what?
> > Atomically transferring the waitset of the condvar to the mutex. 3 > > threads > > are waiting with mutex A... Broadcast occurs; the following is atomic:
> > thread 1 gains access and threads 2-3 are magically in mutex waitset; no > > thundering heard.
> > > Where I can see it?
> > I think a version of NPTL futex has wait-morphing... > I see. It's interesting thing, but as I understand it is applicable > only in kernel, right? At least I don't see how I can implement it...
I have experimented with something like the following on Windows in the past: __________________________________________________________________ class condvar { HANDLE m_sema; HANDLE m_event; // manual-reset; init to set HANDLE m_mutex; unsigned m_waiters; // = 0 unsigned m_generation; // = 0
>> > > > Look into wait-morphing... >> > > Into what?
>> > Atomically transferring the waitset of the condvar to the mutex. 3 >> > threads >> > are waiting with mutex A... Broadcast occurs; the following is atomic:
>> > thread 1 gains access and threads 2-3 are magically in mutex waitset; >> > no >> > thundering heard.
>> > > Where I can see it?
>> > I think a version of NPTL futex has wait-morphing...
>> I see. It's interesting thing, but as I understand it is applicable >> only in kernel, right?
Yeah. Although, if you manually constructed your own waitsets for mutex and condvar, then it could probably be implemented.
> At least I don't see how I can implement it...
> I have experimented with something like the following on Windows in the > past: > __________________________________________________________________ > __________________________________________________________________ > Any thoughts?
I don't know for sure if Windows uses wait-morphing for `WaitForMultipleObjects'. It seems like it should. I am atomically acquiring the user mutex, the internal mutex and decrementing the internal semaphore in a single atomic operation. If there is a thundering heard, its going to be hidden by the internal Windows impl. I hope they would optimize for this scenario... Humm... I should probably ask this on a Windows Kernel group...
>> > > > Look into wait-morphing... >> > > Into what?
>> > Atomically transferring the waitset of the condvar to the mutex. 3 >> > threads >> > are waiting with mutex A... Broadcast occurs; the following is atomic:
>> > thread 1 gains access and threads 2-3 are magically in mutex waitset; >> > no >> > thundering heard.
>> > > Where I can see it?
>> > I think a version of NPTL futex has wait-morphing...
>> I see. It's interesting thing, but as I understand it is applicable >> only in kernel, right? At least I don't see how I can implement it...
> I have experimented with something like the following on Windows in the > past: > __________________________________________________________________ > class condvar { > HANDLE m_sema; > HANDLE m_event; // manual-reset; init to set > HANDLE m_mutex; > unsigned m_waiters; // = 0 > unsigned m_generation; // = 0
On Sep 18, 12:07 am, Anthony Williams <anthony....@gmail.com> wrote:
> > I have another idea. Now I am working on algorithm which is 100% mutex > > based (not counting single unprotected load in eventcount version, > > condvar version will be 100% mutex based). > > This algorithm have promising properties: > > 1. No memory allocation/deallocation > > 2. No kernel object creation/destruction > > 3. Broadcasting with single syscall > > 4. No mutex acquisition after wait on semaphore > > 5. Portable because based on semaphore
> > > > > Look into wait-morphing... > > > > Into what?
> > > Atomically transferring the waitset of the condvar to the mutex. 3 > > > threads > > > are waiting with mutex A... Broadcast occurs; the following is atomic:
> > > thread 1 gains access and threads 2-3 are magically in mutex waitset; no > > > thundering heard.
> > > > Where I can see it?
> > > I think a version of NPTL futex has wait-morphing... > > I see. It's interesting thing, but as I understand it is applicable > > only in kernel, right? At least I don't see how I can implement it...
> I have experimented with something like the following on Windows in the > past: > [...] > Any thoughts?
Hmmm... Cool! With WaitForMultipleObjects() it's possible to implement wait-morphing in user space. But it requires mutex to be kernel object too. I think this annihilates any advantages from wait-morphing.