lock free versus lock-based algorithms

tri

unread,

Oct 17, 2004, 3:03:51 PM10/17/04

to

Hi,

I'm learning more about lock-free versus lock-based algorithms and
their performance on shared data structures amongst mulitple threads.

In particular, I'd like to narrow the discussion to a shared work
queue amongst a dispatcher thread producing work and worker threads
consumming work on a multiprocessor machine. The dispatcher-workers
server I wrote is the one dedicated software application running on
the multiprocessor machine. And let's say, I create as many threads as
processors.

I've done some testing and see that the lock-free algorithm
(java.util.concurrent.ConcurrentLinkedQueue) performs better than the
lock-based (Java's synchronized mechanism).

I'm looking for answers as to why the lock-free algorithm performs
better. Is it because context-switching is costly and the lock-based
algorithm context switches a lot more?

Is it because the lock-free algorithm does a bit of work before doing
the CAS (compare and swap) and this optimistic work helps improve
performance?

The lock-free algorithm does spin, but not on a conventional lock. Is
it that the CAS locking operation performs a lot faster than the
conventional lock operation?

What are some other questions and answers?

Any insights is much appreciated!

Thank You.

Tri

SenderX

unread,

Oct 17, 2004, 3:59:04 PM10/17/04

to

> I've done some testing and see that the lock-free algorithm
> (java.util.concurrent.ConcurrentLinkedQueue) performs better than the
> lock-based (Java's synchronized mechanism).

Yup.

> I'm looking for answers as to why the lock-free algorithm performs
> better. Is it because context-switching is costly and the lock-based
> algorithm context switches a lot more?

That's one reason. Lock-free algorithms also avoid calling into the kernel.
This is another reason why you see a performance increase. A poorly
designed, heavily contended lock-based collection would spend a lot of time
in the kernel managing the waitset of the mutex protecting the shared state.
Also, heavly contended lock-based stuff can suffer from priority inversion
and lock convoy.

> Is
> it that the CAS locking operation performs a lot faster than the
> conventional lock operation?

CAS wrt lock-free algorithm's works on the data structures "entire shared
state". Think of a CAS modifying an anchor to a lock-free stack ( ibm
freelist ). Lock-based would use cas to acquire a mutex "before" it can even
touch its shared state. IMHO, this gives a clear advantage to the lock-free
algorithm...

> The lock-free algorithm does spin, but not on a conventional lock.

Yes, they spin...

But if one thread spins, it means another thread has completed its cas and
made forward progress. "Guaranteed forward progress on cas failure" is
another reason you see a performance improvement.

> What are some other questions and answers?
>
> Any insights is much appreciated!
>

> Tri

Think of embedding a waitset in a lock-free collections anchor. A marriage
between lock-free and lock-based would be an ideal way to introduce a
lock-free collection to a numa environment for instance... A clever mixture
of lock-free and lock-based outperforms a 100% lock-free approach, think of
rcu.

Also, think of better ways of using locks:

http://msdn.microsoft.com/msdnmag/issues/01/08/concur/default.aspx

Joe Seigh

unread,

Oct 17, 2004, 4:37:51 PM10/17/04

to

The CAS loop in the lock-free algorithm isn't looping that much. The primary
reason you're seeing better performance is that the lock-free algorithm doesn't
suspend and resume thread execution the way conventional locks do under contention.
That is a huge amount of overhead. Actual overhead difference between lock-free and
uncontended locks isn't isn't that significant.

Joe Seigh

SenderX

unread,

Oct 17, 2004, 4:37:38 PM10/17/04

to

> I've done some testing and see that the lock-free algorithm
> (java.util.concurrent.ConcurrentLinkedQueue) performs better than the
> lock-based (Java's synchronized mechanism).

Yup.

> I'm looking for answers as to why the lock-free algorithm performs
> better. Is it because context-switching is costly and the lock-based
> algorithm context switches a lot more?

That's one reason. Lock-free algorithms also avoid calling into the kernel.

This is another reason why you see a performance increase. A poorly
designed, heavily contended lock-based collection would spend a lot of time
in the kernel managing the waitset of the mutex protecting the shared state.
Also, heavly contended lock-based stuff can suffer from priority inversion
and lock convoy.

> Is
> it that the CAS locking operation performs a lot faster than the
> conventional lock operation?

CAS wrt lock-free algorithm's works on the data structures "entire shared

state". Think of a CAS modifying an anchor to a lock-free stack ( ibm
freelist ). Lock-based would use cas to acquire a mutex "before" it can even
touch its shared state. IMHO, this gives a clear advantage to the lock-free
algorithm...

> The lock-free algorithm does spin, but not on a conventional lock.

Yes, they spin...

But if one thread spins, it means another thread has completed its cas and
made forward progress. "Guaranteed forward progress on cas failure" is
another reason you see a performance improvement.

> What are some other questions and answers?
>
> Any insights is much appreciated!
>

Ronald Landheer-Cieslak

unread,

Oct 18, 2004, 10:40:15 AM10/18/04

to

Hi tri,

See SenderX and Joe Seigh's responses for a more practical approach :)

tri wrote:
> What are some other questions and answers?

Generally, I'd say lock-free applications are faster not just because
they don't use locks, but because they are inherently better designed:
using locks, you have a whole lot more freedom to do what you want with
the object you're protecting with your lock. Not using locks means
you're restricting the liberty you have to manipulate your object to the
bare minimum of what you can do without breaking any invariants, thus
forcing yourself to re-think your invariants, your design and your
implementation.

Lock-free algorithms come with the added advantage that they are
non-blocking: one thread can not impeed the progress of another thread
by locking a shared resource, nor can one thread change the course of
events in another thread by locking a shared resource (which would
happen if the thread tries to lock, fails, and does something else in
stead). Impeding the progress of another thread or changing the course
of events in it inherently slows down the progress of the entire
process. Non-blocking, lock-free algorithms don't let you do that, so
they are faster (most of the time, anyway).

However, there are a number of problems with non-blocking, lock-free
algorithms that are not wait-free: whereas no single thread will be
blocked indefinitely, no single thread is guaranteed to get its job done
either: two threads trying to change the same value can result in one
changing the value and the other failing to do so repeatedly, ad
infinitum. A locking implementation would allow the developer to
determine the order in which the threads get access to the resource,
this guaranteeing that every thread will eventually get its job done.
Hence, non-blocking, lock-free algorithms are not suitable for *every*
occasion - though in practice they are suitable for the vast majority of
cases where a lock-free algorithm is available..

rlc

Joe Seigh

unread,

Oct 18, 2004, 12:06:53 PM10/18/04

to

Ronald Landheer-Cieslak wrote:
>
> However, there are a number of problems with non-blocking, lock-free
> algorithms that are not wait-free: whereas no single thread will be
> blocked indefinitely, no single thread is guaranteed to get its job done
> either: two threads trying to change the same value can result in one
> changing the value and the other failing to do so repeatedly, ad
> infinitum. A locking implementation would allow the developer to
> determine the order in which the threads get access to the resource,
> this guaranteeing that every thread will eventually get its job done.
> Hence, non-blocking, lock-free algorithms are not suitable for *every*
> occasion - though in practice they are suitable for the vast majority of
> cases where a lock-free algorithm is available..
>

If the contention is high enough to give the interlocked instructions used
in lock-free algorithms problems, it's likely to cause problems for the interlocked
instructions used in lock algorithms as well.

The default scheduling policy for most systems seems to be SCHED_OTHER which doesn't
really guarantee forward progress. You could have situations where threads just don't
get dispatched or get dispatched very sporadically. I suspect the main reason for
making SCHED_OTHER the default is that the performance difference between the adaptive
locks which allow queue jumping and which are used in in SCHED_OTHER and a FIFO lock
used for SCHED_FIFO is really dramatic. IF the OP though there was a significant
difference between the lock-free and locked versions in a Java program, using a
FIFO lock would have had an even larger difference.

Joe Seigh

Ronald Landheer-Cieslak

unread,

Oct 18, 2004, 3:46:19 PM10/18/04

to

Joe Seigh wrote:
> Ronald Landheer-Cieslak wrote:
>> However, there are a number of problems with non-blocking,
>> lock-free algorithms that are not wait-free: whereas no single
>> thread will be blocked indefinitely, no single thread is guaranteed
>> to get its job done either: two threads trying to change the same
>> value can result in one changing the value and the other failing to
>> do so repeatedly, ad infinitum. A locking implementation would
>> allow the developer to determine the order in which the threads get
>> access to the resource, this guaranteeing that every thread will
>> eventually get its job done. Hence, non-blocking, lock-free
>> algorithms are not suitable for *every* occasion - though in
>> practice they are suitable for the vast majority of cases where a
>> lock-free algorithm is available..
> If the contention is high enough to give the interlocked instructions
> used in lock-free algorithms problems, it's likely to cause problems
> for the interlocked instructions used in lock algorithms as well.

Actually, I don't think so: if contention is high enough to cause
problems for the interlocked instructions in a non-blocking algo, that
contention could practically fall away by putting in locks. The
contention would be transferred to racing for the lock, but sometimes,
it's easier to write a lock that handles contention well than to write a
non-blocking algorithm where the single-action-state-change handles
contention well.
All depends on the situation, of course, but that doesn't change that if
you want to make sure every thread gets a chance at a shared resource, a
non-blocking algorithm won't be able to give you that guarantee whereas
one with a FIFO lock will.

> The default scheduling policy for most systems seems to be
> SCHED_OTHER which doesn't really guarantee forward progress. You
> could have situations where threads just don't get dispatched or get
> dispatched very sporadically. I suspect the main reason for making
> SCHED_OTHER the default is that the performance difference between
> the adaptive locks which allow queue jumping and which are used in in
> SCHED_OTHER and a FIFO lock used for SCHED_FIFO is really dramatic.
> IF the OP though there was a significant difference between the
> lock-free and locked versions in a Java program, using a FIFO lock
> would have had an even larger difference.

Of course, but my response wasn't really intended to be practical (which
I said at the very start). SenderX and you gave very nice, practical
responses to the question - I just wanted to add a somewhat wider scope..

rlc

Markus Elfring

unread,

Oct 18, 2004, 5:33:00 PM10/18/04

to

> What are some other questions and answers?

Can the documents from a search help in this discussion?
http://citeseer.ist.psu.edu/cis?q=lock-free+and+lock-based+and+algorithm

Example:
"Characterizing the Performance of Algorithms for Lock-free Objects"
by Theodore Johnson
http://citeseer.ist.psu.edu/30516.html

SenderX

unread,

Oct 20, 2004, 6:55:51 PM10/20/04

to

> Think of embedding a waitset in a lock-free collections anchor. A marriage
> between lock-free and lock-based would be an ideal way to introduce a
> lock-free collection to a numa environment for instance... A clever
> mixture
> of lock-free and lock-based outperforms a 100% lock-free approach, think
> of
> rcu.

For instance, you could do this with a lock-free stack's anchor:

< pseudo-code >

typedef struct node_
{
struct node_ *next;
const void *state;

} node_t;

typedef struct stack_anchor_
{
node_t *front;
unsigned __int32 state;

} volatile stack_anchor_t;

typedef struct stack_
{
stack_anchor_t anchor;
HANDLE waitset;

} stack_t;

/* returns non-zero and updates comprand on failure */
extern int dwcas( volatile void *dest, void *cmp, const void *xchg );

void push( stack_t *_this, node_t *node )
{
stack_anchor_t xchg, cmp;

xchg.front = node;
cmp = _this->anchor;

do
{
node->next = cmp.front;

if ( cmp.state & 0x0000FFFF )
{
xchg.state = cmp.state - 1; /* waiters */
}

else { xchg.state = cmp.state; }

} while ( dwcas( &_this->anchor, &cmp, &xchg ) );

if ( cmp.state & 0x0000FFFF )
{
/* wake one */
if ( ! ReleaseSemaphore
( _this->waitset,
1,
0 ) )
{
assert( 0 ); abort();
}
}
}

node_t* pop( stack_t *_this )
{
stack_anchor_t xchg, cmp = _this->anchor;

for ( ;; )
{
if ( cmp.front ) /* condition predicate */
{
xchg.front = cmp.front->next; /* hazard */
xchg.state = cmp.state + 0x10000; /* aba */
}

else /* stack is empty */
{
xchg.front = cmp.front;
xchg.state = cmp.state + 1; /* waiters */
}

if ( ! dwcas( &_this->anchor, &cmp, &xchg ) )
{
if ( cmp.front ) { break; }

if ( WaitForSingleObject
( _this->waitset,
INFINITE ) != WAIT_OBJECT_0 )
{
assert( 0 ); abort();
}

cmp = _this->anchor;
}
}

return cmp.front;
}

This is an example of how a direct coupling of lock-free and lock-based can
improve performance by removing the need of external signaling state... The
logic for the lock-based signaler is married to the lock-free stack's
anchor. The slow paths are handled by a kernel object.This, in effect,
creates a sort of lock-free condition variable driven stack. You could
further extend this to create adaptable lock-free stacks, that will block
when they have failed their cas too many times. Also, you could allow it to
have timeouts and be deferred cancel safe.

Any thoughts?

Joe Seigh

unread,

Oct 20, 2004, 9:53:21 PM10/20/04

to

SenderX wrote:
>
> > Think of embedding a waitset in a lock-free collections anchor. A marriage
> > between lock-free and lock-based would be an ideal way to introduce a
> > lock-free collection to a numa environment for instance... A clever
> > mixture
> > of lock-free and lock-based outperforms a 100% lock-free approach, think
> > of
> > rcu.
>
> For instance, you could do this with a lock-free stack's anchor:
>

[...]

>
> This is an example of how a direct coupling of lock-free and lock-based can
> improve performance by removing the need of external signaling state... The
> logic for the lock-based signaler is married to the lock-free stack's
> anchor. The slow paths are handled by a kernel object.This, in effect,
> creates a sort of lock-free condition variable driven stack. You could
> further extend this to create adaptable lock-free stacks, that will block
> when they have failed their cas too many times. Also, you could allow it to
> have timeouts and be deferred cancel safe.
>
> Any thoughts?

Well, there's patent 6,636,883 "Mechanism for passing information between queuing
and de-queuing processes" by Ed Zebrowski who I knew when we worked in VM. It
basically dual purposes the queue anchor by queueing waiting threads when queue
is empty of data. I had considered something similar but I decided I'd rather
have the signaling mechanism implementation separate from the collection implementation.
There are additional issues that I want to check out and merging stuff together makes
it hard to isolate them. I have to wait until I get a 2.6 linux kernel to test it
though. Writing testcases to show whether something works better seems to be the
most difficult part. And then convincing everyone else it works better is even more
difficult. I suspect I'll have to write a database again at somepoint.

Joe Seigh

SenderX

unread,

Oct 20, 2004, 10:32:21 PM10/20/04

to

> I had considered something similar but I decided I'd rather
> have the signaling mechanism implementation separate from the collection
> implementation.

Your zero overhead for signalers waitset?

> Writing testcases to show whether something works better seems to be the
> most difficult part.

I think you mentioned that your waitset could be used for lock-free
collections. How would a existing collection make use of your API?

Would it be like wrapping it with a semaphore-like thing?

/* producers */
my_collection->produce();
joes_signal->zero_overhead_post();

/* consumers */
joes_signal->wait();
my_collection->consume();

?

Zero overhead for signalers would be great for algos that can't embed a
waitset in their state...

> And then convincing everyone else it works better is even more
> difficult. I suspect I'll have to write a database again at somepoint.

:)

Joe Seigh

unread,

Oct 21, 2004, 8:08:16 AM10/21/04

to

SenderX wrote:
>
> > I had considered something similar but I decided I'd rather
> > have the signaling mechanism implementation separate from the collection
> > implementation.
>
> Your zero overhead for signalers waitset?

Among other things, yes.

> I think you mentioned that your waitset could be used for lock-free
> collections. How would a existing collection make use of your API?
>
> Would it be like wrapping it with a semaphore-like thing?
>
> /* producers */
> my_collection->produce();
> joes_signal->zero_overhead_post();
>
> /* consumers */
> joes_signal->wait();
> my_collection->consume();
>
> ?

// producers
my_collection->produce();
SetEventCount(ev)

// consumers
if ((p = my_collection->consume()) == NULL) {
w = ResetEventCount(ev);
if ((p = my_collection->consume()) == NULL)
WaitForEventCount(ev, w, INFINITE);
}

Eventcounts are broadcast only so that only works for situations
where that doesn't matter. Otherwise use fast semaphore w/
producer/consumer. Probably same overhead as above if you
don't know how to do evencounts with zero overhead. Also handles
situations where you don't want broadcast, e.g. work queue w/
large thread pool.

The thing to remember is what path you are optimizing. Optimizing
the slow path doesn't buy you all that much though that semaphore
thing is a neat trick.

>
> Zero overhead for signalers would be great for algos that can't embed a
> waitset in their state...
>
> > And then convincing everyone else it works better is even more
> > difficult. I suspect I'll have to write a database again at somepoint.
>
> :)

Selling people on low level api's is difficult. They tend to go full blown
applications which is why there's so much bloatware out there. I wrote a
low level api for constructing database queries. Went nowhere. I then
wrote a primative database using that api and it became fairly popular at the
time. Internal to IBM, you never heard of it.

Joe Seigh

SenderX

unread,

Oct 21, 2004, 7:06:18 PM10/21/04

to

> // consumers
> if ((p = my_collection->consume()) == NULL) {
> w = ResetEventCount(ev);

I assume that this resets the events state, and returns the old, or some
other, value... You in effect get a sort of local reset?

> if ((p = my_collection->consume()) == NULL)
> WaitForEventCount(ev, w, INFINITE);
> }

This seems futex like. w seems to by used to sync the previous local reset
with something in the event count...

> Probably same overhead as above if you
> don't know how to do evencounts with zero overhead.
> Otherwise use fast semaphore w/
> producer/consumer.

A fast semaphore solution is very nice, but the lock-free stack code I
posted outperforms a fast-sema wrapper. Extra atomic ops. Although, your
loop-less fast semaphore is nice because it guarantees you one atomic op for
a post, and one atomic op for a wait when the semaphore count is > 0.

Zero overhead... Humm.

Something like this?

void SetEventCount( ec_t *_this )
{
waiter_t *w = _this->waiters; /* atomic load */

if ( w ) /* check for waiters */
{
waiter_t *n;

/* attempt to get every waiter */
w = InterlockedExchangePointer( &_this, 0 );

while ( w ) /* broadcast */
{
n = w->next;
waiter_signal( w );
waiter_release( w ); /* dec's reference count */
w = n;
}
}
}

Is that even close?

Joe Seigh

unread,

Oct 21, 2004, 9:11:30 PM10/21/04

to

SenderX wrote:
>
> > // consumers
> > if ((p = my_collection->consume()) == NULL) {
> > w = ResetEventCount(ev);

Basically just gets the current eventcount. Reset was just borrowed from
other synchronization api's. It could be called GetEventCount().

>
> I assume that this resets the events state, and returns the old, or some
> other, value... You in effect get a sort of local reset?
>
> > if ((p = my_collection->consume()) == NULL)
> > WaitForEventCount(ev, w, INFINITE);
> > }
>
> This seems futex like. w seems to by used to sync the previous local reset
> with something in the event count...

Futexes can be used to do a straight forward eventcount.

>
> > Probably same overhead as above if you
> > don't know how to do evencounts with zero overhead.
> > Otherwise use fast semaphore w/
> > producer/consumer.
>
> A fast semaphore solution is very nice, but the lock-free stack code I
> posted outperforms a fast-sema wrapper. Extra atomic ops. Although, your
> loop-less fast semaphore is nice because it guarantees you one atomic op for
> a post, and one atomic op for a wait when the semaphore count is > 0.

Yes, you have less atomic ops with your algorithm. I was going for something
that works with other lock-free collections as well. And there are some things
I want to check out that would be limited by restricting it to one data structure.

>
> Zero overhead... Humm.
>
> Something like this?
>

...
>
> Is that even close?

No. But it probably doesn't make that much difference given the double checking
on the consumer side eliminates overhead on consumer side. So, I may just
drop the solution and opt for a more conventional one.

Joe Seigh

unread,

Oct 22, 2004, 11:31:04 AM10/22/04

to

SenderX wrote:
>
> > Think of embedding a waitset in a lock-free collections anchor. A marriage
> > between lock-free and lock-based would be an ideal way to introduce a
> > lock-free collection to a numa environment for instance... A clever
> > mixture
> > of lock-free and lock-based outperforms a 100% lock-free approach, think
> > of
> > rcu.
>
> For instance, you could do this with a lock-free stack's anchor:
>
> < pseudo-code >
>

[...]

You don't have to sacrifice part of your version count field to keep the
number of waiters. Keep the wait count in the ptr field part when there's
no data. When you queue data, swap out the wait count and add it to a running
wait count somewhere else. There's a slight amount of extra overhead but it's
on your slow path. You need 1 bit somewhere to distinquish between data and
wait count but you'll still get at least 31 bits for the version count which
is good.

Joe Seigh

unread,

Oct 22, 2004, 3:29:00 PM10/22/04

to

Joe Seigh wrote:
>
> >
> > For instance, you could do this with a lock-free stack's anchor:
> >
> > < pseudo-code >
> >
> [...]
>
> You don't have to sacrifice part of your version count field to keep the
> number of waiters. Keep the wait count in the ptr field part when there's
> no data. When you queue data, swap out the wait count and add it to a running
> wait count somewhere else. There's a slight amount of extra overhead but it's
> on your slow path. You need 1 bit somewhere to distinquish between data and
> wait count but you'll still get at least 31 bits for the version count which
> is good.
>

Never mind for now. There's a race condition. There may be a way around it but
I'd have to work on it some more.

Joe Seigh

Markus Elfring

unread,

Oct 22, 2004, 8:36:38 PM10/22/04

to

> I'm learning more about lock-free versus lock-based algorithms and
> their performance on shared data structures amongst mulitple threads.

> [...]

> What are some other questions and answers?

Will this thesis for the degree of Doctor of Philosophy by Håkan
Sundell become a recommended reading for the topic of this discussion?
"Efficient and Practical Non-Blocking Data Structures"
http://www.cs.chalmers.se/~phs/phd.pdf (1414 KB)

Regards,
Markus

--

Joe Seigh

unread,

Oct 23, 2004, 11:09:30 AM10/23/04

to

SenderX wrote:
>
[...]

> This is an example of how a direct coupling of lock-free and lock-based can
> improve performance by removing the need of external signaling state... The
> logic for the lock-based signaler is married to the lock-free stack's
> anchor. The slow paths are handled by a kernel object.This, in effect,
> creates a sort of lock-free condition variable driven stack. You could
> further extend this to create adaptable lock-free stacks, that will block
> when they have failed their cas too many times. Also, you could allow it to
> have timeouts and be deferred cancel safe.
>
> Any thoughts?

You don't have to guarantee forward progress so I wouldn't worry about whether
the some threads failing to dequeue data too many times. Most of the lock based
solutions will break long before a lock-free solution will.

Joe Seigh

unread,

Oct 24, 2004, 7:47:07 PM10/24/04

to

I took a real quick look and it appears to be generic RCU pattern with
timing assumptions from using realtime systems to know when it's safe
to free storage. I'm not sure even hard realtime systems can be used
to guarantee safety since it depends on assuming user code wil run in
a certain amount of time.

Joe Seigh

--

SenderX

unread,

Oct 24, 2004, 9:43:45 PM10/24/04

to

> You don't have to sacrifice part of your version count field to keep the
> number of waiters. Keep the wait count in the ptr field part when
> there's
> no data. When you queue data, swap out the wait count and add it to a
> running
> wait count somewhere else. There's a slight amount of extra overhead but
> it's
> on your slow path. You need 1 bit somewhere to distinquish between data
> and
> wait count but you'll still get at least 31 bits for the version count
> which
> is good.

Well, there shoud be a way around the race condition... I guess you could
just append the running count to the end of the anchor, and have it share
the full aba count.

Something like this:

typedef struct node_
{
struct node_ *next;
const void *state;

} node_t;

typedef struct stack_anchor_
{
node_t *front;
unsigned __int32 state;

unsigned __int32 waiters;

} volatile stack_anchor_t;

Now you could use dwcas to keep the front and the waiter atomic due to the
ll/sc semantics of the aba count... This might work:

< pseudo-code >

void push( stack_t *_this, node_t *node )
{

unsigned __int32 waiters;
stack_anchor_t xchg, cmp;

cmp = _this->anchor;

for ( ;; )
{
waiters = this->anchor.waiters;

xchg.state = cmp.state + 1;

if ( ! waiters )
{
node->next = cmp.front;

xchg.front = node;

/* push the node /w aba inc */
if ( ! dwcas ( &_this->anchor, &cmp, &xchg ) )
{
break;
}
}

else
{
cmp.front = (node_t*)cmp.state;
cmp.state = waiters;

xchg.front = (node_t*)xchg.state;

xchg.state = cmp.state - 1;

/* dec waiters /w aba inc */
if ( ! dwcas ( &_this->anchor.state, &cmp, &xchg ) )
{
break;
}

cmp.state = (unsigned __int32)cmp.front;
cmp.front = _this->anchor.front;
}
}

if ( waiters ) /* slow-path */
{
cmp.state = (unsigned __int32)cmp.front;
cmp.front = _this->anchor.front;

xchg.front = node;

do
{
node->next = cmp.front;

xchg.state = cmp.state + 1;
}

/* push the node /w aba inc */

while ( dwcas( &_this->anchor, &cmp, &xchg ) );

/* wake one */

if ( ! ReleaseSemaphore
( _this->waitset,
1,
0 ) )
{
assert( 0 ); abort();
}
}
}

node_t* pop( stack_t *_this )
{

unsigned __int32 waiters;
stack_anchor_t xchg, cmp;

cmp = _this->anchor;

for ( ;; )
{
waiters = this->anchor.waiters;

xchg.state = cmp.state + 1;

if ( cmp.front )
{
node->next = cmp.front;

xchg.front = cmp.front->next;

/* pop the node /w aba inc */
if ( ! dwcas ( &_this->anchor, &cmp, &xchg ) )
{
break;
}
}

else
{
cmp.front = (node_t*)cmp.state;
cmp.state = waiters;

xchg.front = (node_t*)xchg.state;

xchg.state = cmp.state + 1;

/* inc waiters /w aba inc */
if ( ! dwcas ( &_this->anchor.state, &cmp, &xchg ) )
{

if ( WaitForSingleObject
( _this->waitset,
INFINITE ) != WAIT_OBJECT_0 )
{
assert( 0 ); abort();
}

cmp = _this->anchor;
}

else
{
cmp.state = (unsigned __int32)cmp.front;
cmp.front = _this->anchor.front;
}
}
}

return cmp.front;
}

Since the stack_anchor_t::waiters and stack_anchor_t::front fields share
the same aba count, they should stay atomic.

What do you think of this solution?

;)

SenderX

unread,

Oct 24, 2004, 9:55:44 PM10/24/04

to

DOH!.

Ummm, forget that first code I posted!

Look at this one, is a bit cleaner... lol.

;)

void push( stack_t *_this, node_t *node )
{

stack_anchor_t xchg, cmp;

cmp = _this->anchor;

for ( ;; )
{

xchg.state = cmp.state + 1;

if ( ! cmp.waiters )
{
node->next = cmp.front;

xchg.front = node;

/* push the node /w aba inc */
if ( ! dwcas ( &_this->anchor, &cmp, &xchg ) )
{
break;
}
}

else
{
xchg.waiters = cmp.waiters - 1;

/* dec waiters /w aba inc */

if ( ! dwcas ( &_this->anchor.state, &cmp.state, &xchg.state ) )
{
break;
}

cmp.front = _this->anchor.front;
}
}

if ( waiters ) /* slow-path */
{

cmp.front = _this->anchor.front;

xchg.front = node;

do
{
node->next = cmp.front;
xchg.state = cmp.state + 1;
}

/* push the node /w aba inc */
while ( dwcas( &_this->anchor, &cmp, &xchg ) );

/* wake one */
if ( ! ReleaseSemaphore
( _this->waitset,
1,
0 ) )
{
assert( 0 ); abort();
}
}
}

node_t* pop( stack_t *_this )
{

stack_anchor_t xchg, cmp;

cmp = _this->anchor;

for ( ;; )
{

xchg.state = cmp.state + 1;

if ( cmp.front )
{
node->next = cmp.front;

xchg.front = cmp.front->next;

/* pop the node /w aba inc */
if ( ! dwcas ( &_this->anchor, &cmp, &xchg ) )
{
break;
}
}

else
{
xchg.waiters = cmp.waiters + 1;

/* inc waiters /w aba inc */

if ( ! dwcas ( &_this->anchor.state, &cmp.state, &xchg.state ) )

{
if ( WaitForSingleObject
( _this->waitset,
INFINITE ) != WAIT_OBJECT_0 )
{
assert( 0 ); abort();
}

cmp = _this->anchor;
}

else
{

SenderX

unread,

Oct 25, 2004, 5:10:42 PM10/25/04

to

Also, you need to re-update the waiter count on fast-path cas failure for
both push and pop...

push

> if ( ! cmp.waiters )
> {
> node->next = cmp.front;
>
> xchg.front = node;
>
> /* push the node /w aba inc */
> if ( ! dwcas ( &_this->anchor, &cmp, &xchg ) )
> {
> break;
> }

/* right here! */
cmp.waiters = _this->anchor.waiters;

> }

&&

pop

for ( ;; )
{
xchg.state = cmp.state + 1;

if ( cmp.front )
{
node->next = cmp.front;

xchg.front = cmp.front->next;

/* pop the node /w aba inc */
if ( ! dwcas ( &_this->anchor, &cmp, &xchg ) )
{
break;
}

/* and right here! */
cmp.waiters = _this->anchor.waiters;
}

That should work. Sorry for the little bugs, but I just came up with this
algo right off the top of my head...

;)

Do you think this technique might me useful? I am wondering about alignment
issues with dwcas...

Hummm...

Joe Seigh

unread,

Oct 26, 2004, 8:18:29 AM10/26/04

to

SenderX wrote:
>
>
> Do you think this technique might me useful? I am wondering about alignment
> issues with dwcas...
>
> Hummm...

Interlocked stuff usually has to be word, dword, etc... aligned to keep it within
a cache line if the implementation uses cache line locking.

Joe Seigh

H?kan Sundell

unread,

Oct 27, 2004, 1:47:13 PM10/27/04

to

Joe Seigh <jsei...@xemaps.com> wrote in message news:<417A6077...@xemaps.com>...

Thanks for your interest in my thesis. The thesis contains algorithms
of data structures aimed for real-time systems as well as algorithms
of data structures for arbitrary multi-thread systems. The thesis is
thus roughly divided into two parts, the first part (Chapter 2-3)
covers real-time, and the second part (Chapter 4-7) covers general
purpose non-blocking algorithms.

The presented data structures that are designed with real-time systems
in mind are the following:

* Wait-Free Shared Register. Multi-reader/multi-writer single register
that could be implemented on distributed systems using message passing
(i.e. might be suitable for systems that lack shared memory)
* Wait-Free Snapshot. Single-reader/multi-writer that only relies on
atomic read and writes of shared memory and also on timing
characteristics of tasks.

The presented data structures that are designed for arbitrary purposes
and any kind of multi-thread systems (i.e. non real-time), are the
following:

* Lock-Free Skip List data structure. Supports Inserts/Updates/Deletes
and relies only on single-word compare-and-swap. We have showed how
this structure can be used to implement the Lock-Free Priority Queues
as well as the Lock-Free Dictionary abstract data types. These
chapters are based on our previous results published at IPDPS in apr
2003 and SAC in mar 2004 as well as technical reports in jan 2003.

* Lock-Free Doubly Linked List data structure. Supports
Inserts/Deletes at arbitrary positions as well as reliable traversals
(without the need for re-starts in case of concurrent deletions) using
single-word compare-and-swap. We have showed how this structure can be
used to implement the Lock-Free Deque abstract data type. This chapter
is based on our previous results submitted to PODC in feb 2004 and was
made publically available as a technical report in mar 2004.

All of the dynamic lock-free data structures are described in detail
with the lock-free reference counting schemes in mind for the needed
garbage collection. However, if relaxing on some of the properties of
the algorithms (i.e. don't use any back-link pointers in the nodes,
and thus allow re-starts of traversing operations), other more
efficient garbage collection schemes as for example the hazard pointer
method by Michael could be used with simple changes of the algorithms.

Best regards,
Håkan Sundell

--

Joe Seigh

unread,

Oct 29, 2004, 10:31:41 AM10/29/04

to

(reposted with comp.parallel removed as there appears to be a problem with
that newsgroup. My orginal reply will probably show up tomorrow if past
postings are any indication)

H?kan Sundell wrote:
>
> All of the dynamic lock-free data structures are described in detail
> with the lock-free reference counting schemes in mind for the needed
> garbage collection. However, if relaxing on some of the properties of
> the algorithms (i.e. don't use any back-link pointers in the nodes,
> and thus allow re-starts of traversing operations), other more
> efficient garbage collection schemes as for example the hazard pointer
> method by Michael could be used with simple changes of the algorithms.
>

Well there is a lock-free reference counting scheme that works, i.e. with
existing architectures and doesn't require DCAS. Valois's phd thesis doesn't
seem to be publically available so I can't comment on it though I'm sure Detlefs
would have mentioned more about it in his paper if there was more to it.

How does your stuff compare to RCU which is in the Linux kernel?

Joe Seigh

unread,

Oct 29, 2004, 1:18:33 PM10/29/04

to

H?kan Sundell wrote:
> All of the dynamic lock-free data structures are described in detail
> with the lock-free reference counting schemes in mind for the needed
> garbage collection. However, if relaxing on some of the properties of
> the algorithms (i.e. don't use any back-link pointers in the nodes,
> and thus allow re-starts of traversing operations), other more
> efficient garbage collection schemes as for example the hazard pointer
> method by Michael could be used with simple changes of the algorithms.

Well there is a lock-free reference counting scheme that works, i.e. with

existing architectures and doesn't require DCAS. Valois's phd thesis doesn't
seem to be publically available so I can't comment on it though I'm sure Detlefs
would have mentioned more about it in his paper if there was more to it.

How does your stuff compare to RCU which is in the Linux kernel?

Joe Seigh

--