Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Interlocked Singly Linked Lists (for Ben or anyone else).

649 views
Skip to first unread message

Pops

unread,
Nov 14, 2007, 8:35:02 AM11/14/07
to
Ben (C++ MVP) recommended using the Interlocked Singly Linked Lists API:

http://msdn2.microsoft.com/en-us/library/ms686962.aspx

as a more efficient synchronization method for a recent thread
discussion in regards to a FIFO queue design.

If Ben is reading or if anyone else is familiar with this API, I have a
few questions:

First, based on what I see, this is for LIFO designs (i.e., stacks)?
Not FIFO. Am I missing something? I only see Push/Pop functions, no
head/tail functions.

Seconds, can you tell how it works with IPC, can it work for IPC?

Thanks

--
HLS

Ben Voigt [C++ MVP]

unread,
Nov 14, 2007, 9:26:51 AM11/14/07
to

"Pops" <du...@nospamnospamnospam.com> wrote in message
news:%23tndYNs...@TK2MSFTNGP04.phx.gbl...

> Ben (C++ MVP) recommended using the Interlocked Singly Linked Lists API:
>
> http://msdn2.microsoft.com/en-us/library/ms686962.aspx
>
> as a more efficient synchronization method for a recent thread discussion
> in regards to a FIFO queue design.
>
> If Ben is reading or if anyone else is familiar with this API, I have a
> few questions:
>
> First, based on what I see, this is for LIFO designs (i.e., stacks)? Not
> FIFO. Am I missing something? I only see Push/Pop functions, no
> head/tail functions.

We were discussing a multiple writer, single consumer queue. For this the
writers each use the InterlockedPushEntrySList function, while the consumer
uses the InterlockedFlushSList function to retrieve the entire queue as an
atomic operation. The consumer is then responsible for iterating through
the list of items retrieved in FIFO order.

>
> Seconds, can you tell how it works with IPC, can it work for IPC?

Because there is no name assigned, the SList cannot establish an IPC
channel. However, because the caller has control over the placement of all
data structures involved, it should be possible to use with shared memory.
It would be highly desirable in such cases to obtain the same virtual
address space in each of the processes. If this is not possible, then you
probably can't use the Win32 API implementation, though the underlying
algorithm will still work (using relative addresses / indices instead of
pointers).

>
> Thanks
>
> --
> HLS
>
>
>


Pops

unread,
Nov 14, 2007, 10:41:34 AM11/14/07
to
Ben Voigt [C++ MVP] wrote:

> We were discussing a multiple writer, single consumer queue. For this the
> writers each use the InterlockedPushEntrySList function, while the consumer
> uses the InterlockedFlushSList function to retrieve the entire queue as an
> atomic operation. The consumer is then responsible for iterating through
> the list of items retrieved in FIFO order.

Gotcha!

>> Seconds, can you tell how it works with IPC, can it work for IPC?
>
> Because there is no name assigned, the SList cannot establish an IPC
> channel. However, because the caller has control over the placement of all
> data structures involved, it should be possible to use with shared memory.
> It would be highly desirable in such cases to obtain the same virtual
> address space in each of the processes. If this is not possible, then you
> probably can't use the Win32 API implementation, though the underlying
> algorithm will still work (using relative addresses / indices instead of
> pointers).

Ok. Right.

I was interested to see if it can be applied in our virtual I/O
communication/channel framework which currently uses RPC. Probably won't
apply for remote machines operations, but might prove useful as an
improved alternative to local RPC operations.

My whole mantra these days is to increase client/server performance and
scalability. So this api might just have some purpose for us.


--
HLS

Chris Thomasson

unread,
Nov 14, 2007, 12:24:31 PM11/14/07
to
"Pops" <du...@nospamnospamnospam.com> wrote in message
news:%23tndYNs...@TK2MSFTNGP04.phx.gbl...

Yes. You can create a node cache in shared memory and allocate/deallocate
your SLIST_ENTRY from there. Something like:


(quick and dirty pseudo-code sketch)
____________________________________________________________

#define L2CACHE_LINESZ() 128
#define CACHE_DEPTH() 1024


typedef struct shm_node_s shm_node;
typedef struct shm_node_data_s shm_node_data;
typedef struct shm_node_cache_s shm_node_cache;

struct shm_node_data_s {
/* some data */
};

struct shm_node_s {
SLIST_ENTRY slnode;
shm_node_data data;
};

struct shm_node_cache_s {
SLIST_HEADER anchor;
char pad[L2CACHE_LINESZ() - sizeof(SLIST_HEADER)];
shm_node nodes[CACHE_DEPTH()];
};


shm_node_cache*
shm_node_cache_create(
char const* const name
) {
int i;
shm_node_cache* const _this =
/* create file map with name, alloc and align shm */

InitializeSListHead(&_this->anchor);

for(i = 0; i < CACHE_DEPTH(); ++i) {
InterlockedPushEntrySList(&_this->anchor,
&_this->nodes[i].slnode);
}

return _this;
}


shm_node*
shm_node_cache_pop(
shm_node_cache* const _this
) {
return (shm_node*)InterlockedPopEntrySList(&_this->anchor);
}

void
shm_node_cache_push(
shm_node_cache* const _this,
shm_node* const node
) {
InterlockedPushEntrySList(&_this->anchor, &node->slnode);
}


Now you can create your queue. Something like:


typedef struct shm_queue_s shm_queue;

struct shm_queue_s {
SLIST_HEADER anchor;
char pad[L2CACHE_LINESZ() - sizeof(SLIST_HEADER)];
shm_node_cache* cache;
};


shm_queue*
shm_queue_create(
shm_node_cache* const cache,
char const* const name
) {
shm_queue* const _this =
/* create file map with name, alloc and align shm */
InitializeSListHead(&_this->anchor);
_this->cache = cache;
return _this;
}


int
shm_queue_process(
shm_queue* const _this,
void (*fp_process) (shm_node_data*)
) {
int i = 0;
shm_node* node = (shm_node*)InterlockedFlushSList(&_this->anchor);

/* reverse order of node list to create FIFO */

while (node ) {
shm_node* const nx = (shm_node*)nx->slnode.Next;
fp_process(&node->data);
shm_node_cache_push(_this->cache, node);
++i;
node = nx;
}
return i;
}


BOOL
shm_queue_push(
shm_queue* const _this,
shm_node_data* const data
) {
shm_node* const node = shm_node_cache_pop(_this->cache);
if (! node) { return FALSE; }
node->data = *data;
InterlockedPushEntrySList(&_this->anchor, &node->slnode);
return TRUE;
}

____________________________________________________________


Also, for what its worth, here is how Microsoft implements the SList API
under 32-bit systems:

typedef struct slist_anchor_s slist_anchor;
typedef struct slist_node_s slist_node;

struct slist_node_s {
slist_node* nx;
};

struct slist_anchor_s {
slist_node* head;
/* monotonic counter to avoid the infamous aba problem */
__int32 aba;
};


void
slist_anchor_init(
slist_anchor* const _this
) {
_this->head = 0;
_this->aba = 0;
}


void
slist_anchor_push(
slist_anchor* const _this,
slist_node* node
) {
slist_node* cmp;
do {
cmp = _this->head;
node->nx = cmp;
} while (! CAS(&_this->anchor, cmp, node));
}

slist_node*
slist_anchor_pop(
slist_anchor* const _this
) {
slist_anchor cmp, xchg;
do {
cmp = *_this;
if (! cmp.head) { return 0; }

/* hazard! refer to PDR algorihtm for more details... */
xchg.head = cmp.head->nx;

/* increment ABA */
xchg.aba = cmp.aba + 1;

} while (! DWCAS(_this, &cmp, &xchg));

return cmp.head;
}


Any thoughts?

Chris Thomasson

unread,
Nov 14, 2007, 12:27:05 PM11/14/07
to

"Pops" <du...@nospamnospamnospam.com> wrote in message
news:%23tndYNs...@TK2MSFTNGP04.phx.gbl...

FWIW, if you want to create a blocking lock-less queue (e.g., consumer waits
while queue is empty), well here is an algorithm I invented a while back:

http://groups.google.de/group/comp.programming.threads/msg/632b6bdc2a137459


__________________________________________________

< pseudo-code >


typedef struct node_
{
struct node_ *next;
const void *state;

} node_t;


typedef struct stack_anchor_
{
node_t *front;
unsigned __int32 state;


} volatile stack_anchor_t;


typedef struct stack_
{
stack_anchor_t anchor;
HANDLE waitset;


} stack_t;


/* returns non-zero and updates comprand on failure */
extern int dwcas( volatile void *dest, void *cmp, const void *xchg );

void push( stack_t *_this, node_t *node )
{
stack_anchor_t xchg, cmp;


xchg.front = node;
cmp = _this->anchor;


do
{
node->next = cmp.front;


if ( cmp.state & 0x0000FFFF )
{
xchg.state = cmp.state - 1; /* waiters */
}


else { xchg.state = cmp.state; }


} while ( dwcas( &_this->anchor, &cmp, &xchg ) );


if ( cmp.state & 0x0000FFFF )
{
/* wake one */
if ( ! ReleaseSemaphore
( _this->waitset,
1,
0 ) )
{
assert( 0 ); abort();
}
}

}


node_t* pop( stack_t *_this )
{
stack_anchor_t xchg, cmp = _this->anchor;

for ( ;; )
{
if ( cmp.front ) /* condition predicate */
{
xchg.front = cmp.front->next; /* hazard */
xchg.state = cmp.state + 0x10000; /* aba */
}


else /* stack is empty */
{
xchg.front = cmp.front;
xchg.state = cmp.state + 1; /* waiters */
}


if ( ! dwcas( &_this->anchor, &cmp, &xchg ) )
{
if ( cmp.front ) { break; }


if ( WaitForSingleObject
( _this->waitset,
INFINITE ) != WAIT_OBJECT_0 )
{
assert( 0 ); abort();
}


cmp = _this->anchor;
}
}


return cmp.front;

}

___________________________________________________________


Chris Thomasson

unread,
Nov 14, 2007, 12:29:23 PM11/14/07
to
"Ben Voigt [C++ MVP]" <r...@nospam.nospam> wrote in message
news:uEfnFpsJ...@TK2MSFTNGP02.phx.gbl...

>
> "Pops" <du...@nospamnospamnospam.com> wrote in message
> news:%23tndYNs...@TK2MSFTNGP04.phx.gbl...
>> Ben (C++ MVP) recommended using the Interlocked Singly Linked Lists API:
>>
>> http://msdn2.microsoft.com/en-us/library/ms686962.aspx
>>
>> as a more efficient synchronization method for a recent thread discussion
>> in regards to a FIFO queue design.
>>
>> If Ben is reading or if anyone else is familiar with this API, I have a
>> few questions:

I have a question as well. Do you know if Microsoft uses a pointer
compression algorithm to implement their SList API on 64-bit systems? I
think I remember Neill Clift, who works on the Windows Kernel, mention
something about that, but I can't seem to find the message on Google right
now.

Chris Thomasson

unread,
Nov 14, 2007, 12:55:54 PM11/14/07
to
"Chris Thomasson" <cri...@comcast.net> wrote in message
news:Q9adnRAzYrses6ba...@comcast.com...

>
> "Pops" <du...@nospamnospamnospam.com> wrote in message
> news:%23tndYNs...@TK2MSFTNGP04.phx.gbl...
>> Ben (C++ MVP) recommended using the Interlocked Singly Linked Lists API:
>>
>> http://msdn2.microsoft.com/en-us/library/ms686962.aspx
>>
>> as a more efficient synchronization method for a recent thread discussion
>> in regards to a FIFO queue design.
>>
>> If Ben is reading or if anyone else is familiar with this API, I have a
>> few questions:
>>
>> First, based on what I see, this is for LIFO designs (i.e., stacks)? Not
>> FIFO. Am I missing something? I only see Push/Pop functions, no
>> head/tail functions.
>>
>> Seconds, can you tell how it works with IPC, can it work for IPC?
>
> FWIW, if you want to create a blocking lock-less queue (e.g., consumer
> waits while queue is empty), well here is an algorithm I invented a while
> back:
>
> http://groups.google.de/group/comp.programming.threads/msg/632b6bdc2a137459
>
[...]

I should point out that the dwcas used in that algorithm uses an IBM-Style
implementation. It automatically updates the comprand on failure. Here is my
implementation in x86:

int
np_ac_i686_atomic_dwcas_fence(void*, void*, void*);


.globl np_ac_i686_atomic_dwcas_fence
np_ac_i686_atomic_dwcas_fence:
pushl %esi
pushl %ebx
movl 16(%esp), %esi
movl (%esi), %eax
movl 4(%esi), %edx
movl 20(%esp), %esi
movl (%esi), %ebx
movl 4(%esi), %ecx
movl 12(%esp), %esi
lock cmpxchg8b (%esi)
jne np_ac_i686_atomic_dwcas_fence_fail
xorl %eax, %eax
popl %ebx
popl %esi
ret

np_ac_i686_atomic_dwcas_fence_fail:
movl 16(%esp), %esi
movl %eax, (%esi)
movl %edx, 4(%esi)
movl $1, %eax
popl %ebx
popl %esi
ret


Pops

unread,
Nov 14, 2007, 1:00:56 PM11/14/07
to
Ben Voigt [C++ MVP] wrote:

> We were discussing a multiple writer, single consumer queue. For this the
> writers each use the InterlockedPushEntrySList function, while the consumer
> uses the InterlockedFlushSList function to retrieve the entire queue as an
> atomic operation. The consumer is then responsible for iterating through
> the list of items retrieved in FIFO order.

Ben, just a few more questions.

Sorry if I seem dense about this to you, but I would to like to get this
clarification.

When I see you throwing in the multiple writer, single consumer queue
premise above, I am now getting the idea if you introduced this
Interlocked Singly Linked List API to me for this particular scenario
only - one consumer.

In other words, is this API thread safe? The docs say "nonblocking
algorithm to provide atomic synchronization" which I presume it is using
some internal lock-free shared memory method.

But it is protected from multiple threads? I don't see how it can guard
a flush once the initial pointer is read without some lock.

TIA

--
HLS


Chris Thomasson

unread,
Nov 14, 2007, 2:16:31 PM11/14/07
to
[comp.programming.threads included because the discussion is on lock-free
algorithms, and there no place like cpt wrt that subject... ;^]


"Pops" <du...@nospamnospamnospam.com> wrote in message

news:%23uMo9hu...@TK2MSFTNGP03.phx.gbl...


> Ben Voigt [C++ MVP] wrote:
>
>> We were discussing a multiple writer, single consumer queue. For this
>> the writers each use the InterlockedPushEntrySList function, while the
>> consumer uses the InterlockedFlushSList function to retrieve the entire
>> queue as an atomic operation. The consumer is then responsible for
>> iterating through the list of items retrieved in FIFO order.

You can use multiple consumers with that paticiular technique:

http://groups.google.com/group/comp.arch/msg/6ac06e17e5138820

http://groups.google.ca/group/comp.programming.threads/msg/a359bfb41c68b98b


> Ben, just a few more questions.
>
> Sorry if I seem dense about this to you, but I would to like to get this
> clarification.
>
> When I see you throwing in the multiple writer, single consumer queue
> premise above, I am now getting the idea if you introduced this
> Interlocked Singly Linked List API to me for this particular scenario
> only - one consumer.
>
> In other words, is this API thread safe? The docs say "nonblocking
> algorithm to provide atomic synchronization" which I presume it is using
> some internal lock-free shared memory method.
>
> But it is protected from multiple threads?

The API is 100% atomically thread-safe. You can hit the following functions
with any number of threads concurrently:

______________________________________
InterlockedFlushSList
InterlockedPopEntrySList
InterlockedPushEntrySList
QueryDepthSList
______________________________________

BTW, this algorithm has been around for ages as you probably know. In fact,
an IBM Principals of Operations give a more detailed description of why it
atomically thread-safe:

http://publibz.boulder.ibm.com/epubs/pdf/dz9zr002.pdf
(Appendix A/Multi-Processing Examples/Free-Pool Manipulation)


Did that help clear some things up for you?


> I don't see how it can guard a flush once the initial pointer is read
> without some lock.

The flush is implemented with double-width compare-and-swap function that
can atomically mutate the entire SLIST_HEADER data-structure:

http://groups.google.co.uk/group/comp.programming.threads/browse_frm/thread/b5bedbe04e939f55
(I would advise to read entire thread...)


On x86-32 the instruction is CMPXCHG8B. On some x86-64's the instruction is
CMPXCHG16B. On the Itanium, there is a weird instruction called CMP8XCHG16B.
This is instruction is tricky to use because you cannot count on it being
available to most 64-bit systems. For instance, I do a lot of work on the
SPARC that has the CASX instruction which works on 64-bit datum in 32-bit
mode, however, for SPARC64 the CASX instruction only works on 64-bit datum.
This can be a royal pain in the neck if your into inventing new lock-free
algorithms:

http://groups.google.com/group/comp.arch/browse_frm/thread/71f8e0094e353e5
(I would advise to read entire thread...)

There are some nifty tricks you can do to get around this nightmare. I
outline some of them here:

http://groups.google.com/group/comp.arch/msg/a3ebfe80363a4399
(examples toward end of message...)


Any thoughts on this?

Chris Thomasson

unread,
Nov 14, 2007, 2:50:39 PM11/14/07
to
"Chris Thomasson" <cri...@comcast.net> wrote in message
news:uqednbJACNZ0sKba...@comcast.com...

> "Pops" <du...@nospamnospamnospam.com> wrote in message
> news:%23tndYNs...@TK2MSFTNGP04.phx.gbl...
[...]

>> Seconds, can you tell how it works with IPC, can it work for IPC?
>
> Yes. You can create a node cache in shared memory and allocate/deallocate
> your SLIST_ENTRY from there. Something like:
>
>
> (quick and dirty pseudo-code sketch)
> ____________________________________________________________
>
[...]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

should read as:

} while (! CAS(&_this->head, cmp, node));

> }

I forgot to show the flush. Here is how to do that:


slist_node*
slist_anchor_flush(
slist_anchor* const _this
) {
return SWAP(&_this->head, 0);
}


The reason I don't need to bump the monotonic aba counter for the flush is
because its immune to the problem in the first place, here is why:

LIFO producers are inherently not affected by aba. On the other hand, LIFO
consumers are affected, but only if there is a "compare" type method being
used for synchronization. Using an atomic swap to set the stack anchor
to zero can get around aba wrt LIFO lock-free stacks. Was that similar to
your initial solution?

> slist_node*
> slist_anchor_pop(
> slist_anchor* const _this
> ) {
> slist_anchor cmp, xchg;
> do {
> cmp = *_this;
> if (! cmp.head) { return 0; }
>
> /* hazard! refer to PDR algorihtm for more details... */
> xchg.head = cmp.head->nx;
>
> /* increment ABA */
> xchg.aba = cmp.aba + 1;
>
> } while (! DWCAS(_this, &cmp, &xchg));
>
> return cmp.head;
> }
>
>
>
>
> Any thoughts?

BTW, would anybody like to see an actual implementation of this in shared
memory? If so, I will stick it on my to-do list.

;^)

Ben Voigt [C++ MVP]

unread,
Nov 14, 2007, 4:19:31 PM11/14/07
to

"Chris Thomasson" <cri...@comcast.net> wrote in message
news:9c6dnab76-611aba...@comcast.com...

> [comp.programming.threads included because the discussion is on lock-free
> algorithms, and there no place like cpt wrt that subject... ;^]
>
>
> "Pops" <du...@nospamnospamnospam.com> wrote in message
> news:%23uMo9hu...@TK2MSFTNGP03.phx.gbl...
>> Ben Voigt [C++ MVP] wrote:
>>
>>> We were discussing a multiple writer, single consumer queue. For this
>>> the writers each use the InterlockedPushEntrySList function, while the
>>> consumer uses the InterlockedFlushSList function to retrieve the entire
>>> queue as an atomic operation. The consumer is then responsible for
>>> iterating through the list of items retrieved in FIFO order.
>
> You can use multiple consumers with that paticiular technique:

Yes, but with multiple consumers you

(a) won't get work evenly distributed
(b) ordering becomes rather meaningless, and Pops specified FIFO

[snip]

>> I don't see how it can guard a flush once the initial pointer is read
>> without some lock.
>
> The flush is implemented with double-width compare-and-swap function that
> can atomically mutate the entire SLIST_HEADER data-structure:

AFAICT, you need only pointer-sized CAS in order to make the singly-linked
list work, at least I wrote my own version in C++ for template-driven
type-safety and used only pointer-sized functions (each element can only be
in the list once, maybe that helps).


Chris Thomasson

unread,
Nov 14, 2007, 4:48:20 PM11/14/07
to
"Ben Voigt [C++ MVP]" <r...@nospam.nospam> wrote in message
news:u5uxrPwJ...@TK2MSFTNGP05.phx.gbl...

>
> "Chris Thomasson" <cri...@comcast.net> wrote in message
> news:9c6dnab76-611aba...@comcast.com...
>> [comp.programming.threads included because the discussion is on lock-free
>> algorithms, and there no place like cpt wrt that subject... ;^]
>>
>>
>> "Pops" <du...@nospamnospamnospam.com> wrote in message
>> news:%23uMo9hu...@TK2MSFTNGP03.phx.gbl...
>>> Ben Voigt [C++ MVP] wrote:
>>>
>>>> We were discussing a multiple writer, single consumer queue. For this
>>>> the writers each use the InterlockedPushEntrySList function, while the
>>>> consumer uses the InterlockedFlushSList function to retrieve the entire
>>>> queue as an atomic operation. The consumer is then responsible for
>>>> iterating through the list of items retrieved in FIFO order.
>>
>> You can use multiple consumers with that paticiular technique:
>
> Yes, but with multiple consumers you
>
> (a) won't get work evenly distributed

You could use a work-stealing scheme to provide automatic distribution.

> (b) ordering becomes rather meaningless, and Pops specified FIFO

Yes. Good point.


> [snip]
>
>>> I don't see how it can guard a flush once the initial pointer is read
>>> without some lock.
>>
>> The flush is implemented with double-width compare-and-swap function that
>> can atomically mutate the entire SLIST_HEADER data-structure:
>
> AFAICT, you need only pointer-sized CAS in order to make the singly-linked
> list work, at least I wrote my own version in C++ for template-driven
> type-safety and used only pointer-sized functions (each element can only
> be in the list once, maybe that helps).

Okay. Well, you can't immediately reuse or free nodes. So, if you can
guarantee that a node will never be popped off a list, and reused right away
when there could be other threads concurrently popping, then your going to
need to take the ABA problem into account. Read here:

http://publibz.boulder.ibm.com/epubs/pdf/dz9zr002.pdf
(Appendix A/Multi-Processing Examples/Free-Pool Manipulation/Second
Paragraph)

This explains the ABA race-condition in detail. After reading that, you will
understand why its important to have some sort of solution to the ABA. DWCAS
is just one way to do it.

You cannot use pointer-sized CAS to remove a node from a lock-free LIFO
unless you have an external method to get around the ABA problem in general.
ABA is what happens when a thread T1 reads A and then thread T2 uses CAS to
change A to B, then back to A. Thread T1 performs its CAS, and it will still
succeed. This is fine some algorithms. However, its not okay for others. For
instance, if you don't have an external solution for ABA, and you use
pointer-sized CAS to try and pop an item off a lock-free linked-list, well
that a massive race-condition waiting to happen.


It sounds like you already solve ABA by enforcing the fact that a node will
only be in the list one time. However, you cannot deallocate them either
unless you use a lifetime management scheme to get around the hazard in the
pop function:

_________________________________________


slist_node*
slist_anchor_pop(
slist_anchor* const _this
) {
slist_anchor cmp, xchg;
do {
cmp = *_this;
if (! cmp.head) { return 0; }

/* hazard! refer to PDR algorihtm for more details... */
xchg.head = cmp.head->nx;

/* increment ABA */
xchg.aba = cmp.aba + 1;

} while (! DWCAS(_this, &cmp, &xchg));

return cmp.head;
}
_________________________________________


You need to be able to get around the hazard. Here is a paper that describes
most of the concerns:

http://www.research.ibm.com/people/m/michael/podc-2002.pdf
(read all...)


You can hit the ABA scenario if you pop node A; free node A; allocate new
node that happens to have same address as the old node A; and push it back
into the list. This would change the head of the list from A, to something
else, and right back to A.

Chris Thomasson

unread,
Nov 14, 2007, 4:52:38 PM11/14/07
to
[...]

>> AFAICT, you need only pointer-sized CAS in order to make the
>> singly-linked list work, at least I wrote my own version in C++ for
>> template-driven type-safety and used only pointer-sized functions (each
>> element can only be in the list once, maybe that helps).
>
> Okay. Well, you can't immediately reuse or free nodes.

> So, if you can
^^^^^^^^^^^^^^^^^^^^^^^^

Should read as:

So, if you cannot


> guarantee that a node will never be popped off a list, and reused right
> away when there could be other threads concurrently popping, then your
> going to need to take the ABA problem into account. Read here:

[...]


___________________________________________________________________
Okay. Well, you can't immediately reuse or free nodes. So, if you _cannot_


guarantee that a node will never be popped off a list, and reused right away
when there could be other threads concurrently popping, then your going to
need to take the ABA problem into account. Read here:

http://publibz.boulder.ibm.com/epubs/pdf/dz9zr002.pdf
(Appendix A/Multi-Processing Examples/Free-Pool Manipulation/Second
Paragraph)

This explains the ABA race-condition in detail. After reading that, you will
understand why its important to have some sort of solution to the ABA. DWCAS
is just one way to do it.

___________________________________________________________________

Sorry for any confusion!

Ben Voigt [C++ MVP]

unread,
Nov 14, 2007, 5:53:00 PM11/14/07
to
> This explains the ABA race-condition in detail. After reading that, you
> will
> understand why its important to have some sort of solution to the ABA.
> DWCAS
> is just one way to do it.

After thinking about it for a bit, I do not think ABA is actually a problem.

The only way that InterlockedCompareExchange will succeed is if the old head
pointer is what I have stored in the next-node pointer of the new head,
therefore I have correctly inserted a new node regardless of what
manipulations took place in the meantime.

Please take a look at my queue implementation and let me know if you see any
problem (I realize it isn't truely lock-free until I replace the allocator):

#pragma once

#include "Queuefwd.h"

#pragma managed(push, off)

/**
** \brief thread-safe queue, single element append, all elements retrieved
**
** High-performance lockless queue for multi-threading applications. The
order
** of the elements is preserved, but elements cannot be individually
removed.
** Intended application is for multiple threads to post work items with a
single
** reader.
**
** <center>
** \dotfile squeue-concept.dot "API Concept"
**
** \dotfile squeue-filled.dot "Typical State"
** </center>
**/
template<typename T>
class SQueue climodifier(sealed)
{
/**
** \brief internal linked-list implementation of queue for max
performance,
** follows ordinary singly-linked-list pattern
**/
struct SQueueElement climodifier(sealed)
{
/**
** \brief pointer to next list element, ala singly-linked-list
pattern
**
** NULL for the last element in the list
**/
SQueueElement* pNext;

/**
** \brief data storage pigeonhole, ala singly-linked-list pattern
**/
const T data;

SQueueElement( const T& dataInit )
: data(dataInit)
{}

private:
SQueueElement( const SQueueElement& );
SQueueElement& operator=(const SQueueElement&);
};

/**
** \brief pointer to first list element, ala singly-linked-list pattern
**
** NULL if the list is empty.
**
** Since the elements are stored in reverse order to leverage
** the compare-exchange primitive, this first list position holds
** the last (most recent) queue element.
**
** \invariant m_pFirst holds nullptr or the address of an
SQueueElement.
**/
void* volatile m_pFirst;

//static lenarray<T>* TransferElements(SQueueElement* pStart, size_t
startIndex);
public:
//! \brief Constructs a new empty queue
SQueue( void ) { m_pFirst = nullptr; }

//! \brief Adds an element to the end of the queue
void Add(const T& data);
//! \brief Retrieves all elements, removing them from the queue
lenarray<T>* Get(void);
};

/**
** Appends the data parameter to the queue. A copy
** of data is made using T::operator=(const T&)
**
** \param[in] data Value to be added as a new tail element
**/
template<typename T>
void SQueue<T>::Add( const T& data )
{
SQueueElement* pNewHead = new SQueueElement(data);
void* pCurrent = m_pFirst;
void* pOldHead;
do {
pOldHead = pCurrent;
pNewHead->pNext = static_cast<SQueueElement*>(pOldHead);
} while ((pCurrent = ::InterlockedCompareExchangePointer(&m_pFirst,
pNewHead, pOldHead)) != pOldHead);
}

/**
** Fetches all queued items. A copy
** of each element is made using T::operator=(const T&)
**
** \returns a lenarray containing the count of items removed,
** along with the items themselves in order of queueing.
**
** The complexity of this routine is due mainly to the fact
** that the queue is stored with items in reverse order for
** the sake of performance. It's not a bad tradeoff, because
** the extra code is minimal and it lets us avoid locking.
**
** \dotfile squeue-get.dot "Fetching the elements"
**/
template<typename T>
lenarray<T>* SQueue<T>::Get(void)
{
SQueueElement* pGotTail =
static_cast<SQueueElement*>(InterlockedExchangePointer(&m_pFirst,
(void*)nullptr));

if (pGotTail == nullptr)
return nullptr;

size_t length = 0;
SQueueElement* pElem = pGotTail;
do {
length++;
pElem = pElem->pNext;
} while (pElem != nullptr);

lenarray<T>* pArray = new lenarray<T>(length);
T* pData = *pArray + length;

pElem = pGotTail;
do {
*(--pData) = pElem->data;
SQueueElement* pNext = pElem->pNext;
delete pElem;
pElem = pNext;
} while (--length > 0);

return pArray;
}
#pragma managed(pop)


Chris Thomasson

unread,
Nov 14, 2007, 6:15:47 PM11/14/07
to
"Ben Voigt [C++ MVP]" <r...@nospam.nospam> wrote in message
news:O2Xn5DxJ...@TK2MSFTNGP03.phx.gbl...

>> This explains the ABA race-condition in detail. After reading that, you
>> will
>> understand why its important to have some sort of solution to the ABA.
>> DWCAS
>> is just one way to do it.
>
> After thinking about it for a bit, I do not think ABA is actually a
> problem.

It is. There are numerous papers that deal with the subject. If you have a
soultion for it thats great. I look forward to looking at the code you
attached.


>
> The only way that InterlockedCompareExchange will succeed is if the old
> head pointer is what I have stored in the next-node pointer of the new
> head, therefore I have correctly inserted a new node regardless of what
> manipulations took place in the meantime.

I suggest that you read the paper by Maged I lined to in the previous post.
ABA problem is real. At least search the text in the PDF for the term "ABA",
and read the text contained in all the occurrences therein.

> Please take a look at my queue implementation and let me know if you see
> any problem (I realize it isn't truely lock-free until I replace the
> allocator):

Okay. I am going to look at it. I am still tinkering around with my
rwmutex_win class. I want to extend the API to allow for try read/write
functions. (e.g., tryrdlock/wrlock).


Chris Thomasson

unread,
Nov 14, 2007, 6:52:49 PM11/14/07
to
"Ben Voigt [C++ MVP]" <r...@nospam.nospam> wrote in message
news:O2Xn5DxJ...@TK2MSFTNGP03.phx.gbl...

>> This explains the ABA race-condition in detail. After reading that, you
>> will
>> understand why its important to have some sort of solution to the ABA.
>> DWCAS
>> is just one way to do it.
>
> After thinking about it for a bit, I do not think ABA is actually a
> problem.
>
> The only way that InterlockedCompareExchange will succeed is if the old
> head pointer is what I have stored in the next-node pointer of the new
> head, therefore I have correctly inserted a new node regardless of what
> manipulations took place in the meantime.
>
> Please take a look at my queue implementation and let me know if you see
> any problem (I realize it isn't truely lock-free until I replace the
> allocator):
>
[...]

Good news! ;^)


From what I could gather at the first glance is that you have solved the ABA
problem by using the pattern described here:

http://groups.google.ca/group/comp.programming.threads/msg/49a5db082507cb87

http://groups.google.ca/group/comp.programming.threads/msg/a359bfb41c68b98b

http://groups.google.ca/group/comp.programming.threads/browse_frm/thread/fafe5f528700407


If you stick to that, you will be able to deallocate nodes and not have to
worry about it. BTW, when you do it this way, the Get function is not only
lock-free... Its wait-free. That's a nice property to strive for sometimes.
As long as your method tries to cut down on overheads.

Humm... Well, FWIW, if your into low-overhead queuing, I suggest you skim
through the following message and follow links:

http://groups.google.com/group/comp.arch/msg/a7501814f19731cd


What do you think about using distributed message passing algorithms w/
work-stealing scheme for natural auto-balance? I have a library I can
license... Actually, never mind that.

;^)

m

unread,
Nov 14, 2007, 8:43:27 PM11/14/07
to
BTW: This form is not a blog - you should really try to consolidate your
posts ;)

As you say: for LIFO or unordered lists, SCAS is sufficient in the general
case. For FIFO or binary data structures, an MCAS is required (DCAS or
TCAS). Under certain conditions a control variable can be used with CAS to
generate FIFO – the assumption being that fewer than 2^32 (or 2^64)
operations are performed on the list between the time that the control
variable is checked and when the CAS is executed.

Additionally, on windows, FIFO & LIFO are mostly useless for multiple
producer / multiple consumer modes because the strict ordering that they
imply is lost when thread pre-emption occurs. So other than the vague
benefit of LIFO on memory residence (recently used pages are more likely to
be in the working set), lock-free ordered lists should be avoided.


"Chris Thomasson" <cri...@comcast.net> wrote in message

news:I5-dnZi55r10Faba...@comcast.com...

Chris Thomasson

unread,
Nov 14, 2007, 9:05:43 PM11/14/07
to
"m" <m@b.c> wrote in message news:eFgEsjyJ...@TK2MSFTNGP02.phx.gbl...

> BTW: This form is not a blog - you should really try to consolidate your
> posts ;)
>
>
>
> As you say: for LIFO or unordered lists, SCAS is sufficient in the general
> case.

Single CAS wrt LIFO is only good for the method that Ben uses:

http://groups.google.ca/group/comp.programming.threads/msg/49a5db082507cb87

http://groups.google.ca/group/comp.programming.threads/msg/a359bfb41c68b98b

http://groups.google.ca/group/comp.programming.threads/browse_frm/thread/fafe5f528700407

> For FIFO or binary data structures, an MCAS is required (DCAS or TCAS).

> Under certain conditions a control variable can be used with CAS to
> generate FIFO –

You can get FIFO with LIFO using the trick described above links. The
control variable is what I call the anchor. The anchor structure is opaque
and ambiguous. You can have an anchor for a LIFO/FIFO/Whatever...

> the assumption being that fewer than 2^32 (or 2^64) operations are
> performed on the list between the time that the control variable is
> checked and when the CAS is executed.

Yup. The montontnic ABA counter can roll...


> Additionally, on windows, FIFO & LIFO are mostly useless for multiple
> producer / multiple consumer modes because the strict ordering that they
> imply is lost when thread pre-emption occurs.

Sure.

> So other than the vague benefit of LIFO on memory residence (recently used
> pages are more likely to be in the working set), lock-free ordered lists
> should be avoided.
>

[...]

If your design cannot tolerate an algorithm, don't use it. I agree.

Chris Thomasson

unread,
Nov 14, 2007, 9:09:58 PM11/14/07
to
"Chris Thomasson" <cri...@comcast.net> wrote in message
news:T7Kdnd74edSNNaba...@comcast.com...

> "m" <m@b.c> wrote in message news:eFgEsjyJ...@TK2MSFTNGP02.phx.gbl...
>> BTW: This form is not a blog - you should really try to consolidate your
>> posts ;)
>>
>>
>>
>> As you say: for LIFO or unordered lists, SCAS is sufficient in the
>> general case.
>
> Single CAS wrt LIFO is only good for the method that Ben uses:
>
> http://groups.google.ca/group/comp.programming.threads/msg/49a5db082507cb87
>
> http://groups.google.ca/group/comp.programming.threads/msg/a359bfb41c68b98b
>
> http://groups.google.ca/group/comp.programming.threads/browse_frm/thread/fafe5f528700407
>
>
>
>
>
>> For FIFO or binary data structures, an MCAS is required (DCAS or TCAS).
>> Under certain conditions a control variable can be used with CAS to
>> generate FIFO –

Humm... Are you stamping the nodes with the sequence number of the monotonic
aba counter at time of push and/or pop? Are you sorting them in ascending
order or something?

Chris Thomasson

unread,
Nov 14, 2007, 9:22:48 PM11/14/07
to
> Humm... Are you stamping the nodes with the sequence number of the
> monotonic aba counter at time of push and/or pop? Are you sorting them in
> ascending order or something?

I meant to say descending order as ascending would be LIFO.

Chris Thomasson

unread,
Nov 14, 2007, 10:44:06 PM11/14/07
to
"Chris Thomasson" <cri...@comcast.net> wrote in message
news:T7Kdnd74edSNNaba...@comcast.com...

> "m" <m@b.c> wrote in message news:eFgEsjyJ...@TK2MSFTNGP02.phx.gbl...
>> BTW: This form is not a blog - you should really try to consolidate your
>> posts ;)
>>
>>
>>
>> As you say: for LIFO or unordered lists, SCAS is sufficient in the
>> general case.
>
> Single CAS wrt LIFO is only good for the method that Ben uses:

Well, you can also use Single CAS if you provide some other solution to the
ABA problem. The method that Bens code uses is one of many.


Joe Seigh

unread,
Nov 15, 2007, 6:16:57 AM11/15/07
to
m wrote:
> BTW: This form is not a blog - you should really try to consolidate your
> posts ;)
>
>
>
> As you say: for LIFO or unordered lists, SCAS is sufficient in the general
> case. For FIFO or binary data structures, an MCAS is required (DCAS or
> TCAS). Under certain conditions a control variable can be used with CAS to
> generate FIFO – the assumption being that fewer than 2^32 (or 2^64)
> operations are performed on the list between the time that the control
> variable is checked and when the CAS is executed.
>
>
>
> Additionally, on windows, FIFO & LIFO are mostly useless for multiple
> producer / multiple consumer modes because the strict ordering that they
> imply is lost when thread pre-emption occurs. So other than the vague
> benefit of LIFO on memory residence (recently used pages are more likely to
> be in the working set), lock-free ordered lists should be avoided.
>
>
Lock-free is a forward progress guarantee. It has no effect on semantics.
You shouldn't be able to tell the difference between lock-free and locked
queues. The observed order of dequeues will be the same order of any
corresponding enqueues that were observed. This is for queues which is
what are inferred when you use the term producer/consumer. If you're talking
about performing arbitrary functions on an arbitrary collection, e.g.
ordered list, then they're really just the same as an arbitrary data
structure where the operations on them consist of multiple accesses on
the data. You need some form of synchronization to coordinate those
accesses. This could be lock based, MCAS, or STM.

Thread preemption was only a significant distinction when stuff ran on
single processor systems and only one thread could run at a time. That's
no longer true on multi-cores.

--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.

Pops

unread,
Nov 15, 2007, 7:27:18 AM11/15/07
to

Thanks for bringing back the focus and a level of talking I can relate
to. :-)

(This is a side note, I believe at least under WIN32, it is possible to
progammatically delegate your threads so a specific processor. Never
tried it myself, but I recall a reference to it, in this case, under the
guise of potential security conflicts. Maybe that as for hyper-thread
processors only.).

I should make it clear, especially for Chris, that all our work is 100%
for WIN32 only. Very little dependency on its sub-systems specifics,
NT-based only methods, staying clear of APIs with hidden logic,
experimental ideas and/or fuzzy considerations related to HW.

With our current charter to improve performance/scalability, we are
looking at taking advantage of some of the NT-based only offerings.

Part of the issue is the design decision made in 1996 (so those OS/HW
realities of the day were important) to assure quality, maybe at the
expense of performance, I can broadly say, our RPC client/server
framework uses critical sections, rw-locks in short, semaphores, etc, in
much of our multi-context, multi-threaded, data queuing, circular
buffers, FIFO/LIFO, etc, synchronization needs.

So there is much room for improvement in many areas.

While I look for those NT-based improvements, I also have to measure
disk vs memory (or both) queuing, local vs remote IPC, context switching
issues, single vs multi vs hyper-threaded, affinity considerations, etc.
Most of the time, we look for and/or design generalize, reusable,
"practical" solutions, methods.

That said, this Interlocked Singly List API seems to be useful but it
still seems to require a lock-sync when using its FLUSH function, not
for to be confused a PURGE consideration. To me, a FLUSH is a purge if
the data is not available or saved during the flush.

In this case, it provides a pointer to a list. To me, in order to be
thread safe, the function has to be make a COPY of the data and provide
a new pointer list, then it can safely purge the internal list it is
maintaining.

That is basically all I was looking for here, a confirmation of the
workings of this flush.

The order of the list is only important depending on how on where it is
implemented. As as suggested by Ben, it would be easier to transform a
LIFO to a FIFO. But we can not be apply in areas were FIFO is required
if the LIFO order can not be guaranteed.

Thanks

--
HLS

J de Boyne Pollard

unread,
Nov 15, 2007, 8:52:24 AM11/15/07
to
m> BTW: This form is not a blog - you should really try to
m> consolidate your posts ;)

I notice that before you wrote that, we were getting two messages in
reply to everything; but after you wrote it, we started getting four
messages in reply to everything. (-:

Pops

unread,
Nov 15, 2007, 9:44:24 AM11/15/07
to

But now you continue with more junk to the junk already created? Go
Figure! Geez, M wasn't even referring to the cross posting but rather
Chris's highly energetic, eagerness to assist with a prolific posting
ability with corrective follow ups - hence M's consolidation point.

If you have something to contribute, by all means, please do. I started
this thread, the "rest of us" would rather wish you don't screw it up
with junk.

--
HLS

Ben Voigt [C++ MVP]

unread,
Nov 15, 2007, 10:29:04 AM11/15/07
to

"Chris Thomasson" <cri...@comcast.net> wrote in message
news:ReednVXXxKKhHaba...@comcast.com...

> "Ben Voigt [C++ MVP]" <r...@nospam.nospam> wrote in message
> news:O2Xn5DxJ...@TK2MSFTNGP03.phx.gbl...
>>> This explains the ABA race-condition in detail. After reading that, you
>>> will
>>> understand why its important to have some sort of solution to the ABA.
>>> DWCAS
>>> is just one way to do it.
>>
>> After thinking about it for a bit, I do not think ABA is actually a
>> problem.
>
> It is. There are numerous papers that deal with the subject. If you have a
> soultion for it thats great. I look forward to looking at the code you
> attached.

I meant "a problem for me". Sorry about the overly general statement.


Ben Voigt [C++ MVP]

unread,
Nov 15, 2007, 10:28:08 AM11/15/07
to

"Chris Thomasson" <cri...@comcast.net> wrote in message
news:I5-dnZi55r10Faba...@comcast.com...

Thanks for the review. I was pretty sure it was correct but it's very nice
to have concurrence from an expert.

As far as I'm concerned, the only (tenable) way to write multi-threaded code
is to have a couple of lock-free thread-safe components (like my squeue)
that fit entirely on a single page, and elsewhere avoid accessing the same
variable from more than one thread. In any non-trivial project, trying to
protect shared data structures any other way is just too complex to reason
about, with too many opportunities for deadlocks and/or race conditions, and
even if those are solved, then contention to a degree that removes the
benefits of multithreading.

>
> Humm... Well, FWIW, if your into low-overhead queuing, I suggest you skim
> through the following message and follow links:
>
> http://groups.google.com/group/comp.arch/msg/a7501814f19731cd
>

Thanks, I'll take a look.

Chris Thomasson

unread,
Nov 15, 2007, 12:23:45 PM11/15/07
to
"Pops" <du...@nospamnospamnospam.com> wrote in message
news:eUNbMM4J...@TK2MSFTNGP05.phx.gbl...
> Joe Seigh wrote:
[...]

> That said, this Interlocked Singly List API seems to be useful but it
> still seems to require a lock-sync when using its FLUSH function, not for
> to be confused a PURGE consideration. To me, a FLUSH is a purge if the
> data is not available or saved during the flush.

The reason you don't need an explicit mutual exclusion lock is because you
are using instructions that atomically modify the entire list anchor in one
single atomic operation.


>
> In this case, it provides a pointer to a list. To me, in order to be
> thread safe, the function has to be make a COPY of the data and provide a
> new pointer list, then it can safely purge the internal list it is
> maintaining.

[...]

You do not need COW semantics' wrt flushing the contents of a lock-free
LIFO. You simply SWAP the head of the list with NULL, and that's it.

m

unread,
Nov 15, 2007, 12:42:42 PM11/15/07
to
lol!

"J de Boyne Pollard" <j.deboyn...@tesco.net> wrote in message
news:728fdf8b-9c02-4828...@b32g2000hsa.googlegroups.com...

J de Boyne Pollard

unread,
Nov 15, 2007, 1:33:25 PM11/15/07
to
m> BTW: This form is not a blog - you should really try to
m> consolidate your posts ;)

JdeBP> I notice that before you wrote that, we were getting
JdeBP> two messages in reply to everything; but after you
JdeBP> wrote it, we started getting four messages in reply
JdeBP> to everything. (-:

P> [...] Geez, M wasn't even referring to the cross posting [...]

... and neither was I.

m

unread,
Nov 15, 2007, 1:31:36 PM11/15/07
to
Yes I am extending the discussion to include all kinds of data structures
besides queues; and yes there are various ways to protect them. No I don’t
really want to discuss them in detail, I just wanted to put something out
there to see if I could tweak someone’s interest ;)

Thread pre-emption is important on all pre-emptive multi-tasking operating
systems regardless of how many CPUs are present in the system. The
importance of this consideration lies in the fact that between any two
instructions executed by thread A, thread B may have executed a large number
of instructions (either on this CPU or on another one) or none at all. The
reason this is, implicitly, important is that, even though one could use a
lock-free algorithm for nearly anything, they are specifically designed for
HPC scenarios with multiple produces & multiple consumers (or more
generally, multiple modifiers).

Lock-free is not a forward progress guarantee. Lock-free, a misnomer, is an
idiom in which software locks are replaced with hardware locks because they
are, presumably, more efficient. This efficiently is achieved, if by
nothing else, by being smaller (one instruction vs. hundreds) and by
avoiding interaction with the OS scheduler. Lock-free algorithms must
provide a forward progress guarantee, otherwise they would cease to function
under contention, but that guarantee, in and of itself, does not make an
algorithm lock-free and hence does not define the idiom.


"Joe Seigh" <jsei...@xemaps.com> wrote in message
news:JUV_i.7973$WM.837@trndny05...

Chris Thomasson

unread,
Nov 15, 2007, 2:37:08 PM11/15/07
to
"m" <m@b.c> wrote in message news:uCb9BX7J...@TK2MSFTNGP02.phx.gbl...
[...]

>
>
> Lock-free is not a forward progress guarantee. Lock-free, a misnomer, is
> an idiom in which software locks are replaced with hardware locks because
> they are, presumably, more efficient. This efficiently is achieved, if by
> nothing else, by being smaller (one instruction vs. hundreds) and by
> avoiding interaction with the OS scheduler. Lock-free algorithms must
> provide a forward progress guarantee, otherwise they would cease to
> function under contention, but that guarantee, in and of itself, does not
> make an algorithm lock-free and hence does not define the idiom.
[...]

I think you should look at the differences between wait-free and lock-free.
Wait-free means that the operation will complete within a finite number of
steps, in other words, there are no loops. Lock-free implies conditional
looping. You can get into live-lock under certain scenarios. For instance, a
live-lock can ensue if the reservation make by a load-linked is constantly
invalided before the store-conditional get to execute. A PowerPC manual
mentions some of this wrt false-sharing messing with the reservation. That's
why STM is so prove to live-lock... STM is like LL/SC with huge reservation
granules.

Chris Thomasson

unread,
Nov 15, 2007, 3:14:10 PM11/15/07
to
One could theoretically open themselves up to live-lock conditions if they
did not take this into consideration when implementing certain lock-free
algorithms on PPC. For instance, if the anchor of a lock-free LIFO was not
the size of a cache-line, and was not aligned on a cache-line boundary you
could get "false-sharing" on the anchor. I don't have much experience with
PPC, but it seems that false-sharing could break reservations on the anchor
and adversely affect forward progress, and might cause live-lock like
conditions...

m

unread,
Nov 15, 2007, 3:40:36 PM11/15/07
to
Lock-free & wait-free are independent idioms which, in certain algorithms,
can overlap. Except that lock-free doesn’t imply conditional looping, this
is an attribute of many algorithms that are lock-free not a defining
property of the idiom, your definitions are reasonable; so what am I
missing?

The fact that on certain architectures, false-sharing can adversely affect
performance is a consideration in implementation and algorithm choice.
Specifically, don’t choose an algorithm that is hard to implement or
performs badly on your target architecture ;)

"Chris Thomasson" <cri...@comcast.net> wrote in message

news:f7OdndviL--1OqHa...@comcast.com...

Chris Thomasson

unread,
Nov 15, 2007, 4:00:17 PM11/15/07
to
> "Chris Thomasson" <cri...@comcast.net> wrote in message
> news:f7OdndviL--1OqHa...@comcast.com...
>> One could theoretically open themselves up to live-lock conditions if
>> they
>> did not take this into consideration when implementing certain lock-free
>> algorithms on PPC. For instance, if the anchor of a lock-free LIFO was
>> not
>> the size of a cache-line, and was not aligned on a cache-line boundary
>> you
>> could get "false-sharing" on the anchor. I don't have much experience
>> with
>> PPC, but it seems that false-sharing could break reservations on the
>> anchor
>> and adversely affect forward progress, and might cause live-lock like
>> conditions...

[top-post corrected]


"m" <m@b.c> wrote in message news:OMNiHf8J...@TK2MSFTNGP03.phx.gbl...


> Lock-free & wait-free are independent idioms which, in certain algorithms,
> can overlap. Except that lock-free doesn’t imply conditional looping,
> this is an attribute of many algorithms that are lock-free not a defining
> property of the idiom,

Fair enough.


> your definitions are reasonable; so what am I missing?

I don't think your missing anything.


> The fact that on certain architectures, false-sharing can adversely affect
> performance is a consideration in implementation and algorithm choice.
> Specifically, don’t choose an algorithm that is hard to implement or
> performs badly on your target architecture ;)

100% Agreed.

Chris Thomasson

unread,
Nov 15, 2007, 11:17:15 PM11/15/07
to
"Ben Voigt [C++ MVP]" <r...@nospam.nospam> wrote in message
news:uwEcTw5J...@TK2MSFTNGP03.phx.gbl...

>
> "Chris Thomasson" <cri...@comcast.net> wrote in message
> news:I5-dnZi55r10Faba...@comcast.com...
>> "Ben Voigt [C++ MVP]" <r...@nospam.nospam> wrote in message
>> news:O2Xn5DxJ...@TK2MSFTNGP03.phx.gbl...
[...]

>>>
>>> Please take a look at my queue implementation and let me know if you see
>>> any problem (I realize it isn't truely lock-free until I replace the
>>> allocator):
>>>
>> [...]
>>
>> Good news! ;^)
[...]

> Thanks for the review. I was pretty sure it was correct but it's very
> nice to have concurrence from an expert.

You welcome. I think your SQueue class is sound wrt the overall concerns of
the ABA problem in general. I also think that the code correctly address the
concerns of deallocating nodes from lock-free data-structures. You have a
performance enhancement on the SQueue<T>::Get() function because its 100%
wait-free as long as the hardware that Windows happens to be running on
supports an atomic swap instruction; Intel does. SPARC used to support the
SWAP instruction; they depreciated it, and delegated its functionally over
to the CAS instruction on 32-bit systems, and CASX on 64-bit systems. Its
Kind of funny IMHO simply because CASX operates on 64-bit datum in 64-bit,
despite its literal name; the way I read things, CASX implies DWCAS-like
operation, which means 128-bit operation when in 64-bit mode. That seems odd
to me to say the least...

> As far as I'm concerned, the only (tenable) way to write multi-threaded
> code is to have a couple of lock-free thread-safe components (like my
> squeue) that fit entirely on a single page, and elsewhere avoid accessing
> the same variable from more than one thread. In any non-trivial project,
> trying to protect shared data structures any other way is just too complex
> to reason about, with too many opportunities for deadlocks and/or race
> conditions, and even if those are solved, then contention to a degree that
> removes the benefits of multithreading.

Well, deadlocks are caused by the misuse of locking. IMVHO, if a programmer
cannot get lock-based algorithms correct, then they should be nowhere near a
lock-free one; to say the least. The first of the two links provided below
points out some of the basic caveats wrt replacing a lock-based
synchronization point with its non-blocking equivalent. For instance, if you
stick a huge engine in a car with crappy tires, well, then Murphy's Law
dictates that your going to have a blow-out at high speeds, on a windy road,
with a cliff on one side. Therefore, you must upgrade the tires, and
probably some other components as well... Solving that in an efficient
manner can sometimes turn out to be a fairly long, and tedious endeavor;
indeed!


>> Humm... Well, FWIW, if your into low-overhead queuing, I suggest you skim
>> through the following message and follow links:
>>
>> http://groups.google.com/group/comp.arch/msg/a7501814f19731cd
>>
>
> Thanks, I'll take a look.

[...]

I think that you just might find the brief overview of the message-passing
techniques described therein fairly interesting, and perhaps a "bit useful";
as in you may be able to fit it in some of your _existing_ synchronization
schemes with "fairly minimal" effort(s). You can read the following for some
more details on that particular subject:

http://groups.google.com/group/comp.programming.threads/msg/9a5500d831dd2ec7

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/ef3f26bbe1c52e0b


Any thoughts on this?

Tim Roberts

unread,
Nov 15, 2007, 11:48:12 PM11/15/07
to
"Chris Thomasson" <cri...@comcast.net> wrote:
>
>I have a question as well. Do you know if Microsoft uses a pointer
>compression algorithm to implement their SList API on 64-bit systems? I
>think I remember Neill Clift, who works on the Windows Kernel, mention
>something about that, but I can't seem to find the message on Google right
>now.

I, personally, would be very surprised if they did. The benefit is so
small, and memory is essentially free.

Further, in my experience, most kernel structures use the doubly-linked
lists.
--
Tim Roberts, ti...@probo.com, DDK MVP
Providenza & Boekelheide, Inc.

Joe Seigh

unread,
Nov 16, 2007, 6:22:34 AM11/16/07
to
m wrote:
> Yes I am extending the discussion to include all kinds of data structures
> besides queues; and yes there are various ways to protect them. No I don’t
> really want to discuss them in detail, I just wanted to put something out
> there to see if I could tweak someone’s interest ;)
>
You made a generalization and applied it to the special case.

>
>
> Thread pre-emption is important on all pre-emptive multi-tasking operating
> systems regardless of how many CPUs are present in the system. The
> importance of this consideration lies in the fact that between any two
> instructions executed by thread A, thread B may have executed a large number
> of instructions (either on this CPU or on another one) or none at all. The
> reason this is, implicitly, important is that, even though one could use a
> lock-free algorithm for nearly anything, they are specifically designed for
> HPC scenarios with multiple produces & multiple consumers (or more
> generally, multiple modifiers).


Yes, preemption causes more variation in a threads rate of progress but so?
In multi-threading you don't make any assumptions about the rate of forward
progress, preemption or not. And it's certainly news to me that lock-free
is mainly important to HPC and not so much for other things.


>
>
>
> Lock-free is not a forward progress guarantee. Lock-free, a misnomer, is an
> idiom in which software locks are replaced with hardware locks because they
> are, presumably, more efficient. This efficiently is achieved, if by
> nothing else, by being smaller (one instruction vs. hundreds) and by
> avoiding interaction with the OS scheduler. Lock-free algorithms must
> provide a forward progress guarantee, otherwise they would cease to function
> under contention, but that guarantee, in and of itself, does not make an
> algorithm lock-free and hence does not define the idiom.
>

The most commonly used definition for lock-free is by Herlihy. He also did
definitions for wait-free and obstruction-free.

Pops

unread,
Nov 16, 2007, 8:25:43 AM11/16/07
to
Ben Voigt [C++ MVP] wrote:

> We were discussing a multiple writer, single consumer queue. For this the
> writers each use the InterlockedPushEntrySList function, while the consumer
> uses the InterlockedFlushSList function to retrieve the entire queue as an
> atomic operation. The consumer is then responsible for iterating through
> the list of items retrieved in FIFO order.

Hi Ben,

I been spending the last day or so working on this Interlocked Singly
Linked List (ISLL) and did a profile of two basically the same
applications with the key difference:

PROG-A: MFC Clist FIFO queue w/ my RW and Chris's RW code,
PROG-B: ISLL with no RW code. Only LIFO analyzed here.

Each allows me to play with different parameters:

#define USE_CRITSECT_H // Use my RW or Chris's RW

BOOL PRINT_LINES = FALSE; // Avoid console overhead
DWORD MAX_WRITER_THREADS = 1; // # of writer threads
DWORD MAX_READER_THREADS = 1; // # of reader threads
DWORD AUTO_STOP_COUNT = 1000; // # of writer events
DWORD WRITER_FREQ_MS = 1; // Frequency of writes
DWORD READER_FREQ_MS = 1; // Frequency of reads
DWORD READER_EVENT_THRESHOLD= 0; // # before reader is signaled
DWORD READER_PROCESS_DELAY = 0; // Simulate processing time

Each writer and reader threads has a one (1) quantum frequency using
while( WFSO(ShutdownEvent,1) ) thread break test. So essentially a 15ms
frequency of pumping 1 item into the list, a 15ms reader frequency. The
wait can be changed with the XXXX_FREQ_MS parameters.

AUTO_STOP_COUNT allows the test to end. The main thread watches this
count before it signals the threads to shutdown.

The READEREVENT_THRESHOLD, if set, will allow the queue to build before
the writer will signal the reader event.

For PROG-A with the RW locks, the USE_CRIT_SECT pragma allows me test
my RW classes or Chris's fine RW class. :-)

The testing is on a Intel Pentium D 2.66ghz, 1 gig RAM XP/PRO box.

The programs main thread creates athe threadsd. When AUTO_STOP_COUNT is
reached, it signals the shutdown event.

Another console utility is use to set the performance counters to watch
for and log. In this profile, I am only looking at the context switch rate.

The results:

# PROG-A: USING SSI CRITSECT.H READER/WRITER CLASSES
* Successful Shutdown: 16000 ms
* Total Processed : 1023
* Total Empty Gets : 362

PROG-A Reader: ~117 cs/sec
PROG-A Writer: ~88 cs/sec

# PROG-A: USING CHRIS'S READER/WRITER CLASSES
* Successful Shutdown: 16000 ms
* Total Processed : 1023
* Total Empty Gets : 259

PROG-A Reader: ~77 cs/sec
PROG-A Writer: ~71 cs/sec

So Chris's code offers less context switching. Don't get excited Chris.
:-) Note, the reader in the first test is wasting time with empty read
events. Comes into play in addition tests.

# PROG-B: USING ISLL API WITH NO RW LOCKS
* Successful Shutdown: 16000 ms
* Total Processed : 1023
* Total Empty Gets : 279

PROG-B Reader: ~67 cs/sec
PROG-B Writer: ~65 cs/sec

Even slightly better with the ISLL, about the same waste in reader
events. Note: This is using a PUSH and POP LIFO.

For the next tests, I optimized the sychronization using event driven
thresholds, READEREVENT_THRESHOLD = 10; the pump will will signal the
reader when 10 items are add to the list.

# PROG-A: USING SSI CRITSECT.H READER/WRITER CLASSES
* Successful Shutdown: 16000 ms
* Total Processed : 1021
* Total Empty Gets : 0

PROG-A Reader: ~6 cs/sec
PROG-A Writer: ~70 cs/sec

Drastic improvement here by collecting a bucket of items before
processing. Notice no waste in reader events.

# PROG-A: USING CHRIS'S READER/WRITER CLASSES
* Successful Shutdown: 16000 ms
* Total Processed : 1021
* Total Empty Gets : 0

PROG-A Reader: ~8 cs/sec
PROG-A Writer: ~66 cs/sec

Essentially the same performance.

# PROG-B: USING ISLL API WITH NO RW LOCKS
* Successful Shutdown: 16000 ms
* Total Processed : 1021
* Total Empty Gets : 0

PROG-B Reader: ~7 cs/sec
PROG-B Writer: ~67 cs/sec

And for ISLL, essentially the same performance, when using a sychronized
event driven bucket brigade.

I performed many other test, and real world simulations, like adding
delays for processing time, lowing the frequency of the pump but overall
I think it is pretty clear to me, there is no real big difference when
you need to take into account key optimization and synchronization factors.

In regards to ISLL, I did see a very small slight improving using the
flush when I began to work on a LIFO to FIFO transformation. But seeing
that regular MFC Clist based FIFO performed equally as well, to me, its
the added complexity, unsureness of lock-free methods makes, OS
dependency, makes it practical at the moment.

Other considerations such as memory fragmentation, needs to be reviewed.

Anyway, I though I would share these results.

Thanks

--
HLS

Alexander Grigoriev

unread,
Nov 16, 2007, 9:11:12 AM11/16/07
to
Per lack of double-pointer-sized InterlockedCompareExchange on some 64 bit
platforms, it's necessary to pack more than a pointer into SLIST_HEADER:

#if defined(_WIN64)
typedef union _SLIST_HEADER {
ULONGLONG Alignment;
struct {
ULONGLONG Depth : 16;
ULONGLONG Sequence : 8;
ULONGLONG Next : 40;
};
} SLIST_HEADER, *PSLIST_HEADER;

See ntddk.h

"Tim Roberts" <ti...@probo.com> wrote in message
news:a58qj3p6jmkts8vti...@4ax.com...

Ben Voigt [C++ MVP]

unread,
Nov 16, 2007, 11:20:57 PM11/16/07
to

"Pops" <du...@nospamnospamnospam.com> wrote in message
news:%23wNxgRF...@TK2MSFTNGP02.phx.gbl...

Did you try a multiple writer scenario, or reduced waits where you could
have some contention?

Especially don't use Sleep to simulate processing time, use a loop with a
volatile variable to make sure the compiler doesn't eliminate it.

Chris Thomasson

unread,
Nov 17, 2007, 12:28:46 AM11/17/07
to
"Joe Seigh" <jsei...@xemaps.com> wrote in message
news:_3f%i.7212$RR1.5517@trnddc02...

>m wrote:
>> Yes I am extending the discussion to include all kinds of data structures
>> besides queues; and yes there are various ways to protect them. No I don’t
>> really want to discuss them in detail, I just wanted to put something out
>> there to see if I could tweak someone’s interest ;)
>>
> You made a generalization and applied it to the special case.
>>
>>
>> Thread pre-emption is important on all pre-emptive multi-tasking
>> operating systems regardless of how many CPUs are present in the system.
>> The importance of this consideration lies in the fact that between any
>> two instructions executed by thread A, thread B may have executed a large
>> number of instructions (either on this CPU or on another one) or none at
>> all. The reason this is, implicitly, important is that, even though one
>> could use a lock-free algorithm for nearly anything, they are
>> specifically designed for HPC scenarios with multiple produces & multiple
>> consumers (or more generally, multiple modifiers).
>
>
> Yes, preemption causes more variation in a threads rate of progress but
> so?
> In multi-threading you don't make any assumptions about the rate of
> forward
> progress, preemption or not.

Yeah, especially since any predicates that are based on an assumption of the
non-deterministic rate of a threads forward progress is basically a
race-condition in and of itself. I think its similar to all the reasons
against implementing RCU with timers; reader threads just might decide to
stay in the active read epoch, longer than the timer allows for... Not good.


> And it's certainly news to me that lock-free
> is mainly important to HPC and not so much for other things.

IMHO, I find the reader pattern to be the most useful overall technique.

[...]

Vladimir Petter [MSFT]

unread,
Nov 17, 2007, 1:07:45 AM11/17/07
to
You can find more info on this here http://www.alex-ionescu.com/?p=50
Vladimir Petter. [microsoft]

"Alexander Grigoriev" <al...@earthlink.net> wrote in message
news:#UdcMqFK...@TK2MSFTNGP04.phx.gbl...

Chris Thomasson

unread,
Nov 17, 2007, 12:26:30 PM11/17/07
to

"Vladimir Petter [MSFT]" <vla...@hotmail.com> wrote in message
news:u$CatAOKI...@TK2MSFTNGP04.phx.gbl...


> You can find more info on this here http://www.alex-ionescu.com/?p=50
> Vladimir Petter. [microsoft]


Thanks for that. Humm... So, Windows 64-bit running on x86-64 systems that
lack CMPXCHG16B instruction uses only 9-bits for an ABA counter; I think
that may be a bit dangerous. That counter will be rolling over quite
frequently under load... Humm:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/133f764f1bc0c9f4


Any thoughts?

Alexander Grigoriev

unread,
Nov 17, 2007, 5:46:26 PM11/17/07
to

"Chris Thomasson" <cri...@comcast.net> wrote in message
news:BfKdnfd2btBLv6La...@comcast.com...

> "Vladimir Petter [MSFT]" <vla...@hotmail.com> wrote in message
> news:u$CatAOKI...@TK2MSFTNGP04.phx.gbl...
>> You can find more info on this here http://www.alex-ionescu.com/?p=50
>> Vladimir Petter. [microsoft]
>
>
> Thanks for that. Humm... So, Windows 64-bit running on x86-64 systems that
> lack CMPXCHG16B instruction uses only 9-bits for an ABA counter; I think
> that may be a bit dangerous. That counter will be rolling over quite
> frequently under load... Humm:
>
> http://groups.google.com/group/comp.programming.threads/browse_frm/thread/133f764f1bc0c9f4
>
>
> Any thoughts?

Counter could roll over, but probability of hitting the same is
astronomically low. If there is too much ABA contentions, counter will
simply change forward, but not by 512 at once.


Chris Thomasson

unread,
Nov 17, 2007, 11:53:17 PM11/17/07
to
"Chris Thomasson" <cri...@comcast.net> wrote in message
news:6budnUf11cP9haDa...@comcast.com...

> "Ben Voigt [C++ MVP]" <r...@nospam.nospam> wrote in message
> news:uwEcTw5J...@TK2MSFTNGP03.phx.gbl...
>>
>> "Chris Thomasson" <cri...@comcast.net> wrote in message
>> news:I5-dnZi55r10Faba...@comcast.com...
>>> "Ben Voigt [C++ MVP]" <r...@nospam.nospam> wrote in message
>>> news:O2Xn5DxJ...@TK2MSFTNGP03.phx.gbl...
> [...]
>>>>
>>>> Please take a look at my queue implementation and let me know if you
>>>> see any problem (I realize it isn't truely lock-free until I replace
>>>> the allocator):
>>>>
>>> [...]
>>>
>>> Good news! ;^)
> [...]
>
>> Thanks for the review. I was pretty sure it was correct but it's very
>> nice to have concurrence from an expert.
>
> You welcome. I think your SQueue class is sound wrt the overall concerns
> of the ABA problem in general. I also think that the code correctly
> address the concerns of deallocating nodes from lock-free data-structures.
> You have a performance enhancement on the SQueue<T>::Get() function
> because its 100% wait-free as long as the hardware that Windows happens to
> be running on supports an atomic swap instruction; Intel does. SPARC used
> to support the SWAP instruction; they depreciated it, and delegated its
> functionally over to the CAS instruction on 32-bit systems, and CASX on
> 64-bit systems.

http://groups.google.com/group/comp.arch/msg/04cb5e2ca2a7e19a

Damn!

Pops

unread,
Nov 19, 2007, 2:54:26 AM11/19/07
to
Hi Ben, I hope you enjoyed your weekend. :-) See inline comments.

Ben Voigt [C++ MVP] wrote:

> Did you try a multiple writer scenario, or reduced waits
> where you could have some contention?

Yes.

It was pretty much all the expected issues and results wrt synchronizing
and/or controlling queuing and dequeuing rates, including using the
proper methods for the design in hand, i.e., pop vs a flush.

> Especially don't use Sleep to simulate processing time,
> use a loop with a volatile variable to make sure the compiler
> doesn't eliminate it.

Not sure what you are thinking at the time you wrote this, but it all
depends on what you are simulating. I won't attempt to elaborate too
deeply but just in case it was misunderstood, I did not intend to imply
the parameter READER_PROCESS_DELAY "Simulating Processing Time" to
exclusively mean CPU time, if that was you were thinking?

To me, the logic to simulate processing time implies doing "things" that
emulates as near as possible real world behavior of your data handling
design needs which may includes considerations such as HW related
interrupts, natural time slicing pre-emptions, as well as CPU time, etc,
all to create situations which will help find contention, deadlocks,
timing, synchronization, optimization issues, etc.

I use various simulation techniques, including combinations of these
techniques: sleeps, yields/pokes, software-based timers, math
computations, high memory loads, file I/O, RPC calls (which is
critically important for us, because this is where most of bottlenecks
might occur), for GUI relate issues, posting or sending messages,
message handlers with simulator delays, etc.

That said, I'm curious. I have various methods for software timers. Can
you provide an example of one using volatile? If we referring to the
same thing, I wasn't sure if you were thinking HW based timers vs SW
based timers.

Thanks

--
HLS

Chris Thomasson

unread,
Nov 22, 2007, 1:11:09 AM11/22/07
to
"Alexander Grigoriev" <al...@earthlink.net> wrote in message
news:%23c$YxuWKI...@TK2MSFTNGP06.phx.gbl...

Its good to keep in mind that we are talking about solutions that are
analogous to managed race-conditions in their essence. I think that PDR is
good method to use wrt addressing the ABA problem in general. The depth of
the counter is directly proportional to the "chance" of the race-condition
not raising its ugly head right in the middle of a critical operations.

Chris Thomasson

unread,
Nov 22, 2007, 1:41:02 AM11/22/07
to
"Alexander Grigoriev" <al...@earthlink.net> wrote in message
news:%23c$YxuWKI...@TK2MSFTNGP06.phx.gbl...

The augment the algorithm with a "counter" (e.g., counter inc'd for every
push, and dec for every pop). This is not a monotonic counter, therefore its
not effective for solving ABA in general. You either need to manage memory
and solve ABA externally, or solve it within the state of a specific
algorithm.

Alexander Grigoriev

unread,
Nov 22, 2007, 9:13:22 AM11/22/07
to
The structure we're discussing has a separate 16-bit depth count and 9 bit
sequence count. Sequence count increases monotonically.

"Chris Thomasson" <cri...@comcast.net> wrote in message

news:98OdnXUQ0LWfvtja...@comcast.com...

Chris Thomasson

unread,
Nov 22, 2007, 6:35:29 PM11/22/07
to
"Alexander Grigoriev" <al...@earthlink.net> wrote in message
news:%23HpYWHR...@TK2MSFTNGP05.phx.gbl...

> The structure we're discussing has a separate 16-bit depth count and 9 bit
> sequence count. Sequence count increases monotonically.

Right. So effectively its only a 9-bit ABA counter. The aba problem can trip
up the 16-bit depth count by popping and pushing the same item. depth of 1,
changed to depth of 0, and back to 1 again; ABA...

Alexander Grigoriev

unread,
Nov 22, 2007, 10:17:04 PM11/22/07
to
This is what sequence number is for.

"Chris Thomasson" <cri...@comcast.net> wrote in message

news:F-ydnQxWxLBYjdva...@comcast.com...

Chris Thomasson

unread,
Nov 22, 2007, 10:50:11 PM11/22/07
to
"Alexander Grigoriev" <al...@earthlink.net> wrote in message
news:OGWKX%23XLIH...@TK2MSFTNGP05.phx.gbl...

> This is what sequence number is for.

[...]
9 bit sequence number is small. Back when IBM did things the processors were
not as fast as they are now...

0 new messages