http://msdn2.microsoft.com/en-us/library/ms686962.aspx
as a more efficient synchronization method for a recent thread
discussion in regards to a FIFO queue design.
If Ben is reading or if anyone else is familiar with this API, I have a
few questions:
First, based on what I see, this is for LIFO designs (i.e., stacks)?
Not FIFO. Am I missing something? I only see Push/Pop functions, no
head/tail functions.
Seconds, can you tell how it works with IPC, can it work for IPC?
Thanks
--
HLS
We were discussing a multiple writer, single consumer queue. For this the
writers each use the InterlockedPushEntrySList function, while the consumer
uses the InterlockedFlushSList function to retrieve the entire queue as an
atomic operation. The consumer is then responsible for iterating through
the list of items retrieved in FIFO order.
>
> Seconds, can you tell how it works with IPC, can it work for IPC?
Because there is no name assigned, the SList cannot establish an IPC
channel. However, because the caller has control over the placement of all
data structures involved, it should be possible to use with shared memory.
It would be highly desirable in such cases to obtain the same virtual
address space in each of the processes. If this is not possible, then you
probably can't use the Win32 API implementation, though the underlying
algorithm will still work (using relative addresses / indices instead of
pointers).
>
> Thanks
>
> --
> HLS
>
>
>
> We were discussing a multiple writer, single consumer queue. For this the
> writers each use the InterlockedPushEntrySList function, while the consumer
> uses the InterlockedFlushSList function to retrieve the entire queue as an
> atomic operation. The consumer is then responsible for iterating through
> the list of items retrieved in FIFO order.
Gotcha!
>> Seconds, can you tell how it works with IPC, can it work for IPC?
>
> Because there is no name assigned, the SList cannot establish an IPC
> channel. However, because the caller has control over the placement of all
> data structures involved, it should be possible to use with shared memory.
> It would be highly desirable in such cases to obtain the same virtual
> address space in each of the processes. If this is not possible, then you
> probably can't use the Win32 API implementation, though the underlying
> algorithm will still work (using relative addresses / indices instead of
> pointers).
Ok. Right.
I was interested to see if it can be applied in our virtual I/O
communication/channel framework which currently uses RPC. Probably won't
apply for remote machines operations, but might prove useful as an
improved alternative to local RPC operations.
My whole mantra these days is to increase client/server performance and
scalability. So this api might just have some purpose for us.
--
HLS
Yes. You can create a node cache in shared memory and allocate/deallocate
your SLIST_ENTRY from there. Something like:
(quick and dirty pseudo-code sketch)
____________________________________________________________
#define L2CACHE_LINESZ() 128
#define CACHE_DEPTH() 1024
typedef struct shm_node_s shm_node;
typedef struct shm_node_data_s shm_node_data;
typedef struct shm_node_cache_s shm_node_cache;
struct shm_node_data_s {
/* some data */
};
struct shm_node_s {
SLIST_ENTRY slnode;
shm_node_data data;
};
struct shm_node_cache_s {
SLIST_HEADER anchor;
char pad[L2CACHE_LINESZ() - sizeof(SLIST_HEADER)];
shm_node nodes[CACHE_DEPTH()];
};
shm_node_cache*
shm_node_cache_create(
char const* const name
) {
int i;
shm_node_cache* const _this =
/* create file map with name, alloc and align shm */
InitializeSListHead(&_this->anchor);
for(i = 0; i < CACHE_DEPTH(); ++i) {
InterlockedPushEntrySList(&_this->anchor,
&_this->nodes[i].slnode);
}
return _this;
}
shm_node*
shm_node_cache_pop(
shm_node_cache* const _this
) {
return (shm_node*)InterlockedPopEntrySList(&_this->anchor);
}
void
shm_node_cache_push(
shm_node_cache* const _this,
shm_node* const node
) {
InterlockedPushEntrySList(&_this->anchor, &node->slnode);
}
Now you can create your queue. Something like:
typedef struct shm_queue_s shm_queue;
struct shm_queue_s {
SLIST_HEADER anchor;
char pad[L2CACHE_LINESZ() - sizeof(SLIST_HEADER)];
shm_node_cache* cache;
};
shm_queue*
shm_queue_create(
shm_node_cache* const cache,
char const* const name
) {
shm_queue* const _this =
/* create file map with name, alloc and align shm */
InitializeSListHead(&_this->anchor);
_this->cache = cache;
return _this;
}
int
shm_queue_process(
shm_queue* const _this,
void (*fp_process) (shm_node_data*)
) {
int i = 0;
shm_node* node = (shm_node*)InterlockedFlushSList(&_this->anchor);
/* reverse order of node list to create FIFO */
while (node ) {
shm_node* const nx = (shm_node*)nx->slnode.Next;
fp_process(&node->data);
shm_node_cache_push(_this->cache, node);
++i;
node = nx;
}
return i;
}
BOOL
shm_queue_push(
shm_queue* const _this,
shm_node_data* const data
) {
shm_node* const node = shm_node_cache_pop(_this->cache);
if (! node) { return FALSE; }
node->data = *data;
InterlockedPushEntrySList(&_this->anchor, &node->slnode);
return TRUE;
}
____________________________________________________________
Also, for what its worth, here is how Microsoft implements the SList API
under 32-bit systems:
typedef struct slist_anchor_s slist_anchor;
typedef struct slist_node_s slist_node;
struct slist_node_s {
slist_node* nx;
};
struct slist_anchor_s {
slist_node* head;
/* monotonic counter to avoid the infamous aba problem */
__int32 aba;
};
void
slist_anchor_init(
slist_anchor* const _this
) {
_this->head = 0;
_this->aba = 0;
}
void
slist_anchor_push(
slist_anchor* const _this,
slist_node* node
) {
slist_node* cmp;
do {
cmp = _this->head;
node->nx = cmp;
} while (! CAS(&_this->anchor, cmp, node));
}
slist_node*
slist_anchor_pop(
slist_anchor* const _this
) {
slist_anchor cmp, xchg;
do {
cmp = *_this;
if (! cmp.head) { return 0; }
/* hazard! refer to PDR algorihtm for more details... */
xchg.head = cmp.head->nx;
/* increment ABA */
xchg.aba = cmp.aba + 1;
} while (! DWCAS(_this, &cmp, &xchg));
return cmp.head;
}
Any thoughts?
FWIW, if you want to create a blocking lock-less queue (e.g., consumer waits
while queue is empty), well here is an algorithm I invented a while back:
http://groups.google.de/group/comp.programming.threads/msg/632b6bdc2a137459
__________________________________________________
< pseudo-code >
typedef struct node_
{
struct node_ *next;
const void *state;
} node_t;
typedef struct stack_anchor_
{
node_t *front;
unsigned __int32 state;
} volatile stack_anchor_t;
typedef struct stack_
{
stack_anchor_t anchor;
HANDLE waitset;
} stack_t;
/* returns non-zero and updates comprand on failure */
extern int dwcas( volatile void *dest, void *cmp, const void *xchg );
void push( stack_t *_this, node_t *node )
{
stack_anchor_t xchg, cmp;
xchg.front = node;
cmp = _this->anchor;
do
{
node->next = cmp.front;
if ( cmp.state & 0x0000FFFF )
{
xchg.state = cmp.state - 1; /* waiters */
}
else { xchg.state = cmp.state; }
} while ( dwcas( &_this->anchor, &cmp, &xchg ) );
if ( cmp.state & 0x0000FFFF )
{
/* wake one */
if ( ! ReleaseSemaphore
( _this->waitset,
1,
0 ) )
{
assert( 0 ); abort();
}
}
}
node_t* pop( stack_t *_this )
{
stack_anchor_t xchg, cmp = _this->anchor;
for ( ;; )
{
if ( cmp.front ) /* condition predicate */
{
xchg.front = cmp.front->next; /* hazard */
xchg.state = cmp.state + 0x10000; /* aba */
}
else /* stack is empty */
{
xchg.front = cmp.front;
xchg.state = cmp.state + 1; /* waiters */
}
if ( ! dwcas( &_this->anchor, &cmp, &xchg ) )
{
if ( cmp.front ) { break; }
if ( WaitForSingleObject
( _this->waitset,
INFINITE ) != WAIT_OBJECT_0 )
{
assert( 0 ); abort();
}
cmp = _this->anchor;
}
}
return cmp.front;
}
___________________________________________________________
I have a question as well. Do you know if Microsoft uses a pointer
compression algorithm to implement their SList API on 64-bit systems? I
think I remember Neill Clift, who works on the Windows Kernel, mention
something about that, but I can't seem to find the message on Google right
now.
I should point out that the dwcas used in that algorithm uses an IBM-Style
implementation. It automatically updates the comprand on failure. Here is my
implementation in x86:
int
np_ac_i686_atomic_dwcas_fence(void*, void*, void*);
.globl np_ac_i686_atomic_dwcas_fence
np_ac_i686_atomic_dwcas_fence:
pushl %esi
pushl %ebx
movl 16(%esp), %esi
movl (%esi), %eax
movl 4(%esi), %edx
movl 20(%esp), %esi
movl (%esi), %ebx
movl 4(%esi), %ecx
movl 12(%esp), %esi
lock cmpxchg8b (%esi)
jne np_ac_i686_atomic_dwcas_fence_fail
xorl %eax, %eax
popl %ebx
popl %esi
ret
np_ac_i686_atomic_dwcas_fence_fail:
movl 16(%esp), %esi
movl %eax, (%esi)
movl %edx, 4(%esi)
movl $1, %eax
popl %ebx
popl %esi
ret
> We were discussing a multiple writer, single consumer queue. For this the
> writers each use the InterlockedPushEntrySList function, while the consumer
> uses the InterlockedFlushSList function to retrieve the entire queue as an
> atomic operation. The consumer is then responsible for iterating through
> the list of items retrieved in FIFO order.
Ben, just a few more questions.
Sorry if I seem dense about this to you, but I would to like to get this
clarification.
When I see you throwing in the multiple writer, single consumer queue
premise above, I am now getting the idea if you introduced this
Interlocked Singly Linked List API to me for this particular scenario
only - one consumer.
In other words, is this API thread safe? The docs say "nonblocking
algorithm to provide atomic synchronization" which I presume it is using
some internal lock-free shared memory method.
But it is protected from multiple threads? I don't see how it can guard
a flush once the initial pointer is read without some lock.
TIA
--
HLS
"Pops" <du...@nospamnospamnospam.com> wrote in message
news:%23uMo9hu...@TK2MSFTNGP03.phx.gbl...
> Ben Voigt [C++ MVP] wrote:
>
>> We were discussing a multiple writer, single consumer queue. For this
>> the writers each use the InterlockedPushEntrySList function, while the
>> consumer uses the InterlockedFlushSList function to retrieve the entire
>> queue as an atomic operation. The consumer is then responsible for
>> iterating through the list of items retrieved in FIFO order.
You can use multiple consumers with that paticiular technique:
http://groups.google.com/group/comp.arch/msg/6ac06e17e5138820
http://groups.google.ca/group/comp.programming.threads/msg/a359bfb41c68b98b
> Ben, just a few more questions.
>
> Sorry if I seem dense about this to you, but I would to like to get this
> clarification.
>
> When I see you throwing in the multiple writer, single consumer queue
> premise above, I am now getting the idea if you introduced this
> Interlocked Singly Linked List API to me for this particular scenario
> only - one consumer.
>
> In other words, is this API thread safe? The docs say "nonblocking
> algorithm to provide atomic synchronization" which I presume it is using
> some internal lock-free shared memory method.
>
> But it is protected from multiple threads?
The API is 100% atomically thread-safe. You can hit the following functions
with any number of threads concurrently:
______________________________________
InterlockedFlushSList
InterlockedPopEntrySList
InterlockedPushEntrySList
QueryDepthSList
______________________________________
BTW, this algorithm has been around for ages as you probably know. In fact,
an IBM Principals of Operations give a more detailed description of why it
atomically thread-safe:
http://publibz.boulder.ibm.com/epubs/pdf/dz9zr002.pdf
(Appendix A/Multi-Processing Examples/Free-Pool Manipulation)
Did that help clear some things up for you?
> I don't see how it can guard a flush once the initial pointer is read
> without some lock.
The flush is implemented with double-width compare-and-swap function that
can atomically mutate the entire SLIST_HEADER data-structure:
http://groups.google.co.uk/group/comp.programming.threads/browse_frm/thread/b5bedbe04e939f55
(I would advise to read entire thread...)
On x86-32 the instruction is CMPXCHG8B. On some x86-64's the instruction is
CMPXCHG16B. On the Itanium, there is a weird instruction called CMP8XCHG16B.
This is instruction is tricky to use because you cannot count on it being
available to most 64-bit systems. For instance, I do a lot of work on the
SPARC that has the CASX instruction which works on 64-bit datum in 32-bit
mode, however, for SPARC64 the CASX instruction only works on 64-bit datum.
This can be a royal pain in the neck if your into inventing new lock-free
algorithms:
http://groups.google.com/group/comp.arch/browse_frm/thread/71f8e0094e353e5
(I would advise to read entire thread...)
There are some nifty tricks you can do to get around this nightmare. I
outline some of them here:
http://groups.google.com/group/comp.arch/msg/a3ebfe80363a4399
(examples toward end of message...)
Any thoughts on this?
should read as:
} while (! CAS(&_this->head, cmp, node));
> }
I forgot to show the flush. Here is how to do that:
slist_node*
slist_anchor_flush(
slist_anchor* const _this
) {
return SWAP(&_this->head, 0);
}
The reason I don't need to bump the monotonic aba counter for the flush is
because its immune to the problem in the first place, here is why:
LIFO producers are inherently not affected by aba. On the other hand, LIFO
consumers are affected, but only if there is a "compare" type method being
used for synchronization. Using an atomic swap to set the stack anchor
to zero can get around aba wrt LIFO lock-free stacks. Was that similar to
your initial solution?
> slist_node*
> slist_anchor_pop(
> slist_anchor* const _this
> ) {
> slist_anchor cmp, xchg;
> do {
> cmp = *_this;
> if (! cmp.head) { return 0; }
>
> /* hazard! refer to PDR algorihtm for more details... */
> xchg.head = cmp.head->nx;
>
> /* increment ABA */
> xchg.aba = cmp.aba + 1;
>
> } while (! DWCAS(_this, &cmp, &xchg));
>
> return cmp.head;
> }
>
>
>
>
> Any thoughts?
BTW, would anybody like to see an actual implementation of this in shared
memory? If so, I will stick it on my to-do list.
;^)
Yes, but with multiple consumers you
(a) won't get work evenly distributed
(b) ordering becomes rather meaningless, and Pops specified FIFO
[snip]
>> I don't see how it can guard a flush once the initial pointer is read
>> without some lock.
>
> The flush is implemented with double-width compare-and-swap function that
> can atomically mutate the entire SLIST_HEADER data-structure:
AFAICT, you need only pointer-sized CAS in order to make the singly-linked
list work, at least I wrote my own version in C++ for template-driven
type-safety and used only pointer-sized functions (each element can only be
in the list once, maybe that helps).
You could use a work-stealing scheme to provide automatic distribution.
> (b) ordering becomes rather meaningless, and Pops specified FIFO
Yes. Good point.
> [snip]
>
>>> I don't see how it can guard a flush once the initial pointer is read
>>> without some lock.
>>
>> The flush is implemented with double-width compare-and-swap function that
>> can atomically mutate the entire SLIST_HEADER data-structure:
>
> AFAICT, you need only pointer-sized CAS in order to make the singly-linked
> list work, at least I wrote my own version in C++ for template-driven
> type-safety and used only pointer-sized functions (each element can only
> be in the list once, maybe that helps).
Okay. Well, you can't immediately reuse or free nodes. So, if you can
guarantee that a node will never be popped off a list, and reused right away
when there could be other threads concurrently popping, then your going to
need to take the ABA problem into account. Read here:
http://publibz.boulder.ibm.com/epubs/pdf/dz9zr002.pdf
(Appendix A/Multi-Processing Examples/Free-Pool Manipulation/Second
Paragraph)
This explains the ABA race-condition in detail. After reading that, you will
understand why its important to have some sort of solution to the ABA. DWCAS
is just one way to do it.
You cannot use pointer-sized CAS to remove a node from a lock-free LIFO
unless you have an external method to get around the ABA problem in general.
ABA is what happens when a thread T1 reads A and then thread T2 uses CAS to
change A to B, then back to A. Thread T1 performs its CAS, and it will still
succeed. This is fine some algorithms. However, its not okay for others. For
instance, if you don't have an external solution for ABA, and you use
pointer-sized CAS to try and pop an item off a lock-free linked-list, well
that a massive race-condition waiting to happen.
It sounds like you already solve ABA by enforcing the fact that a node will
only be in the list one time. However, you cannot deallocate them either
unless you use a lifetime management scheme to get around the hazard in the
pop function:
_________________________________________
slist_node*
slist_anchor_pop(
slist_anchor* const _this
) {
slist_anchor cmp, xchg;
do {
cmp = *_this;
if (! cmp.head) { return 0; }
/* hazard! refer to PDR algorihtm for more details... */
xchg.head = cmp.head->nx;
/* increment ABA */
xchg.aba = cmp.aba + 1;
} while (! DWCAS(_this, &cmp, &xchg));
return cmp.head;
}
_________________________________________
You need to be able to get around the hazard. Here is a paper that describes
most of the concerns:
http://www.research.ibm.com/people/m/michael/podc-2002.pdf
(read all...)
You can hit the ABA scenario if you pop node A; free node A; allocate new
node that happens to have same address as the old node A; and push it back
into the list. This would change the head of the list from A, to something
else, and right back to A.
> So, if you can
^^^^^^^^^^^^^^^^^^^^^^^^
Should read as:
So, if you cannot
> guarantee that a node will never be popped off a list, and reused right
> away when there could be other threads concurrently popping, then your
> going to need to take the ABA problem into account. Read here:
[...]
___________________________________________________________________
Okay. Well, you can't immediately reuse or free nodes. So, if you _cannot_
guarantee that a node will never be popped off a list, and reused right away
when there could be other threads concurrently popping, then your going to
need to take the ABA problem into account. Read here:
http://publibz.boulder.ibm.com/epubs/pdf/dz9zr002.pdf
(Appendix A/Multi-Processing Examples/Free-Pool Manipulation/Second
Paragraph)
This explains the ABA race-condition in detail. After reading that, you will
understand why its important to have some sort of solution to the ABA. DWCAS
is just one way to do it.
___________________________________________________________________
Sorry for any confusion!
After thinking about it for a bit, I do not think ABA is actually a problem.
The only way that InterlockedCompareExchange will succeed is if the old head
pointer is what I have stored in the next-node pointer of the new head,
therefore I have correctly inserted a new node regardless of what
manipulations took place in the meantime.
Please take a look at my queue implementation and let me know if you see any
problem (I realize it isn't truely lock-free until I replace the allocator):
#pragma once
#include "Queuefwd.h"
#pragma managed(push, off)
/**
** \brief thread-safe queue, single element append, all elements retrieved
**
** High-performance lockless queue for multi-threading applications. The
order
** of the elements is preserved, but elements cannot be individually
removed.
** Intended application is for multiple threads to post work items with a
single
** reader.
**
** <center>
** \dotfile squeue-concept.dot "API Concept"
**
** \dotfile squeue-filled.dot "Typical State"
** </center>
**/
template<typename T>
class SQueue climodifier(sealed)
{
/**
** \brief internal linked-list implementation of queue for max
performance,
** follows ordinary singly-linked-list pattern
**/
struct SQueueElement climodifier(sealed)
{
/**
** \brief pointer to next list element, ala singly-linked-list
pattern
**
** NULL for the last element in the list
**/
SQueueElement* pNext;
/**
** \brief data storage pigeonhole, ala singly-linked-list pattern
**/
const T data;
SQueueElement( const T& dataInit )
: data(dataInit)
{}
private:
SQueueElement( const SQueueElement& );
SQueueElement& operator=(const SQueueElement&);
};
/**
** \brief pointer to first list element, ala singly-linked-list pattern
**
** NULL if the list is empty.
**
** Since the elements are stored in reverse order to leverage
** the compare-exchange primitive, this first list position holds
** the last (most recent) queue element.
**
** \invariant m_pFirst holds nullptr or the address of an
SQueueElement.
**/
void* volatile m_pFirst;
//static lenarray<T>* TransferElements(SQueueElement* pStart, size_t
startIndex);
public:
//! \brief Constructs a new empty queue
SQueue( void ) { m_pFirst = nullptr; }
//! \brief Adds an element to the end of the queue
void Add(const T& data);
//! \brief Retrieves all elements, removing them from the queue
lenarray<T>* Get(void);
};
/**
** Appends the data parameter to the queue. A copy
** of data is made using T::operator=(const T&)
**
** \param[in] data Value to be added as a new tail element
**/
template<typename T>
void SQueue<T>::Add( const T& data )
{
SQueueElement* pNewHead = new SQueueElement(data);
void* pCurrent = m_pFirst;
void* pOldHead;
do {
pOldHead = pCurrent;
pNewHead->pNext = static_cast<SQueueElement*>(pOldHead);
} while ((pCurrent = ::InterlockedCompareExchangePointer(&m_pFirst,
pNewHead, pOldHead)) != pOldHead);
}
/**
** Fetches all queued items. A copy
** of each element is made using T::operator=(const T&)
**
** \returns a lenarray containing the count of items removed,
** along with the items themselves in order of queueing.
**
** The complexity of this routine is due mainly to the fact
** that the queue is stored with items in reverse order for
** the sake of performance. It's not a bad tradeoff, because
** the extra code is minimal and it lets us avoid locking.
**
** \dotfile squeue-get.dot "Fetching the elements"
**/
template<typename T>
lenarray<T>* SQueue<T>::Get(void)
{
SQueueElement* pGotTail =
static_cast<SQueueElement*>(InterlockedExchangePointer(&m_pFirst,
(void*)nullptr));
if (pGotTail == nullptr)
return nullptr;
size_t length = 0;
SQueueElement* pElem = pGotTail;
do {
length++;
pElem = pElem->pNext;
} while (pElem != nullptr);
lenarray<T>* pArray = new lenarray<T>(length);
T* pData = *pArray + length;
pElem = pGotTail;
do {
*(--pData) = pElem->data;
SQueueElement* pNext = pElem->pNext;
delete pElem;
pElem = pNext;
} while (--length > 0);
return pArray;
}
#pragma managed(pop)
It is. There are numerous papers that deal with the subject. If you have a
soultion for it thats great. I look forward to looking at the code you
attached.
>
> The only way that InterlockedCompareExchange will succeed is if the old
> head pointer is what I have stored in the next-node pointer of the new
> head, therefore I have correctly inserted a new node regardless of what
> manipulations took place in the meantime.
I suggest that you read the paper by Maged I lined to in the previous post.
ABA problem is real. At least search the text in the PDF for the term "ABA",
and read the text contained in all the occurrences therein.
> Please take a look at my queue implementation and let me know if you see
> any problem (I realize it isn't truely lock-free until I replace the
> allocator):
Okay. I am going to look at it. I am still tinkering around with my
rwmutex_win class. I want to extend the API to allow for try read/write
functions. (e.g., tryrdlock/wrlock).
Good news! ;^)
From what I could gather at the first glance is that you have solved the ABA
problem by using the pattern described here:
http://groups.google.ca/group/comp.programming.threads/msg/49a5db082507cb87
http://groups.google.ca/group/comp.programming.threads/msg/a359bfb41c68b98b
http://groups.google.ca/group/comp.programming.threads/browse_frm/thread/fafe5f528700407
If you stick to that, you will be able to deallocate nodes and not have to
worry about it. BTW, when you do it this way, the Get function is not only
lock-free... Its wait-free. That's a nice property to strive for sometimes.
As long as your method tries to cut down on overheads.
Humm... Well, FWIW, if your into low-overhead queuing, I suggest you skim
through the following message and follow links:
http://groups.google.com/group/comp.arch/msg/a7501814f19731cd
What do you think about using distributed message passing algorithms w/
work-stealing scheme for natural auto-balance? I have a library I can
license... Actually, never mind that.
;^)
As you say: for LIFO or unordered lists, SCAS is sufficient in the general
case. For FIFO or binary data structures, an MCAS is required (DCAS or
TCAS). Under certain conditions a control variable can be used with CAS to
generate FIFO – the assumption being that fewer than 2^32 (or 2^64)
operations are performed on the list between the time that the control
variable is checked and when the CAS is executed.
Additionally, on windows, FIFO & LIFO are mostly useless for multiple
producer / multiple consumer modes because the strict ordering that they
imply is lost when thread pre-emption occurs. So other than the vague
benefit of LIFO on memory residence (recently used pages are more likely to
be in the working set), lock-free ordered lists should be avoided.
"Chris Thomasson" <cri...@comcast.net> wrote in message
news:I5-dnZi55r10Faba...@comcast.com...
Single CAS wrt LIFO is only good for the method that Ben uses:
http://groups.google.ca/group/comp.programming.threads/msg/49a5db082507cb87
http://groups.google.ca/group/comp.programming.threads/msg/a359bfb41c68b98b
http://groups.google.ca/group/comp.programming.threads/browse_frm/thread/fafe5f528700407
> For FIFO or binary data structures, an MCAS is required (DCAS or TCAS).
> Under certain conditions a control variable can be used with CAS to
> generate FIFO –
You can get FIFO with LIFO using the trick described above links. The
control variable is what I call the anchor. The anchor structure is opaque
and ambiguous. You can have an anchor for a LIFO/FIFO/Whatever...
> the assumption being that fewer than 2^32 (or 2^64) operations are
> performed on the list between the time that the control variable is
> checked and when the CAS is executed.
Yup. The montontnic ABA counter can roll...
> Additionally, on windows, FIFO & LIFO are mostly useless for multiple
> producer / multiple consumer modes because the strict ordering that they
> imply is lost when thread pre-emption occurs.
Sure.
> So other than the vague benefit of LIFO on memory residence (recently used
> pages are more likely to be in the working set), lock-free ordered lists
> should be avoided.
>
[...]
If your design cannot tolerate an algorithm, don't use it. I agree.
Humm... Are you stamping the nodes with the sequence number of the monotonic
aba counter at time of push and/or pop? Are you sorting them in ascending
order or something?
I meant to say descending order as ascending would be LIFO.
Well, you can also use Single CAS if you provide some other solution to the
ABA problem. The method that Bens code uses is one of many.
Thread preemption was only a significant distinction when stuff ran on
single processor systems and only one thread could run at a time. That's
no longer true on multi-cores.
--
Joe Seigh
When you get lemons, you make lemonade.
When you get hardware, you make software.
Thanks for bringing back the focus and a level of talking I can relate
to. :-)
(This is a side note, I believe at least under WIN32, it is possible to
progammatically delegate your threads so a specific processor. Never
tried it myself, but I recall a reference to it, in this case, under the
guise of potential security conflicts. Maybe that as for hyper-thread
processors only.).
I should make it clear, especially for Chris, that all our work is 100%
for WIN32 only. Very little dependency on its sub-systems specifics,
NT-based only methods, staying clear of APIs with hidden logic,
experimental ideas and/or fuzzy considerations related to HW.
With our current charter to improve performance/scalability, we are
looking at taking advantage of some of the NT-based only offerings.
Part of the issue is the design decision made in 1996 (so those OS/HW
realities of the day were important) to assure quality, maybe at the
expense of performance, I can broadly say, our RPC client/server
framework uses critical sections, rw-locks in short, semaphores, etc, in
much of our multi-context, multi-threaded, data queuing, circular
buffers, FIFO/LIFO, etc, synchronization needs.
So there is much room for improvement in many areas.
While I look for those NT-based improvements, I also have to measure
disk vs memory (or both) queuing, local vs remote IPC, context switching
issues, single vs multi vs hyper-threaded, affinity considerations, etc.
Most of the time, we look for and/or design generalize, reusable,
"practical" solutions, methods.
That said, this Interlocked Singly List API seems to be useful but it
still seems to require a lock-sync when using its FLUSH function, not
for to be confused a PURGE consideration. To me, a FLUSH is a purge if
the data is not available or saved during the flush.
In this case, it provides a pointer to a list. To me, in order to be
thread safe, the function has to be make a COPY of the data and provide
a new pointer list, then it can safely purge the internal list it is
maintaining.
That is basically all I was looking for here, a confirmation of the
workings of this flush.
The order of the list is only important depending on how on where it is
implemented. As as suggested by Ben, it would be easier to transform a
LIFO to a FIFO. But we can not be apply in areas were FIFO is required
if the LIFO order can not be guaranteed.
Thanks
--
HLS
I notice that before you wrote that, we were getting two messages in
reply to everything; but after you wrote it, we started getting four
messages in reply to everything. (-:
But now you continue with more junk to the junk already created? Go
Figure! Geez, M wasn't even referring to the cross posting but rather
Chris's highly energetic, eagerness to assist with a prolific posting
ability with corrective follow ups - hence M's consolidation point.
If you have something to contribute, by all means, please do. I started
this thread, the "rest of us" would rather wish you don't screw it up
with junk.
--
HLS
I meant "a problem for me". Sorry about the overly general statement.
Thanks for the review. I was pretty sure it was correct but it's very nice
to have concurrence from an expert.
As far as I'm concerned, the only (tenable) way to write multi-threaded code
is to have a couple of lock-free thread-safe components (like my squeue)
that fit entirely on a single page, and elsewhere avoid accessing the same
variable from more than one thread. In any non-trivial project, trying to
protect shared data structures any other way is just too complex to reason
about, with too many opportunities for deadlocks and/or race conditions, and
even if those are solved, then contention to a degree that removes the
benefits of multithreading.
>
> Humm... Well, FWIW, if your into low-overhead queuing, I suggest you skim
> through the following message and follow links:
>
> http://groups.google.com/group/comp.arch/msg/a7501814f19731cd
>
Thanks, I'll take a look.
> That said, this Interlocked Singly List API seems to be useful but it
> still seems to require a lock-sync when using its FLUSH function, not for
> to be confused a PURGE consideration. To me, a FLUSH is a purge if the
> data is not available or saved during the flush.
The reason you don't need an explicit mutual exclusion lock is because you
are using instructions that atomically modify the entire list anchor in one
single atomic operation.
>
> In this case, it provides a pointer to a list. To me, in order to be
> thread safe, the function has to be make a COPY of the data and provide a
> new pointer list, then it can safely purge the internal list it is
> maintaining.
[...]
You do not need COW semantics' wrt flushing the contents of a lock-free
LIFO. You simply SWAP the head of the list with NULL, and that's it.
"J de Boyne Pollard" <j.deboyn...@tesco.net> wrote in message
news:728fdf8b-9c02-4828...@b32g2000hsa.googlegroups.com...
JdeBP> I notice that before you wrote that, we were getting
JdeBP> two messages in reply to everything; but after you
JdeBP> wrote it, we started getting four messages in reply
JdeBP> to everything. (-:
P> [...] Geez, M wasn't even referring to the cross posting [...]
... and neither was I.
Thread pre-emption is important on all pre-emptive multi-tasking operating
systems regardless of how many CPUs are present in the system. The
importance of this consideration lies in the fact that between any two
instructions executed by thread A, thread B may have executed a large number
of instructions (either on this CPU or on another one) or none at all. The
reason this is, implicitly, important is that, even though one could use a
lock-free algorithm for nearly anything, they are specifically designed for
HPC scenarios with multiple produces & multiple consumers (or more
generally, multiple modifiers).
Lock-free is not a forward progress guarantee. Lock-free, a misnomer, is an
idiom in which software locks are replaced with hardware locks because they
are, presumably, more efficient. This efficiently is achieved, if by
nothing else, by being smaller (one instruction vs. hundreds) and by
avoiding interaction with the OS scheduler. Lock-free algorithms must
provide a forward progress guarantee, otherwise they would cease to function
under contention, but that guarantee, in and of itself, does not make an
algorithm lock-free and hence does not define the idiom.
"Joe Seigh" <jsei...@xemaps.com> wrote in message
news:JUV_i.7973$WM.837@trndny05...
I think you should look at the differences between wait-free and lock-free.
Wait-free means that the operation will complete within a finite number of
steps, in other words, there are no loops. Lock-free implies conditional
looping. You can get into live-lock under certain scenarios. For instance, a
live-lock can ensue if the reservation make by a load-linked is constantly
invalided before the store-conditional get to execute. A PowerPC manual
mentions some of this wrt false-sharing messing with the reservation. That's
why STM is so prove to live-lock... STM is like LL/SC with huge reservation
granules.
The fact that on certain architectures, false-sharing can adversely affect
performance is a consideration in implementation and algorithm choice.
Specifically, don’t choose an algorithm that is hard to implement or
performs badly on your target architecture ;)
"Chris Thomasson" <cri...@comcast.net> wrote in message
news:f7OdndviL--1OqHa...@comcast.com...
[top-post corrected]
"m" <m@b.c> wrote in message news:OMNiHf8J...@TK2MSFTNGP03.phx.gbl...
> Lock-free & wait-free are independent idioms which, in certain algorithms,
> can overlap. Except that lock-free doesn’t imply conditional looping,
> this is an attribute of many algorithms that are lock-free not a defining
> property of the idiom,
Fair enough.
> your definitions are reasonable; so what am I missing?
I don't think your missing anything.
> The fact that on certain architectures, false-sharing can adversely affect
> performance is a consideration in implementation and algorithm choice.
> Specifically, don’t choose an algorithm that is hard to implement or
> performs badly on your target architecture ;)
100% Agreed.
> Thanks for the review. I was pretty sure it was correct but it's very
> nice to have concurrence from an expert.
You welcome. I think your SQueue class is sound wrt the overall concerns of
the ABA problem in general. I also think that the code correctly address the
concerns of deallocating nodes from lock-free data-structures. You have a
performance enhancement on the SQueue<T>::Get() function because its 100%
wait-free as long as the hardware that Windows happens to be running on
supports an atomic swap instruction; Intel does. SPARC used to support the
SWAP instruction; they depreciated it, and delegated its functionally over
to the CAS instruction on 32-bit systems, and CASX on 64-bit systems. Its
Kind of funny IMHO simply because CASX operates on 64-bit datum in 64-bit,
despite its literal name; the way I read things, CASX implies DWCAS-like
operation, which means 128-bit operation when in 64-bit mode. That seems odd
to me to say the least...
> As far as I'm concerned, the only (tenable) way to write multi-threaded
> code is to have a couple of lock-free thread-safe components (like my
> squeue) that fit entirely on a single page, and elsewhere avoid accessing
> the same variable from more than one thread. In any non-trivial project,
> trying to protect shared data structures any other way is just too complex
> to reason about, with too many opportunities for deadlocks and/or race
> conditions, and even if those are solved, then contention to a degree that
> removes the benefits of multithreading.
Well, deadlocks are caused by the misuse of locking. IMVHO, if a programmer
cannot get lock-based algorithms correct, then they should be nowhere near a
lock-free one; to say the least. The first of the two links provided below
points out some of the basic caveats wrt replacing a lock-based
synchronization point with its non-blocking equivalent. For instance, if you
stick a huge engine in a car with crappy tires, well, then Murphy's Law
dictates that your going to have a blow-out at high speeds, on a windy road,
with a cliff on one side. Therefore, you must upgrade the tires, and
probably some other components as well... Solving that in an efficient
manner can sometimes turn out to be a fairly long, and tedious endeavor;
indeed!
>> Humm... Well, FWIW, if your into low-overhead queuing, I suggest you skim
>> through the following message and follow links:
>>
>> http://groups.google.com/group/comp.arch/msg/a7501814f19731cd
>>
>
> Thanks, I'll take a look.
[...]
I think that you just might find the brief overview of the message-passing
techniques described therein fairly interesting, and perhaps a "bit useful";
as in you may be able to fit it in some of your _existing_ synchronization
schemes with "fairly minimal" effort(s). You can read the following for some
more details on that particular subject:
http://groups.google.com/group/comp.programming.threads/msg/9a5500d831dd2ec7
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/ef3f26bbe1c52e0b
Any thoughts on this?
I, personally, would be very surprised if they did. The benefit is so
small, and memory is essentially free.
Further, in my experience, most kernel structures use the doubly-linked
lists.
--
Tim Roberts, ti...@probo.com, DDK MVP
Providenza & Boekelheide, Inc.
Yes, preemption causes more variation in a threads rate of progress but so?
In multi-threading you don't make any assumptions about the rate of forward
progress, preemption or not. And it's certainly news to me that lock-free
is mainly important to HPC and not so much for other things.
>
>
>
> Lock-free is not a forward progress guarantee. Lock-free, a misnomer, is an
> idiom in which software locks are replaced with hardware locks because they
> are, presumably, more efficient. This efficiently is achieved, if by
> nothing else, by being smaller (one instruction vs. hundreds) and by
> avoiding interaction with the OS scheduler. Lock-free algorithms must
> provide a forward progress guarantee, otherwise they would cease to function
> under contention, but that guarantee, in and of itself, does not make an
> algorithm lock-free and hence does not define the idiom.
>
The most commonly used definition for lock-free is by Herlihy. He also did
definitions for wait-free and obstruction-free.
> We were discussing a multiple writer, single consumer queue. For this the
> writers each use the InterlockedPushEntrySList function, while the consumer
> uses the InterlockedFlushSList function to retrieve the entire queue as an
> atomic operation. The consumer is then responsible for iterating through
> the list of items retrieved in FIFO order.
Hi Ben,
I been spending the last day or so working on this Interlocked Singly
Linked List (ISLL) and did a profile of two basically the same
applications with the key difference:
PROG-A: MFC Clist FIFO queue w/ my RW and Chris's RW code,
PROG-B: ISLL with no RW code. Only LIFO analyzed here.
Each allows me to play with different parameters:
#define USE_CRITSECT_H // Use my RW or Chris's RW
BOOL PRINT_LINES = FALSE; // Avoid console overhead
DWORD MAX_WRITER_THREADS = 1; // # of writer threads
DWORD MAX_READER_THREADS = 1; // # of reader threads
DWORD AUTO_STOP_COUNT = 1000; // # of writer events
DWORD WRITER_FREQ_MS = 1; // Frequency of writes
DWORD READER_FREQ_MS = 1; // Frequency of reads
DWORD READER_EVENT_THRESHOLD= 0; // # before reader is signaled
DWORD READER_PROCESS_DELAY = 0; // Simulate processing time
Each writer and reader threads has a one (1) quantum frequency using
while( WFSO(ShutdownEvent,1) ) thread break test. So essentially a 15ms
frequency of pumping 1 item into the list, a 15ms reader frequency. The
wait can be changed with the XXXX_FREQ_MS parameters.
AUTO_STOP_COUNT allows the test to end. The main thread watches this
count before it signals the threads to shutdown.
The READEREVENT_THRESHOLD, if set, will allow the queue to build before
the writer will signal the reader event.
For PROG-A with the RW locks, the USE_CRIT_SECT pragma allows me test
my RW classes or Chris's fine RW class. :-)
The testing is on a Intel Pentium D 2.66ghz, 1 gig RAM XP/PRO box.
The programs main thread creates athe threadsd. When AUTO_STOP_COUNT is
reached, it signals the shutdown event.
Another console utility is use to set the performance counters to watch
for and log. In this profile, I am only looking at the context switch rate.
The results:
# PROG-A: USING SSI CRITSECT.H READER/WRITER CLASSES
* Successful Shutdown: 16000 ms
* Total Processed : 1023
* Total Empty Gets : 362
PROG-A Reader: ~117 cs/sec
PROG-A Writer: ~88 cs/sec
# PROG-A: USING CHRIS'S READER/WRITER CLASSES
* Successful Shutdown: 16000 ms
* Total Processed : 1023
* Total Empty Gets : 259
PROG-A Reader: ~77 cs/sec
PROG-A Writer: ~71 cs/sec
So Chris's code offers less context switching. Don't get excited Chris.
:-) Note, the reader in the first test is wasting time with empty read
events. Comes into play in addition tests.
# PROG-B: USING ISLL API WITH NO RW LOCKS
* Successful Shutdown: 16000 ms
* Total Processed : 1023
* Total Empty Gets : 279
PROG-B Reader: ~67 cs/sec
PROG-B Writer: ~65 cs/sec
Even slightly better with the ISLL, about the same waste in reader
events. Note: This is using a PUSH and POP LIFO.
For the next tests, I optimized the sychronization using event driven
thresholds, READEREVENT_THRESHOLD = 10; the pump will will signal the
reader when 10 items are add to the list.
# PROG-A: USING SSI CRITSECT.H READER/WRITER CLASSES
* Successful Shutdown: 16000 ms
* Total Processed : 1021
* Total Empty Gets : 0
PROG-A Reader: ~6 cs/sec
PROG-A Writer: ~70 cs/sec
Drastic improvement here by collecting a bucket of items before
processing. Notice no waste in reader events.
# PROG-A: USING CHRIS'S READER/WRITER CLASSES
* Successful Shutdown: 16000 ms
* Total Processed : 1021
* Total Empty Gets : 0
PROG-A Reader: ~8 cs/sec
PROG-A Writer: ~66 cs/sec
Essentially the same performance.
# PROG-B: USING ISLL API WITH NO RW LOCKS
* Successful Shutdown: 16000 ms
* Total Processed : 1021
* Total Empty Gets : 0
PROG-B Reader: ~7 cs/sec
PROG-B Writer: ~67 cs/sec
And for ISLL, essentially the same performance, when using a sychronized
event driven bucket brigade.
I performed many other test, and real world simulations, like adding
delays for processing time, lowing the frequency of the pump but overall
I think it is pretty clear to me, there is no real big difference when
you need to take into account key optimization and synchronization factors.
In regards to ISLL, I did see a very small slight improving using the
flush when I began to work on a LIFO to FIFO transformation. But seeing
that regular MFC Clist based FIFO performed equally as well, to me, its
the added complexity, unsureness of lock-free methods makes, OS
dependency, makes it practical at the moment.
Other considerations such as memory fragmentation, needs to be reviewed.
Anyway, I though I would share these results.
Thanks
--
HLS
#if defined(_WIN64)
typedef union _SLIST_HEADER {
ULONGLONG Alignment;
struct {
ULONGLONG Depth : 16;
ULONGLONG Sequence : 8;
ULONGLONG Next : 40;
};
} SLIST_HEADER, *PSLIST_HEADER;
See ntddk.h
"Tim Roberts" <ti...@probo.com> wrote in message
news:a58qj3p6jmkts8vti...@4ax.com...
Did you try a multiple writer scenario, or reduced waits where you could
have some contention?
Especially don't use Sleep to simulate processing time, use a loop with a
volatile variable to make sure the compiler doesn't eliminate it.
Yeah, especially since any predicates that are based on an assumption of the
non-deterministic rate of a threads forward progress is basically a
race-condition in and of itself. I think its similar to all the reasons
against implementing RCU with timers; reader threads just might decide to
stay in the active read epoch, longer than the timer allows for... Not good.
> And it's certainly news to me that lock-free
> is mainly important to HPC and not so much for other things.
IMHO, I find the reader pattern to be the most useful overall technique.
[...]
"Alexander Grigoriev" <al...@earthlink.net> wrote in message
news:#UdcMqFK...@TK2MSFTNGP04.phx.gbl...
"Vladimir Petter [MSFT]" <vla...@hotmail.com> wrote in message
news:u$CatAOKI...@TK2MSFTNGP04.phx.gbl...
> You can find more info on this here http://www.alex-ionescu.com/?p=50
> Vladimir Petter. [microsoft]
Thanks for that. Humm... So, Windows 64-bit running on x86-64 systems that
lack CMPXCHG16B instruction uses only 9-bits for an ABA counter; I think
that may be a bit dangerous. That counter will be rolling over quite
frequently under load... Humm:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/133f764f1bc0c9f4
Any thoughts?
Counter could roll over, but probability of hitting the same is
astronomically low. If there is too much ABA contentions, counter will
simply change forward, but not by 512 at once.
http://groups.google.com/group/comp.arch/msg/04cb5e2ca2a7e19a
Damn!
Ben Voigt [C++ MVP] wrote:
> Did you try a multiple writer scenario, or reduced waits
> where you could have some contention?
Yes.
It was pretty much all the expected issues and results wrt synchronizing
and/or controlling queuing and dequeuing rates, including using the
proper methods for the design in hand, i.e., pop vs a flush.
> Especially don't use Sleep to simulate processing time,
> use a loop with a volatile variable to make sure the compiler
> doesn't eliminate it.
Not sure what you are thinking at the time you wrote this, but it all
depends on what you are simulating. I won't attempt to elaborate too
deeply but just in case it was misunderstood, I did not intend to imply
the parameter READER_PROCESS_DELAY "Simulating Processing Time" to
exclusively mean CPU time, if that was you were thinking?
To me, the logic to simulate processing time implies doing "things" that
emulates as near as possible real world behavior of your data handling
design needs which may includes considerations such as HW related
interrupts, natural time slicing pre-emptions, as well as CPU time, etc,
all to create situations which will help find contention, deadlocks,
timing, synchronization, optimization issues, etc.
I use various simulation techniques, including combinations of these
techniques: sleeps, yields/pokes, software-based timers, math
computations, high memory loads, file I/O, RPC calls (which is
critically important for us, because this is where most of bottlenecks
might occur), for GUI relate issues, posting or sending messages,
message handlers with simulator delays, etc.
That said, I'm curious. I have various methods for software timers. Can
you provide an example of one using volatile? If we referring to the
same thing, I wasn't sure if you were thinking HW based timers vs SW
based timers.
Thanks
--
HLS
Its good to keep in mind that we are talking about solutions that are
analogous to managed race-conditions in their essence. I think that PDR is
good method to use wrt addressing the ABA problem in general. The depth of
the counter is directly proportional to the "chance" of the race-condition
not raising its ugly head right in the middle of a critical operations.
The augment the algorithm with a "counter" (e.g., counter inc'd for every
push, and dec for every pop). This is not a monotonic counter, therefore its
not effective for solving ABA in general. You either need to manage memory
and solve ABA externally, or solve it within the state of a specific
algorithm.
"Chris Thomasson" <cri...@comcast.net> wrote in message
news:98OdnXUQ0LWfvtja...@comcast.com...
Right. So effectively its only a 9-bit ABA counter. The aba problem can trip
up the 16-bit depth count by popping and pushing the same item. depth of 1,
changed to depth of 0, and back to 1 again; ABA...
"Chris Thomasson" <cri...@comcast.net> wrote in message
news:F-ydnQxWxLBYjdva...@comcast.com...
[...]
9 bit sequence number is small. Back when IBM did things the processors were
not as fast as they are now...