A Correct, Lock-free FIFO with a single CAS per complete put/get cycle

2244 views
Skip to first unread message

ch...@megabasic.com

unread,
Sep 16, 2008, 11:49:06 PM9/16/08
to Scalable Synchronization Algorithms
Sounds impossible, but when many threads are putting, but only one
thread is pulling, a huge simplification can be had for the taking.
It works like this...

The queue is treated like a stack, starting the a head pointer set to
zero. Stacks are really fast and easy, especially to push new items
on, because no ABA protection is required (unlike popping items off).
Pushing items onto the stack does have the problem of creating a list
in the wrong order for a FIFO dequeue operation.

But since the receiving thread has no competition, it can grab the
entire list by exchanging the list head pointer with zero, leaving the
list empty for subsequent enqueue operations. This cannot conflict
with any concurrent CAS's, but the list order is still reversed.
However, since the thread exclusively holds the entire list by itself,
it simply reverses all the links, then processes the items in the list
in sequence, in the right order. This is simple, correct, fast and
elegant. Any comments...

Dmitriy V'jukov

unread,
Sep 17, 2008, 4:36:58 AM9/17/08
to Scalable Synchronization Algorithms
It's very good known algorithm for multi-producer/single-consumer fifo
queue:
http://groups.google.ca/group/comp.programming.threads/msg/49a5db082507cb87
http://groups.google.ca/group/comp.programming.threads/msg/a359bfb41c68b98b

The only thing I am concerned with is 'reversing' operation. In my
benchmarks I've seen substantial performance degradation exactly in
'reversing' operation.

You can consider my mpsc queue algorithm:
http://groups.google.ru/group/lock-free/browse_frm/thread/55df71b87acb8201
(see intrusive version)
It has some advantages as well as disadvantages as compared to
'reversed IBM freelist'.
Producers are wait-free. They execute only XCHG, which is faster than
CAS.
On fast-path consumer also doesn't execute any atomic RMW. It has to
execute XCHG rarely (rarer than in reversed freelist).
No need for 'reversing' operation.
Also doesn't prone to ABA.
Also no need for GC.


Dmitriy V'jukov
--
Relacy Race Detector: Make your synchronization correct!
http://groups.google.ru/group/relacy/web
Message has been deleted

Codewarp

unread,
Sep 17, 2008, 3:07:11 PM9/17/08
to Scalable Synchronization Algorithms
I am surprised by the reversal cost that you mentioned. We are using
this method in a production multi-threaded application framework in C+
+ for both event processing and for memory-block freelists (for
freeing blocks belonging to other threads). Of course the freelists
are not reversed by the consumer, so there is no reversal cost for
that. But the event processor (or task processor, depending on your
point of view) is executing event response code for each node in the
list. The reversal cost is bounded and amortized across all the
events to be processed. Each link reversal is a single, uncontended
non-atomic memory write, paired up with a (much larger) event/task
invocation. This guarantees the reversal cost stays much smaller than
the consumer's contribution. The same links all have to be modified
in any case, but this method just modifies them in a different order.
This scales well across multiple CPUs because it places the burdon on
the multiple producers and off the single consumers. It may also have
working set advantages.

Dmitriy V'jukov

unread,
Sep 17, 2008, 4:16:16 PM9/17/08
to Scalable Synchronization Algorithms
In such context you can't feel the things I am talking about.
If consumer's contribution is much bigger then the cost of dequeue
operation, then you can't compare different queue implementations in
such test. In such context, I think, it really doesn't matter how many
CASes per operation, whether it's 1 CAS per operation, or whether it's
5 CASes per operation. Try to reduce consumer's local work to few tens
of cycles (or even to zero).

Yes, reverse operation involves only plain memory writes. But the
problem is that every node, to which consumer has to write, situated
in remote cache. So basically consumer generates chain of cache
misses, and every next address can't be predicted until current miss
will be satisfied. It's very-very bad thing wrt performance, every
cache miss is around 200-300 cycles (on modern x86 processors).


Dmitriy V'jukov

Codewarp

unread,
Sep 17, 2008, 5:19:05 PM9/17/08
to Scalable Synchronization Algorithms
On the consumer end, I have no CASes. But you are absolutely right
about the cache misses and their cost. My performance suffers from
this on long chains. But how much of it is due to the memory write to
all those cache lines, and how much if we don't write at all, but only
read?

Since the consumer only performs the reversal for its own private
needs, suppose that instead, it scans the chain and stores all the
pointers on the stack (just for illustration). Then we walk through
the pointers in the right order, process all the events, pop off the
pointer list and the consumer is done. I don't need to use the stack,
I could use a thread-local array for this, but the principle is the
same.

This would seem, in principle, to eliminate over half of the caching
overhead. I like this idea, and it would be easy for me to try, and
see what effect it has on maximum transaction bandwidth.

Dmitriy V'jukov

unread,
Sep 17, 2008, 6:55:27 PM9/17/08
to Scalable Synchronization Algorithms
On 18 сент, 01:19, Codewarp <ch...@megabasic.com> wrote:
> On the consumer end, I have no CASes.  But you are absolutely right
> about the cache misses and their cost.  My performance suffers from
> this on long chains.  But how much of it is due to the memory write to
> all those cache lines, and how much if we don't write at all, but only
> read?


I think that there will be no difference whether thread writes or only
reads. Maybe if thread instantly writes to node, it will be even
better. Because if thread will anyway write to node some time later,
for example, when it will free the node, or when it will enqueue the
node to another queue, it's better to make write instantly, then make
read first, and then write some time later.
But it's only my guesses.

Note that algorithm which doesn't make reversion has to make all those
cache misses too. But they will be spread over time. I don't know why,
but it seems that 'spread' misses are better than 'concentrated'
misses. To be honest, I made my benchmarks wrt node order reversion in
very limited context. So maybe results was affected by something else.
It's just the conclusion I made that time.

Also queue which doesn't make reversion can apply following trick.
When consuming current node, it can prefetch next node. So when next
node will be needed, it will be already in cache. Something like:

node* queue::dequeue()
{
node* result;
...
result = ...;
...
_mm_prefetch(result->next);
return result;
}


> Since the consumer only performs the reversal for its own private
> needs, suppose that instead, it scans the chain and stores all the
> pointers on the stack (just for illustration).  Then we walk through
> the pointers in the right order, process all the events, pop off the
> pointer list and the consumer is done.  I don't need to use the stack,
> I could use a thread-local array for this, but the principle is the
> same.
>
> This would seem, in principle, to eliminate over half of the caching
> overhead.  I like this idea, and it would be easy for me to try, and
> see what effect it has on maximum transaction bandwidth.


I will appreciate keenly if you will post results here.


Dmitriy V'jukov

Codewarp

unread,
Sep 18, 2008, 12:01:26 AM9/18/08
to Scalable Synchronization Algorithms
I have been studing your mpscq method and it appears to be the best
alg for this operation that I have ever encountered--bravo. In
particular, it would seem to have the best performance on longer
chains, degrading only on the degenerate case of single node queues.
It's response on any one event is always O(1), without any of the
discontinuities of the reversed stack method. Also the notion that
its producers can block its consumer is way overstated. Rather, new
nodes are simply not available to the consumer until the push( )
operation has completed. No big shock there, though an sfence may be
required to resolve push( ) results, at the point when only a single
node is encountered by the pop( ) call.

I may not persue the reversed stack ideas any further after
considering all of this. But when I do try this new algorithm, I
intend to test the performance difference carefully, and will post the
results here.

Codewarp

unread,
Sep 18, 2008, 7:44:18 PM9/18/08
to Scalable Synchronization Algorithms
Dmitriy,

How is the prev->next=n in your push( ) function below, made visible
to other processors, without an mfence or some atomic instruction to
force things back in line? The pop( ) doesn't do any atomic
instructions either, until it's down to 1 node.

void mpscq_push(mpscq_t* self, mpscq_node_t* n) {
n->next = 0;
mpscq_node_t* prev = XCHG(&self->head, n); // serialization-point
prev->next = n; // serialization-point wrt consumer
// <-- mfence here?
}

Dmitriy V'jukov

unread,
Sep 19, 2008, 1:10:12 AM9/19/08
to Scalable Synchronization Algorithms
Nope. mfence and atomic RMWs have nothing to do with visibility.
Please read my conversation with Johan Torp, it's exactly about the
same thing:

http://groups.google.ru/group/comp.programming.threads/tree/browse_frm/thread/9ccc757e7fe7ee2d/8d045d6d726d86e8?rnum=21&_done=%2Fgroup%2Fcomp.programming.threads%2Fbrowse_frm%2Fthread%2F9ccc757e7fe7ee2d%3F#doc_8d045d6d726d86e8


Dmitriy V'jukov

Codewarp

unread,
Sep 19, 2008, 3:38:26 PM9/19/08
to Scalable Synchronization Algorithms

> Nope. mfence and atomic RMWs have nothing to do with visibility.
> Please read my conversation with Johan Torp, it's exactly about the
> same thing:
>
> http://groups.google.ru/group/comp.programming.threads/tree/browse_fr...
>
> Dmitriy V'jukov

Of course! Exactly one thread modifies any particular node.next, and
exactly one thread reads it, so no serialization is necessary. Most
impressive.

Let's try this angle... Suppose that exactly between the XCHG( ) and
the prev-next = n, the producer thread timeslices out, and is then
forcefully terminated by the OS, so that it never comes back. In that
case, the chain of nodes will break at that point in the list, causing
the consumer to stop receiving any further input from any producers.

Granted, this is really an odd-ball case, but it's the only hole I can
find. It might be plausable in some application exit scenerio.

Dmitriy V'jukov

unread,
Sep 19, 2008, 3:56:48 PM9/19/08
to Scalable Synchronization Algorithms
On Sep 19, 11:38 pm, Codewarp <ch...@megabasic.com> wrote:

> Let's try this angle... Suppose that exactly between the XCHG( ) and
> the prev-next = n, the producer thread timeslices out, and is then
> forcefully terminated by the OS, so that it never comes back.  In that
> case, the chain of nodes will break at that point in the list, causing
> the consumer to stop receiving any further input from any producers.


Assume mutex-based queue. Assume producer or consumer is terminated
inside critical section. The same effect. This is why in my queue
producer is blocking (i.e. behaves like it is based on mutex) wrt
consumers. I've described this moment in description - see
'Disadvantages' section:
http://groups.google.com/group/lock-free/browse_frm/thread/55df71b87acb8201

But producers are wait-free wrt other producers, and consumer is wait-
free wrt producers.

Totally lock-free (or wait-free) queue can cope with thread
termination. But note that in order to overcome thread termination you
need not only lock-free queue, you need TOTALLY LOCK-FREE WHOLE
SYSTEM. Because thread can be terminated inside, for example, malloc()
call, which usually contains mutex. And I am usually not designing
algorithms for forward progress guarantees (lock-free, wait-free), I
usually design for performance/scalability.


Dmitriy V'jukov

Codewarp

unread,
Sep 19, 2008, 5:18:12 PM9/19/08
to Scalable Synchronization Algorithms


On Sep 19, 12:56 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:
> On Sep 19, 11:38 pm, Codewarp <ch...@megabasic.com> wrote:
>
> Assume mutex-based queue. Assume producer or consumer is terminated
> inside critical section. The same effect. This is why in my queue
> producer is blocking (i.e. behaves like it is based on mutex) wrt
> consumers. I've described this moment in description - see
> 'Disadvantages' section:http://groups.google.com/group/lock-free/browse_frm/thread/55df71b87a...

Yes, I have read your disadvantages, but I think they are overstated.
How, for instance, can a producer thread go into a critical section
between the XCHG and the prev->next = n, when push( )ing another
node? It can't, nor can it timeslice away, get locked up and
delayed. The same is true with the consumer. Am I missing something
here??

> Totally lock-free (or wait-free) queue can cope with thread
> termination. But note that in order to overcome thread termination you
> need not only lock-free queue, you need TOTALLY LOCK-FREE WHOLE
> SYSTEM. Because thread can be terminated inside, for example, malloc()
> call, which usually contains mutex. And I am usually not designing
> algorithms for forward progress guarantees (lock-free, wait-free), I
> usually design for performance/scalability.
>
> Dmitriy V'jukov

Performance and scalability is the goal, but as you know, lock-free/
wait-free can provide the means. It was precisely because of
malloc( ) and its infatuation with critical-sections that lead to
developing my own multi-threaded allocator. It was truely
dumbfounding to discover how slow and unacceptable malloc( ) is in the
multi-threaded libraries. Mine uses a per-thread allocator, but you
can release undifferentiated blocks from other threads into the
current allocator as they arise. Local allocation/free is lock-free,
wait-free, atomic-free. Freeing blocks from other threads is lock-
free and wait-free. Local allocation cycles involving random sizes
takes on the order of 200-400 cycles, compared to the 1800 cycles I
have seen from malloc( ) to do the same thing. My dynamic string
implementation is based on it, and it's greased lightening because of
it (along with its move constructors and other things).

As far as a totally lock-free system, our framework is completely lock-
free, except for when it has nothing at all to do, in which case it
blocks waiting for a signal that tasks are now available. I have yet
to find any way to eliminate this one wait from the system. All
threading within this framework uses thread objects and logical events
managed by the framemork, so termination from elsewhere is not an
issue.

Codewarp

unread,
Sep 19, 2008, 8:25:18 PM9/19/08
to Scalable Synchronization Algorithms
Continuing with these concepts... when the producer timeslices out, in
that rare empty void between the XCHG and the prev->next = n
assignment, the consumer doesn't "block", it just finds the list empty
and gets back a zero. How does the consumer know that if it kept
trying, it would find more nodes? The consumer might block, thinking
that it will be signalled when queue has something new in it, when in
fact it can hold several nodes, with no more nodes to follow, and
appear to hang.

This limitation now seems much more serious that simply a
"disadvantage", suggesting this algorithm may be too good to be true.
Please show me where I am wrong--your alg has too much going for it to
let go of unless I really have to. I wouldn't even mind another
atomic RMW if it can close this gap. What good is performance if you
can't keep it awake?

Codewarp

unread,
Sep 20, 2008, 3:54:08 PM9/20/08
to Scalable Synchronization Algorithms
After sleeping on it, I realized that if the pop( ) would return BUSY,
EMPTY or a node pointer, the calling context would always have precise
knowledge to make correct choices. We can detect the BUSY state, when
we find head != tail when tail->next == 0, an impossible state unless
it's BUSY. When a caller to pop( ) finds the queue BUSY, it can
sleep(0) and retry, or just spin, if it so desires. BUSY states are
expected to persist for something comparable to a time-slice interval
(6 microsecs is what I have observed), so they must be extremely rare,
and its probability will affect the extent of scalability across many
simultaneous producers. Returning EMPTY=0,BUSY=1 is what I use in my
code now. Returning (head!=tail) is all that is necessary when
otherwise returning 0.

Dmitriy V'jukov

unread,
Sep 23, 2008, 7:11:51 AM9/23/08
to Scalable Synchronization Algorithms
Sorry for delays. Currently I'm quite busy with other activities. I
hope I will answer in the near future.

Dmitriy V'jukov

Codewarp

unread,
Sep 23, 2008, 2:44:30 PM9/23/08
to Scalable Synchronization Algorithms
Ok, so let me take the opportunity to reissue the V'jukov MPSC queue
algorithm as a relatively complete template class, with many comments,
and the additional return values from push( ) and pop( ). I hope the
auto-formatting of this discussion board doesn't chop all the lines in
half, we'll see...

/*/
Advantages:
Producers are wait-free. They execute only XCHG, which is faster
than CAS.
One XCHG is maximum that one can get with multi-producer non-
distributed queues.
On fast-path consumer also doesn't execute any atomic RMW. It
executes XCHG rarely.
No need for node order reversion, so pop operation is always
O(1).
Intrusive. No need for additional internal nodes.
ABA-free.
No need for PDR, nor for GC

Disadvantage:
If producer blocked in (*), then consumer is blocked too.
Fortunately
'window of inconsistency' is extremely small - producer must be
blocked
exactly in (*). Actually it's a disadvantage only as compared
with a totally
lockfree algorithm. It's still much better than a lock-based
algorithm.

This disadvantage is handled as a BUSY state, indicated when
pop( ) returns 1.
BUSY is not really a normal "blocking" state, rather it means a
next node is
delayed a few microsecs for a time-slice cycle.

The push( ) function returns 0 if it encounters an empty queue, or
1 if nonempty.
This knowledge is useful to callers who need to set a queue-non-
empty signal.
/*/

// T may be any type with a public T *next data member and a default
constructor
template<class T>
class FastQueue : public SharedObjectEx {
T *head, *tail, *stub;
public:
FastQueue ( );
virtual ~FastQueue ( );
Integer push (T* node); // put another node into the queue
T* pop ( ); // return 0=empty, 1=busy, or a node
pointer
T* popwait ( ); // spin while busy, return 0=empty, or a
node pointer
};

template<class T>
FastQueue<T>::FastQueue ( ) { head = tail = stub = new T; stub->next
= 0; }

template<class T>
FastQueue<T>::~FastQueue ( ) { stub->Release ( ); }

// push node onto the queue, return 0 if the queue was empty, 1 if non-
empty.
template<class T>
Integer FastQueue<T>::push (T* node) {
node->next = 0;
T* prev = (T*)safeswap ((void**)&head, node); // atomic (prev =
head, head = node)
// (*) another push( ) here would succeed at this point
// (*) a pop( ) here would find the list without this or
later nodes
prev->next = node; // link prev to node, now the pop( )
can find it
return prev != stub; // return 0 on a push into an empty
queue, 1 if nonempty
}

template<class T>
T* FastQueue<T>::popwait ( ) {
T* node;
while (1 == (Integer)(node = pop ( )))
Sleep (0); // wait for nonbusy result
return node;
}

template<class T>
T* FastQueue<T>::pop ( ) {
T *tail1 = tail, *tail2 = tail->next; // tail1 is last node,
tail2 is 2nd to last
if (tail1 == stub) { // if last node is the stub
if (0 == tail2)
return (T*)(head != tail); // if no 2nd to last, then
list empty, return 0/1
tail1 = tail = tail2; // new tail is the new last,
lose the stub here
tail2 = tail1->next; // new next is the new 2nd to
last
}
if (tail2) {
tail = tail2; // if 2nd to last exists then
return the tail
// prefetch the next tail and return the current tail
return tail1; // this is the usual return point
from this function
}
if (tail1 == head) { // if consistent, and queue has
one node,
push (stub); // then push the stub to force
two nodes.
if (tail2 = tail1->next) { // set next = 2nd to last, is it
now nonzero?
tail = tail2; // if so, then pull off the last
node and return it
return tail1;
}
}
return (T*)(head != tail); // otherwise the list is still
empty or busy (0 or 1)
}

Codewarp

unread,
Sep 23, 2008, 2:48:59 PM9/23/08
to Scalable Synchronization Algorithms
What a mess! Isn't there any way to posted commented code in this
discussion?? It appears that 72 characters is all you can fit onto a
line.

Dmitriy V'jukov

unread,
Sep 26, 2008, 8:30:46 AM9/26/08
to Scalable Synchronization Algorithms
On Sep 18, 8:01 am, Codewarp <ch...@megabasic.com> wrote:
> I have been studing your mpscq method and it appears to be the best
> alg for this operation that I have ever encountered--bravo.

Thank you.

> In
> particular, it would seem to have the best performance on longer
> chains, degrading only on the degenerate case of single node queues.

At least it's not worse that 'stack with reversing'.


Dmitriy V'jukov

Dmitriy V'jukov

unread,
Sep 26, 2008, 8:48:06 AM9/26/08
to Scalable Synchronization Algorithms
On Sep 20, 1:18 am, Codewarp <ch...@megabasic.com> wrote:

> Performance and scalability is the goal, but as you know, lock-free/
> wait-free can provide the means.

Lock-free/wait-free properties can be *side-effect* of performance.
But not vise versa.
If you have fast queue, you can't make it faster by "applying" lock-
free. And in some situations lock-based solutions are faster than lock-
free.


> It was precisely because of
> malloc( ) and its infatuation with critical-sections that lead to
> developing my own multi-threaded allocator.  It was truely
> dumbfounding to discover how slow and unacceptable malloc( ) is in the
> multi-threaded libraries.  Mine uses a per-thread allocator, but you
> can release undifferentiated blocks from other threads into the
> current allocator as they arise.  Local allocation/free is lock-free,
> wait-free, atomic-free.  Freeing blocks from other threads is lock-
> free and wait-free.

I can guess that your allocator uses multi-producer/multi-consumer
queue to pass 'foreign' memory blocks back, right?

If so, you can consider to use my 'memory-passing queue', it was
designed especially for that purpose:
http://groups.google.com/group/lock-free/browse_frm/thread/f6e41e5aebdf6dc2
It works like N spsc queues aggregated by 1 mpsc queue, so mpsc queue
actually transfers spsc queues, not user items. And spsc queue
transfers user items.

I have even dirtier design, which has even lower overheads. It's 3-
layered. First layer is mpsc queue which transfers spsc queues. Second
layer is node-based spsc queue which transfers freed memory blocks.
Third layer is array-based spsc queue established directly inside
freed memory blocks. They are already freed, so why not reuse them?
Basically freed memory blocks transfer another freed memory blocks. I
really like this design.



> As far as a totally lock-free system, our framework is completely lock-
> free, except for when it has nothing at all to do, in which case it
> blocks waiting for a signal that tasks are now available.  I have yet
> to find any way to eliminate this one wait from the system. All
> threading within this framework uses thread objects and logical events
> managed by the framemork, so termination from elsewhere is not an
> issue.

Well, then this queue is just not intended for your case. It's not
lock-free queue, it's just fast queue. And if you want to violently
terminate your threads, then you need lock-free queue.


Dmitriy V'jukov

Dmitriy V'jukov

unread,
Sep 26, 2008, 8:56:51 AM9/26/08
to Scalable Synchronization Algorithms
Provided correctly designed blocking/signaling logic, consumer can't
deadlock. If consumer deadlock, then it's just a bug in blocking/
signaling logic.

There are 2 options for blocking/signaling logic.
1. Consumer waits for the 'next' link to be filled. After writing to
'next' link, producer executes strong enough memory fence (or just
writes 'next' link with atomic XCHG), and then producer signals
consumer.
2. Consumer waits for the 'head' variable to be changed. Producer
doesn't have to execute another atomic RMW or strong memory fence. But
if consumer finds 'gap' in queue, it has to wait in spin-loop for the
'next' link to be filled.

Dmitriy V'jukov

Dmitriy V'jukov

unread,
Sep 26, 2008, 9:01:21 AM9/26/08
to Scalable Synchronization Algorithms
On Sep 23, 10:44 pm, Codewarp <ch...@megabasic.com> wrote:

> template<class T>
> class FastQueue : public SharedObjectEx {
>     T *head, *tail, *stub;

/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\

I hope you will not forget to insert cache line padding between those
variables, when you will be conducting your benchmarks. Otherwise it
will totally destroy scalability.


Dmitriy V'jukov

Dmitriy V'jukov

unread,
Sep 26, 2008, 9:07:54 AM9/26/08
to Scalable Synchronization Algorithms
On Sep 23, 10:44 pm, Codewarp <ch...@megabasic.com> wrote:

Now about the most interesting question - why I call this queue
blocking.

Consider following piece of code:

template<class T>
T* FastQueue<T>::popwait ( ) {
    T* node;
    while (1 == (Integer)(node = pop ( )))
        Sleep (0);   // wait for nonbusy result
    return node;
}


Now consider piece of code from spin-mutex based queue:

T queue::pop()
{
while (ATOMIC_XCHG(lock, 1))
Sleep(0);
T result = pop_impl();
STORE_RELEASE(lock, 0);
return result;
}


See? It's actually the same. Code just a bit rearranged. That's all.
No more difference.
So my queue behaves exactly like mutex-based queue, so it's blocking.
Well, it's blocking from application point of view. Formally it's just
returns 0. But it doesn't make any difference for application, which
have to wait in spin-loop anyway.


Dmitriy V'jukov

Codewarp

unread,
Sep 27, 2008, 12:29:44 AM9/27/08
to Scalable Synchronization Algorithms
On Sep 26, 5:48 am, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:
> On Sep 20, 1:18 am, Codewarp <ch...@megabasic.com> wrote:
>
> > Performance and scalability is the goal, but as you know, lock-free/
> > wait-free can provide the means.
>
> Lock-free/wait-free properties can be *side-effect* of performance.
> But not vise versa.
> If you have fast queue, you can't make it faster by "applying" lock-
> free. And in some situations lock-based solutions are faster than lock-
> free.
>
> > It was precisely because of
> > malloc( ) and its infatuation with critical-sections that lead to
> > developing my own multi-threaded allocator.  It was truely
> > dumbfounding to discover how slow and unacceptable malloc( ) is in the
> > multi-threaded libraries.  Mine uses a per-thread allocator, but you
> > can release undifferentiated blocks from other threads into the
> > current allocator as they arise.  Local allocation/free is lock-free,
> > wait-free, atomic-free.  Freeing blocks from other threads is lock-
> > free and wait-free.

> I can guess that your allocator uses multi-producer/multi-consumer
> queue to pass 'foreign' memory blocks back, right?

No, I use a multi-producer, single consumer stack. The consumer is
the memory allocation request. It takes the entire stack with an
XCHG, then pops and deletes all its nodes in single-threaded bliss,
before proceeding on to the allocation. An allocation uses a normal
memory read to check for this free stack being nonempty, so no atomic
RMWs, and it's normally found in the cache.

> > As far as a totally lock-free system, our framework is completely lock-
> > free, except for when it has nothing at all to do, in which case it
> > blocks waiting for a signal that tasks are now available.  I have yet
> > to find any way to eliminate this one wait from the system. All
> > threading within this framework uses thread objects and logical events
> > managed by the framemork, so termination from elsewhere is not an
> > issue.
>
> Well, then this queue is just not intended for your case. It's not
> lock-free queue, it's just fast queue. And if you want to violently
> terminate your threads, then you need lock-free queue.

Yes, but this "blocking" state is unlike normal blocking, which can
block indefinitely. This blocking state cannot be extended by any
means, is extremely rare, and side-band tasks can be executed during
BUSY retry states where that is desired. It also appears the only one
call to Sleep(0) is necessary, and in MP systems a hard spin for a 30
iterations could be enough to kick it over, and avoid the Sleep(0)
altogether. I will be putting together a worst-case scenerio with 4
processors putting maximum pressure on the consumer to measure the
ratio of BUSY states to transactions processed. Past experience leads
me to expect something like 1 BUSY per 1000-2000 non-busy, but actual
testing will really tell.

Dmitriy V'jukov

unread,
Sep 29, 2008, 5:06:04 AM9/29/08
to Scalable Synchronization Algorithms
On Sep 27, 8:29 am, Codewarp <ch...@megabasic.com> wrote:

> > I can guess that your allocator uses multi-producer/multi-consumer
> > queue to pass 'foreign' memory blocks back, right?
>
> No, I use a multi-producer, single consumer stack.  The consumer is
> the memory allocation request.  It takes the entire stack with an
> XCHG, then pops and deletes all its nodes in single-threaded bliss,
> before proceeding on to the allocation.  An allocation uses a normal
> memory read to check for this free stack being nonempty, so no atomic
> RMWs, and it's normally found in the cache.

Oh! I meant 'multi-producer/*single*-consumer queue'.
My message-passing queue was designed exactly to improve on such
design.

Dmitriy V'jukov

Dmitriy V'jukov

unread,
Sep 29, 2008, 5:08:50 AM9/29/08
to Scalable Synchronization Algorithms
On Sep 27, 8:29 am, Codewarp <ch...@megabasic.com> wrote:

> > Well, then this queue is just not intended for your case. It's not
> > lock-free queue, it's just fast queue. And if you want to violently
> > terminate your threads, then you need lock-free queue.
>
> Yes, but this "blocking" state is unlike normal blocking, which can
> block indefinitely.  This blocking state cannot be extended by any
> means, is extremely rare, and side-band tasks can be executed during
> BUSY retry states where that is desired.  It also appears the only one
> call to Sleep(0) is necessary, and in MP systems a hard spin for a 30
> iterations could be enough to kick it over, and avoid the Sleep(0)
> altogether.  I will be putting together a worst-case scenerio with 4
> processors putting maximum pressure on the consumer to measure the
> ratio of BUSY states to transactions processed.  Past experience leads
> me to expect something like 1 BUSY per 1000-2000 non-busy, but actual
> testing will really tell.


I can apply all those arguments to mutexes.
If mutex-protected region is small then blocking will be extremely
rare too.
Mutex-protected blocking can't be extended by any means too.
Side-band tasks can be executed too, when you get failure from
try_lock function.
Etc.


Dmitriy V'jukov

Chris Cochran

unread,
Dec 26, 2008, 9:04:20 PM12/26/08
to lock...@googlegroups.com
Dmitriy,
 
Sorry for my disappearing act.  Among other things, before I could evaluate your algorithm properly, I needed to restructure my reversing-stack methods and usage, for easy switching between that method and your's.  I have done that and achieved correctness using the new method, with no lost wakeups or other obvious digressions.  When I get some initial performance results, I will post them on your discussion board. 
 
Although your mpscq method demands that applications make no TerminateThread( ) or SuspendThread( ) calls between threads, that is an acceptable limitation for my framework and most situations.  The reversing stack method can lose performance because long reversals tend to displace the L1 working set.  Your method has no such loss.
 
Chris Cochran
 

Dmitriy V'jukov

unread,
Jan 16, 2009, 4:29:50 AM1/16/09
to Scalable Synchronization Algorithms
On 27 дек 2008, 05:04, "Chris Cochran" <ch...@megabasic.com> wrote:
> Dmitriy,
>
> Sorry for my disappearing act.  Among other things, before I could evaluate
> your algorithm properly, I needed to restructure my reversing-stack methods
> and usage, for easy switching between that method and your's.  I have done
> that and achieved correctness using the new method, with no lost wakeups or
> other obvious digressions.  When I get some initial performance results, I
> will post them on your discussion board.


I will appreciate it. It's always interesting to see some real
performance numbers for different approaches.


> Although your mpscq method demands that applications make no
> TerminateThread( ) or SuspendThread( ) calls between threads, that is an
> acceptable limitation for my framework and most situations.  The reversing
> stack method can lose performance because long reversals tend to displace
> the L1 working set.  Your method has no such loss.

Also it seems that XCHG has somewhat better performance than CAS on
x86, because XCHG instantly prefetches cache-line in Modified status,
and with CAS one usually has to prefetch cache-line in Shared status
at first and only then in Modified status. Although, I was observing
performance difference not as big as I was expecting.

--
Dmitriy V'jukov
Reply all
Reply to author
Forward
0 new messages