another non-blocking multi-producer/single-consumer fifo...

Chris Thomasson

unread,

Apr 16, 2008, 10:38:53 PM4/16/08

to

This experimental FIFO has lock-free producers, and atomic/membar/wait-free
consumers. I can't really notice any race-conditions right off the bat. I
need to study some more to see if the producers can suffer from ABA. As of
now, I don't think that it can because there is only a single consumer. I
also think that there won't be a problem with destroying nodes right after
they are consumed because there is only a single consumer and there is
always a dummy node which acts as an offset. Anyway, here is the current
algorithm:
________________________________________________________________
struct node {
node* next;
void* state;
};

struct mpscq {
node* head;
node* tail;
};

void
mpscq_create(
mpscq* const _this,
node* const dummy
) {
dummy->next = NULL;
_this->head = _this->tail = dummy;
}

node*
mpscq_destroy(
mpscq* const _this
) {
return _this->head;
}

void
mpscq_push(
mpscq* _this,
node* const n
) {
node* next = NULL;
node* tail = _this->tail;
n->next = NULL;
#StoreStore;
/* CAS automatically updates comprand on failure */
while (! CAS(&tail->next, &next, n)) {
node* const ctail = _this->tail;
tail = (tail != ctail) ? ctail : next;
next = NULL;
}
#StoreStore;
CAS(&_this->tail, &tail, n);
}

node*
mpscq_pop(
mpscq* _this
) {
node* const head = _this->head;
node const* const next = head->next;
if (next) {
head->state = next->state;
_this->head = next;
return head;
}
return NULL;
}
________________________________________________________________

See any potential problems?

Thanks.

--
Chris M. Thomasson
http://appcore.home.comcast.net

Dmitriy V'jukov

unread,

Apr 17, 2008, 2:14:43 PM4/17/08

to

On Apr 17, 6:38 am, "Chris Thomasson" <cris...@comcast.net> wrote:

> See any potential problems?

> I
> also think that there won't be a problem with destroying nodes right after
> they are consumed because there is only a single consumer and there is
> always a dummy node which acts as an offset

Assume following situation.

Queue size is 1 (only fake node).
Producer loads tail pointer. Producer blocks.
Other producers push several nodes, and consumer pops all those nodes.
At this point tail pointer remembered by producer points to nowhere.
Producer wake-ups and accesses tail->next in CAS. Bang!

Also I think that this CAS:
CAS(&_this->tail, &tail, n);
is susceptible to ABA, because both nodes can be already consumed and
reused.

If consumer doesn't set next pointer of nodes to NULL before quiescent
state, then this CAS is *not* susceptible to ABA:

while (! CAS(&tail->next, &next, n)) {

Imho.

So, I think, you still have to use PDR with this algo.

Dmitriy V'jukov

Chris Thomasson

unread,

Apr 17, 2008, 3:02:57 PM4/17/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message
news:bbfcdb81-522d-4ddf...@l42g2000hsc.googlegroups.com...

> On Apr 17, 6:38 am, "Chris Thomasson" <cris...@comcast.net> wrote:
>
>> See any potential problems?
>
>> I
>> also think that there won't be a problem with destroying nodes right
>> after
>> they are consumed because there is only a single consumer and there is
>> always a dummy node which acts as an offset
>
>
> Assume following situation.
>
> Queue size is 1 (only fake node).
> Producer loads tail pointer. Producer blocks.
> Other producers push several nodes, and consumer pops all those nodes.
> At this point tail pointer remembered by producer points to nowhere.
> Producer wake-ups and accesses tail->next in CAS. Bang!

Yup, it needs PDR.

> Also I think that this CAS:
> CAS(&_this->tail, &tail, n);
> is susceptible to ABA, because both nodes can be already consumed and
> reused.
>
>
> If consumer doesn't set next pointer of nodes to NULL before quiescent
> state, then this CAS is *not* susceptible to ABA:
> while (! CAS(&tail->next, &next, n)) {
>
> Imho.
>
>
> So, I think, you still have to use PDR with this algo.

Indeed.

Chris Thomasson

unread,

Apr 17, 2008, 3:14:57 PM4/17/08

to

"Chris Thomasson" <cri...@comcast.net> wrote in message
news:2YmdnYqLOL4bA5rV...@comcast.com...

> "Dmitriy V'jukov" <dvy...@gmail.com> wrote in message
> news:bbfcdb81-522d-4ddf...@l42g2000hsc.googlegroups.com...
>> On Apr 17, 6:38 am, "Chris Thomasson" <cris...@comcast.net> wrote:
>>
>>> See any potential problems?
>>
>>> I
>>> also think that there won't be a problem with destroying nodes right
>>> after
>>> they are consumed because there is only a single consumer and there is
>>> always a dummy node which acts as an offset
>>
>>
>> Assume following situation.
>>
>> Queue size is 1 (only fake node).
>> Producer loads tail pointer. Producer blocks.
>> Other producers push several nodes, and consumer pops all those nodes.
>> At this point tail pointer remembered by producer points to nowhere.
>> Producer wake-ups and accesses tail->next in CAS. Bang!
>
> Yup, it needs PDR.

[...]

The algorithm works fine with PDR, and only the producers would need to
acquire proxy regions, which makes then too darn expensive; oh well. AFAICT,
the one of the most efficient non-distributed multi-producer/multi-consumer
LIFO/FIFO hybrid would have to be:

<code-sketch of ABA-free non-blocking lifo w/ wait-free consumer>
__________________________________________________________________
struct node {
node* nx;
};

struct nblifo {
node* hd; /* initialize to NULL */
};

void
nblifo_produce(
nblifo* const _this,
node* const n
) {
n->nx = _this->hd;
while (! CAS(&_this->hd, &n->nx, n));
}

node*
nblifo_swap(
nblifo* const _this,
node* const n
) {
return SWAP(&_this->hd, n);
}

#define nblifo_consume(mp_this) nblifo_swap((mp_this), NULL)
__________________________________________________________________

<code-sketch of ABA-free non-blocking fifo w/ wait-free consumer>
__________________________________________________________________
typedef nblifo nbfifo;

node*
nbfifo_sys_transform(
node* _this
) {
if (_this) {
node* fifo = NULL;
while (_this) {
node* const n = _this;
_this = n->nx;
n->nx = fifo;
fifo = n;
}
return fifo;
}
return NULL;
}

#define nbfifo_produce nblifo_produce

#define nbfifo_swap(mp_this, mp_n) \
nbfifo_sys_transform(nblifo_swap((mp_this), (mp_n)))

#define nbfifo_consume(mp_this) nbfifo_swap((mp_this), NULL)
__________________________________________________________________

That setup does not need PDR, and is totally immune to ABA. Plus it grabs
nodes in batches using wait-free SWAP. Can you think of a mpmc lifo/fifo
thing that would destroy that? Without using distribution techniques, I
can't think really think of anything off-hand that would beat it. The one
thing that bugs be about the above technique is "reversing" the LIFO to
render a FIFO order. vZOOM provides this exact same thing, and I have not
received any complaints, but it still bugs be a bit. I worry about large
batches of items.

Dmitriy V'jukov

unread,

Apr 18, 2008, 2:58:23 AM4/18/08

to

On 17 апр, 23:14, "Chris Thomasson" <cris...@comcast.net> wrote:

> void
> nblifo_produce(
> nblifo* const _this,
> node* const n
> ) {
> n->nx = _this->hd;
> while (! CAS(&_this->hd, &n->nx, n));
> }

API can be extended to support production of node batches:

void
nblifo_produce_batch(
nblifo* const _this,
node* const head,
node* const tail
) {
tail->nx = _this->hd;
while (! CAS(&_this->hd, &tail->nx, head));
}

Btw, note that in mpmc setup consumption of nodes is not lock-free,
it's blocking. Assume that one consumer grab batch of nodes and then
blocked. Those nodes unavailable for other consumers. So blocking of
consumer can stop system-wide forward progress.
In mpsc setup consumption is wait-free.

> The one thing that bugs be about the above technique is "reversing" the LIFO to render a FIFO order.

My benchmark of spsc queues have showed that queue which reverse order
of nodes scales badly provided large queue length:
http://groups.google.ru/group/comp.programming.threads/msg/ad6a5790cfa90718
Imo, reverse operation accounts for bad scaling.

Dmitriy V'jukov

unread,

Apr 18, 2008, 10:45:40 AM4/18/08

to

On Apr 17, 11:14 pm, "Chris Thomasson" <cris...@comcast.net> wrote:

> That setup does not need PDR, and is totally immune to ABA. Plus it grabs
> nodes in batches using wait-free SWAP. Can you think of a mpmc lifo/fifo
> thing that would destroy that?

It looks like challenge!
Ok, let's destroy that silly stuff :)

struct mpscq_node_t
{
mpscq_node_t* volatile next;
void* state;
};

struct mpscq_t
{
mpscq_node_t* volatile head;
mpscq_node_t* tail;
};

void mpscq_create(mpscq_t* self, mpscq_node_t* stub)
{
stub->next = 0;
self->head = stub;
self->tail = stub;
}

void mpscq_push(mpscq_t* self, mpscq_node_t* n)
{
n->next = 0;
mpscq_node_t* prev = XCHG(&self->head, n); // serialization-point
wrt producers
prev->next = n; // serialization-point wrt consumer
}

mpscq_node_t* mpscq_pop(mpscq_t* self)
{
mpscq_node_t* tail = self->tail;
mpscq_node_t* next = tail->next; // serialization-point wrt
producers
if (next)
{
self->tail = next;
tail->state = next->state;
return tail;
}
return 0;
}

-------------------------------------------------------------
Here is intrusive version:
-------------------------------------------------------------

struct mpscq_node_t
{
mpscq_node_t* volatile next;
};

struct mpscq_t
{
mpscq_node_t* volatile head;
mpscq_node_t* tail;
mpscq_node_t stub;
};

#define MPSCQ_STATIC_INIT(self) {&self.stub, &self.stub, {0}}

void mpscq_create(mpscq_t* self)
{
self->head = &self->stub;
self->tail = &self->stub;
self->stub.next = 0;
}

void mpscq_push(mpscq_t* self, mpscq_node_t* n)
{
n->next = 0;
mpscq_node_t* prev = XCHG(&self->head, n);
//(*)
prev->next = n;
}

mpscq_node_t* mpscq_pop(mpscq_t* self)
{
mpscq_node_t* tail = self->tail;
mpscq_node_t* next = tail->next;
if (tail == &self->stub)
{
if (0 == next)
return 0;
self->tail = next;
tail = next;
next = next->next;
}
if (next)
{
self->tail = next;
return tail;
}
mpscq_node_t* head = self->head;
if (tail != head)
return 0;
mpscq_push(self, &self->stub);
next = tail->next;
if (next)
{
self->tail = next;
return tail;
}
return 0;
}

Disadvantages:
- Push function is blocking wrt consumer. I.e. if producer blocked in
(*), then consumer is blocked too. Fortunately 'window of
inconsistency' is extremely small - producer must be blocked exactly
in (*). Actually it's disadvantage only as compared with totally lock-
free algorithm. It's still much better lock-based algorithm.

Advantages:
+ Wait-free and fast producers. One XCHG is maximum what one can get
with multi-producer non-distributed queue.
+ Extremely fast consumer. On fast-path it's atomic-free, XCHG
executed per node batch, in order to grab 'last item'.
+ No need for node order reversion. So pop operation is always "O(1)".
+ Intrusive. No need for additional internal nodes.
+ ABA-free.
+ No need for PDR!

WOW!

I am ready to suffer the disadvantage provided all advantages. No need
for PDR means that one can use this algorithm out-of-the-box. No need
for thread registration/deregistration, periodic activity, deferred
garbage etc.

Dmitriy V'jukov

unread,

Apr 18, 2008, 1:15:36 PM4/18/08

to

On Apr 18, 6:45 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:

> -------------------------------------------------------------
> Here is intrusive version:
> -------------------------------------------------------------

Here is another version. It doesn't use stub node, but uses CAS. I
don't know which is better. They really very similar.

struct mpscq_node_t
{
mpscq_node_t* volatile next;
};

struct mpscq_t
{
mpscq_node_t* volatile head;

mpscq_node_t tail;
};

#define MPSCQ_STATIC_INIT(self) {&self.tail, 0}

void mpscq_create(mpscq_t* self)
{
self->head = &self->tail;
self->tail.next = 0;
}

void mpscq_push(mpscq_t* self, mpscq_node_t* n)
{
n->next = 0;
mpscq_node_t* prev = XCHG(&self->head, n);

prev->next = n;
}

mpscq_node_t* mpscq_pop(mpscq_t* self)
{
mpscq_node_t* tail = self->tail.next;
if (0 == tail)
return 0;

mpscq_node_t* next = tail->next;

if (next)
{
self->tail.next = next;

return tail;
}
mpscq_node_t* head = self->head;
if (tail != head)
return 0;

self->tail.next = 0;
if (CAS(&self->head, head, &self->tail))
{
return tail;

}
next = tail->next;
if (next)
{

self->tail.next = next;
return tail;
}
return 0;
}

Dmitriy V'jukov

Chris Thomasson

unread,

Apr 18, 2008, 2:46:07 PM4/18/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message

news:7fc4709a-138b-4bfb...@m71g2000hse.googlegroups.com...

On 17 апр, 23:14, "Chris Thomasson" <cris...@comcast.net> wrote:

> > void
> > nblifo_produce(
> > nblifo* const _this,
> > node* const n
> > ) {
> > n->nx = _this->hd;
> > while (! CAS(&_this->hd, &n->nx, n));
> > }

> API can be extended to support production of node batches:

> void
> nblifo_produce_batch(
> nblifo* const _this,
> node* const head,
> node* const tail
> ) {
> tail->nx = _this->hd;
> while (! CAS(&_this->hd, &tail->nx, head));
> }

> Btw, note that in mpmc setup consumption of nodes is not lock-free,
> it's blocking. Assume that one consumer grab batch of nodes and then
> blocked.

The usage pattern would be consumer grabs nodes, process nodes, and tries to
grab more.

> Those nodes unavailable for other consumers.

The batching does reduce even work distribution.

> So blocking of consumer can stop system-wide forward progress.

Perhaps I am misunderstanding you... If consumer A grabs nodes, and consumer
B is waiting on the eventcount. How does the blocking of consumer A prevent
consumer B from grabbing any subsequently "produced" nodes?

> In mpsc setup consumption is wait-free.

> The one thing that bugs be about the above technique is "reversing" the
> LIFO to render a FIFO order.

> My benchmark of spsc queues have showed that queue which reverse order
> of nodes scales badly provided large queue length:
> http://groups.google.ru/group/comp.programming.threads/msg/ad6a5790cfa90718
> Imo, reverse operation accounts for bad scaling.

That's what I was thinking. It's kind of like processing the entire node
batch twice... One time to convert it to FIFO, and the other would be the
application finally traversing it.

Chris Thomasson

unread,

Apr 18, 2008, 3:02:32 PM4/18/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message

news:1009a3c6-b7c3-46dc...@w8g2000prd.googlegroups.com...

> On Apr 17, 11:14 pm, "Chris Thomasson" <cris...@comcast.net> wrote:
>
>> That setup does not need PDR, and is totally immune to ABA. Plus it grabs
>> nodes in batches using wait-free SWAP. Can you think of a mpmc lifo/fifo
>> thing that would destroy that?
>
>
> It looks like challenge!
> Ok, let's destroy that silly stuff :)

:^D

[...]

>
> Disadvantages:
> - Push function is blocking wrt consumer. I.e. if producer blocked in
> (*), then consumer is blocked too. Fortunately 'window of
> inconsistency' is extremely small - producer must be blocked exactly
> in (*). Actually it's disadvantage only as compared with totally lock-
> free algorithm. It's still much better lock-based algorithm.

Humm. I must be totally misunderstanding you here... If the producer is
blocked in (*) that means that it has not completed its push operation as a
whole. This would not block consumers. It only means that an item has not
been produced because the producer has not completed its push operation
yet... A consumer cannot grab an item if the producer is has not completed
it's job. See, when you say the consumer is blocked, I immediately looked
for a spin-loop in the pop function.

Perhaps you are writing about the fact that producers need to execute
multiple steps in order to completely render an item visible to the
consumer. If producer gets pre-empted in the middle of one of those steps,
then according to a consumer, nothing has been produced, even though an item
is "logically" there. I personally would not say that's blocking. Think
about it, any consumer always waits for a producer to push something, no
matter what algorithm you use...

Could you clarify a bit more? Thanks Dmitriy.

> Advantages:
> + Wait-free and fast producers. One XCHG is maximum what one can get
> with multi-producer non-distributed queue.
> + Extremely fast consumer. On fast-path it's atomic-free, XCHG
> executed per node batch, in order to grab 'last item'.
> + No need for node order reversion. So pop operation is always "O(1)".
> + Intrusive. No need for additional internal nodes.
> + ABA-free.
> + No need for PDR!
>
> WOW!

I need to study the algorithm, but my initial reaction, was... Wow!

:^)

> I am ready to suffer the disadvantage provided all advantages. No need
> for PDR means that one can use this algorithm out-of-the-box. No need
> for thread registration/deregistration, periodic activity, deferred
> garbage etc.

Yeah. Using PDR for things like stack/queue communication channels is not
good.

Chris Thomasson

unread,

Apr 18, 2008, 3:04:59 PM4/18/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message

news:98458be4-cc80-4951...@b9g2000prh.googlegroups.com...

> On Apr 18, 6:45 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:
>
>> -------------------------------------------------------------
>> Here is intrusive version:
>> -------------------------------------------------------------
>
>
> Here is another version. It doesn't use stub node, but uses CAS. I
> don't know which is better. They really very similar.

[...]

I believe that the first one could end up beating this one. Even though the
CAS operation in this algorithm is wait-free in that there is no looping,
well, CAS is interlocked RMW which can be expensive when compared to naked
operations...

Chris Thomasson

unread,

Apr 19, 2008, 8:16:18 PM4/19/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message

news:98458be4-cc80-4951...@b9g2000prh.googlegroups.com...

> On Apr 18, 6:45 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:
>
>> -------------------------------------------------------------
>> Here is intrusive version:
>> -------------------------------------------------------------
>
>
> Here is another version. It doesn't use stub node, but uses CAS. I
> don't know which is better. They really very similar.

[...]

What do you think of the following queue pseudo-code:

http://groups.google.com/group/comp.programming.threads/msg/1dbf2562b7b9796d

?

Chris Thomasson

unread,

Apr 19, 2008, 8:52:23 PM4/19/08

to

"Chris Thomasson" <cri...@comcast.net> wrote in message

news:UfqdnWfW7_92F5fV...@comcast.com...

Whoops! Never mind. I forgot that was only single-producer/consumer.

Dmitriy V'jukov

unread,

Apr 20, 2008, 4:05:25 AM4/20/08

to

On 20 апр, 04:16, "Chris Thomasson" <cris...@comcast.net> wrote:

> > Here is another version. It doesn't use stub node, but uses CAS. I
> > don't know which is better. They really very similar.
>
> [...]
>
> What do you think of the following queue pseudo-code:
>

> http://groups.google.com/group/comp.programming.threads/msg/1dbf2562b...
>
> ?

It seems that the code is a bit incomplete... I don't see where 'next'
link modified in push()... Why new node is allocated in pop()...

Dmitriy V'jukov

unread,

Apr 23, 2008, 3:50:32 PM4/23/08

to

On Apr 18, 10:46 pm, "Chris Thomasson" <cris...@comcast.net> wrote:

> > So blocking of consumer can stop system-wide forward progress.
>
> Perhaps I am misunderstanding you... If consumer A grabs nodes, and consumer
> B is waiting on the eventcount. How does the blocking of consumer A prevent
> consumer B from grabbing any subsequently "produced" nodes?

Batching of nodes is implementation detail. Usually batching as-is is
not needed for end user. But batching can produce some unpleasant
effect for end user.

Consider following example:
4 processor system. Queue is empty.
Producer produces 4 work items.
Consumer 1 grabs all 4 items. Consumer 1 is blocked by OS.
Other consumers are unable to get work items. System stands idle.

Actually consumer 1 can be running. But other consumers are still
unable to get work items. So only 1 processor will be loaded. Others
are idle.

Further, consumer can't control batch size, he can grab 1 item, 10
items, 100 items or 1000 items. He just grabs all items which are
situated in queue now.

In single consumer setup the negative effect is not showed.

Dmitriy V'jukov

unread,

Apr 23, 2008, 4:00:44 PM4/23/08

to

On Apr 18, 11:02 pm, "Chris Thomasson" <cris...@comcast.net> wrote:
>
> > Disadvantages:
> > - Push function is blocking wrt consumer. I.e. if producer blocked in
> > (*), then consumer is blocked too. Fortunately 'window of
> > inconsistency' is extremely small - producer must be blocked exactly
> > in (*). Actually it's disadvantage only as compared with totally lock-
> > free algorithm. It's still much better lock-based algorithm.
>
> Humm. I must be totally misunderstanding you here... If the producer is
> blocked in (*) that means that it has not completed its push operation as a
> whole. This would not block consumers. It only means that an item has not
> been produced because the producer has not completed its push operation
> yet... A consumer cannot grab an item if the producer is has not completed
> it's job. See, when you say the consumer is blocked, I immediately looked
> for a spin-loop in the pop function.
>
> Perhaps you are writing about the fact that producers need to execute
> multiple steps in order to completely render an item visible to the
> consumer. If producer gets pre-empted in the middle of one of those steps,
> then according to a consumer, nothing has been produced, even though an item
> is "logically" there. I personally would not say that's blocking. Think
> about it, any consumer always waits for a producer to push something, no
> matter what algorithm you use...
>
> Could you clarify a bit more? Thanks Dmitriy.

If the producer is blocked in (*) that means that it has not completed

its push operation as a whole. Right. Then 10 producers can fully
complete push() operation. But consumer still will be blocked waiting
for the first producer to complete. Because of the fifo order
requirement.
I care not about item produced by blocked producer. I care about items
subsequently produced by other producers.

Dmitriy V'jukov

unread,

Apr 23, 2008, 4:05:51 PM4/23/08

to

On Apr 18, 11:02 pm, "Chris Thomasson" <cris...@comcast.net> wrote:

> > Advantages:
> > + Wait-free and fast producers. One XCHG is maximum what one can get
> > with multi-producer non-distributed queue.
> > + Extremely fast consumer. On fast-path it's atomic-free, XCHG
> > executed per node batch, in order to grab 'last item'.
> > + No need for node order reversion. So pop operation is always "O(1)".
> > + Intrusive. No need for additional internal nodes.
> > + ABA-free.
> > + No need for PDR!
>
> > WOW!
>
> I need to study the algorithm, but my initial reaction, was... Wow!

I've made some benchmarks. This algorithm kicks a$$ in intrusive and
non-intrusive versions.
But unfortunately it doesn't give answer to the problem of creation of
ideal non-intrusive spsc queue...

Dmitriy V'jukov