Lock-free bounded fifo-queue on top of vector

Dmitriy Vyukov

unread,

Feb 8, 2007, 12:59:10 PM2/8/07

to

To the best of my knowledge, all todays lock-free structures are build
on top of dynamic linked structures, with some kind of nodes, and
links between nodes.

I realize lock-free bounded fifo-queue on top of vector (no nodes and
dynamic links) (source code below).
My thoughts on this:
1. No nodes -> no dynamic memory allocation (even from lock-free
freelist) -> less CAS'es -> less contention -> faster
2. No nodes -> no problems with node reuse -> no ABA
3. No nodes -> less memory consumption
4. In the last Doherty/Herlihy/Luchangco/Moir fifo-queue realization
(http://www.cs.brown.edu/~mph/DohertyHLM04/ft_gateway.cfm.pdf) I count
about 5 CAS'es in enqueue function and 5 CAS'es in dequeue function.
My realization contains 2 CAS'es in enqueue, and 2 CAS'es in dequeue
function. And one of the CAS is not competing (result of CAS is
ignored). I think my realizations must be fast enough.

Questions to discuss:
1. Is my thoughts right?
2. Is lock-free structures on top of vector perspective? (if they can
be build at all :) )
3. Is realization correct?
4. Is it realization fast? (I cannot run it on multiprocessor machine)
5. Is there are any drawbacks in comparison with todays "linked"
structures?
6. Any other thoughts on this?

Some comments to code: queue size = 32768, stored value must not be
equal to 0, 2 least significant bits from value stolen for 'epoch
number'
There are some relatively small possibility that new value enqueued in
the wrong place of buffer, and there are some relatively small
possibility than value dequeued from the wrong place of buffer. But
this is not lead to value duplication or value loss, this is may lead
to wrong order of value consumption at the most.
Realization may be slightly crude - I code it today...
Source code:

class bounded_queue
{
static const unsigned int buf_size = 0x00008000;
static const unsigned int pos_mask = 0x00007FFF;
static const unsigned int epoch_mask = 0x00018000;
static const unsigned int epoch_mask2 = 0x00000003;
static const unsigned int epoch_offset = 15;

void* volatile* buf;
volatile unsigned int push_pos;
volatile unsigned int pop_pos;

bounded_queue(const bounded_queue&);
bounded_queue& operator = (const bounded_queue&);

public:
bounded_queue()
: buf (new void*[buf_size])
, push_pos()
, pop_pos()
{
memset((void*)buf, 0, buf_size * sizeof(void*));
}

~bounded_queue()
{
delete[] buf;
}

bool enqueue(void* value)
{
while (true)
{
unsigned int const raw_pos = push_pos;
if (raw_pos == pop_pos + buf_size)
return false;
unsigned int const pos = raw_pos & pos_mask;
unsigned int const epoch = (raw_pos & epoch_mask) >> epoch_offset;
bool const success = (void*)epoch ==
InterlockedCompareExchangePointer
((void*volatile*)&buf[pos], (void*)((unsigned int)value | epoch),
(void*)epoch);
InterlockedCompareExchange((volatile long*)&push_pos, raw_pos + 1,
raw_pos);
if (success)
return true;
}
}

bool dequeue(void*& value)
{
while (true)
{
unsigned int const raw_pos = pop_pos;
if (raw_pos == push_pos)
return false;
unsigned int const pos = raw_pos & pos_mask;
unsigned int const epoch = (raw_pos & epoch_mask) >> epoch_offset;
value = buf[pos];
if (((unsigned int)value & epoch_mask2) != epoch || ((unsigned
int)value & ~epoch_mask2) == 0)
continue;
bool const success = value == InterlockedCompareExchangePointer
((void*volatile*)&buf[pos], (void*)((epoch + 1) & epoch_mask2),
value);
InterlockedCompareExchange((volatile long*)&pop_pos, raw_pos + 1,
raw_pos);
if (success)
return true;
}
}
};

sincerely yours, Dmitriy Vyukov
p.s. sorry for bad english :)

Chris Thomasson

unread,

Feb 8, 2007, 3:13:08 PM2/8/07

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message
news:1170957550.7...@v45g2000cwv.googlegroups.com...

> To the best of my knowledge, all todays lock-free structures are build
> on top of dynamic linked structures, with some kind of nodes, and
> links between nodes.

> I realize lock-free bounded fifo-queue on top of vector (no nodes and
> dynamic links) (source code below).

I have not examined your code yet in any detail but your getting your nodes
from an array and are replacing the 'pointer-as-a-link-logic' with
'index-as-a-link'. You can cram a lot more stuff in CAS when you use the
offset trick:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/133f764f1bc0c9f4

http://groups.google.com/group/comp.arch/msg/a3ebfe80363a4399

I make frequent use of this technique on SPARC64 because of its lack of
DWCAS.

> My thoughts on this:
> 1. No nodes -> no dynamic memory allocation (even from lock-free
> freelist) -> less CAS'es -> less contention -> faster
> 2. No nodes -> no problems with node reuse -> no ABA

Well, the ABA problem can certainly still exist. The frequent reuse of a
particular pointer value, or index offset for that matter, is the main cause
of ABA in the first place. So, it doesn't matter where you get your nodes
from (e.g., alloc one at a time, or a slab of nodes at a time via. array),
its the 'tagging' method you use to id them individually (e.g., pointer to
node, or index offset to node in array) that can cause CAS to hit ABA
conditions. The stuff that bites you is the frequent reuse of a couple 'tag'
values over and over again...

[...]

> 4. In the last Doherty/Herlihy/Luchangco/Moir fifo-queue realization
> (http://www.cs.brown.edu/~mph/DohertyHLM04/ft_gateway.cfm.pdf) I count
> about 5 CAS'es in enqueue function and 5 CAS'es in dequeue function.
> My realization contains 2 CAS'es in enqueue, and 2 CAS'es in dequeue
> function. And one of the CAS is not competing (result of CAS is
> ignored). I think my realizations must be fast enough.

2 CAS for enqueue is just as expensive as a enqueue function protected by a
fast-pathed mutex; ditto for the dequeue. So, if you handle contention
issues properly, the contention for the queue should not be that bad... So,
your hitting the fast-path on the mutex most of the time... Well, why use
the lock-free queue then? Is good to follow a single rule wrt lock-free
anything. Under the 'no-contention-case': A lock-free algorithm is usually
not a good idea when its overhead is equal-to-or-greater-than that of a
fast-pathed mutex implementation. That means that if a lock-free
implementation has 2 or more atomic operations and/or memory barriers for a
uncontended call to a function X, then it just as expensive as function X
protected by a mutex.

http://groups.google.com/group/comp.lang.java.programmer/msg/9af5f99388172527

http://groups.google.com/group/comp.lang.java.programmer/msg/e09c78f88d2b1d65

!

Chris Thomasson

unread,

Feb 8, 2007, 3:18:28 PM2/8/07

to

[...]

> The stuff that bites you is the frequent reuse of a couple 'tag' values
> over and over again...

The stuff that bites you is the frequent reuse of a couple of different

gottlo...@gmail.com

unread,

Feb 10, 2007, 12:26:05 AM2/10/07

to

On Feb 8, 3:13 pm, "Chris Thomasson" <cris...@comcast.net> wrote:
> "Dmitriy Vyukov" <dvyu...@gmail.com> wrote in message

>
> news:1170957550.7...@v45g2000cwv.googlegroups.com...
>
> > To the best of my knowledge, all todays lock-free structures are build
> > on top of dynamic linked structures, with some kind of nodes, and
> > links between nodes.
> > I realize lock-free bounded fifo-queue on top of vector (no nodes and
> > dynamic links) (source code below).
>
> I have not examined your code yet in any detail but your getting your nodes
> from an array and are replacing the 'pointer-as-a-link-logic' with
> 'index-as-a-link'.

I don't really see it that way. It is just a simple circular buffer
queue. He's not using the array as a place to get nodes that are then
linked together in some order - the array IS the order of the items.

This has been done lots of times in the past. I think you can get rid
of the PointerCAS - you just need to atomic-increment the push_pos and
pop_pos, which, in effect, 'reserves' that array entry for you. In
the typical case (one producer, one consumer) you use events to wake
the consumer when a new item is added. The producer runs through the
buffer just making sure not to pass the consumer, going to sleep on
the event when it reaches the consumer (pop_pos == push_pos).

etc. or something like that. I've used a few of these, written a
few. Basically it is how the Windows Message Queue works, etc.
I think the main downside is that it is not generic - you need to know
the size of your queue before hand.

Or maybe I'm missing something - I didn't read the code that closely
either. :-(

Tony

Dmitriy Vyukov

unread,

Feb 10, 2007, 2:57:30 AM2/10/07

to

On 10 фев, 08:26, gottlobfr...@gmail.com wrote:

Fisrt of all, thanks for reply

> I don't really see it that way. It is just a simple circular buffer
> queue. He's not using the array as a place to get nodes that are then
> linked together in some order - the array IS the order of the items.

Mostly I agree.
Such orgamization of structure make possible some properties, that
cannot be achived with "linked nodes" structure. For example, wait-
free property, where linked-nodes suggests onle wait-free - spsc
queue.

> This has been done lots of times in the past. I think you can get rid
> of the PointerCAS - you just need to atomic-increment the push_pos and
> pop_pos, which, in effect, 'reserves' that array entry for you.

That would not work in multile-producer or multiple-consumer scenario.
Because after reserving position, producer also must write value, but
producer can stop between this steps.... I think about that a lot :)

> In the typical case (one producer, one consumer) you use events to wake
> the consumer when a new item is added. The producer runs through the
> buffer just making sure not to pass the consumer, going to sleep on
> the event when it reaches the consumer (pop_pos == push_pos).

For spsc scenario this would work, and this is even wait-free
solution.
And you don't even need atomic-increment - use can use just memory-
write-barrier for producer, and memory-read-barrier for consumer -
because there are no contension.

> etc. or something like that. I've used a few of these, written a
> few. Basically it is how the Windows Message Queue works, etc.
> I think the main downside is that it is not generic - you need to know
> the size of your queue before hand.

I call this not downside, I call this ... feature :)
In some cases this is advantage. Some queues are specially bounded :)

Dmitriy Vyukov

unread,

Feb 10, 2007, 3:30:39 AM2/10/07

to

On 8 фев, 23:13, "Chris Thomasson" <cris...@comcast.net> wrote:

First of all, thanks for reply
I answer to you yesterday, but it looks as if answer was lost...

> I have not examined your code yet in any detail but your getting your nodes
> from an array and are replacing the 'pointer-as-a-link-logic' with
> 'index-as-a-link'. You can cram a lot more stuff in CAS when you use the
> offset trick:
>

> http://groups.google.com/group/comp.programming.threads/browse_frm/th...
> http://groups.google.com/group/comp.arch/msg/a3ebfe80363a4399

I think, this is different. In all that solutions you use some kinds
of nodes and links (through pointer or index). I don't have links at
all, and I don't use indexes. Natural order of elements in buffer - is
logical order too.
I don't operate with indexes and I don't store indexes - just
atomically increment position in buffer.
It is different, and it has different properties as compared with
"linked-nodes" solutions.

> > My thoughts on this:
> > 1. No nodes -> no dynamic memory allocation (even from lock-free
> > freelist) -> less CAS'es -> less contention -> faster
> > 2. No nodes -> no problems with node reuse -> no ABA
>
> Well, the ABA problem can certainly still exist. The frequent reuse of a
> particular pointer value, or index offset for that matter, is the main cause
> of ABA in the first place. So, it doesn't matter where you get your nodes
> from (e.g., alloc one at a time, or a slab of nodes at a time via. array),
> its the 'tagging' method you use to id them individually (e.g., pointer to
> node, or index offset to node in array) that can cause CAS to hit ABA
> conditions. The stuff that bites you is the frequent reuse of a couple 'tag'
> values over and over again...

I don't have ABA problem in such sense.
ABA can occur in next situation: when thread 1 sleeps in enqueue/
dequeue operation, other threads must proccess (enqueue and dequeue)
buffer_size*2^number_of_stolen_bits values.
In my example: 32k * 4 = 128k values
If speed of request arrival 1000 values/sec, it takes 128 seconds.
ABA cant occur just in "the next cycle", because of some node is
already freed and already reused.
And we can enlarge buffer size to, for example, 128k elements (it
takes 512k of memory), and we can steal more bits, for example, 3 bits
(in your proxy collector you steal 7 bits :) but I understand that
there are different situation)
128k * 8 = 1M values
1M values must be proceed while some thread sleeps in the enqueue/
dequeue operation
And, if ABA occurs, no value loss or duplication occurs - just wrong
order of consumption.
I think, this is attractive properties to use such structure in real-
life (at least in some situations)

> 2 CAS for enqueue is just as expensive as a enqueue function protected by a
> fast-pathed mutex; ditto for the dequeue. So, if you handle contention
> issues properly, the contention for the queue should not be that bad... So,
> your hitting the fast-path on the mutex most of the time... Well, why use
> the lock-free queue then? Is good to follow a single rule wrt lock-free
> anything. Under the 'no-contention-case': A lock-free algorithm is usually
> not a good idea when its overhead is equal-to-or-greater-than that of a
> fast-pathed mutex implementation. That means that if a lock-free
> implementation has 2 or more atomic operations and/or memory barriers for a
> uncontended call to a function X, then it just as expensive as function X
> protected by a mutex.

Yes, I understand that point :)
But we still prevent blocking, deadlocking, context switching,
priority inversion etc. Shouldn't we?

btw by this reason I don't see mpmc queue in appcore? :)

But I have no more CAS'es than mutex solution - why mutex is better?
Why lock-free not-a-good solution in that case?
And why people realize lock-free queues that have 5 CAS'es per
operation?

Chris Thomasson

unread,

Feb 10, 2007, 8:17:43 PM2/10/07

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

news:1171096239.0...@a75g2000cwd.googlegroups.com...

> On 8 фев, 23:13, "Chris Thomasson" <cris...@comcast.net> wrote:
>
> First of all, thanks for reply
> I answer to you yesterday, but it looks as if answer was lost...

I go it as an E-Mail. Should I post it?

>> I have not examined your code yet in any detail [...]

>> http://groups.google.com/group/comp.programming.threads/browse_frm/th...
>> http://groups.google.com/group/comp.arch/msg/a3ebfe80363a4399
>
> I think, this is different.

[...]

> I don't operate with indexes and I don't store indexes - just
> atomically increment position in buffer.

Well, the position in the buffer is an index; think about it. The position
is the index into the array which makes up the buffer. You grabbing your
links in an ordered fashion. A simple atomic increment will do here. No CAS.

>> > My thoughts on this:
>> > 1. No nodes -> no dynamic memory allocation (even from lock-free
>> > freelist) -> less CAS'es -> less contention -> faster
>> > 2. No nodes -> no problems with node reuse -> no ABA
>>
>> Well, the ABA problem can certainly still exist. The frequent reuse of a
>> particular pointer value, or index offset for that matter, is the main
>> cause
>> of ABA in the first place. So, it doesn't matter where you get your nodes
>> from (e.g., alloc one at a time, or a slab of nodes at a time via.
>> array),
>> its the 'tagging' method you use to id them individually (e.g., pointer
>> to
>> node, or index offset to node in array) that can cause CAS to hit ABA
>> conditions. The stuff that bites you is the frequent reuse of a couple
>> 'tag'
>> values over and over again...
>
> I don't have ABA problem in such sense.
> ABA can occur in next situation: when thread 1 sleeps in enqueue/
> dequeue operation, other threads must proccess (enqueue and dequeue)
> buffer_size*2^number_of_stolen_bits values.

[...]

That is exactly the same as the problem I described. Its the reuse of the
same index/position. Your aba counter does wrap, in a sense, when you reach
the end of the buffer.

>> 2 CAS for enqueue is just as expensive as a enqueue function protected by
>> a
>> fast-pathed mutex; ditto for the dequeue.

[...]

> Yes, I understand that point :)
> But we still prevent blocking, deadlocking, context switching,
> priority inversion etc. Shouldn't we?

Everything has its tradeoffs. If I can "get away with" using a mutex based
queue over a lock-free queue, well I would go for the mutex based one.

Please keep in mind that not only can a mutex be fast-pathed, it can be
adaptive as well. Therefore, it is possible to create clever locking system
that rarely hit contention. Think of the Solaris memory allocator.
Personally, I made use of lock-based algorithm for the writer side of a
lock-free reader pattern... Lock-free for the writer side can be ridiculous
some times... Think of the algorithm that uses 5'CAS!!! What the hell was
the author of that paper thinking!

> btw by this reason I don't see mpmc queue in appcore? :)

Yes.

> But I have no more CAS'es than mutex solution - why mutex is better?

Simpler, and has standard POSIX to back it up.

> Why lock-free not-a-good solution in that case?

Same overheads as locking in contended case.

> And why people realize lock-free queues that have 5 CAS'es per
> operation?

They have a severe lack of knowledge wrt lock-free algorithms and overhead
from atomic operations and memory barriers in general!

:O

Chris Thomasson

unread,

Feb 10, 2007, 8:26:50 PM2/10/07

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

news:1171094250.0...@v33g2000cwv.googlegroups.com...

> On 10 фев, 08:26, gottlobfr...@gmail.com wrote:
>
> Fisrt of all, thanks for reply

[...]

>
>> etc. or something like that. I've used a few of these, written a
>> few. Basically it is how the Windows Message Queue works, etc.
>> I think the main downside is that it is not generic - you need to know
>> the size of your queue before hand.
>
> I call this not downside, I call this ... feature :)
> In some cases this is advantage. Some queues are specially bounded :)

You can create a bound, or a number of conditions, on any lock-free or
lock-based queue implementation by using an eventcount. The eventcount works
well with lock-free because its heavily fast-pathed on the signal side.
Couple this with Joe Seigh. usage pattern, well, putting a bound on a queue
is very trivial task indeed. Plus, absolutely no logic for a bound needs to
be hard coded into the queue implementation... Just create an unbounded
queue, then wrap it with the proper eventcount logic to add a bound.

Chris Thomasson

unread,

Feb 11, 2007, 11:15:38 PM2/11/07

to

>> Why lock-free not-a-good solution in that case?
>
> Same overheads as locking in contended case.

in NON!!! contended case.

sorry for any confusions(s)!!!

:O

gottlo...@gmail.com

unread,

Feb 12, 2007, 1:09:00 PM2/12/07

to

On Feb 10, 8:17 pm, "Chris Thomasson" <cris...@comcast.net> wrote:

> A simple atomic increment will do here. No CAS.

There's a difference between atomic increment and CAS? Isn't atomic
increment usually implemented with CAS? What am I missing here?

Tony

Chris Thomasson

unread,

Feb 12, 2007, 5:50:27 PM2/12/07

to

<gottlo...@gmail.com> wrote in message
news:1171303740....@j27g2000cwj.googlegroups.com...

> On Feb 10, 8:17 pm, "Chris Thomasson" <cris...@comcast.net> wrote:
>
>> A simple atomic increment will do here. No CAS.
>
> There's a difference between atomic increment and CAS? Isn't atomic
> increment usually implemented with CAS?

Sometimes.

> What am I missing here?

Well, certain architectures have an instruction that "specializes" in
increments only. On x86 you can use the INC or XADD instructions instead of
CMPXCHG. The "advantage" of using a dedicated instruction is that it can
usually allow you get rid of the 'loop' that comes along with CAS:

inc using cas:

do {
local = shared;
} while(! CAS(&shared, local, local + 1));

inc using inc; ;^)

INC(&shared);

inc using xadd:

XADD(&shared, 1);