ABP stealing deque algorithm

Dmitriy Vyukov

unread,

Aug 29, 2007, 3:15:02 PM8/29/07

to

Here is code of [slightly modified] ABP work stealing deque:
http://research.sun.com/scalable/pubs/DynamicWorkstealing.pdf
(in PDF code in images, so I can't post code itself here)

ABP work stealing deque - is lock-free deque, that supports
PushBottom(), PopBottom() operations, which can be executed by *only*
one thread, and PopTop() operation, which can be executed by any
number of threads.

Question about PopBottom() operation which executed by only one
thread.
As I understand between lines 55 and 56 #StoreLoad memory barrier (or
mfence in terms of x86) must be executed. Am I right?
There are critical store-load sequence that provide synchronization
between thread executing PopBottom() and threads executing PopTop().
Authors says that algorithm uses only cheap stores and loads and
avoids using costly CAS (until a potential conflict). So if #StoreLoad
is required between those cheap stores and loads, than it's no cheaper
than CAS.

And the second question. Is there way to eliminate #StoreLoad in
PopBottom() when deque size is large enough?

Dmitriy V'jukov

Chris Thomasson

unread,

Sep 4, 2007, 12:42:52 AM9/4/07

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message
news:1188414902.4...@l22g2000prc.googlegroups.com...

> Here is code of [slightly modified] ABP work stealing deque:
> http://research.sun.com/scalable/pubs/DynamicWorkstealing.pdf
> (in PDF code in images, so I can't post code itself here)

[...]

> Question about PopBottom() operation which executed by only one
> thread.
> As I understand between lines 55 and 56 #StoreLoad memory barrier (or
> mfence in terms of x86) must be executed. Am I right?

[...]

I reasonably sure that you are correct.

> So if #StoreLoad
> is required between those cheap stores and loads, than it's no cheaper
> than CAS.

[...]

Indeed. I think I am getting a handle on the algorithm. I should really
implement this thing...

Ivan Wang

unread,

Sep 4, 2007, 10:34:18 PM9/4/07

to

> So if #StoreLoad
> is required between those cheap stores and loads, than it's no cheaper
> than CAS.

I don't think so. So far as I know, A CAS instruction on x86 may need
to lock the bus (same as "lock prefix") if the local cache did not has
exclusive ownership for the target address. Compared with a memory
barrier of a local processor, a bus locking is more expensive.

Thanks
Ivan

Dmitriy Vyukov

unread,

Sep 5, 2007, 4:38:15 AM9/5/07

to

When I measure those things I get:
lock cmpxchg - 104 cycles
mfence - 100 cycles

But actually maybe I measure only 'cache locking' and not 'bus
locking'. Thank you for advice.

Dmitriy V'jukov

Ivan Wang

unread,

Sep 5, 2007, 6:48:06 AM9/5/07

to

On Sep 5, 4:38 pm, Dmitriy Vyukov <dvyu...@gmail.com> wrote:
> On 5 , 06:34, Ivan Wang <ivanwan...@gmail.com> wrote:
>
> > > So if #StoreLoad
> > > is required between those cheap stores and loads, than it's no cheaper
> > > than CAS.
>
> > I don't think so. So far as I know, A CAS instruction on x86 may need
> > to lock the bus (same as "lock prefix") if the local cache did not has
> > exclusive ownership for the target address. Compared with a memory
> > barrier of a local processor, a bus locking is more expensive.
>
> When I measure those things I get:
> lock cmpxchg - 104 cycles
> mfence - 100 cycles

I think the overhead of lock prefix and mfence can not be measured by
how many cycles they spend.
For mfence, the overhead is it disables all the memory reorder
optimization near this instruction.
For lock prefix, the overhead is it freezes all the memory accessing
of other processors.

The former one hurt the local performance and the later one hurt the
overall performance and scalability.

Thanks
Ivan

Dmitriy Vyukov

unread,

Sep 5, 2007, 8:22:32 AM9/5/07

to

On Sep 5, 2:48 pm, Ivan Wang <ivanwan...@gmail.com> wrote:

> > When I measure those things I get:
> > lock cmpxchg - 104 cycles
> > mfence - 100 cycles
>
> I think the overhead of lock prefix and mfence can not be measured by
> how many cycles they spend.
> For mfence, the overhead is it disables all the memory reorder
> optimization near this instruction.
> For lock prefix, the overhead is it freezes all the memory accessing
> of other processors.
>
> The former one hurt the local performance and the later one hurt the
> overall performance and scalability.

Hmmm. Interesting.
Can someone make clear when /cache locking/ occurs?
When target cache line in *current* core's cache? Or when target cache
line in *any* core's cache?

Dmitriy V'jukov

Chris Thomasson

unread,

Sep 5, 2007, 6:17:29 PM9/5/07

to

"Ivan Wang" <ivanw...@gmail.com> wrote in message
news:1188959658.5...@50g2000hsm.googlegroups.com...

http://groups.google.com/group/comp.programming.threads/msg/aeb7e857ef440520

Chris Thomasson

unread,

Sep 16, 2007, 9:24:06 AM9/16/07

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message
news:1188414902.4...@l22g2000prc.googlegroups.com...

> Here is code of [slightly modified] ABP work stealing deque:
> http://research.sun.com/scalable/pubs/DynamicWorkstealing.pdf
> (in PDF code in images, so I can't post code itself here)

[...]

> Authors says that algorithm uses only cheap stores and loads and
> avoids using costly CAS (until a potential conflict).

[...]

Humm. I have not had the time to study their algorithm to my satisfaction,
however I kind of think there might be a potential race-condition wrt
detecting when a CAS must be issued in the 'PopBottom' function. Not sure
yet...

Chris Thomasson

unread,

Sep 16, 2007, 9:46:50 AM9/16/07

to

"Chris Thomasson" <cri...@comcast.net> wrote in message
news:4dOdnUy8SoAEs3Db...@comcast.com...

Have you read (section 4.1.2/paragraph 2) yet? IMVHO, it seems they are
making some "troublesome" assumptions about atomicity... It seems like they
assume that a single statement in their pseudo-code will be executed
atomically. Then they say no statement accesses more than one shared
variable, _except_ auxiliary nodes. This means that such statements will be
made up of multiple instructions. The weird thing is that they say something
like (paraphrase):

____________
We can consider statements that are made up of multiple instructions (e.g.,
use auxiliary nodes) as a single atomic action because auxiliary nodes are
not used in a "real" implementation.
____________

This makes their proofs a bit confusing... They also seem to think that
memory visibility will somehow handle itself because each statement is
atomic... Humm...

Any thoughts on this?

Chris Thomasson

unread,

Sep 16, 2007, 2:44:57 PM9/16/07

to

"Chris Thomasson" <cri...@comcast.net> wrote in message

news:7_GdnZBAdf6QrnDb...@comcast.com...
[...]

> Have you read (section 4.1.2/paragraph 2) yet? IMVHO, it seems they are
> making some "troublesome" assumptions about atomicity... It seems like
> they assume that a single statement in their pseudo-code will be executed
> atomically.

[...]

They also seem to assume that the effects of a atomic single statement will
be visible before any subsequent ones...

Chris Thomasson

unread,

Sep 16, 2007, 2:55:09 PM9/16/07

to

"Chris Thomasson" <cri...@comcast.net> wrote in message

news:4dOdnUy8SoAEs3Db...@comcast.com...

They require 64-bit CAS and they store ABA counters.

Chris Thomasson

unread,

Sep 16, 2007, 9:35:34 PM9/16/07

to

"Chris Thomasson" <cri...@comcast.net> wrote in message

news:eIadnReKXprN5nDb...@comcast.com...

Humm... I believe that they are going to need DWCAS because on a 64-bit
system a pointer will take up all of the room. I am not sure how they are
going to pack a 64-bit pointer, cell-index and aba counter in a single
64-bit word.

Chris Thomasson

unread,

Sep 16, 2007, 10:57:15 PM9/16/07

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message
news:1188414902.4...@l22g2000prc.googlegroups.com...

> Here is code of [slightly modified] ABP work stealing deque:
> http://research.sun.com/scalable/pubs/DynamicWorkstealing.pdf
> (in PDF code in images, so I can't post code itself here)
>
> ABP work stealing deque - is lock-free deque, that supports
> PushBottom(), PopBottom() operations, which can be executed by *only*
> one thread, and PopTop() operation, which can be executed by any
> number of threads.

[...]

I think I understand their algorithm now. IMO, it's easier to follow this if
you consider (page 5/figure 4/line 15) as the start of a linizeraziation
point for (line 18) which closes with (line 34). Think of (lines 1-15) as a
single atomic operation because it's only executed by a single thread. The
last instruction (line 15) performs the action which makes the state change
visible to (lines 16-42). (Lines 16-42) can be thought of as a basic
CAS-loop without the 'loop' part; analogous to a "try_lock" function on a
mutex. 'ABORT' is basically equal to the DWCAS failing, and EMPTY is
self-explanatory. believe I understand their algorithm a lot better now.

As for memory barriers, 'PushBottom' function should probably have a
#StoreStore, along with a possible #LoadStore constraint, depending on the
data your producing into the doubly linked-list. The PopTry should have an
acquire-style membar after the entire operation. Basic outline:

void push() {
membar /* | #LoadStore; */ #StoreStore;
[push impl...]
}

void* pop() {
void* const state = [pop impl...];
membar /* | #StoreLoad; */ #StoreStore;
return state;
}

Any thoughts on this?

Chris Thomasson

unread,

Sep 16, 2007, 11:10:09 PM9/16/07

to

"Chris Thomasson" <cri...@comcast.net> wrote in message

news:d42dnWvBXM3QcXDb...@comcast.com...

> "Dmitriy Vyukov" <dvy...@gmail.com> wrote in message
> news:1188414902.4...@l22g2000prc.googlegroups.com...
>> Here is code of [slightly modified] ABP work stealing deque:
>> http://research.sun.com/scalable/pubs/DynamicWorkstealing.pdf
>> (in PDF code in images, so I can't post code itself here)
>>
>> ABP work stealing deque - is lock-free deque, that supports
>> PushBottom(), PopBottom() operations, which can be executed by *only*
>> one thread, and PopTop() operation, which can be executed by any
>> number of threads.
>
> [...]
>
> I think I understand their algorithm now. IMO, it's easier to follow this
> if you consider (page 5/figure 4/line 15) as the start of a
> linizeraziation point for (line 18) which closes with (line 34). Think of
> (lines 1-15) as a single atomic operation because it's only executed by a
> single thread. The last instruction (line 15) performs the action which
> makes the state change visible to (lines 16-42). (Lines 16-42) can be
> thought of as a basic CAS-loop without the 'loop' part; analogous to a
> "try_lock" function on a mutex. 'ABORT' is basically equal to the DWCAS
> failing, and EMPTY is self-explanatory. believe I understand their
> algorithm a lot better now.
>
> As for memory barriers, 'PushBottom' function should probably have a
> #StoreStore, along with a possible #LoadStore constraint, depending on the
> data your producing into the doubly linked-list. The PopTry should have an
> acquire-style membar after the entire operation. Basic outline:

[...]
> Any thoughts on this?

Those memory barriers were the macro layer required for the 'ThreadData'
structure. The micro layer memory barriers for say the 'PopBottom' function
would require a #LoadLoad in between (page 6/figure 5/line 43) and (line
57).

I notice that the 'PopBottom' function acts like the CAS logic in 'PopTop'
when it notices the node it needs to remove is the last one in the
container. I was wondering what would happen if a thread A executed (lines
43-55) can be considered atomic because only a single thread performs it and
the last instruction (line 55) makes the state change. You are correct
Dmitriy in that a #StoreLoad | #StoreStore barrier is needed right after
(line 55). (Line 57) is the start of a linearization point which closes with
(line 68) if the "slow-path" is hit.

Chris Thomasson

unread,

Sep 16, 2007, 11:13:53 PM9/16/07

to

"Chris Thomasson" <cri...@comcast.net> wrote in message

news:RYSdneDGQPrJcnDb...@comcast.com...

Never mind the 'I was wondering what would happen if a thread A executed'
text in that sentence.

> (lines 43-55) can be considered atomic because only a single thread
> performs it and the last instruction (line 55) makes the state change.

[...]

fixed:
_______________

I notice that the 'PopBottom' function acts like the CAS logic in 'PopTop'
when it notices the node it needs to remove is the last one in the

container. (Lines 43-55) can be considered atomic because only a single