ABP work stealing deque - is lock-free deque, that supports
PushBottom(), PopBottom() operations, which can be executed by *only*
one thread, and PopTop() operation, which can be executed by any
number of threads.
Question about PopBottom() operation which executed by only one
thread.
As I understand between lines 55 and 56 #StoreLoad memory barrier (or
mfence in terms of x86) must be executed. Am I right?
There are critical store-load sequence that provide synchronization
between thread executing PopBottom() and threads executing PopTop().
Authors says that algorithm uses only cheap stores and loads and
avoids using costly CAS (until a potential conflict). So if #StoreLoad
is required between those cheap stores and loads, than it's no cheaper
than CAS.
And the second question. Is there way to eliminate #StoreLoad in
PopBottom() when deque size is large enough?
Dmitriy V'jukov
I reasonably sure that you are correct.
> So if #StoreLoad
> is required between those cheap stores and loads, than it's no cheaper
> than CAS.
[...]
Indeed. I think I am getting a handle on the algorithm. I should really
implement this thing...
I don't think so. So far as I know, A CAS instruction on x86 may need
to lock the bus (same as "lock prefix") if the local cache did not has
exclusive ownership for the target address. Compared with a memory
barrier of a local processor, a bus locking is more expensive.
Thanks
Ivan
When I measure those things I get:
lock cmpxchg - 104 cycles
mfence - 100 cycles
But actually maybe I measure only 'cache locking' and not 'bus
locking'. Thank you for advice.
Dmitriy V'jukov
The former one hurt the local performance and the later one hurt the
overall performance and scalability.
Thanks
Ivan
> > When I measure those things I get:
> > lock cmpxchg - 104 cycles
> > mfence - 100 cycles
>
> I think the overhead of lock prefix and mfence can not be measured by
> how many cycles they spend.
> For mfence, the overhead is it disables all the memory reorder
> optimization near this instruction.
> For lock prefix, the overhead is it freezes all the memory accessing
> of other processors.
>
> The former one hurt the local performance and the later one hurt the
> overall performance and scalability.
Hmmm. Interesting.
Can someone make clear when /cache locking/ occurs?
When target cache line in *current* core's cache? Or when target cache
line in *any* core's cache?
Dmitriy V'jukov
http://groups.google.com/group/comp.programming.threads/msg/aeb7e857ef440520
Humm. I have not had the time to study their algorithm to my satisfaction,
however I kind of think there might be a potential race-condition wrt
detecting when a CAS must be issued in the 'PopBottom' function. Not sure
yet...
Have you read (section 4.1.2/paragraph 2) yet? IMVHO, it seems they are
making some "troublesome" assumptions about atomicity... It seems like they
assume that a single statement in their pseudo-code will be executed
atomically. Then they say no statement accesses more than one shared
variable, _except_ auxiliary nodes. This means that such statements will be
made up of multiple instructions. The weird thing is that they say something
like (paraphrase):
____________
We can consider statements that are made up of multiple instructions (e.g.,
use auxiliary nodes) as a single atomic action because auxiliary nodes are
not used in a "real" implementation.
____________
This makes their proofs a bit confusing... They also seem to think that
memory visibility will somehow handle itself because each statement is
atomic... Humm...
Any thoughts on this?
They also seem to assume that the effects of a atomic single statement will
be visible before any subsequent ones...
They require 64-bit CAS and they store ABA counters.
Humm... I believe that they are going to need DWCAS because on a 64-bit
system a pointer will take up all of the room. I am not sure how they are
going to pack a 64-bit pointer, cell-index and aba counter in a single
64-bit word.
[...]
I think I understand their algorithm now. IMO, it's easier to follow this if
you consider (page 5/figure 4/line 15) as the start of a linizeraziation
point for (line 18) which closes with (line 34). Think of (lines 1-15) as a
single atomic operation because it's only executed by a single thread. The
last instruction (line 15) performs the action which makes the state change
visible to (lines 16-42). (Lines 16-42) can be thought of as a basic
CAS-loop without the 'loop' part; analogous to a "try_lock" function on a
mutex. 'ABORT' is basically equal to the DWCAS failing, and EMPTY is
self-explanatory. believe I understand their algorithm a lot better now.
As for memory barriers, 'PushBottom' function should probably have a
#StoreStore, along with a possible #LoadStore constraint, depending on the
data your producing into the doubly linked-list. The PopTry should have an
acquire-style membar after the entire operation. Basic outline:
void push() {
membar /* | #LoadStore; */ #StoreStore;
[push impl...]
}
void* pop() {
void* const state = [pop impl...];
membar /* | #StoreLoad; */ #StoreStore;
return state;
}
Any thoughts on this?
Those memory barriers were the macro layer required for the 'ThreadData'
structure. The micro layer memory barriers for say the 'PopBottom' function
would require a #LoadLoad in between (page 6/figure 5/line 43) and (line
57).
I notice that the 'PopBottom' function acts like the CAS logic in 'PopTop'
when it notices the node it needs to remove is the last one in the
container. I was wondering what would happen if a thread A executed (lines
43-55) can be considered atomic because only a single thread performs it and
the last instruction (line 55) makes the state change. You are correct
Dmitriy in that a #StoreLoad | #StoreStore barrier is needed right after
(line 55). (Line 57) is the start of a linearization point which closes with
(line 68) if the "slow-path" is hit.
Never mind the 'I was wondering what would happen if a thread A executed'
text in that sentence.
> (lines 43-55) can be considered atomic because only a single thread
> performs it and the last instruction (line 55) makes the state change.
[...]
fixed:
_______________
I notice that the 'PopBottom' function acts like the CAS logic in 'PopTop'
when it notices the node it needs to remove is the last one in the
container. (Lines 43-55) can be considered atomic because only a single