On Oct 21, 8:47 pm, Alexander Shuvaev <
alex.shuv...@gmail.com> wrote:
> Hi ;)
> I was sure that you knew the answer to this question. I've analysed it
> but still have doubt about the heart of the problem.
> As I undestand, your last example would be compiled for Itanium
> platform to something like this:
>
> // thread 1
> st.rel [x] = 1
> ld.acq r1 = [y]
>
> // thread 2
> st.rel [y] = 1
> ld.acq r2 = [x]
>
> Am I right? In other words all loads with memory_order_acq_rel have
> acquire semantics and all stores have release semantics respectively.
> And atomic RMW operations act as load has acq and store has rel
> semantics at the same time, something like operations with lock prefix
> on x86. But if I am right, it's interesting how it would be
> implemented for Itanium.
First of all sorry for that nonsense, only RMW operations can be
acq_rel. So the example must be:
// thread 1
x.exchange(1, memory_order_acq_rel);
R1 = y.exchange(0, memory_order_acq_rel);
// thread 2
y.exchange(1, memory_order_acq_rel);
R2 = x.exchange(0, memory_order_acq_rel);
And for this code, I think (but not quite sure right now), the result
R1==R2==0 is impossible because of the RMW operations. RMW operations
establish total order over operations on the variable.
Ok, here is an example I am sure about:
// thread 1
x.store(1, memory_order_relaxed);
atomic_thread_fence(memory_order_seq_cst);
R1 = y.load(memory_order_relaxed);
// thread 2
y.store(1, memory_order_relaxed);
atomic_thread_fence(memory_order_seq_cst);
R2 = x.load(memory_order_relaxed);
If you will replace seq_cst fences with acq_rel fences, then the
result R1==R2==0 will be possible.
> And atomic RMW operations act as load has acq and store has rel
> semantics at the same time, something like operations with lock prefix
> on x86.
No, x86 RMW with LOCK prefix is more powerful than acq_rel, it's
actually seq_cst RMW. There is no way to express acq_rel RMW on x86.
I think x86 is just not a good arch to reason about relaxed atomic
operations, because it's too strength. There is actually only 1 kind
of reordering possible - a store can sink below a load.
A better arch to reason about relaxed atomics is SPARC RMO.
On SPARC RMO the implementation is as follows (not quite sure, but
something like that).
// store-release
membar #LoadStore | #StoreStore
store
// load-acquire
load
membar #LoadStore | #LoadLoad
// acq_rel RMW
membar #LoadStore | #StoreStore
RMW
membar #LoadStore | #LoadLoad
You may notice that there is no #StoreLoad involved. #StoreLoad membar
is the most expensive, and it's it that provdes sequential
consistency.
And if you do critical store-load sequence (like in Dekker mutual
exclusion algorithm) w/o #StoreLoad membar, it won't work - the load
may hoist above the store.
On x86 MFENCE is basically equal to #StoreLoad membar, because it's
the only kind of reorderings it prevents, all other kind of
reorderings are just impossible. So all fences except seq_cst are no-
op on x86, and seq_cst fence maps to MFENCE.
Ah, the first things I had to mention regarding acq_rel vs seq_cst is
that seq_cst operations participate in total global order of all
seq_cst operations, while acq_rel operations do not.
Consider following example:
Initially X==Y==0
// thread 1
X.exchange(1, memory_order_acq_rel);
// thread 2
X.exchange(1, memory_order_acq_rel);
// thread 3
R1 = X.load(memory_order_acquire);
R2 = Y.load(memory_order_acquire);
// thread 4
R3 = Y.load(memory_order_acquire);
R4 = X.load(memory_order_acquire);
Here, output R1==1, R2==0, R3==1, R4==0 is possible (which basically
means that thread 3 and 4 see stores in different order).
However, if you will replace acq_rel and acquire with seq_cst the
output will be impossible, because seq_cst operations form total
global order, so other threads can't see stores in different order.
--
Dmitriy V'jukov