memory_order_acq_rel and memory_order_seq_cst

2,037 views
Skip to first unread message

Alexander Shuvaev

unread,
Oct 21, 2009, 6:59:34 AM10/21/09
to Scalable Synchronization Algorithms
Would anybody clarify what the difference between memory_order_acq_rel
and memory_order_seq_cst memory synchronization orders in future C++
standard? Can anybody give some examples where this difference may be
observed?

Dmitriy Vyukov

unread,
Oct 21, 2009, 9:56:16 AM10/21/09
to Scalable Synchronization Algorithms
Hi ;)

seq_cst provides sequential consistency, while acq_rel does not. More
precisely, acq/rel/acq_rel have any meaning only when one thread loads
a value stored by another thread.
Here is main pattern for acq/rel based synchronization:

int data; // = 0
atomic<int> flag; // = 0

// thread 1
data = produce_data(); // produce data
flag.store(1, memory_order_release); // notify consumer

// thread 2
if (flag.load(memory_order_acquire))
consume_data(data);

And acq_rel just allows an atomic RMW (read-modify-write) operation to
act like both "consumer" and "producer", i.e. to both acquire
visibility over some data and to transfer visibility over some data.
For example:

void release_reference(object* o)
{
if (o->ref_count.fetch_sub(1, memory_order_acq_rel) == 1)
delete o;
}

In the above example all but last fetch_sub must synchronize-with the
last fetch_sub. So fetch_sub must be both acquire and release.

So, where difference between acq_rel and seq_cst makes sense? Consider
following example:

atomic<int> x; // = 0
atomic<int> y; // = 0

// thread 1
x.store(1, memory_order_acq_rel);
R1 = y.load(memory_order_acq_rel);

// thread 2
y.store(1, memory_order_acq_rel);
R2 = x.load(memory_order_acq_rel);

In the above example the output R1 == R2 == 0 is possible. However, if
you will replace acq_rel with seq_cst the result will be impossible,
because there is no such interleaving of threads under which both
threads will miss update of another.

Hope that Relacy will be able to confirm that.

--
Dmitriy V'jukov

Alexander Shuvaev

unread,
Oct 21, 2009, 12:47:30 PM10/21/09
to Scalable Synchronization Algorithms
Hi ;)
I was sure that you knew the answer to this question. I've analysed it
but still have doubt about the heart of the problem.
As I undestand, your last example would be compiled for Itanium
platform to something like this:

// thread 1
st.rel [x] = 1
ld.acq r1 = [y]

// thread 2
st.rel [y] = 1
ld.acq r2 = [x]

Am I right? In other words all loads with memory_order_acq_rel have
acquire semantics and all stores have release semantics respectively.
And atomic RMW operations act as load has acq and store has rel
semantics at the same time, something like operations with lock prefix
on x86. But if I am right, it's interesting how it would be
implemented for Itanium.

Dmitriy Vyukov

unread,
Oct 22, 2009, 6:29:32 AM10/22/09
to Scalable Synchronization Algorithms
On Oct 21, 8:47 pm, Alexander Shuvaev <alex.shuv...@gmail.com> wrote:

> Hi ;)
> I was sure that you knew the answer to this question. I've analysed it
> but still have doubt about the heart of the problem.
> As I undestand, your last example would be compiled for Itanium
> platform to something like this:
>
> // thread 1
> st.rel [x] = 1
> ld.acq r1 = [y]
>
> // thread 2
> st.rel [y] = 1
> ld.acq r2 = [x]
>
> Am I right? In other words all loads with memory_order_acq_rel have
> acquire semantics and all stores have release semantics respectively.
> And atomic RMW operations act as load has acq and store has rel
> semantics at the same time, something like operations with lock prefix
> on x86. But if I am right, it's interesting how it would be
> implemented for Itanium.


First of all sorry for that nonsense, only RMW operations can be
acq_rel. So the example must be:

// thread 1
x.exchange(1, memory_order_acq_rel);
R1 = y.exchange(0, memory_order_acq_rel);

// thread 2
y.exchange(1, memory_order_acq_rel);
R2 = x.exchange(0, memory_order_acq_rel);

And for this code, I think (but not quite sure right now), the result
R1==R2==0 is impossible because of the RMW operations. RMW operations
establish total order over operations on the variable.

Ok, here is an example I am sure about:

// thread 1
x.store(1, memory_order_relaxed);
atomic_thread_fence(memory_order_seq_cst);
R1 = y.load(memory_order_relaxed);

// thread 2
y.store(1, memory_order_relaxed);
atomic_thread_fence(memory_order_seq_cst);
R2 = x.load(memory_order_relaxed);

If you will replace seq_cst fences with acq_rel fences, then the
result R1==R2==0 will be possible.

> And atomic RMW operations act as load has acq and store has rel
> semantics at the same time, something like operations with lock prefix
> on x86.

No, x86 RMW with LOCK prefix is more powerful than acq_rel, it's
actually seq_cst RMW. There is no way to express acq_rel RMW on x86.
I think x86 is just not a good arch to reason about relaxed atomic
operations, because it's too strength. There is actually only 1 kind
of reordering possible - a store can sink below a load.
A better arch to reason about relaxed atomics is SPARC RMO.
On SPARC RMO the implementation is as follows (not quite sure, but
something like that).

// store-release
membar #LoadStore | #StoreStore
store

// load-acquire
load
membar #LoadStore | #LoadLoad

// acq_rel RMW
membar #LoadStore | #StoreStore
RMW
membar #LoadStore | #LoadLoad

You may notice that there is no #StoreLoad involved. #StoreLoad membar
is the most expensive, and it's it that provdes sequential
consistency.

And if you do critical store-load sequence (like in Dekker mutual
exclusion algorithm) w/o #StoreLoad membar, it won't work - the load
may hoist above the store.
On x86 MFENCE is basically equal to #StoreLoad membar, because it's
the only kind of reorderings it prevents, all other kind of
reorderings are just impossible. So all fences except seq_cst are no-
op on x86, and seq_cst fence maps to MFENCE.

Ah, the first things I had to mention regarding acq_rel vs seq_cst is
that seq_cst operations participate in total global order of all
seq_cst operations, while acq_rel operations do not.
Consider following example:

Initially X==Y==0

// thread 1
X.exchange(1, memory_order_acq_rel);

// thread 2
X.exchange(1, memory_order_acq_rel);

// thread 3
R1 = X.load(memory_order_acquire);
R2 = Y.load(memory_order_acquire);

// thread 4
R3 = Y.load(memory_order_acquire);
R4 = X.load(memory_order_acquire);

Here, output R1==1, R2==0, R3==1, R4==0 is possible (which basically
means that thread 3 and 4 see stores in different order).
However, if you will replace acq_rel and acquire with seq_cst the
output will be impossible, because seq_cst operations form total
global order, so other threads can't see stores in different order.


--
Dmitriy V'jukov

Alexander Shuvaev

unread,
Oct 23, 2009, 11:55:14 AM10/23/09
to Scalable Synchronization Algorithms
Hi! It's strange, I am always thinking that acquire and release
semantics has orthogonal meaning comparing to read/write memory
fences.
For example, next two operations may be reordered:

y.load(memory_order_relaxed);
x.load(memory_order_acquire);

So, as I was thinking, acquire semantics is not equivalent to membar
#LoadStore | #LoadLoad. Now I see self-contradiction in future
standart, or it's in my mind only :). Neither acquire/release
semantics definition in standart doesn't coincide with Intel and
Microsoft definition, nor definition of memory_order_acq_rel doesn't
agree with acquire/release semantics definition.

From Intel docs:
Acquire semantics imply that the instruction is made visible prior to
all subsequent orderable
instructions.
Release semantics imply that the instruction is made visible after all
prior orderable
instructions.

From standart draft:

One additional fence that "falls apart" from the notation is the
acquire+release fence that is expressed as atomic_memory_fence
( __acq_rel ) and provides the equivalent of #LoadLoad | #LoadStore |
#StoreStore in Sun notation or lwsync on IBM platforms
It's clear for me.

Dmitriy Vyukov

unread,
Oct 30, 2009, 1:22:38 PM10/30/09
to Scalable Synchronization Algorithms
On Oct 23, 8:55 am, Alexander Shuvaev <alex.shuv...@gmail.com> wrote:

> Hi! It's strange, I am always thinking that acquire and release
> semantics has orthogonal meaning comparing to read/write memory
> fences.

Yes, sort of.

> For example, next two operations may be reordered:
>
> y.load(memory_order_relaxed);
> x.load(memory_order_acquire);

Indeed.

> So, as I was thinking, acquire semantics is not equivalent to membar
> #LoadStore | #LoadLoad.

Yes, they are not equivalent. But they do not have to be.
However, what does matter is that LoadStore | LoadLoad is a superset
(i.e. stronger) than acquire.

> Now I see self-contradiction in future
> standart, or it's in my mind only :). Neither acquire/release
> semantics definition in standart doesn't coincide with Intel and
> Microsoft definition, nor definition of memory_order_acq_rel doesn't
> agree with acquire/release semantics definition.
>
> From Intel docs:
> Acquire semantics imply that the instruction is made visible prior to
> all subsequent orderable
> instructions.
> Release semantics imply that the instruction is made visible after all
> prior orderable
> instructions.
>
> From standart draft:
>
> One additional fence that "falls apart" from the notation is the
> acquire+release fence that is expressed as atomic_memory_fence
> ( __acq_rel ) and provides the equivalent of #LoadLoad | #LoadStore |
> #StoreStore in Sun notation or lwsync on IBM platforms

I do not quite get what exactly do you mean by contradiction. But I
probably see the root of mis-understanding.
All the hardware memory models and C++ abstract memory model are quite
different. There are no *exact* mappings between them, but they are
not required either. What does matter here is the following.
C++ abstract memory model gives you a portable memory model againt
which you may code portable algorithms. Then when you compile your
portable program for some particular hardware, some mapping between C+
+ memory model and the hardware memory model takes place. I.e.
basically for each C++ primitive, for example acquire fence, there is
an according hardware counter-part, for example for SPARC it is membar
LoadLoad|LoadStore. There is the only requirement for hardware counter-
part - it MUST be NOT WEAKER (i.e. equal or stronger).
Now, if you get "load-acquire implementation" for Itanium and SPARC
(or whatever), for Itanium it is ld.acq, and for SPARC it is load
followed by membar LoadLoad|LoadStore. These implementations are in
fact orthogonal, in a sense. And they are actually both stronger than C
++ load-acquire, and stronger "in a different ways". However, what
does matter is that they both IMPLEMENT abstract C++ load-acquire
primitive. That's it. There may be no other relations between them.


--
Dmitriy V'jukov

Alexander Shuvaev

unread,
Nov 6, 2009, 11:24:43 AM11/6/09
to Scalable Synchronization Algorithms
Thank you for your reply, Dmitriy.

> I do not quite get what exactly do you mean by contradiction. But I
> probably see the root of mis-understanding.
> All the hardware memory models and C++ abstract memory model are quite
> different. There are no *exact* mappings between them, but they are
> not required either. What does matter here is the following.
> C++ abstract memory model gives you a portable memory model againt
> which you may code portable algorithms. Then when you compile your
> portable program for some particular hardware, some mapping between C+
> + memory model and the hardware memory model takes place. I.e.
> basically for each C++ primitive, for example acquire fence, there is
> an according hardware counter-part, for example for SPARC it is membar
> LoadLoad|LoadStore. There is the only requirement for hardware counter-
> part - it MUST be NOT WEAKER (i.e. equal or stronger).
> Now, if you get "load-acquire implementation" for Itanium and SPARC
> (or whatever), for Itanium it is ld.acq, and for SPARC it is load
> followed by membar LoadLoad|LoadStore. These implementations are in
> fact orthogonal, in a sense. And they are actually both stronger than C
> ++ load-acquire, and stronger "in a different ways". However, what
> does matter is that they both IMPLEMENT abstract C++ load-acquire
> primitive. That's it. There may be no other relations between them.
>
> --
> Dmitriy V'jukov



Ok, It's now clear for me. The heart of the problem was my ignorance
of acquire/release fence definitions given in the current draft. These
synchronization primitives can't be replaced by RMW operations with
corresponding semantics. Here it's interesting notes about acquire/
release fence implementation by Peter Dimov:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2633.html

In you first example,

// thread 1
x.exchange(1, memory_order_acq_rel);
R1 = y.exchange(0, memory_order_acq_rel);

// thread 2
y.exchange(1, memory_order_acq_rel);
R2 = x.exchange(0, memory_order_acq_rel);

the outcome R1==R2==0 is impossible. I'm sure about it, it is enough
to keep in mind the definition of acquire/release operations (1.10/6
and 1.10/7) and the fact that "all modifications to a particular
atomic object occur in some particular total order" (1.10/5).

You wrote that Itanium implementaion actually stronger than C++ load-
acquire. As I understand, Itanium acquire(for load)/release(for store)
operations give us the same guarantee as corresponding operations in C+
+ standart except for the fact that Itanium acquire/release operations
need be seen in the same total order by all observers.




Alexander Shuvaev

unread,
Nov 9, 2009, 11:24:39 AM11/9/09
to Scalable Synchronization Algorithms
Dmitriy, I have the question about your Relacy Race Detector. You'd
supported C++0x atomic operations in relacy, your implementation must
be as weak as future standart guarantees (must permit same memory
reodering). Obviously, it must not be stronger, otherwise some test
never fail even if a program being testing is knowingly errorphrone.
Am I right about atomic operations in relacy, the implemetation (in
terms of verification possibility) guarantees is equal to that's given
in standart?

Dmitriy Vyukov

unread,
Nov 9, 2009, 11:34:08 AM11/9/09
to Scalable Synchronization Algorithms, rel...@googlegroups.com
Yes, your right. Relacy does not rely on underlying hardware memory
model during testing in any way, it models C++0x memory model as close
as possible. So even on x86 hardware you are able to test such things
as relaxed stores, relaxed RMW, failing compare_exchange_weak(),
absence of global order, etc.
Relacy provides "the worst C++0x implementation", in a sense.

--
Dmitriy V'jukov

Dmitriy Vyukov

unread,
Nov 9, 2009, 11:52:41 AM11/9/09
to Scalable Synchronization Algorithms
Well, frankly I do not know what exactly are the differences, however
I believe there must be some. You've named one - total store order.
Probably (my guess) 'release sequence' on Itanium includes st.rel
operations as well.


--
Dmitriy V'jukov
Reply all
Reply to author
Forward
0 new messages