Subtle difference between C++0x MM and other MMs

Dmitriy V'jukov

unread,

Aug 24, 2008, 12:46:30 PM8/24/08

to

Consider following Peterson's algorithm implementation from Wikipedia:
http://en.wikipedia.org/wiki/Peterson%27s_algorithm

flag[0] = 0
flag[1] = 0
turn = 0

P0: flag[0] = 1
turn = 1
memory_barrier()
while( flag[1] && turn == 1 );
// do nothing
// critical section
...
// end of critical section
flag[0] = 0

P1: flag[1] = 1
turn = 0
memory_barrier()
while( flag[0] && turn == 0 );
// do nothing
// critical section
...
// end of critical section
flag[1] = 0

We can implement this in Java using volatile variables, and needed
memory_barrier() will be emitted automatically by compiler.
We can implement this in C# using volatile variables, and
Thread.MemoryBarrier() as memory_barrier().
We can implement this in x86 MM using plain loads and stores, and
mfence instruction as memory_barrier().
We can implement this in C++0x using std::atomic<> and issuing loads
with memory_order_acquire, stores with memory_order_release, and
atomic_thread_fence(memory_order_seq_cst) as memory_barrier(). This is
the most straightforward translation of Java/C#/x86 implementations.

The only problem is that C++0x implementation will not work.
Personally for me, it's quite counter-intuitive. And following
question arise. What is the most simple way to translate some existing
Java/C#/x86 algorithm implementation to C++0x? It seems that it's not
so easy...

Dmitriy V'jukov

Peter Dimov

unread,

Aug 24, 2008, 1:52:58 PM8/24/08

to

And the C++ equivalent is to use seq_cst load and stores, which are
equivalent to Java volatiles.

> We can implement this in C# using volatile variables, and
> Thread.MemoryBarrier() as memory_barrier().
> We can implement this in x86 MM using plain loads and stores, and
> mfence instruction as memory_barrier().
> We can implement this in C++0x using std::atomic<> and issuing loads
> with memory_order_acquire, stores with memory_order_release, and
> atomic_thread_fence(memory_order_seq_cst) as memory_barrier(). This is
> the most straightforward translation of Java/C#/x86 implementations.
>
> The only problem is that C++0x implementation will not work.

Why will it not work?

Dmitriy V'jukov

unread,

Aug 24, 2008, 2:44:31 PM8/24/08

to

Yes, it's possible to implement any algorithm that relying on
sequentially consistent memory model in C++0x using seq_cst atomic
operations. But! Seq_cst atomic operations, especially stores, can be
quite expensive. So, one has general desire to use weaker operations,
like store-release and load-acquire. And in Java/C#/x86 it's possible
to implement Peterson's algorithm using weak operations + 1 strong
fence. In C++0x - NOT.

> > We can implement this in C# using volatile variables, and
> > Thread.MemoryBarrier() as memory_barrier().
> > We can implement this in x86 MM using plain loads and stores, and
> > mfence instruction as memory_barrier().
> > We can implement this in C++0x using std::atomic<> and issuing loads
> > with memory_order_acquire, stores with memory_order_release, and
> > atomic_thread_fence(memory_order_seq_cst) as memory_barrier(). This is
> > the most straightforward translation of Java/C#/x86 implementations.
>
> > The only problem is that C++0x implementation will not work.
>
> Why will it not work?

I mean not every C++0x implementation of Peterson's algorithm, but
particular implementation which uses store-release, load-acquire + 1
seq_cst fence.

Dmitriy V'jukov

Peter Dimov

unread,

Aug 24, 2008, 4:53:08 PM8/24/08

to

On Aug 24, 9:44 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:
> And in Java/C#/x86 it's possible
> to implement Peterson's algorithm using weak operations + 1 strong
> fence. In C++0x - NOT.

How would you implement Peterson's algorithm in Java using weak
operations and a fence? Java doesn't have weak operations or fences.
Its volatile loads and stores are equivalent to C++MM's seq_cst loads
and stores. Both promise sequential consistency (no more and no less).

> I mean not every C++0x implementation of Peterson's algorithm, but
> particular implementation which uses store-release, load-acquire + 1
> seq_cst fence.

Why do you think that this implementation doesn't work?

Peter Dimov

unread,

Aug 24, 2008, 6:47:06 PM8/24/08

to

I think I see your point. Getting back to

P0: flag[0] = 1
turn = 1
memory_barrier()
while( flag[1] && turn == 1 );
// do nothing
// critical section
...
// end of critical section
flag[0] = 0

P1: flag[1] = 1
turn = 0
memory_barrier()
while( flag[0] && turn == 0 );
// do nothing
// critical section
...
// end of critical section
flag[1] = 0

It's easy to show that P0 and P1 can't block each other forever;
eventually they will agree on the value of 'turn' and one of them will
proceed.

The case where P0 sees flag[1] == 0 and P1 sees flag[0] == 0 is a
classic SC violation example and every reasonable definition of
memory_barrier rules it out.

The interesting case you must have had in mind is the sequence

P1:flag[1] = 1

P1:turn = 0

P0:flag[0] = 1

P0:turn = 1
P0:memory_barrier

Can P0 now see flag[1] == 0? (P1 will later see turn == 1 and enter
the critical section.)

I wonder whether the formal CLR memory model (or even the current
formal x86 memory model) disallows this. (XCHG for turn instead of a
fence should work.)

I think that the C++MM does, if the condition is while( turn == 1 &&
flag[1] ). P0 seeing its own turn=1 doesn't break the release sequence
started by P1:turn=0 because turn=1 is executed by the same thread
(first bullet in 1.10/6). So P1:turn=0 synchronizes-with the read from
'turn' in P0 and ensures that P1:flag[1]=1 is seen.

Dmitriy V'jukov

unread,

Aug 25, 2008, 2:27:54 AM8/25/08

to

Exactly! And this behavior is very counter-intuitive for me!

> I wonder whether the formal CLR memory model (or even the current
> formal x86 memory model) disallows this. (XCHG for turn instead of a
> fence should work.)

As for x86 MM, I think that Yes, such behavior is disallowed. x86 MM
defines possible reorderings in one thread. And then intended reading
is that you must try to construct interleaving of threads (taking into
account reorderings in threads). If it's possible to construct
interleaving, then behavior is allowed. If it's impossible - then
disallowed.

For Peterson's algorithm it's impossible to construct interleaving,
which will break algorithm.

For for CLR, it's very informal. But I think intended reading is the
same as for x86... just because Microsoft targets mainly to x86 :)

But for C++0x the fact that it's impossible to construct interleaving
doesn't matter...

> I think that the C++MM does, if the condition is while( turn == 1 &&
> flag[1] ). P0 seeing its own turn=1 doesn't break the release sequence
> started by P1:turn=0 because turn=1 is executed by the same thread
> (first bullet in 1.10/6). So P1:turn=0 synchronizes-with the read from
> 'turn' in P0 and ensures that P1:flag[1]=1 is seen.

"by the same thread" which execute release, not "by the same thread"
which execute acquire.
So this won't work too.

Dmitriy V'jukov

unread,

Aug 25, 2008, 2:37:45 AM8/25/08

to

On 25 авг, 00:53, Peter Dimov <pdi...@gmail.com> wrote:

> How would you implement Peterson's algorithm in Java using weak
> operations and a fence? Java doesn't have weak operations or fences.
> Its volatile loads and stores are equivalent to C++MM's seq_cst loads
> and stores. Both promise sequential consistency (no more and no less).

Java volatiles promise more. volatile store is release, and volatile
load is acquire. for x86 this means that plain stores and loads will
be used. Well, yes, sometimes heavy membar will be emitted.
C++0x's seq_cst atomic store on x86 will be locked instruction.
So, in my opinion, translating Java volatiles to C++0x's seq_cst
atomics is not fair.
IMVHO more fair way is to translate volatile store to store-release,
volatile load to load-acquire, and manually emit seq_cst fence. At
least this is what initially comes to mind.

Dmitriy V'jukov

Peter Dimov

unread,

Aug 25, 2008, 3:02:32 AM8/25/08

to

On Aug 25, 9:27 am, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:

> "by the same thread" which execute release, not "by the same thread"
> which execute acquire.
> So this won't work too.

Yes, you are right.

Peter Dimov

unread,

Aug 25, 2008, 3:12:38 AM8/25/08

to

On Aug 25, 9:37 am, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:

> On 25 Á×Ç, 00:53, Peter Dimov <pdi...@gmail.com> wrote:
>
> > How would you implement Peterson's algorithm in Java using weak
> > operations and a fence? Java doesn't have weak operations or fences.
> > Its volatile loads and stores are equivalent to C++MM's seq_cst loads
> > and stores. Both promise sequential consistency (no more and no less).
>
> Java volatiles promise more. volatile store is release, and volatile
> load is acquire.

Exactly the same as seq_cst.

> for x86 this means that plain stores and loads will
> be used. Well, yes, sometimes heavy membar will be emitted.
> C++0x's seq_cst atomic store on x86 will be locked instruction.

Java VMs will also (be changed to) emit XCHG for volatile stores,
because plain stores do not guarantee SC, no matter how many MFENCEs
one uses. x86 is now officially not TSO.

T0: x = 1
T1: y = 1
T2: r1 = x; r2 = y; // 1 0 allowed
T3: r3 = y; r4 = x; // 1 0 allowed

Dmitriy V'jukov

unread,

Aug 25, 2008, 3:27:00 AM8/25/08

to

My example is not about total order, it's about ordering just between
2 threads.

Dmitriy V'jukov

Peter Dimov

unread,

Aug 25, 2008, 3:58:22 AM8/25/08

to

On Aug 25, 10:27 am, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:

> My example is not about total order, it's about ordering just between
> 2 threads.

How does this change the fact that Java volatiles and C++ seq_cst
operations are effectively defined in the same way?

Dmitriy V'jukov

unread,

Aug 25, 2008, 6:15:04 AM8/25/08

to

http://groups.google.com/group/comp.programming.threads/msg/ec91885b65521880

But note that you don't get it instantly too:
http://groups.google.com/group/comp.programming.threads/msg/34ed88e972296af9?hl=en
And you are one of the developers of current MM, just think about
others :)

When and if I will find more convincing arguments, I will post them...

Dmitriy V'jukov

Peter Dimov

unread,

Aug 25, 2008, 7:37:37 AM8/25/08

to

On Aug 25, 1:15 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:
> On Aug 25, 11:58 am, Peter Dimov <pdi...@gmail.com> wrote:
>
> > On Aug 25, 10:27 am, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:
>
> > > My example is not about total order, it's about ordering just between
> > > 2 threads.
>
> > How does this change the fact that Java volatiles and C++ seq_cst
> > operations are effectively defined in the same way?
>

> http://groups.google.com/group/comp.programming.threads/msg/ec91885b6...
>
> But note that you don't get it instantly too:http://groups.google.com/group/comp.programming.threads/msg/34ed88e97...

> And you are one of the developers of current MM, just think about
> others :)
>
> When and if I will find more convincing arguments, I will post them...

I agree with you that the translation from x86 to the C++MM doesn't
work, and that this is a potential problem. I don't see at the moment
how the definition of a seq_cst fence can be modified to match the x86
behavior though.

But I still don't see why you think that there is any difference
between Java volatiles and C++ seq_cst operations. They are specified
in (more or less) the exact same way. The Pearson example's
formulation in Java and in C++ seq_cst is equivalent.

Boehm, Hans

unread,

Aug 25, 2008, 3:39:26 PM8/25/08

to

> formulation in Java and in C++ seq_cst is equivalent.- Hide quoted text -
>
Thanks for pointing out an interesting example.

I'm with Peter on this one. There are some differences between Java
volatiles and C++ seq_cst atomics (e.g. C++ seq_cst atomics can be non-
scalars and increments on integral types are atomic). But none of
those matter here.

I also believe that on X86, if you write the original example with
release stores following the critical section, omit the fence, and
rely on seq_cst atomics everywhere else, the compiler should fairly
easily be able to generate code at least as good as the fence-based
code. And I find it much easier to reason about that version of the
code. Unfortunately, I think this does not currently apply across all
architectures.

We do realize that this version of a "fence" is weaker than some
hardware fences. To some extent, it has to be, since it's goal is to
be efficiently implementable across architectures. I'm not sure
whether there is a possible definition that would both satisfy that
constraint, and make this particular version of Peterson's algorithm
work.

Clarification, of sorts, on X86 implementation of sequentially
consistent stores:
I believe the currently published specifications promise correctness
only if the stores are implemented with an atomic RMW operation like
xchg. AFAIK, it is currently still unclear whether an ordinary store
with a trailing fence is also sufficient. And there seems to be
little reason not to compile it as an xchg.

Hans

Peter Dimov

unread,

Aug 25, 2008, 4:25:45 PM8/25/08

to

On Aug 25, 1:47 am, Peter Dimov <pdi...@gmail.com> wrote:
> The interesting case you must have had in mind is the sequence
>
> P1:flag[1] = 1
> P1:turn = 0
> P0:flag[0] = 1
> P0:turn = 1
> P0:memory_barrier
>
> Can P0 now see flag[1] == 0? (P1 will later see turn == 1 and enter
> the critical section.)

You may be interested to know that this is (likely to be) disallowed
on PowerPC as well. It'd definitely be nice to somehow disallow it in
the C++MM as well, if we figure out how to do that.

Otherwise, we'll have to live with something like

P0: flag[0].store( 1, relaxed );
turn.exchange( 1, acq_rel );
while( flag[1].load( acquire ) && turn.load( relaxed ) == 1 );

// do nothing
// critical section
...
// end of critical section

flag[0].store( 0, release );

(I hope I got this right). Of course using a RMW somewhat defeats the
point of the algorithm, but we're using it as an illustration.

Dmitriy V'jukov

unread,

Aug 26, 2008, 7:16:31 AM8/26/08

to

On Aug 25, 11:39 pm, "Boehm, Hans" <hans.bo...@hp.com> wrote:

> We do realize that this version of a "fence" is weaker than some
> hardware fences. To some extent, it has to be, since it's goal is to
> be efficiently implementable across architectures. I'm not sure
> whether there is a possible definition that would both satisfy that
> constraint, and make this particular version of Peterson's algorithm
> work.

I can describe how I currently implement seq_cst fence in Relacy Race
Detector in Java/CLR mode:
http://groups.google.com/group/relacy

I will try to formalize the rules.
Preconditions:
1. There is store A with memory_order_release (or stronger) on atomic
object M.
2. There is store B with memory_order_relaxed (or stronger) on atomic
object M.
3. A precedes B in modification order of M, A and B are adjacent in
modification order of M.
4. There is seq_cst fence C.
5. B is sequenced-before C.

Postcondition:
A synchronizes-with C.

My reasoning is following. Preconditions 2-5 effectively equivalent
to:
2*. There is load B with memory_order_acquire on atomic object M.
3*. B loads value stored by A.

Informally this can be rephrased as:
seq_cst fence renders all sequenced-before stores to be loads-acquire.

Consider how this applies to Peterson's algorithm:

[01] flag[0] = 0
[02] flag[1] = 0
[03] turn = 0

P0:
[04] flag[0] = 1
[05] turn = 1
[06] memory_barrier()
[07] while( flag[1]
[08] && turn == 1 );
[09] // do nothing
[10] // critical section
[11] ...
[12] // end of critical section
[13] flag[0] = 0

P1:
[14] flag[1] = 1
[15] turn = 0
[16] memory_barrier()
[17] while( flag[0]
[18] && turn == 0 );
[19] // do nothing
[20] // critical section
[21] ...
[22] // end of critical section
[23] flag[1] = 0

Store to variable 'turn' on line 15 is store A.
Store to variable 'turn' on line 05 is store B.
A precedes B in modification order of 'turn', A and B are adjacent in
modification order of 'turn'.
Store B is sequenced-before seq_cst fence on line 06.
So store A synchronizes-with fence C.
So load of 'flag[1]' on line 07, cannot return 0 any more.

Dmitriy V'jukov

unread,

Aug 26, 2008, 7:18:45 AM8/26/08

to

On Aug 25, 3:37 pm, Peter Dimov <pdi...@gmail.com> wrote:

> I agree with you that the translation from x86 to the C++MM doesn't
> work, and that this is a potential problem. I don't see at the moment
> how the definition of a seq_cst fence can be modified to match the x86
> behavior though.

This is how I currently made this in Relacy Race Detector in Java/CLR
mode:
http://groups.google.com/group/comp.programming.threads/msg/07c810b38be80bbb

Dmitriy V'jukov

unread,

Aug 26, 2008, 7:20:41 AM8/26/08

to

On Aug 26, 12:25 am, Peter Dimov <pdi...@gmail.com> wrote:
> On Aug 25, 1:47 am, Peter Dimov <pdi...@gmail.com> wrote:
>
> > The interesting case you must have had in mind is the sequence
>
> > P1:flag[1] = 1
> > P1:turn = 0
> > P0:flag[0] = 1
> > P0:turn = 1
> > P0:memory_barrier
>
> > Can P0 now see flag[1] == 0? (P1 will later see turn == 1 and enter
> > the critical section.)
>
> You may be interested to know that this is (likely to be) disallowed
> on PowerPC as well. It'd definitely be nice to somehow disallow it in
> the C++MM as well, if we figure out how to do that.

I think that this is disallowed on Sparc too.

Dmitriy V'jukov

Peter Dimov

unread,

Aug 26, 2008, 8:37:58 PM8/26/08

to

Yes, I believe it is. FWIW, the following:

P0: flag[0].store( 1, relaxed );

fence( seq_cst ); // #1
turn.store( 1, relaxed );
fence( seq_cst );

while( flag[1].load( acquire ) && turn.load( relaxed ) == 1 );
// do nothing
// critical section
...
// end of critical section
flag[0].store( 0, release );

seems to work under the C++MM. The model in its current form doesn't
let us get away with just a fence(release) at #1.

Dmitriy V'jukov

unread,

Aug 27, 2008, 10:11:12 AM8/27/08

to

On Aug 27, 4:37 am, Peter Dimov <pdi...@gmail.com> wrote:

> > > You may be interested to know that this is (likely to be) disallowed
> > > on PowerPC as well. It'd definitely be nice to somehow disallow it in
> > > the C++MM as well, if we figure out how to do that.
>
> > I think that this is disallowed on Sparc too.
>
> Yes, I believe it is. FWIW, the following:
>
> P0: flag[0].store( 1, relaxed );
> fence( seq_cst ); // #1
> turn.store( 1, relaxed );
> fence( seq_cst );
> while( flag[1].load( acquire ) && turn.load( relaxed ) == 1 );
> // do nothing
> // critical section
> ...
> // end of critical section
> flag[0].store( 0, release );
>
> seems to work under the C++MM. The model in its current form doesn't
> let us get away with just a fence(release) at #1.

It looks like it works.
But it's easy to construct heavyweight correct algorithm.
The question that bothers me - how much lightweight (and at the same
time formally correct) algorithm I can construct with C++0x?

Dmitriy V'jukov

unread,

Aug 28, 2008, 9:34:00 AM8/28/08

to

Or put this way.
'synchronizes-with' definition contains something like '... acquire
operation which loads value stored by X ...'. Reasoning behind this
formulation is (as I understand it):
1. 'loads value stored by X' means that operation happens after X.
It's obvious, it's a casual relation.
2. 'acquire operation' means that subsequent operations can't hoist
above this operation.
Now consider 'store operation which follows X in modification order of
M, followed by full fence'. Here:
1. 'operation which follows X in modification order of M' means that
operation happens after X. It's a kind of casual relation too.
2. 'followed by full fence' means that subsequent operations can't
hoist above this fence.
So effectively those two definitions equal wrt 'synchronizes-with'
relation.

Anthony, Peter, Hans, what do you think?

Dmitriy V'jukov

Anthony Williams

unread,

Sep 4, 2008, 3:47:25 AM9/4/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> writes:

> On Aug 26, 3:16 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:
>> I will try to formalize the rules.
>> Preconditions:
>> 1. There is store A with memory_order_release (or stronger) on atomic
>> object M.
>> 2. There is store B with memory_order_relaxed (or stronger) on atomic
>> object M.
>> 3. A precedes B in modification order of M, A and B are adjacent in
>> modification order of M.
>> 4. There is seq_cst fence C.
>> 5. B is sequenced-before C.
>>
>> Postcondition:
>> A synchronizes-with C.
>>
>> My reasoning is following. Preconditions 2-5 effectively equivalent
>> to:
>> 2*. There is load B with memory_order_acquire on atomic object M.
>> 3*. B loads value stored by A.

If the store B in step 2 was a RMW operation, then yes. If it's just a
plain store then no. Peter has proposed that the definition of release
sequence now includes relaxed RMW operations too.

> Or put this way.
> 'synchronizes-with' definition contains something like '... acquire
> operation which loads value stored by X ...'. Reasoning behind this
> formulation is (as I understand it):
> 1. 'loads value stored by X' means that operation happens after X.
> It's obvious, it's a casual relation.
> 2. 'acquire operation' means that subsequent operations can't hoist
> above this operation.
> Now consider 'store operation which follows X in modification order of
> M, followed by full fence'. Here:
> 1. 'operation which follows X in modification order of M' means that
> operation happens after X. It's a kind of casual relation too.

Yes it's later in the modification order of M, but that doesn't mean
that other operations that happen-before the first store (A) also
happen-before the second store (B). I can imagine an architecture
where the stores are just issued to the memory, and one clobbers the
value written by the second, with no synchronization between the
processors at all.

> 2. 'followed by full fence' means that subsequent operations can't
> hoist above this fence.
> So effectively those two definitions equal wrt 'synchronizes-with'
> relation.

I disagree.

Anthony
--
Anthony Williams | Just Software Solutions Ltd
Custom Software Development | http://www.justsoftwaresolutions.co.uk
Registered in England, Company Number 5478976.
Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL

Dmitriy V'jukov

unread,

Sep 8, 2008, 12:15:55 PM9/8/08

to

On Sep 4, 11:47 am, Anthony Williams <anthony....@gmail.com> wrote:

> > Or put this way.
> > 'synchronizes-with' definition contains something like '... acquire
> > operation which loads value stored by X ...'. Reasoning behind this
> > formulation is (as I understand it):
> > 1. 'loads value stored by X' means that operation happens after X.
> > It's obvious, it's a casual relation.
> > 2. 'acquire operation' means that subsequent operations can't hoist
> > above this operation.
> > Now consider 'store operation which follows X in modification order of
> > M, followed by full fence'. Here:
> > 1. 'operation which follows X in modification order of M' means that
> > operation happens after X. It's a kind of casual relation too.
>
> Yes it's later in the modification order of M, but that doesn't mean
> that other operations that happen-before the first store (A) also
> happen-before the second store (B).

Other operations can happen after store (B). It's perfectly Ok. They
must not happen after seq_cst fence!

> I can imagine an architecture
> where the stores are just issued to the memory, and one clobbers the
> value written by the second, with no synchronization between the
> processors at all.
>
> > 2. 'followed by full fence' means that subsequent operations can't
> > hoist above this fence.
> > So effectively those two definitions equal wrt 'synchronizes-with'
> > relation.
>
> I disagree.

Can you then point to error in my reasoning.

Dmitriy V'jukov
--
Relacy Race Detector: Make your synchronization correct!
http://groups.google.ru/group/relacy/web

Anthony Williams

unread,

Sep 8, 2008, 5:49:49 PM9/8/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> writes:

> On Sep 4, 11:47 am, Anthony Williams <anthony....@gmail.com> wrote:
>
>> > Or put this way.
>> > 'synchronizes-with' definition contains something like '... acquire
>> > operation which loads value stored by X ...'. Reasoning behind this
>> > formulation is (as I understand it):
>> > 1. 'loads value stored by X' means that operation happens after X.
>> > It's obvious, it's a casual relation.
>> > 2. 'acquire operation' means that subsequent operations can't hoist
>> > above this operation.
>> > Now consider 'store operation which follows X in modification order of
>> > M, followed by full fence'. Here:
>> > 1. 'operation which follows X in modification order of M' means that
>> > operation happens after X. It's a kind of casual relation too.
>>
>> Yes it's later in the modification order of M, but that doesn't mean
>> that other operations that happen-before the first store (A) also
>> happen-before the second store (B).
>
>
> Other operations can happen after store (B). It's perfectly Ok. They
> must not happen after seq_cst fence!

If they are stores to other memory locations that's not guaranteed.

std::atomic_int x,m;

void thread1()
{
x.store(1,std::memory_order_relaxed);
m.store(1,std::memory_order_release); // A
}

void thread2()
{
m.store(2,std::memory_order_relaxed); // B
std::atomic_thread_fence(std::memory_order_seq_cst); // C
int z=x.load(std::memory_order_relaxed);
}

First off, thread2() does no loads, so there is no way it can tell
whether A or B comes first in the modification order of m.

Since we don't know which comes first, we don't know whether z will
have the value 1 or 0.

Even if we read back the value of m and found that it had the value
"2", we wouldn't have any extra information, since the store A could
just not have become visible to thread2 yet.

Stores clobber release sequences. If you need the ordering
information, use a RMW operation such as exchange().

Dmitriy V'jukov

unread,

Sep 9, 2008, 5:40:13 AM9/9/08

to

On 9 сент, 01:49, Anthony Williams <anthony....@gmail.com> wrote:

> >> Yes it's later in the modification order of M, but that doesn't mean
> >> that other operations that happen-before the first store (A) also
> >> happen-before the second store (B).
>
> > Other operations can happen after store (B). It's perfectly Ok. They
> > must not happen after seq_cst fence!

> [...]

> First off, thread2() does no loads, so there is no way it can tell
> whether A or B comes first in the modification order of m.
>
> Since we don't know which comes first, we don't know whether z will
> have the value 1 or 0.

It's irrelevant to the question. See original Peterson's algorithm
example. There are situation where you don't need to know which store
comes first.

Dmitriy V'jukov

Anthony Williams

unread,

Sep 9, 2008, 6:22:40 AM9/9/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> writes:

True, there are circumstances where it doesn't matter which came
first, but in order to get synchronization, the processor must
know.

You only get a release sequence with a release store and subsequent
RMW operations. A subsequent store destroys the original sequence
(though it may start its own). If the processor doesn't load the old
value it doesn't have to do any synchronization, and can just throw
the value at the memory hardware and leave the memory unit to pick an
arbitrary ordering of the stores.

There is no happens-before relationship between the store to x and the
load from x in this example. This use of a fence is not equivalent to
the use of load-acquire.

I'm not sure that even x86 CPUs guarantee the ordering in this case:
it isn't the normal transitive visibility because thread 2 doesn't
load the value of m.