questions about memory_order_seq

Masakuni Oishi

unread,

Jun 13, 2011, 12:40:30 PM6/13/11

to

Hi,

I have some questions about C++0x memory model.

/*** code 1 ***/
// Initially
atomic<int> x(0), y(0);

// Thread 1:
y.store(1, memory_order_release);
atomic_thread_fence(memory_order_seq_cst);
r1 = x.load(memory_order_acquire);

// Thread 2:
x.store(1, memory_order_release);

// Thread 3:
r2 = x.load(memory_order_acquire);
atomic_thread_fence(memory_order_seq_cst);
r3 = y.load(memory_order_acquire);
/***************/

In the above code, is r1 == 0 && r2 == 1 && r3 == 0 possible?
I think it should be prohibited, but I couldn't make sure that
from the C++0x FDIS.

So I changed the memory_order of Thread 2 to seq_cst, like this:

/*** code 2 ***/
// Initially
atomic<int> x(0), y(0);

// Thread 1:
y.store(1, memory_order_release);
atomic_thread_fence(memory_order_seq_cst); // (1)
r1 = x.load(memory_order_acquire);

// Thread 2:
x.store(1, memory_order_seq_cst); // (3)

// Thread 3:
r2 = x.load(memory_order_acquire);
atomic_thread_fence(memory_order_seq_cst); // (2)
r3 = y.load(memory_order_acquire);
/***************/

In code 2, r1 == 0 && r2 == 1 && r3 == 0 is not allowed.
The proof is as follows:
When (1) precedes (2) in S,
r3 == 0 is not allowed (FDIS 29.3_6).
When (2) precedes (1) in S,
if r2 == 1 then (3) happens before (2), so (3) precedes (2) in S,
therefore, (3) precedes (1) in S.
Then, r1 == 0 is not allowed (FDIS 29.3_4).

So, my questions are:
For code 1, is r1 == 0 && r2 == 1 && r3 == 0 allowed in the C++0x
FDIS?
If it is, is there any real architecture which allows such result?

-- masakuni

Edek

unread,

Jun 13, 2011, 4:39:50 PM6/13/11

to

On 06/13/2011 06:40 PM, Masakuni Oishi wrote:
> Hi,
>
> I have some questions about C++0x memory model.
>
> /*** code 1 ***/
>

[snip a lot]

I'm tired and I haven't followed the C++11 discussion,
but are the fences relevant to the atomic operations
themselves? I mean, aren't they for enforcing regular
variables around atomics?

Edek

Joshua Maurice

unread,

Jun 13, 2011, 7:00:02 PM6/13/11

to

On Jun 13, 9:40 am, Masakuni Oishi <yam...@bsdhouse.org> wrote:
> Hi,
>
> I have some questions about C++0x memory model.
>
> /*** code 1 ***/
> // Initially
> atomic<int> x(0), y(0);
>
> // Thread 1:
> y.store(1, memory_order_release);
> atomic_thread_fence(memory_order_seq_cst);
> r1 = x.load(memory_order_acquire);
>
> // Thread 2:
> x.store(1, memory_order_release);
>
> // Thread 3:
> r2 = x.load(memory_order_acquire);
> atomic_thread_fence(memory_order_seq_cst);
> r3 = y.load(memory_order_acquire);
> /***************/

I don't claim to be correct or knowledgeable, but let me take a crack
at this. Please double check my reasoning.

/*** code 1 ***/
// Initially
atomic<int> x(0), y(0);

// Thread 1:
y.store(1, memory_order_release);

atomic_thread_fence(memory_order_seq_cst); //fence 1
r1 = x.load(memory_order_acquire);

// Thread 2:
x.store(1, memory_order_release);

// Thread 3:
r2 = x.load(memory_order_acquire);

atomic_thread_fence(memory_order_seq_cst); //fence 2
r3 = y.load(memory_order_acquire);

So,

>In the above code, is r1 == 0 && r2 == 1 && r3 == 0 possible?

Let us see.

Let's try a proof by contradiction. Let's assume r1 == 0 && r2 == 1 &&
r3 == 0 as end conditions is possible.

Case 1 - fence 1 happens-before fence 2.
Thus "y.store(1, memory_order_release);" happens-before "r3 =
y.load(memory_order_acquire);". But then r3 == 1, which contradicts
our assumption, so false.

Case 2 - fence 2 happens-before fence 1.
Thus "r2 = x.load(memory_order_acquire);" happens-before "r1 =
x.load(memory_order_acquire);".

To get r1 == 0 and r2 == 1, the modification order of x needs to be:
r1 = x.load(memory_order_acquire);
x.store(1, memory_order_release);
r2 = x.load(memory_order_acquire);

The required modification order of x is inconsistent with the happens-
before relationship, which is disallowed. Thus false.

End cases.

We got false on both cases, thus we can complete our proof by
contradiction, and thus one of our assumptions must be false. Thus our
only assumption, r1 == 0 && r2 == 1 && r3 == 0 as possible end
conditions, must be false.

Please, someone double check my reasoning. I hate (yet love) thinking
about this stuff. So easy to make a little mistake.

----

Also, note the following.

Note that thread 2 has only a single memory action. It is an atomic
write of memory_order_release. There are no other memory actions in
that thread, so the write might as well be memory_order_relaxed. (At
least, maybe. There's something nagging me about release sequences,
but I don't think that's relevant in this example.)

Now, we have two reads of x. As there are no writes to x besides
memory_order_relaxed writes, the memory_order_acquire reads of x are
equivalent to memory_order_relaxed.

The rule of I use for acquire and release semantics is: "If the
acquire sees the release, then all reads after the acquire must see
all writes before the release." There are two operations on y, one
read acquire, one write release. There are no reads after the read
acquire, and there are no writes before the write release, so both of
these memory_order types are equivalent to memory_order_relaxed. (With
the same qualification about maybe - release sequences maybe
withstanding.)

Thus, your acquires and releases are red herrings. They don't do
anything to change the nature of your little program.

Joshua Maurice

unread,

Jun 14, 2011, 2:31:34 AM6/14/11

to

Actually, I'm wrong. Well, I had the right conclusion I think, but
entirely wrong reasoning. (I still think that I was right that all of
loads and stores are equivalent to memory_order_relaxed, but the rest
of my reasoning is wrong.) Allow me to try to correct myself. (For
some weird reason, I wrote assuming that reads are part of the
modification order. Silly me - that's wrong. Brain fart.)

/*** code 1 ***/
// Initially
atomic<int> x(0), y(0);

// Thread 1:
y.store(1, memory_order_release);
atomic_thread_fence(memory_order_seq_cst); //fence 1
r1 = x.load(memory_order_acquire);

// Thread 2:
x.store(1, memory_order_release);

// Thread 3:
r2 = x.load(memory_order_acquire);
atomic_thread_fence(memory_order_seq_cst); //fence 2
r3 = y.load(memory_order_acquire);

/* **** end code */

The modification order for x is:
atomic<int> x(0);
x.store(1, memory_order_release);

Case 1- fence 1 comes before fence 2.

Thus
y.store(1, memory_order_release);
happens-before
r3 = y.load(memory_order_acquire);

Thus r3 == 1.

Case 2- fence 2 happens-before fence 1.
Thus
r2 = x.load(memory_order_relaxed);
happens-before
r1 = x.load(memory_order_relaxed);

Consider C++0x, n3242, 1.10 / 16, aka read-read coherence. In my
hopefully correct translation, it says:
[]
For atomic reads A and B on M, if A happens-before B, then A sees the
same write as B, or the write which A sees must come before the write
which B sees in the modification order of M.
[/]

Let's suppose, for a moment, that r1 == 0 and r2 == 1.
We know that the atomic read r2 happens-before the atomic read r1.
We know that the read r2 sees the write x=1.
We know that the read r1 sees the write x=0.
We can thus conclude from 1.10 / 16 that the write x=1 is the write
x=0 (nope), or the write x=1 comes before x=0 in the modification
order of x (also nope).
Thus we know that r1 == 0 and r2 == 1 is impossible ala proof by
contradiction.

End cases

Thus we know
(r3 == 1) OR NOT (r1 == 0 AND r2 == 1)
NOT (r3 == 0) OR NOT (r1 == 0 AND r2 == 1)
NOT (r3 == 0 AND r1 == 0 AND r2 == 1)
NOT (r1 == 0 AND r2 == 1 AND r3 == 0)

I am so feeling totally not confident about all of this. I post this
merely as my next best guess. Of course I feel better about it, or I
wouldn't be posting it, but damn this is hard to reason about. I have
a lot to learn.

Masakuni Oishi

unread,

Jun 14, 2011, 11:27:29 AM6/14/11

to

Edek <edek.pienkow...@gmail.com> wrote:
> I'm tired and I haven't followed the C++11 discussion,
> but are the fences relevant to the atomic operations
> themselves? I mean, aren't they for enforcing regular
> variables around atomics?

The fences enforce ordering for not only regular variables
but also atomic variables.
For example:

// Initially
atomic<int> x(0), y(0);

// Thread 1:
y.store(1, memory_order_relaxed);
x.store(1, memory_order_release);

// Thread 2:
r1 = x.load(memory_order_acquire);
r2 = y.load(memory_order_relaxed);

In this code, if r1 is 1, then r2 must be 1, not 0.

-- masakuni

Masakuni Oishi

unread,

Jun 14, 2011, 11:32:56 AM6/14/11

to

Joshua Maurice <joshuamaur...@gmail.com> wrote:
[snip]

> Case 1- fence 1 comes before fence 2.
> Thus
> y.store(1, memory_order_release);
> happens-before
> r3 = y.load(memory_order_acquire);
> Thus r3 == 1.
>
> Case 2- fence 2 happens-before fence 1.
> Thus
> r2 = x.load(memory_order_relaxed);
> happens-before
> r1 = x.load(memory_order_relaxed);

Total order S described in FDIS 29.3 is not equivalent to happens-
before.
"If fence 2 happens before fence 1, then fence 2 precedes fence 1 in
S"
is always satisfied. But the conversion -- "If fence 2 precedes fence
1
in S, then fence 2 happens before fence 1" -- is not always satisfied.

Since S is a total order, either "fence 2 precedes fence 1 in S" or
"fence 1 precedes fence 2 in S" must be true.
But happens-before relation is not a total order.

-- masakuni

Anthony Williams

unread,

Jun 14, 2011, 6:05:57 PM6/14/11

to

Masakuni Oishi <yam...@bsdhouse.org> writes:

> /*** code 1 ***/
> // Initially
> atomic<int> x(0), y(0);
>
> // Thread 1:
> y.store(1, memory_order_release);
> atomic_thread_fence(memory_order_seq_cst);
> r1 = x.load(memory_order_acquire);
>
> // Thread 2:
> x.store(1, memory_order_release);
>
> // Thread 3:
> r2 = x.load(memory_order_acquire);
> atomic_thread_fence(memory_order_seq_cst);
> r3 = y.load(memory_order_acquire);
> /***************/
>
> In the above code, is r1 == 0 && r2 == 1 && r3 == 0 possible?
> I think it should be prohibited, but I couldn't make sure that
> from the C++0x FDIS.

Premise 1: r3 == 0.

Since the only store to y is the store of 1 by thread 1, r3 == 0 implies
that the load from y on thread 3 is reading the initial value of y.

29.3p6 of N3290 says:

"For atomic operations A and B on an atomic object M, where A modifies M
and B takes its value, if there are memory_order_seq_cst fences X and Y
such that A is sequenced before X, Y is sequenced before B, and X
precedes Y in S, then B observes either the eﬀects of A or a later
modification of M in its modification order."

If the fence in thread 1 precedes the fence in thread 3 in S, then the
read at r3 must therefore see the value written by the store in thread
1. Consequently, the fence in thread 3 must precede the fence in thread
1 in S.

Premise 2: r2 == 1.

The load of x in thread 3 reading 1 implies it has read the value of the
store to x from thread 2.

We already know that the fence in thread 3 precedes the fence in thread
1 in S.

29.3p7 of N3290 says:

"For atomic operations A and B on an atomic object M, if there are
memory_order_seq_cst fences X and Y such that A is sequenced before X, Y
is sequenced before B, and X precedes Y in S, then B occurs later than A

in the modification order of M."

In our case, the fence in thread 3 occurs before the fence in thread 1
in S. The fence in thread 3 is thus X, the fence in thread 1 Y, the read
from x in thread 3 is A and the read from x in thread 1 is B.

Thus the read of x in thread 1 must occur later in the modification
order of x than the read in thread 3. Since the read in thread 3 is of
the last store to x (1), the read in thread 1 must also read that value,
and r1 == 1.

Therefore r1==0, r2 == 1 and r3 == 0 is not possible.

> So, my questions are:
> For code 1, is r1 == 0 && r2 == 1 && r3 == 0 allowed in the C++0x
> FDIS?

No.

Anthony
--
Author of C++ Concurrency in Action http://www.stdthread.co.uk/book/
just::thread C++0x thread library http://www.stdthread.co.uk
Just Software Solutions Ltd http://www.justsoftwaresolutions.co.uk
15 Carrallack Mews, St Just, Cornwall, TR19 7UL, UK. Company No. 5478976

Joshua Maurice

unread,

Jun 14, 2011, 7:44:37 PM6/14/11

to

I agree with your proof up to this point.

> We already know that the fence in thread 3 precedes the fence in thread
> 1 in S.
>
> 29.3p7 of N3290 says:
>
> "For atomic operations A and B on an atomic object M, if there are
> memory_order_seq_cst fences X and Y such that A is sequenced before X, Y
> is sequenced before B, and X precedes Y in S, then B occurs later than A
> in the modification order of M."
>
> In our case, the fence in thread 3 occurs before the fence in thread 1
> in S. The fence in thread 3 is thus X, the fence in thread 1 Y, the read
> from x in thread 3 is A and the read from x in thread 1 is B.
>
> Thus the read of x in thread 1 must occur later in the modification
> order of x than the read in thread 3. Since the read in thread 3 is of
> the last store to x (1), the read in thread 1 must also read that value,
> and r1 == 1.

I disagree with this. I think you made the same mistake which I made
in my first post of this thread. Only /modifications/ appear in a
modification order. Reads do not appear in the modification order. At
least, this is the simple reading of 1.10 / 7, and its name is
"modification order", not "access order". Also note that the
definition of a "release sequence" in 1.10 / 8 does not make sense if
reads may appear in modification orders. Also note that the definition
of "visible sequence of side effects" in 1.10 / 14 does not make sense
if reads may appear in modification orders. (Note that there's some
ambiguity w.r.t. volatile atomic objects, as volatile reads do count
as side effects, but non-volatile reads AFAIK are not side effects.)

As I'm beginning to understand this, there is a typo in 29.3 / 7. The
exact text is:
[]

For atomic operations A and B on an atomic object M, if there are
memory_order_seq_cst fences X and Y such that A is sequenced before X,
Y is sequenced before B, and X precedes Y in S, then B occurs later
than A in the modification order of M.

[/]
I think it's meant to read:
[]
For atomic **write (and atomic read-modify-write operations)** A and B

on an atomic object M, if there are memory_order_seq_cst fences X and
Y such that A is sequenced before X, Y is sequenced before B, and X
precedes Y in S, then B occurs later than A in the modification order
of M.

[/]

When you said "the read from x in thread 3 is A and the read from x in
thread 1 is B.", that is where you went wrong. You cannot apply
"29.3 / 7" to atomic reads A and B on an atomic object M, only atomic
writes (and atomic read-modify-write operations).

I'm still dealing with the implication that A precedes B in the total
order S of seq_cst operations does not imply that A happens-before B
nor A inter-thread happens before B nor A synchronizes-with B - and
that's irritating and slightly surprising. At least, I haven't yet
found a rule which states that, and I've looked decently hard, and
I've even found a note saying as much in 29.3 / 8.

I'm still looking at the standard and the code of the OP, if only for
my own sake. I'm leaning towards Masakuni Oishi's interpretation at
the moment. Have I mentioned non-seq_cst code is a bitch to reason
about?

As always, please correct me when I'm wrong.

Anthony Williams

unread,

Jun 15, 2011, 3:34:36 AM6/15/11

to

Joshua Maurice <joshua...@gmail.com> writes:

I used "modification order" here because that is what is used in
29.3p7. I realise that this has caused confusion because reads are
indeed not included in the modification order.

> As I'm beginning to understand this, there is a typo in 29.3 / 7.

Agreed.

> I think it's meant to read:
> []
> For atomic **write (and atomic read-modify-write operations)** A and B
> on an atomic object M, if there are memory_order_seq_cst fences X and
> Y such that A is sequenced before X, Y is sequenced before B, and X
> precedes Y in S, then B occurs later than A in the modification order
> of M.
> [/]

When we drafted this words, I think we meant "operation". It is the
reference to "modification order" that is the typo. What is meant is
that if A is a write and B is a read then B must read the value written
by A, or something later in the modification order; if A is a write and
B is a write then B must be later than A in the MO; if A is a read and B
is a write then A must read a value prior to B in the MO, and if A is a
read and B is a read then B must read the same value as A or a value
later in the MO.

It is based on this understanding that I applied my reasoning. It is
of course possible that I have misremembered the discussions of the
concurrency working group.

After rereading this section of the standard several times, I'm no
longer sure. I'll have to check my notes.

> I'm still dealing with the implication that A precedes B in the total
> order S of seq_cst operations does not imply that A happens-before B
> nor A inter-thread happens before B nor A synchronizes-with B - and
> that's irritating and slightly surprising. At least, I haven't yet
> found a rule which states that, and I've looked decently hard, and
> I've even found a note saying as much in 29.3 / 8.

You're right there. Note that 29.3p4-7 explicitly apply to **fences**,
so plain memory_order_seq_cst operations do not impose any ordering on
other operations beyond what they would if they were
acquire/release/acq_rel as appropriate.

Masakuni Oishi

unread,

Jun 15, 2011, 4:53:37 AM6/15/11

to

Me, too.

I agree.

> I'm still dealing with the implication that A precedes B in the total
> order S of seq_cst operations does not imply that A happens-before B
> nor A inter-thread happens before B nor A synchronizes-with B - and
> that's irritating and slightly surprising. At least, I haven't yet
> found a rule which states that, and I've looked decently hard, and
> I've even found a note saying as much in 29.3 / 8.

Even though the total order S does not imply happens-before
relationship, S gives constraints of visibility which is
mentioned in n3290, 29.3p3-p7.

But, I think it is not enough.
In particular, the following constraint should be added
to n3290, 29.3:
/*****/
// Thread 1:
r1 = x.load(memory_order_relaxed);
atomic_thread_fence(memory_order_seq_cst);

// Thread 2:
atomic_thread_fence(memory_order_seq_cst);
r2 = x.load(memory_order_relaxed);
/*****/
in this code, if the fence in thread 1 precedes the fence
in thread 2 in S, and the value of r1 comes from a
modification A on x, then the load operation in thread 2
should not take the value of the modification which precedes A
in the modification order of x.

If there was a constraint like this, then we could prove that
r1 == 0 && r2 == 1 && r3 == 0 is not allowed in the first code.

-- masakuni

Chris M. Thomasson

unread,

Jun 15, 2011, 5:06:08 AM6/15/11

to

"Anthony Williams" <antho...@gmail.com> wrote in message
news:87tybsm...@justsoftwaresolutions.co.uk...
[...]

> Masakuni Oishi <yam...@bsdhouse.org> writes:
[...]

FWIW, I think I have an "example" of why `memory_order_seq_cst' can be
important...:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/6cf4655c4e0b7c95

http://groups.google.com/group/comp.programming.threads/msg/50ced5527a964ca4

Any thoughts?

Masakuni Oishi

unread,

Jun 15, 2011, 5:11:46 AM6/15/11

to

On Jun15, Anthony Williams <anthony....@gmail.com> wrote:
> if A is a write and B is a read then B must read the value written
> by A, or something later in the modification order; if A is a write and
> B is a write then B must be later than A in the MO; if A is a read and B
> is a write then A must read a value prior to B in the MO, and if A is a
> read and B is a read then B must read the same value as A or a value
> later in the MO.

I completely agree this sentences itself.
But I'm not sure 29.3p7 means that.

-- masakuni

Joshua Maurice

unread,

Jun 15, 2011, 5:57:08 AM6/15/11

to

On Jun 15, 1:53 am, Masakuni Oishi <yam...@bsdhouse.org> wrote:
> On Jun 14, Joshua Maurice <joshuamaur...@gmail.com> wrote:
>
> > I'm still dealing with the implication that A precedes B in the total
> > order S of seq_cst operations does not imply that A happens-before B
> > nor A inter-thread happens before B nor A synchronizes-with B - and
> > that's irritating and slightly surprising. At least, I haven't yet
> > found a rule which states that, and I've looked decently hard, and
> > I've even found a note saying as much in 29.3 / 8.
>
> Even though the total order S does not imply happens-before
> relationship, S gives constraints of visibility which is
> mentioned in n3290, 29.3p3-p7.

Indeed. I've been reviewing this today, after your good correction.
Thank you again.

> But, I think it is not enough.
> In particular, the following constraint should be added
> to n3290, 29.3:
> /*****/
> // Thread 1:
> r1 = x.load(memory_order_relaxed);
> atomic_thread_fence(memory_order_seq_cst);
>
> // Thread 2:
> atomic_thread_fence(memory_order_seq_cst);
> r2 = x.load(memory_order_relaxed);
> /*****/
> in this code, if the fence in thread 1 precedes the fence
> in thread 2 in S, and the value of r1 comes from a
> modification A on x, then the load operation in thread 2
> should not take the value of the modification which precedes A
> in the modification order of x.
>
> If there was a constraint like this, then we could prove that
> r1 == 0 && r2 == 1 && r3 == 0 is not allowed in the first code.

I think so as well. I'm just wondering if this is an oversight, or a
purposeful decision because requiring this would negatively impact
real hardware. I don't even begin to claim to know enough to comment
on that.

Edek

unread,

Jun 15, 2011, 6:21:01 PM6/15/11

to

On 06/14/2011 05:27 PM, Masakuni Oishi wrote:
> Edek<edek.pienkow...@gmail.com> wrote:
>> I'm tired and I haven't followed the C++11 discussion,
>> but are the fences relevant to the atomic operations
>> themselves? I mean, aren't they for enforcing regular
>> variables around atomics?
>
> The fences enforce ordering for not only regular variables
> but also atomic variables.
> For example:
>
> // Initially
> atomic<int> x(0), y(0);
>

> // Thread 1:
> y.store(1, memory_order_relaxed); <--- this needn't be atomic,
it just happens to be
> x.store(1, memory_order_release); <---- This is like releasing a lock
so y store must be done earlier
>
> // Thread 2:
> r1 = x.load(memory_order_acquire); <--- This is like acquiring a lock
so y load must be done later
> r2 = y.load(memory_order_relaxed); <--- Might not be atomic

>
>
> In this code, if r1 is 1, then r2 must be 1, not 0.
>
> -- masakuni

Yes, that is what I actually meant (sorry, I'm a plain novice in this
weird atomic stuff). See my above small comments.

But what I also meant is the following, simplifying things

> // Thread 2:
> x.store(1, memory_order_release);

This seems not to enforce anything, if there is no code above it
in thread 2. Does it enforce anything (apart from atomicity
of write)?

(Quite antique these days) quote, from n2427
memory_order_release
Performs a release operation on the affected
memory locations, thus
making regular memory writes visible to other
threads through the atomic variable to which it is applied.

This, however, is a sequence point (interacts with thread 1's and
thread 2's fences):

> // Thread 2:
> x.store(1, memory_order_seq_cst); // (3)

Quote:
...in addition, has sequentially-consistent operation ordering.

So, if I understand this mail thread correctly, the question is whether
in the code 1 case from your question the change to x made by thread 2,
but - to my understanding - not really ordered in any way in code 1,
is visible by threads 1 and 3 in a specified order enforced by these
threads' (1 & 3) fences, or not necessarily (and y is just to show the
constraints - if we can look at it for a moment from this aspect).

Is everything above right, or is it more subtle?

Edek

PS.
As a side note, I compiled all ops on x86_64, and atomic<int>::load
is mfence/load/mfence even for relaxed. I'd need some time to analyse
the thing from perspective of other ops implementations in all
orderings, and this leads me to the not so subtle question:

I understand people _should definitely_ use the default.
I assume the goal, or the official word for it is 'rationale', was
to let people 'who understand' write code that would be both portable
and more efficient than the default. I am somehow beginning to doubt
both my capabilities and whether this will be at all possible - I mean
the efficiency, not portability (the latter can be simply done right).
Though maybe the bottom line is that at least the precise implementation
will have at least a couple of fences less than the default, at the
price of - I think no one argues about that - complexity.

Joshua Maurice

unread,

Jun 15, 2011, 6:48:07 PM6/15/11

to

I'm pretty sure no. A store memory_order_release vs
memory_order_relaxed differs only in that it can provide additional
guarantees w.r.t. other actions in the same thread as the store.

Technically, that isn't quite true. The call which creates and starts
a thread synchronizes-with the first memory action in the thread
(30.3.1.2 thread constructors / 5), and the last memory action in a
thread synchronizes-with any join calls on that thread from other
threads (30.3.1.5 thread members / 5). Thus, the release could lead to
additional guarantees because of this synchronizes-with relation with
the start and end of the thread. However, there's no such relevant
operations in this example, so in this example that single memory
operation in thread 2 might as well be memory_order_relaxed AFAIK.

Moreover, I think you can argue similarly for the rest of the
memory_order_acquires and memory_order_releases. They might as well be
memory_order_relaxed. (The fences do matter if they're seq_cst or
something else, of course.)

> > // Thread 2:
> > x.store(1, memory_order_seq_cst); // (3)
>
> Quote:
> ...in addition, has sequentially-consistent operation ordering.
>
> So, if I understand this mail thread correctly, the question is whether
> in the code 1 case from your question the change to x made by thread 2,
> but - to my understanding - not really ordered in any way in code 1,
> is visible by threads 1 and 3 in a specified order enforced by these
> threads' (1 & 3) fences, or not necessarily (and y is just to show the
> constraints - if we can look at it for a moment from this aspect).
>
> Is everything above right, or is it more subtle?

That's about it. The question is:

For seq_cst fences X and Y, for atomic reads A and B on atomic object
M,
where
A sees the value from an atomic write C, and
B sees the value from an atomic write D (which may be the same as
C),
if
A is sequenced-before X, and
X precedes Y in the total order S on seq_cst operations, and
Y is sequenced-before B,
then
Can D precede C in the modification order of M?

I'm going to bring this to comp.lang.c++, and maybe comp.lang.c++
moderated and/or comp.std.c++.

> PS.
> As a side note, I compiled all ops on x86_64, and atomic<int>::load
> is mfence/load/mfence even for relaxed.

This is very wrong. At least, very counter to intent. What
implementation of C++0x threading are you using? Maybe it's a first
attempt implementation, and they're going to "fix" it later?

> I'd need some time to analyse
> the thing from perspective of other ops implementations in all
> orderings, and this leads me to the not so subtle question:
>
> I understand people _should definitely_ use the default.
> I assume the goal, or the official word for it is 'rationale', was
> to let people 'who understand' write code that would be both portable
> and more efficient than the default. I am somehow beginning to doubt
> both my capabilities and whether this will be at all possible - I mean
> the efficiency, not portability (the latter can be simply done right).
> Though maybe the bottom line is that at least the precise implementation
> will have at least a couple of fences less than the default, at the
> price of - I think no one argues about that - complexity.

This sounds very wrong.

Joshua Maurice

unread,

Jun 15, 2011, 6:58:04 PM6/15/11

to

On Jun 15, 12:34 am, Anthony Williams <anthony....@gmail.com> wrote:

Somehow, I missed this post earlier. I just read this now. Thank you
for this cool discussion. If you could figure this out, it would be
most welcome. I'm about to post something to comp.lang.c++ now, and
I'll include a reference to this possible interpretation and this
thread.

Anthony Williams

unread,

Jun 16, 2011, 3:26:16 AM6/16/11

to

Anthony Williams <antho...@gmail.com> writes:

> Joshua Maurice <joshua...@gmail.com> writes:
>
>>> 29.3p7 of N3290 says:
>>>
>>> "For atomic operations A and B on an atomic object M, if there are
>>> memory_order_seq_cst fences X and Y such that A is sequenced before X, Y
>>> is sequenced before B, and X precedes Y in S, then B occurs later than A
>>> in the modification order of M."
>>>

>> I think it's meant to read:
>> []
>> For atomic **write (and atomic read-modify-write operations)** A and B
>> on an atomic object M, if there are memory_order_seq_cst fences X and
>> Y such that A is sequenced before X, Y is sequenced before B, and X
>> precedes Y in S, then B occurs later than A in the modification order
>> of M.
>> [/]
>
> When we drafted this words, I think we meant "operation". It is the
> reference to "modification order" that is the typo. What is meant is
> that if A is a write and B is a read then B must read the value written
> by A, or something later in the modification order; if A is a write and
> B is a write then B must be later than A in the MO; if A is a read and B
> is a write then A must read a value prior to B in the MO, and if A is a
> read and B is a read then B must read the same value as A or a value
> later in the MO.

> After rereading this section of the standard several times, I'm no

> longer sure. I'll have to check my notes.

I've checked my notes. My recollection was wrong.

We *discussed* requiring seq_cst fences to order loads like this, but
decided to explicitly limit the orderings to writes. The approved
wording said "where A and B modify M", but this somehow got lost.

In particular, it was felt that requiring this would add complexity to
the IA-64 implementation of fences, and potentially require that relaxed
operations on that architecture would also have to be more expensive,
thus requiring people to pay for features they didn't need.

Anthony Williams

unread,

Jun 16, 2011, 3:35:59 AM6/16/11

to

Edek <edek.pi...@gmail.com> writes:

> As a side note, I compiled all ops on x86_64, and atomic<int>::load
> is mfence/load/mfence even for relaxed. I'd need some time to analyse
> the thing from perspective of other ops implementations in all
> orderings, and this leads me to the not so subtle question:

Ouch. That is a seriously over-cautious implementation. Atomics should
not need any fences on x86_64 unless you actually write
atomic_thread_fence(memory_order_seq_cst)

See my blog entry for the required operations:

http://www.justsoftwaresolutions.co.uk/threading/intel-memory-ordering-and-c++-memory-model.html

The only one that is not a plain MOV is a seq_cst store, which is an
XCHG.

Alexander Terekhov

unread,

Jun 16, 2011, 6:03:23 AM6/16/11

to

Edek wrote:
[...]

> As a side note, I compiled all ops on x86_64, and atomic<int>::load
> is mfence/load/mfence even for relaxed. I'd need some time to analyse
> the thing from perspective of other ops implementations in all
> orderings, and this leads me to the not so subtle question:
>
> I understand people _should definitely_ use the default.
> I assume the goal, or the official word for it is 'rationale', was
> to let people 'who understand' write code that would be both portable
> and more efficient than the default. I am somehow beginning to doubt
> both my capabilities and whether this will be at all possible - I mean
> the efficiency, not portability (the latter can be simply done right).
> Though maybe the bottom line is that at least the precise implementation
> will have at least a couple of fences less than the default, at the
> price of - I think no one argues about that - complexity.

Try

http://www.decadent.org.uk/pipermail/cpp-threads/2008-December/001933.html
(Brief tentative example x86 Implementation for C/C++ Memory Model)

and

http://portal.acm.org/citation.cfm?id=1926394
(Mathematizing C++ concurrency)

regards,
alexander.

Alexander Terekhov

unread,

Jun 16, 2011, 6:05:53 AM6/16/11

to

Anthony Williams wrote:
>
> Edek <edek.pi...@gmail.com> writes:
>
> > As a side note, I compiled all ops on x86_64, and atomic<int>::load
> > is mfence/load/mfence even for relaxed. I'd need some time to analyse
> > the thing from perspective of other ops implementations in all
> > orderings, and this leads me to the not so subtle question:
>
> Ouch. That is a seriously over-cautious implementation. Atomics should
> not need any fences on x86_64 unless you actually write
> atomic_thread_fence(memory_order_seq_cst)
>
> See my blog entry for the required operations:
>
> http://www.justsoftwaresolutions.co.uk/threading/intel-memory-ordering-and-c++-memory-model.html
>
> The only one that is not a plain MOV is a seq_cst store, which is an
> XCHG.

I believe that a seq_cst load is not a plain move either...

regards,
alexander.

Anthony Williams

unread,

Jun 16, 2011, 6:19:00 AM6/16/11

to

Alexander Terekhov <tere...@web.de> writes:

Why do think it shouldn't be, and what do you think it should be instead?

The mapping in my blog post was confirmed by Intel processor engineers.

Alexander Terekhov

unread,

Jun 16, 2011, 7:57:14 AM6/16/11

to

Anthony Williams wrote:
>
> Alexander Terekhov <tere...@web.de> writes:
>
> > Anthony Williams wrote:
> >>
> >> Edek <edek.pi...@gmail.com> writes:
> >>
> >> > As a side note, I compiled all ops on x86_64, and atomic<int>::load
> >> > is mfence/load/mfence even for relaxed. I'd need some time to analyse
> >> > the thing from perspective of other ops implementations in all
> >> > orderings, and this leads me to the not so subtle question:
> >>
> >> Ouch. That is a seriously over-cautious implementation. Atomics should
> >> not need any fences on x86_64 unless you actually write
> >> atomic_thread_fence(memory_order_seq_cst)
> >>
> >> See my blog entry for the required operations:
> >>
> >> http://www.justsoftwaresolutions.co.uk/threading/intel-memory-ordering-and-c++-memory-model.html
> >>
> >> The only one that is not a plain MOV is a seq_cst store, which is an
> >> XCHG.
> >
> > I believe that a seq_cst load is not a plain move either...
>
> Why do think it shouldn't be,

Akin to what Jeffrey Yasskin said on your blog:

http://www.justsoftwaresolutions.co.uk/threading/intel-memory-ordering-and-c++-memory-model.html

"Do you happen to have a link to a proof that a plain mov is all that's
needed for a sequentially-consistent load? I see from the memory model
documents that the xchg'es all happen in a total order, and that movs
won't be reordered across xchg, but I'm having trouble getting from
there to sequential consistency.

by Jeffrey Yasskin at 07:23:51 on Wednesday, 19 May 2010"

Do *you* happen to have a link to a proof?

> and what do you think it should be instead?

http://www.decadent.org.uk/pipermail/cpp-threads/2008-December/001933.html

"Load Seq_Cst: LOCK XADD(0) // alternative: MFENCE,MOV (from memory)"

http://portal.acm.org/citation.cfm?id=1926394

"Sequentially consistent atomics

The proposal above includes two implementations of sequentially
consistent atomic reads and writes; one with the x86 locked
instructions, and the other with fence instructions on both the reads
and writes. However, we can prove that it suffices either to place an
mfence before every sc read, or after every sc write, but that it is not
necessary to do both. In practice, placing the fence after the sc writes
is expected to yield higher performance."

>
> The mapping in my blog post was confirmed by Intel processor engineers.

Without a proof?

regards,
alexander.

Masakuni Oishi

unread,

Jun 16, 2011, 8:05:35 AM6/16/11

to

Anthony Williams <anthony....@gmail.com> wrote:
> Anthony Williams <anthony....@gmail.com> writes:

Interesting.
If so, for the code 1 in my first post, is the outcome
r1 == 0 && r2 == 1 && r3 == 0 possible on IA-64?

In fact, I encountered this problem when I was implementing
a variant of work-stealing deque (*).
And I'm dissatisfied with the result that the code 2 works well
but the code 1 doesn't.

(*) http://portal.acm.org/citation.cfm?doid=277651.277678

-- masakuni

Masakuni Oishi

unread,

Jun 16, 2011, 8:03:33 AM6/16/11

to

On Jun 16, Joshua Maurice <joshuamaur...@gmail.com> wrote:
> On Jun 15, 3:21 pm, Edek <edek.pienkow...@gmail.com> wrote:
> > So, if I understand this mail thread correctly, the question is whether
> > in the code 1 case from your question the change to x made by thread 2,
> > but - to my understanding - not really ordered in any way in code 1,
> > is visible by threads 1 and 3 in a specified order enforced by these
> > threads' (1 & 3) fences, or not necessarily (and y is just to show the
> > constraints - if we can look at it for a moment from this aspect).
>
> > Is everything above right, or is it more subtle?
>
> That's about it. The question is:
>
> For seq_cst fences X and Y, for atomic reads A and B on atomic object
> M,
> where
> A sees the value from an atomic write C, and
> B sees the value from an atomic write D (which may be the same as
> C),
> if
> A is sequenced-before X, and
> X precedes Y in the total order S on seq_cst operations, and
> Y is sequenced-before B,
> then
> Can D precede C in the modification order of M?

Yes.
In this topic, my concern is effects of seq_cst fence.
So, please don't matter the difference between release/acquire and
relaxed.

-- masakuni

Anthony Williams

unread,

Jun 16, 2011, 8:11:59 AM6/16/11

to

Masakuni Oishi <yam...@bsdhouse.org> writes:

> If so, for the code 1 in my first post, is the outcome
> r1 == 0 && r2 == 1 && r3 == 0 possible on IA-64?

I believe so.

Edek

unread,

Jun 16, 2011, 8:39:59 AM6/16/11

to

On 06/16/2011 09:35 AM, Anthony Williams wrote:
> Edek<edek.pi...@gmail.com> writes:
>
>> As a side note, I compiled all ops on x86_64, and atomic<int>::load
>> is mfence/load/mfence even for relaxed. I'd need some time to analyse
>> the thing from perspective of other ops implementations in all
>> orderings, and this leads me to the not so subtle question:
>
> Ouch. That is a seriously over-cautious implementation. Atomics should
> not need any fences on x86_64 unless you actually write
> atomic_thread_fence(memory_order_seq_cst)
>
> See my blog entry for the required operations:
>
> http://www.justsoftwaresolutions.co.uk/threading/intel-memory-ordering-and-c++-memory-model.html
>
> The only one that is not a plain MOV is a seq_cst store, which is an
> XCHG.
>
> Anthony

From Joshua Maurice:
>> PS.

>> > As a side note, I compiled all ops on x86_64, and atomic<int>::load
>> > is mfence/load/mfence even for relaxed.

> This is very wrong. At least, very counter to intent. What
> implementation of C++0x threading are you using? Maybe it's a first
> attempt implementation, and they're going to "fix" it later?
>

Answering this:

it is certainly possible that this is a 'quick-and-dirty'
implementation. I wouldn't care now, given the status is experimental,
if it gives worse performance [1]. I am not fit to argue,
but maybe there are technical details why currently in this compiler
it is necessary [3].

I do care whether it is correct, and correct me if I'm wrong, there is
nothing wrong with stricter ordering than asked for [2], and looser
language lawyering done by the compiler at the moment.

Edek

[1] Performance: In all cases? In some cases, and not in others?

[2] Unless someone tries to say 'I had a proof: my algorithm did not
fail on compilers X and Y on platforms A,B,C. My proof was wrong,
because the compiler implemented a more strict ordering, and this
caused me to think my algorithm was right when it was not'

[3] what comes to my mind: missing alignment requirements, or something
of this kind, in the current state of the compiler.

Anthony Williams

unread,

Jun 16, 2011, 8:51:35 AM6/16/11

to

Alexander Terekhov <tere...@web.de> writes:

LOCKed instructions have a total order (7.2.3.8 in my copy of the Intel
64 and IA-32 SDM Vol 3A), so if you use XCHG for stores (as per my blog
post) then all processors/threads must agree on the order of the writes.

This is the key to SC --- all SC stores must have a single total order
which is agreed across all threads.

Why do you think this is not enough?

Anthony Williams

unread,

Jun 16, 2011, 8:55:09 AM6/16/11

to

Edek <edek.pi...@gmail.com> writes:

> I do care whether it is correct, and correct me if I'm wrong, there is
> nothing wrong with stricter ordering than asked for [2], and looser
> language lawyering done by the compiler at the moment.

Sticking additional fences in will not make the implementation
incorrect, just slower.

Alexander Terekhov

unread,

Jun 16, 2011, 10:53:04 AM6/16/11

to

Anthony Williams wrote:
[...]

> LOCKed instructions have a total order (7.2.3.8 in my copy of the Intel
> 64 and IA-32 SDM Vol 3A), so if you use XCHG for stores (as per my blog
> post) then all processors/threads must agree on the order of the writes.
>
> This is the key to SC --- all SC stores must have a single total order
> which is agreed across all threads.
>
> Why do you think this is not enough?

In

x.store(44, memory_order_relaxed); // or _release
r1 = y.load(memory_order_seq_cst);

there should be a store-load reordering constraint, just like in

x.store(44, memory_order_seq_cst);
r1 = y.load(memory_order_relaxed); // or _acquire/_consume

Implementation 1:

Load Seq_Cst: LOCK XADD(0)
Store Seq_Cst: LOCK XCHG

Implementation 2:

Load Seq_Cst: LOCK XADD(0)
Store Seq_Cst: MOV, MFENCE

Implementation 3:

Load Seq_Cst: MFENCE, MOV
Store Seq_Cst: LOCK XCHG

Incorrect implementation

Load Seq_Cst: MFENCE, MOV
Store Seq_Cst: MOV, MFENCE

is busted because IRIW won't work.

regards,
alexander.

Anthony Williams

unread,

Jun 16, 2011, 11:17:20 AM6/16/11

to

Alexander Terekhov <tere...@web.de> writes:

> Anthony Williams wrote:
> [...]
>> LOCKed instructions have a total order (7.2.3.8 in my copy of the Intel
>> 64 and IA-32 SDM Vol 3A), so if you use XCHG for stores (as per my blog
>> post) then all processors/threads must agree on the order of the writes.
>>
>> This is the key to SC --- all SC stores must have a single total order
>> which is agreed across all threads.
>>
>> Why do you think this is not enough?
>

> x.store(44, memory_order_relaxed); // or _release
> r1 = y.load(memory_order_seq_cst);
>
> there should be a store-load reordering constraint, just like in
>
> x.store(44, memory_order_seq_cst);
> r1 = y.load(memory_order_relaxed); // or _acquire/_consume

Why? Can you give an example that would be broken by the lack of such a
constraint, and a reference to the FDIS that guarantees the presence of
that constraint?

Alexander Terekhov

unread,

Jun 16, 2011, 11:25:58 AM6/16/11

to

Addition re "Incorrect implementation" below:

That is my thinking... but according to

http://portal.acm.org/citation.cfm?id=1926394

the "Incorrect implementation" above is actually correct (probably
because according to them, MFENCE has the same semantics as PPC's
hwsync).

(I'm personally still in doubt).

regards,
alexander.

Alexander Terekhov

unread,

Jun 16, 2011, 11:37:38 AM6/16/11

to

Anthony Williams wrote:
>
> Alexander Terekhov <tere...@web.de> writes:
>
> > Anthony Williams wrote:
> > [...]
> >> LOCKed instructions have a total order (7.2.3.8 in my copy of the Intel
> >> 64 and IA-32 SDM Vol 3A), so if you use XCHG for stores (as per my blog
> >> post) then all processors/threads must agree on the order of the writes.
> >>
> >> This is the key to SC --- all SC stores must have a single total order
> >> which is agreed across all threads.
> >>
> >> Why do you think this is not enough?
> >
> > x.store(44, memory_order_relaxed); // or _release
> > r1 = y.load(memory_order_seq_cst);
> >
> > there should be a store-load reordering constraint, just like in
> >
> > x.store(44, memory_order_seq_cst);
> > r1 = y.load(memory_order_relaxed); // or _acquire/_consume
>
> Why?

Because the above is the same as

x.store(44, memory_order_relaxed); // or _release

atomic_thread_fence(memory_order_seq_cst);

r1 = y.load(memory_order_relaxed); // or _acquire/_consume

> Can you give an example that would be broken by the lack of such a
> constraint,

Any code written with the expectation that

x.store(44, memory_order_relaxed); // or _release
r1 = y.load(memory_order_seq_cst);

and/or

x.store(44, memory_order_seq_cst);
r1 = y.load(memory_order_relaxed); // or _acquire/_consume

is the same as

x.store(44, memory_order_relaxed); // or _release

atomic_thread_fence(memory_order_seq_cst);

r1 = y.load(memory_order_relaxed); // or _acquire/_consume

> and a reference to the FDIS that guarantees the presence of
> that constraint?

Do you have a reference to FDIS? (*&%*(&^$%&*(^_*(&+)(&*)&^978

http://groups.google.com/groups?threadm=4DF1EC53...@web.de

regards,
alexander.

Chris M. Thomasson

unread,

Jun 16, 2011, 11:47:08 AM6/16/11

to

"Alexander Terekhov" <tere...@web.de> wrote in message
news:4DFA2342...@web.de...

I cannot think of exactly why the code above would not behave the same as
the code below:

> is the same as
>
> x.store(44, memory_order_relaxed); // or _release
> atomic_thread_fence(memory_order_seq_cst);
> r1 = y.load(memory_order_relaxed); // or _acquire/_consume

Humm...

[...]

Chris M. Thomasson

unread,

Jun 16, 2011, 11:51:49 AM6/16/11

to

"Anthony Williams" <antho...@gmail.com> wrote in message

news:87oc1xa...@justsoftwaresolutions.co.uk...

> Alexander Terekhov <tere...@web.de> writes:
>
>> Anthony Williams wrote:
>> [...]
>>> LOCKed instructions have a total order (7.2.3.8 in my copy of the Intel
>>> 64 and IA-32 SDM Vol 3A), so if you use XCHG for stores (as per my blog
>>> post) then all processors/threads must agree on the order of the writes.
>>>
>>> This is the key to SC --- all SC stores must have a single total order
>>> which is agreed across all threads.
>>>
>>> Why do you think this is not enough?
>>
>> x.store(44, memory_order_relaxed); // or _release
>> r1 = y.load(memory_order_seq_cst);
>>
>> there should be a store-load reordering constraint, just like in
>>
>> x.store(44, memory_order_seq_cst);
>> r1 = y.load(memory_order_relaxed); // or _acquire/_consume
>
> Why? Can you give an example that would be broken by the lack of such a
> constraint, and a reference to the FDIS that guarantees the presence of
> that constraint?

I think I can see this breaking an SMR implementation in which the author
thinks that the seq_cst on the load will prevent it from being hoisted up
above the preceding store. I am having some trouble trying to understanding
why:

x.store(44, memory_order_relaxed); // or _release
r1 = y.load(memory_order_seq_cst);

would not ensure that the store will be rendered visible before the load is
performed... What am I missing here?

Chris M. Thomasson

unread,

Jun 16, 2011, 11:51:21 AM6/16/11

to

"Anthony Williams" <antho...@gmail.com> wrote in message
news:87oc1xa...@justsoftwaresolutions.co.uk...

> Alexander Terekhov <tere...@web.de> writes:
>
>> Anthony Williams wrote:
>> [...]
>>> LOCKed instructions have a total order (7.2.3.8 in my copy of the Intel
>>> 64 and IA-32 SDM Vol 3A), so if you use XCHG for stores (as per my blog
>>> post) then all processors/threads must agree on the order of the writes.
>>>
>>> This is the key to SC --- all SC stores must have a single total order
>>> which is agreed across all threads.
>>>
>>> Why do you think this is not enough?
>>
>> x.store(44, memory_order_relaxed); // or _release
>> r1 = y.load(memory_order_seq_cst);
>>
>> there should be a store-load reordering constraint, just like in
>>
>> x.store(44, memory_order_seq_cst);
>> r1 = y.load(memory_order_relaxed); // or _acquire/_consume
>
> Why? Can you give an example that would be broken by the lack of such a
> constraint, and a reference to the FDIS that guarantees the presence of
> that constraint?

I think I can see this breaking an SMR implementation in which the author

thinks that the seq_cst on the load will prevent it from being hoisted up
above the preceding store. I am having some trouble trying to understanding
why:

x.store(44, memory_order_relaxed); // or _release
r1 = y.load(memory_order_seq_cst);

would not ensure that the store will be rendered visible before the load is

Anthony Williams

unread,

Jun 16, 2011, 11:54:54 AM6/16/11

to

Alexander Terekhov <tere...@web.de> writes:

> Anthony Williams wrote:
>>
>> Alexander Terekhov <tere...@web.de> writes:
>>
>> > Anthony Williams wrote:
>> > [...]
>> >> LOCKed instructions have a total order (7.2.3.8 in my copy of the Intel
>> >> 64 and IA-32 SDM Vol 3A), so if you use XCHG for stores (as per my blog
>> >> post) then all processors/threads must agree on the order of the writes.
>> >>
>> >> This is the key to SC --- all SC stores must have a single total order
>> >> which is agreed across all threads.
>> >>
>> >> Why do you think this is not enough?
>> >
>> > x.store(44, memory_order_relaxed); // or _release
>> > r1 = y.load(memory_order_seq_cst);
>> >
>> > there should be a store-load reordering constraint, just like in
>> >
>> > x.store(44, memory_order_seq_cst);
>> > r1 = y.load(memory_order_relaxed); // or _acquire/_consume
>>
>> Why?
>
> Because the above is the same as
>
> x.store(44, memory_order_relaxed); // or _release
> atomic_thread_fence(memory_order_seq_cst);
> r1 = y.load(memory_order_relaxed); // or _acquire/_consume

No, it isn't. SC fences have very specific rules. This thread started by
discussing these rules, and how they don't necessarily do what you might
expect. Also, an SC load following a relaxed store does not impose any
ordering constraints on that store, and neither does an SC store
preceding a relaxed load impose any constraints on the load. All 3 bits
of code above are thus different.

For reference, 29.3p8 says:

[ Note: memory_order_seq_cst ensures sequential consistency only for a
program that is free of data races and uses exclusively
memory_order_seq_cst operations. Any use of weaker ordering will
invalidate this guarantee unless extreme care is used. In particular,
memory_order_seq_cst fences ensure a total order only for the fences
themselves. Fences cannot, in general, be used to restore sequential
consistency for atomic operations with weaker ordering speciﬁcations. —
end note ]

i.e. you cannot replace an SC operation with a relaxed op and a fence and
expect it to behave the same.

>> Can you give an example that would be broken by the lack of such a
>> constraint,
>
> Any code written with the expectation that
>
> x.store(44, memory_order_relaxed); // or _release
> r1 = y.load(memory_order_seq_cst);
>
> is the same as
>
> x.store(44, memory_order_relaxed); // or _release
> atomic_thread_fence(memory_order_seq_cst);
> r1 = y.load(memory_order_relaxed); // or _acquire/_consume

Can you give a specific example of such code?

>> and a reference to the FDIS that guarantees the presence of
>> that constraint?
>
> Do you have a reference to FDIS? (*&%*(&^$%&*(^_*(&+)(&*)&^978
>
> http://groups.google.com/groups?threadm=4DF1EC53...@web.de

:-)

If you can cite N3242, that will do in this case. The only changes to
the concurrency section were those from N3278, and I don't think they
materially affect this discussion.

http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2011/n3242.pdf
http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2011/n3278.htm

Anthony Williams

unread,

Jun 16, 2011, 11:56:27 AM6/16/11

to

Alexander Terekhov <tere...@web.de> writes:

>> Incorrect implementation
>>
>> Load Seq_Cst: MFENCE, MOV
>> Store Seq_Cst: MOV, MFENCE
>>
>> is busted because IRIW won't work.
>
> That is my thinking... but according to
>
> http://portal.acm.org/citation.cfm?id=1926394
>
> the "Incorrect implementation" above is actually correct (probably
> because according to them, MFENCE has the same semantics as PPC's
> hwsync).

That paper also says:

"Sequentially consistent atomics The proposal above includes
two implementations of sequentially consistent atomic reads and
writes; one with the x86 locked instructions, and the other with
fence instructions on both the reads and writes. However, we can

prove that it sufﬁces either to place an mfence before every sc read,

or after every sc write, but that it is not necessary to do both. In
practice, placing the fence after the sc writes is expected to yield
higher performance."

i.e. they think that

Load SC: MOV
Store SC: MOV, MFENCE

is a valid implementation.

Chris M. Thomasson

unread,

Jun 16, 2011, 12:01:34 PM6/16/11

to

"Anthony Williams" <antho...@gmail.com> wrote in message

news:87k4cla...@justsoftwaresolutions.co.uk...

> Alexander Terekhov <tere...@web.de> writes:
>
>> Anthony Williams wrote:

[....]

>>> Can you give an example that would be broken by the lack of such a
>>> constraint,
>>
>> Any code written with the expectation that
>>
>> x.store(44, memory_order_relaxed); // or _release
>> r1 = y.load(memory_order_seq_cst);
>>
>> is the same as
>>
>> x.store(44, memory_order_relaxed); // or _release
>> atomic_thread_fence(memory_order_seq_cst);
>> r1 = y.load(memory_order_relaxed); // or _acquire/_consume
>
> Can you give a specific example of such code?

Perhaps in an SMR implementation. The author would be screwed if he/she
thinks that the seq_cst load to `y' means that it will not be reordered with
the preceding store to `x'.

Anthony Williams

unread,

Jun 16, 2011, 12:07:28 PM6/16/11

to

If you rely on things that aren't true, then you end up in a mess ;-)

I'd better expand my question:

Can you give an example of such code, along with reasoning based on the
C++0x memory model to show that it is not broken?

Anthony Williams

unread,

Jun 16, 2011, 12:23:56 PM6/16/11

to

"Chris M. Thomasson" <cri...@charter.net> writes:

Where is the guarantee that the store *is* rendered visible? Unless
there is a specific guarantee given in the C++0x standard, then there is
no such guarantee. An SC load does not add any additional ordering
constraints wrt non-SC operations over and above a load-acquire.

Alexander Terekhov

unread,

Jun 16, 2011, 12:32:45 PM6/16/11

to

Anthony Williams wrote:
[...]

> For reference, 29.3p8 says:
>
> [ Note: memory_order_seq_cst ensures sequential consistency only for a
> program that is free of data races and uses exclusively
> memory_order_seq_cst operations. Any use of weaker ordering will
> invalidate this guarantee unless extreme care is used. In particular,

This is appalling.

regards,
alexander.

Alexander Terekhov

unread,

Jun 16, 2011, 12:49:19 PM6/16/11

to

Anthony Williams wrote:
[...]

> no such guarantee. An SC load does not add any additional ordering
> constraints wrt non-SC operations over and above a load-acquire.

Then call it
memory_order_seq_cst_as_long_as_you_use_seq_cst_atomics_only and get rid
of it being default.

regards,
alexander.

Chris M. Thomasson

unread,

Jun 16, 2011, 4:59:03 PM6/16/11

to

"Anthony Williams" <antho...@gmail.com> wrote in message

news:87boxxa...@justsoftwaresolutions.co.uk...

> "Chris M. Thomasson" <cri...@charter.net> writes:
>
>> "Anthony Williams" <antho...@gmail.com> wrote in message
>> news:87k4cla...@justsoftwaresolutions.co.uk...
>>> Alexander Terekhov <tere...@web.de> writes:
>>>
>>>> Anthony Williams wrote:
>> [....]
>>>>> Can you give an example that would be broken by the lack of such a
>>>>> constraint,
>>>>
>>>> Any code written with the expectation that
>>>>
>>>> x.store(44, memory_order_relaxed); // or _release
>>>> r1 = y.load(memory_order_seq_cst);
>>>>
>>>> is the same as
>>>>
>>>> x.store(44, memory_order_relaxed); // or _release
>>>> atomic_thread_fence(memory_order_seq_cst);
>>>> r1 = y.load(memory_order_relaxed); // or _acquire/_consume
>>>
>>> Can you give a specific example of such code?
>>
>> Perhaps in an SMR implementation. The author would be screwed if he/she
>> thinks that the seq_cst load to `y' means that it will not be reordered
>> with
>> the preceding store to `x'.
>
> If you rely on things that aren't true, then you end up in a mess ;-)

Touché! :^)

> I'd better expand my question:
>
> Can you give an example of such code, along with reasoning based on the
> C++0x memory model to show that it is not broken?

No. Quite frankly, it makes me want to use `std::memory_order_relaxed'
everywhere; ordering is provided by `std::atomic_thread_fence(...)'.

This better fuc%ing work!
_________________________________________________________
static std::atomic<int> g_var_1(0);
static std::atomic<int> g_var_2(0);
static std::atomic<int> g_sync(0);

void only_single_producer_in_process_lifetime()
{
g_var_1.store(1, std::memory_order_relaxed);
g_var_2.store(3, std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_seq_cst);
g_sync.store(1, std::memory_order_relaxed);
}

void many_consumers_in_process_lifetime()
{
while (! g_sync.load(std::memory_order_relaxed)) BACKOFF();
std::atomic_thread_fence(std::memory_order_seq_cst);
int var_1 = var_1.load(std::memory_order_relaxed);
int var_2 = var_2.load(std::memory_order_relaxed);
ASSERT((var_1 + var_2) == 4);
}
_________________________________________________________

:^O

Joshua Maurice

unread,

Jun 16, 2011, 5:32:19 PM6/16/11

to

On Jun 16, 1:59 pm, "Chris M. Thomasson" <cris...@charter.net> wrote:
> "Anthony Williams" <anthony....@gmail.com> wrote in message
>
> news:87boxxa...@justsoftwaresolutions.co.uk...
>
>
>
>
>
>
>
>
>
> > "Chris M. Thomasson" <cris...@charter.net> writes:
>
> >> "Anthony Williams" <anthony....@gmail.com> wrote in message
> >>news:87k4cla...@justsoftwaresolutions.co.uk...

AFAIK, the fences don't need to be seq_cst. This all works out with
just acquire and load fences.

29.8/2- For release fence A, acquire fence B, atomic write X on M,
atomic read Y on M:
If (A <sequence X) and (Y <sequence B) and (Y sees release-
seq(X)), then (A <sync B).

The producer's fence is sequenced before g_sync.store, and
g_sync.load is sequenced before the consumer's fence, and
and (eventually) g_sync.load sees a value from a write in the release
sequence of g_sync.store.
Therefore the producer's fence synchronizes-with the consumer's fence
(29.8/2).
Therefore with repeated application of 1.10/11 and 1.10/12, we can
conclude that
g_var_1.store happens-before g_var_1.load, and
g_var_2.store happens-before g_var_2.load.
Therefore with 1.10/18 we can conclude that
g_var_1.load sees g_var_1.store or a later write in the
modification order, and
g_var_2.load sees g_var_2.store or a later write in the
modification order.
There are no writes later than g_var_1.store in its modification
order, and same for g_var_2, and thus g_var_1.load sees g_var_1.store,
and g_var_2.load sees g_var_2.store. And thus the assert will not
fire.

Btw, the rule 29.8/2 is what I was taught as the meaning of "read" and
"write" barriers according to whatever Linux kernel documentation is
floating around on google. That still works, but of course there's a
lot more rules than can be applied on top of that now.

----

I was thinking we might be able to do something with:

For atomic operations A, B on atomic object M. For seq_cst fences X
and Y.
29.3/6- For read B, write A:
If (A <sequence X) and (X <S Y) and (Y <sequence B), then (B sees
mod-M[A, ...)).

But I haven't thought hard enough about it. We'd need to prove that
the producer's fence precedes the consumer's fence in the total order
S, and I'm not sure we could do that in this case.

----

This is another reason that I'm feeling a little miffed about seq_cst.
Sometimes seq_cst fences don't need seq_cst reads and writes to be
more than acq_rel fences. Sometimes they do need seq_cst reads and
writes to be more than acq_rel fences.

Here's what I've been able to parse on the subject of seq_cst fences.

For atomic operations A, B on atomic object M. For seq_cst fences X
and Y.
29.3/4- For read B, seq_cst write A:
If (A <S X) and (X <sequence B), then (B sees mod-M[A, ...)).
29.3/5- For seq_cst read B, write A:
If (A <sequence X) and (X <S B), then (B sees mod-M[A, ...)).
29.3/6- For read B, write A:
If (A <sequence X) and (X <S Y) and (Y <sequence B), then (B sees
mod-M[A, ...)).
29.3/7- For writes A and B:
If (A <sequence X) and (X <S Y) and (Y <sequence B), then (A <mod-
M B).

The first concerns a relaxed (or better) read and a seq_cst write.
The second concerns a seq_cst read and a relaxed (or better) write.
The third concerns a relaxed (or better) read and a relaxed (or
better) write.
The fourth concerns two relaxed (or better) writes.

seq_cst fences seem to give no guarantees for two non-seq_cst reads.
At least, seq_cst fences gives no additional guarantees w.r.t. two non-
seq_cst reads above and beyond the guarantees of acq_rel fences. At
least, that's what I'm reading now. Considering I only really started
looking at this the other day, take with great warning please.

Chris M. Thomasson

unread,

Jun 16, 2011, 6:42:40 PM6/16/11

to

"Alexander Terekhov" <tere...@web.de> wrote in message

news:4DFA302D...@web.de...

I really do like the SPARC `MEMBAR' instruction!

Anthony Williams

unread,

Jun 17, 2011, 3:03:51 AM6/17/11

to

Joshua Maurice <joshua...@gmail.com> writes:

> On Jun 16, 1:59 pm, "Chris M. Thomasson" <cris...@charter.net> wrote:
>> "Anthony Williams" <anthony....@gmail.com> wrote in message
>>
>> news:87boxxa...@justsoftwaresolutions.co.uk...

>> > Can you give an example of such code, along with reasoning based on the
>> > C++0x memory model to show that it is not broken?
>>
>> No. Quite frankly, it makes me want to use `std::memory_order_relaxed'
>> everywhere; ordering is provided by `std::atomic_thread_fence(...)'.
>>
>> This better fuc%ing work!
>> _________________________________________________________
>> static std::atomic<int> g_var_1(0);
>> static std::atomic<int> g_var_2(0);
>> static std::atomic<int> g_sync(0);
>>
>> void only_single_producer_in_process_lifetime()
>> {
>> g_var_1.store(1, std::memory_order_relaxed);
>> g_var_2.store(3, std::memory_order_relaxed);
>> std::atomic_thread_fence(std::memory_order_seq_cst);
>> g_sync.store(1, std::memory_order_relaxed);
>>
>> }
>>
>> void many_consumers_in_process_lifetime()
>> {
>> while (! g_sync.load(std::memory_order_relaxed)) BACKOFF();
>> std::atomic_thread_fence(std::memory_order_seq_cst);
>> int var_1 = var_1.load(std::memory_order_relaxed);
>> int var_2 = var_2.load(std::memory_order_relaxed);
>> ASSERT((var_1 + var_2) == 4);}
>
> AFAIK, the fences don't need to be seq_cst. This all works out with
> just acquire and load fences.

Agreed.

> This is another reason that I'm feeling a little miffed about seq_cst.
> Sometimes seq_cst fences don't need seq_cst reads and writes to be
> more than acq_rel fences. Sometimes they do need seq_cst reads and
> writes to be more than acq_rel fences.

True.

> seq_cst fences seem to give no guarantees for two non-seq_cst reads.
> At least, seq_cst fences gives no additional guarantees w.r.t. two non-
> seq_cst reads above and beyond the guarantees of acq_rel fences. At
> least, that's what I'm reading now. Considering I only really started
> looking at this the other day, take with great warning please.

You are right, and this is precisely the issue that was the start of
this thread. The concurrency group discussed it. Some of us thought that
SC fences should order relaxed reads, but the final result was that they
don't, as it was too expensive on some architectures (such as IA-64).

Alexander Terekhov

unread,

Jun 17, 2011, 10:19:03 AM6/17/11

to

Anthony Williams wrote:
[...]

> SC fences should order relaxed reads, but the final result was that they
> don't, as it was too expensive on some architectures (such as IA-64).

Care to elaborate? What is the problem regarding IA-64? Refs:

http://www.decadent.org.uk/pipermail/cpp-threads/2008-December/001932.html
http://www.decadent.org.uk/pipermail/cpp-threads/2008-December/001942.html
http://www.decadent.org.uk/pipermail/cpp-threads/2008-December/001954.html

regards,
alexander.

Anthony Williams

unread,

Jun 17, 2011, 10:40:30 AM6/17/11

to

Alexander Terekhov <tere...@web.de> writes:

> Anthony Williams wrote:
> [...]
>> SC fences should order relaxed reads, but the final result was that they
>> don't, as it was too expensive on some architectures (such as IA-64).
>
> Care to elaborate? What is the problem regarding IA-64?

I don't remember, and my notes don't say. I think there was some
discussion that relaxed loads would have to be more than just a plain LD
instruction if SC fences were to order them, but I really don't remember
the details, and wasn't privy to the entire discussion anyway.

Chris M. Thomasson

unread,

Jun 17, 2011, 6:24:04 PM6/17/11

to

"Anthony Williams" <antho...@gmail.com> wrote in message

news:871uyt3...@justsoftwaresolutions.co.uk...

> Joshua Maurice <joshua...@gmail.com> writes:
>
>> On Jun 16, 1:59 pm, "Chris M. Thomasson" <cris...@charter.net> wrote:
>>> "Anthony Williams" <anthony....@gmail.com> wrote in message
>>>
>>> news:87boxxa...@justsoftwaresolutions.co.uk...
>>> > Can you give an example of such code, along with reasoning based on
>>> > the
>>> > C++0x memory model to show that it is not broken?
>>>
>>> No. Quite frankly, it makes me want to use `std::memory_order_relaxed'
>>> everywhere; ordering is provided by `std::atomic_thread_fence(...)'.
>>>
>>> This better fuc%ing work!
>>> _________________________________________________________

[...]

>> AFAIK, the fences don't need to be seq_cst. This all works out with
>> just acquire and load fences.
>
> Agreed.

Here is an example of an algorithm that requires a `StoreLoad' membar, (or
something else that guarantees such ordering):

http://groups.google.com/group/comp.programming.threads/msg/50ced5527a964ca4

I need to ensure that the atomic store to `waitset::m_waitbit' in
`waitset::wait_begin()' and `waitset::wait_try_begin()' are sequentially
consistent with the load that are contained within`waitset::signal()' and
`waitset::broadcast()'.

Basically:
__________________________________________________
- I cannot let anything in the "user algorithm" sink below the seq_cst
barrier in `waitset::signal()' and `waitset::broadcast()'.

- I cannot let anything in the "waitset algorithm" rise above the seq_cst
barrier in `waitset::signal()' and `waitset::broadcast()'.

- I cannot let anything in the "user algorithm" rise above the seq_cst
barrier in `waitset::wait_begin()' and `waitset::wait_try_begin()'.
__________________________________________________

I don't think that acq_rel barrier is enough for this because it does not
seem as if it's required to generate a `StoreLoad' barrier on a SPARC...

Humm... Also, I don't think I need the seq_cst barrier in
`waitset::wait_commit()'...

FWIW, here is Relacy source code:

http://pastebin.com/q9R0sqT7

[...]

Joshua Maurice

unread,

Jun 18, 2011, 6:49:58 AM6/18/11

to

On Jun 17, 3:24 pm, "Chris M. Thomasson" <cris...@charter.net> wrote:
>
> FWIW, here is Relacy source code:
>

[...]

You know, I really ought to look at this tool. All of the descriptions
of it I'm seeing talk about "interleavings", which could explain why
it missed the allowed execution which breaks the OP's code. That
allowed execution is not a sequentially consistent one. It cannot be
expressed as a simple interleaving of instructions.

It took me the better part of 2 days of all of my spare time, but I
finally managed to do it. I wrote my own little thread tool, based on
the C++0x rules as best I can understand them. (Thanks again Anthony!)
I had to throw away a day's work as I was taking an approach that
would result in too big of a space to be searched, but I managed to
make the search space much smaller / the search algorithm much
better.

When given the OP's code, it finds the allowed execution in a second
or so, printing the allowed execution.

For funzies, when I ran it against the two code pieces at:
http://www.justsoftwaresolutions.co.uk/threading/petersons_lock_with_C++0x_atomics.html
It found the allowed execution of Bartosz's implementation that allows
both threads to get the lock in about a second, and it informs me in
about a second that in Dmitriy's implementation there is no allowed
execution which has both threads getting the lock. It's really
fascinating watching how you can change any one little piece, like
making an exchange just a write, and seeing the algorithm fail and an
allowed execution printed to the screen.

I want to thank you all for this thread again, and bearing with me. I
feel as though I have a much better handle on all things C++0x
threading related.

Alexander Terekhov

unread,

Jun 18, 2011, 8:06:45 AM6/18/11

to

"Chris M. Thomasson" wrote:
[...]

> I don't think that acq_rel barrier is enough for this because it does not
> seem as if it's required to generate a `StoreLoad' barrier on a SPARC...

I agree that acq_rel fence does not provide `StoreLoad' on SPARC.

http://www.decadent.org.uk/pipermail/cpp-threads/2008-December/001949.html

regards,
alexander.

Alexander Terekhov

unread,

Jun 18, 2011, 8:28:26 AM6/18/11

to

Anthony Williams wrote:
>
> Alexander Terekhov <tere...@web.de> writes:
>
> >> Incorrect implementation
> >>
> >> Load Seq_Cst: MFENCE, MOV
> >> Store Seq_Cst: MOV, MFENCE
> >>
> >> is busted because IRIW won't work.
> >
> > That is my thinking... but according to
> >
> > http://portal.acm.org/citation.cfm?id=1926394
> >
> > the "Incorrect implementation" above is actually correct (probably
> > because according to them, MFENCE has the same semantics as PPC's
> > hwsync).
>
> That paper also says:
>
> "Sequentially consistent atomics The proposal above includes
> two implementations of sequentially consistent atomic reads and
> writes; one with the x86 locked instructions, and the other with
> fence instructions on both the reads and writes. However, we can
> prove that it sufﬁces either to place an mfence before every sc read,
> or after every sc write, but that it is not necessary to do both. In
> practice, placing the fence after the sc writes is expected to yield
> higher performance."
>
> i.e. they think that
>
> Load SC: MOV
> Store SC: MOV, MFENCE
>
> is a valid implementation.

They are talking about X86-TSO (not actual X86), which is equivalent to
SPARC-TSO (with #StoreLoad spelled as MFENCE), see

http://www.decadent.org.uk/pipermail/cpp-threads/2008-December/001947.html

(obviously, it was meant that two consecutive #StoreLoad/MFENCE
barriers, like in "store x, load y" both seq_cst, ought to be reduced to
just one by the compiler.)

regards,
alexander.

Alexander Terekhov

unread,

Jun 18, 2011, 9:42:17 AM6/18/11

to

Anthony Williams wrote:
>
> Masakuni Oishi <yam...@bsdhouse.org> writes:
>
> > If so, for the code 1 in my first post, is the outcome
> > r1 == 0 && r2 == 1 && r3 == 0 possible on IA-64?
>
> I believe so.

Perhaps this is relevant:

http://download.intel.com/design/itanium/downloads/25142901.pdf

See 3.3.7.1 Total ordering of WB Releases.

(remote write atomicity)

regards,
alexander.

Chris M. Thomasson

unread,

Jun 18, 2011, 5:40:35 PM6/18/11

to

"Joshua Maurice" <joshua...@gmail.com> wrote in message
news:2081fd67-f37d-42fa...@r27g2000prr.googlegroups.com...

On Jun 17, 3:24 pm, "Chris M. Thomasson" <cris...@charter.net> wrote:
>
> FWIW, here is Relacy source code:
>
[...]

> You know, I really ought to look at this tool. All of the descriptions
> of it I'm seeing talk about "interleavings", which could explain why
> it missed the allowed execution which breaks the OP's code. That
> allowed execution is not a sequentially consistent one. It cannot be
> expressed as a simple interleaving of instructions.

You can get the source code for Relacy here:

http://www.1024cores.net/home/relacy-race-detector

> It took me the better part of 2 days of all of my spare time, but I
> finally managed to do it. I wrote my own little thread tool, based on
> the C++0x rules as best I can understand them. (Thanks again Anthony!)
> I had to throw away a day's work as I was taking an approach that
> would result in too big of a space to be searched, but I managed to
> make the search space much smaller / the search algorithm much
> better.

Care to share some code!? ;^)

Did you use fibers/user threads to simulate the environment?

> When given the OP's code, it finds the allowed execution in a second
> or so, printing the allowed execution.

Nice.

> For funzies, when I ran it against the two code pieces at:
>
> http://www.justsoftwaresolutions.co.uk/threading/petersons_lock_with_C++0x_atomics.html
> It found the allowed execution of Bartosz's implementation that allows
> both threads to get the lock in about a second, and it informs me in
> about a second that in Dmitriy's implementation there is no allowed
> execution which has both threads getting the lock. It's really
> fascinating watching how you can change any one little piece, like
> making an exchange just a write, and seeing the algorithm fail and an
> allowed execution printed to the screen.

Very nice!

Joshua Maurice

unread,

Jun 18, 2011, 6:46:28 PM6/18/11

to

On Jun 18, 2:40 pm, "Chris M. Thomasson" <cris...@charter.net> wrote:
> "Joshua Maurice" <joshuamaur...@gmail.com> wrote in message

>
> news:2081fd67-f37d-42fa...@r27g2000prr.googlegroups.com...
> On Jun 17, 3:24 pm, "Chris M. Thomasson" <cris...@charter.net> wrote:
>
> > FWIW, here is Relacy source code:
>
> [...]
>
> > You know, I really ought to look at this tool. All of the descriptions
> > of it I'm seeing talk about "interleavings", which could explain why
> > it missed the allowed execution which breaks the OP's code. That
> > allowed execution is not a sequentially consistent one. It cannot be
> > expressed as a simple interleaving of instructions.
>
> You can get the source code for Relacy here:
>
> http://www.1024cores.net/home/relacy-race-detector
>
> > It took me the better part of 2 days of all of my spare time, but I
> > finally managed to do it. I wrote my own little thread tool, based on
> > the C++0x rules as best I can understand them. (Thanks again Anthony!)
> > I had to throw away a day's work as I was taking an approach that
> > would result in too big of a space to be searched, but I managed to
> > make the search space much smaller / the search algorithm much
> > better.
>
> Care to share some code!? ;^)
>
> Did you use fibers/user threads to simulate the environment?
>
> > When given the OP's code, it finds the allowed execution in a second
> > or so, printing the allowed execution.
>
> Nice.
>
> > For funzies, when I ran it against the two code pieces at:
>

> >http://www.justsoftwaresolutions.co.uk/threading/petersons_lock_with_...

> > It found the allowed execution of Bartosz's implementation that allows
> > both threads to get the lock in about a second, and it informs me in
> > about a second that in Dmitriy's implementation there is no allowed
> > execution which has both threads getting the lock. It's really
> > fascinating watching how you can change any one little piece, like
> > making an exchange just a write, and seeing the algorithm fail and an
> > allowed execution printed to the screen.
>
> Very nice!

Let me clean it up a little. It's in a really rough shape right now.

How did I do it though? I didn't simulate at all. I fed my program the
program fragment in question. To specify the program fragment in
question, I need to specify to it:
- All the threads (of the thing to be tested),
- All the atomic variables,
- All the memory operations, including:
* isfence or acts on a variable
* read only, write only, or read-modify-write, (meaningless for
fence),
* consume, acquire, release, seq_cst
* which thread does the memory action
* (optional: enhancing at the moment: for the write, what value does
it write)
* I lumped the start of thread, end of thread, as memory operations
* I lumped the create-and-start thread call, and join thread call, as
memory operations
- information on sequenced-before for the specified memory operations
- TODO dependency-ordered-before information

Or most of that.

From that, I transitively close the sequenced-before.
I iterate over all possible sees pairs - a "sees pair" is a pair of a
read and a write (possibly read-modify-writes), where the read loads
the value from the write.
I iterate over all possible modification orders for all atomic
variables.
I construct the release sequences, given the above information.
I construct the synchronizes with partial order, given the above
information.
I construct the inter-thread happens before partial order, given the
above information.
I construct the happens before partial order, given the above
information.
I iterate over all possible total order S values.

I test it against all of the relevant rules of the C++0x which I can
find. If that particular set passes, then I save the set, and return
satisfiable. If that particular set fails, then I iterate to the next
possible set, construct all which needs to be constructed, and test
that. Repeat as necessary until I've exhausted all possible sets, and
if that happens then return not satisfiable.

The key was releasing this particular order for iteration and
construction. My first attempt was to just iterate over all possible
everythings, which was way too big of a space to examine one by one on
available computers. With this order, I can construct several of the
pieces of information, such as the happens before partial order,
instead of iterating over all possible values.

Also, with this order, I can put some of the tests earlier, so I can
prune the space early, greatly reducing the actual work the program
needs to do. For example, I can test 29.3 p12 right after I construct
the release sequences, instead of waiting until the end, thereby
preventing wasteful time that would fail a test no matter what.

As I said, let me clean it up some, and I'll post the code. Of course,
as I just wrote it literally yesterday, I'm sure it's full of bugs. I
find one every now and then.

Currently, it doesn't even support branching, looping, or anything
that's not straight line code. I was thinking if and how I could
extend my idea to cover that. That's a next step. I'll do that later,
presumably after I post what I have now.

Chris M. Thomasson

unread,

Jun 18, 2011, 11:27:38 PM6/18/11

to

"Joshua Maurice" <joshua...@gmail.com> wrote in message

news:e69af07a-dd09-4844...@j13g2000pro.googlegroups.com...

On Jun 18, 2:40 pm, "Chris M. Thomasson" <cris...@charter.net> wrote:
> > "Joshua Maurice" <joshuamaur...@gmail.com> wrote in message

[...]

> > > It took me the better part of 2 days of all of my spare time, but I
> > > finally managed to do it. I wrote my own little thread tool, based on
> > > the C++0x rules as best I can understand them.

[...]

> > Care to share some code!? ;^)

[...]

> > Very nice!

> Let me clean it up a little. It's in a really rough shape right now.

> How did I do it though? I didn't simulate at all. I fed my program the
> program fragment in question. To specify the program fragment in
> question, I need to specify to it:

[...]

Clever.

Joshua Maurice

unread,

Jun 19, 2011, 12:31:05 AM6/19/11

to

https://www.rapidshare.com/files/1030332591/cpptester.zip

Well, there it is, sans makefiles or anything else. It uses just
standard C++, so it shouldn't bad at all get it working. Sorry that
it's light on instructions, and light on comments. I think that the
code style is fairly good (but I'm sure I could refactor to make it
look nicer for a week).

Tested and built only with msvc, so sorry if there's any cross
platform issues, but again it's standard C++ only, so I doubt it.

Oh, and I know there's some memory leaks in some of the functions in
jjracedetector2.cpp - I don't care enough atm to fix them for a
program that is designed to die as soon as it finishes this one job
(though it would be easy to fix).

As before, I mentioned that I just wrote this literally yesterday, and
have done little to no testing, so no correctness guarantees, nor any
guarantees that it won't format your hard drive (the proverbial result
of undefined behavior).

Joshua Maurice

unread,

Jun 19, 2011, 12:37:11 AM6/19/11

to

Oh - just to be clear in case you miss it in the random comment in
there, it doesn't support anything to do with dependency ordered
before and memory_order_consume (yet). The rest is looking good, I
think.

Anthony Williams

unread,

Jun 20, 2011, 4:27:09 AM6/20/11

to

Alexander Terekhov <tere...@web.de> writes:

Maybe I'm wrong; I am not an expert on IA-64 semantics.

Anthony Williams

unread,

Jun 20, 2011, 4:54:42 AM6/20/11

to

Alexander Terekhov <tere...@web.de> writes:

> Anthony Williams wrote:
>>
>> Alexander Terekhov <tere...@web.de> writes:
>>
>> >> Incorrect implementation
>> >>
>> >> Load Seq_Cst: MFENCE, MOV
>> >> Store Seq_Cst: MOV, MFENCE
>> >>
>> >> is busted because IRIW won't work.
>> >
>> > That is my thinking... but according to
>> >
>> > http://portal.acm.org/citation.cfm?id=1926394
>> >
>> > the "Incorrect implementation" above is actually correct (probably
>> > because according to them, MFENCE has the same semantics as PPC's
>> > hwsync).
>>
>> That paper also says:
>>
>> "Sequentially consistent atomics The proposal above includes
>> two implementations of sequentially consistent atomic reads and
>> writes; one with the x86 locked instructions, and the other with
>> fence instructions on both the reads and writes. However, we can

>> prove that it suffices either to place an mfence before every sc read,

>> or after every sc write, but that it is not necessary to do both. In
>> practice, placing the fence after the sc writes is expected to yield
>> higher performance."
>>
>> i.e. they think that
>>
>> Load SC: MOV
>> Store SC: MOV, MFENCE
>>
>> is a valid implementation.
>
> They are talking about X86-TSO (not actual X86), which is equivalent to
> SPARC-TSO (with #StoreLoad spelled as MFENCE),

Their X86-TSO paper says that it tries to model actual X86 cpus, and
tries to be consistent with the documented semantics, but favouring
actual CPU behaviour over the specs.

Anyway, it was you that cited this paper; if you don't think it is
relevant then drop it, and focus purely on the published Intel/AMD specs.

> see
>
> http://www.decadent.org.uk/pipermail/cpp-threads/2008-December/001947.html

That is a SPARC implementation; I see no relevance here other than your
claim that x86-TSO is equivalent to SPARC-TSO (the x86-TSO paper
describes it as "broadly similar").

> (obviously, it was meant that two consecutive #StoreLoad/MFENCE
> barriers, like in "store x, load y" both seq_cst, ought to be reduced to
> just one by the compiler.)

Yes, that part is "obvious".

Alexander Terekhov

unread,

Jun 20, 2011, 9:30:25 AM6/20/11

to

Anthony Williams wrote:
>
> Alexander Terekhov <tere...@web.de> writes:
>
> > Anthony Williams wrote:
> >>
> >> Masakuni Oishi <yam...@bsdhouse.org> writes:
> >>
> >> > If so, for the code 1 in my first post, is the outcome
> >> > r1 == 0 && r2 == 1 && r3 == 0 possible on IA-64?
> >>
> >> I believe so.
> >
> > Perhaps this is relevant:
> >
> > http://download.intel.com/design/itanium/downloads/25142901.pdf
> >
> > See 3.3.7.1 Total ordering of WB Releases.
> >
> > (remote write atomicity)
>
> Maybe I'm wrong; I am not an expert on IA-64 semantics.

Suppose that I can program in terms of C++MM's acquire/release. The
platform is x86.

How am I supposed to ensure remote write atomicity ala IA-64 releases
for a particular store (relevant loads) using C++MM and NOT pessimizing
everything to seq_cst?

The next question is regarding store-load fencing for particular
release-store and subsequent acquire-load.

If both can not be done easily under C++MM then it is pretty useless to
me.

regards,
alexander.

Alexander Terekhov

unread,

Jun 20, 2011, 9:39:05 AM6/20/11

to

Anthony Williams wrote:
[...]

> >> "Sequentially consistent atomics The proposal above includes
> >> two implementations of sequentially consistent atomic reads and
> >> writes; one with the x86 locked instructions, and the other with
> >> fence instructions on both the reads and writes. However, we can
> >> prove that it suffices either to place an mfence before every sc read,
> >> or after every sc write, but that it is not necessary to do both. In
> >> practice, placing the fence after the sc writes is expected to yield
> >> higher performance."
> >>
> >> i.e. they think that
> >>
> >> Load SC: MOV
> >> Store SC: MOV, MFENCE
> >>
> >> is a valid implementation.
> >
> > They are talking about X86-TSO (not actual X86), which is equivalent to
> > SPARC-TSO (with #StoreLoad spelled as MFENCE),
>
> Their X86-TSO paper says that it tries to model actual X86 cpus, and
> tries to be consistent with the documented semantics, but favouring
> actual CPU behaviour over the specs.
>
> Anyway, it was you that cited this paper; if you don't think it is
> relevant then drop it, and focus purely on the published Intel/AMD specs.

It was meant to show that mfence/load/mfence has more fencing than
needed.

>
> > see
> >
> > http://www.decadent.org.uk/pipermail/cpp-threads/2008-December/001947.html
>
> That is a SPARC implementation; I see no relevance here other than your

The relevance here is that for TSO, implementation with Load: MOV,
Store: MOV, MFENCE is SC.

> claim that x86-TSO is equivalent to SPARC-TSO (the x86-TSO paper
> describes it as "broadly similar").
>
> > (obviously, it was meant that two consecutive #StoreLoad/MFENCE
> > barriers, like in "store x, load y" both seq_cst, ought to be reduced to
> > just one by the compiler.)
>
> Yes, that part is "obvious".

regards,
alexander.

Anthony Williams

unread,

Jun 20, 2011, 10:00:39 AM6/20/11

to

Alexander Terekhov <tere...@web.de> writes:

> Anthony Williams wrote:
>>
>> Alexander Terekhov <tere...@web.de> writes:
>>
>> > Anthony Williams wrote:
>> >>
>> >> Masakuni Oishi <yam...@bsdhouse.org> writes:
>> >>
>> >> > If so, for the code 1 in my first post, is the outcome
>> >> > r1 == 0 && r2 == 1 && r3 == 0 possible on IA-64?
>> >>
>> >> I believe so.
>> >
>> > Perhaps this is relevant:
>> >
>> > http://download.intel.com/design/itanium/downloads/25142901.pdf
>> >
>> > See 3.3.7.1 Total ordering of WB Releases.
>> >
>> > (remote write atomicity)
>>
>> Maybe I'm wrong; I am not an expert on IA-64 semantics.
>
> Suppose that I can program in terms of C++MM's acquire/release. The
> platform is x86.
>
> How am I supposed to ensure remote write atomicity ala IA-64 releases
> for a particular store (relevant loads) using C++MM and NOT pessimizing
> everything to seq_cst?

In the general case, you can't. Acquire/release is essentially pairwise
synchronization in the C++0x MM; if you need a total order then you must
use seq_cst.

However, in particular cases then you can order particular subsets. If
you have an example of code that you would like to work then I can see
if I can think of a way to make it work without using seq_cst everywhere.

> The next question is regarding store-load fencing for particular
> release-store and subsequent acquire-load.

One issue here is that the terminology and effects of SPARC membar
constraints do not map cleanly onto C++0x ordering constraints. SPARC
membar constraints relate instruction order to global visibility
order. C++0x ordering constraints are transitive pairwise constraints
between processors, apart from seq_cst which is a global ordering on
seq_cst operations.

Consequently there is not always a simple 1-1 translation between the
two, but you should be able to achieve the same effects. If there are
cases where you can't, then I am interested to understand the issues,
and see how the C++ MM could be extended to handle the new cases.

Anthony Williams

unread,

Jun 20, 2011, 10:41:24 AM6/20/11

to

Alexander Terekhov <tere...@web.de> writes:

> Anthony Williams wrote:
>>
>> Masakuni Oishi <yam...@bsdhouse.org> writes:
>>
>> > If so, for the code 1 in my first post, is the outcome
>> > r1 == 0 && r2 == 1 && r3 == 0 possible on IA-64?
>>
>> I believe so.
>
> Perhaps this is relevant:
>
> http://download.intel.com/design/itanium/downloads/25142901.pdf
>
> See 3.3.7.1 Total ordering of WB Releases.

The example is this:

> /*** code 1 ***/
> // Initially
> atomic<int> x(0), y(0);
>
> // Thread 1:
> y.store(1, memory_order_release);
> atomic_thread_fence(memory_order_seq_cst);
> r1 = x.load(memory_order_acquire);
>
> // Thread 2:
> x.store(1, memory_order_release);
>
> // Thread 3:
> r2 = x.load(memory_order_acquire);
> atomic_thread_fence(memory_order_seq_cst);
> r3 = y.load(memory_order_acquire);
> /***************/
>
> In the above code, is r1 == 0 && r2 == 1 && r3 == 0 possible?

Under the C++0x memory model, the acquire and release operations here
serve no purpose --- there are no stores prior to the releases, so there
is no imposed ordering, and they might as well all be
memory_order_relaxed.

The above code is thus equivalent to the code below, as far as the C++0x
memory model goes, and the compiler would be conforming if it compiled
it as such:

/***************/
// Initially
atomic<int> x(0), y(0);

// Thread 1:
y.store(1, memory_order_relaxed);
atomic_thread_fence(memory_order_seq_cst);
r1 = x.load(memory_order_relaxed);

// Thread 2:
x.store(1, memory_order_relaxed);

// Thread 3:
r2 = x.load(memory_order_relaxed);
atomic_thread_fence(memory_order_seq_cst);
r3 = y.load(memory_order_relaxed);
/***************/

I am not an expert on IA-64, but I would expect a relaxed write on IA-64
to be plain ST --- an unordered write --- and a relaxed load to be a
plain LD --- an unordered read.

Given that, the total ordering of WB releases is not relevant here, and
I suspect that r1 == 0 && r2 == 1 && r3 == 0 is possible.

Chris M. Thomasson

unread,

Jun 20, 2011, 6:45:29 PM6/20/11

to

"Joshua Maurice" <joshua...@gmail.com> wrote in message

news:cdb347b0-b3d8-41e6...@s16g2000prf.googlegroups.com...

[...]

> Oh - just to be clear in case you miss it in the random comment in
> there, it doesn't support anything to do with dependency ordered
> before and memory_order_consume (yet). The rest is looking good, I
> think.

Looking pretty good! FWIW, this version of Petersons algorithms is reported
to work under your testing framework:
_________________________________________________________
void ChrisImpl()
{
cout << "Testing ChrisImpl..." << endl;

AtomicObj flag0 = jjNewAtomicObj("flag0");
AtomicObj flag1 = jjNewAtomicObj("flag1");
AtomicObj turn = jjNewAtomicObj("turn");

ThreadId tfirst = jjNewThread("tfirst");
ThreadId t0 = jjNewThread("t0");
ThreadId t1 = jjNewThread("t1");

jjStartOfThread("tfirst_start", tfirst);
jjWrite("tfirst_flag0_write_0", jjRelaxed, flag0, 0);
jjWrite("tfirst_flag1_write_0", jjRelaxed, flag1, 0);
jjWrite("tfirst_flag1_write_0", jjRelaxed, turn, 0);
jjStartThreadCall("tfirst_starting_t0", t0);
jjStartThreadCall("tfirst_starting_t1", t1);

jjStartOfThread("t0_start", t0);
jjWrite("t0_flag0_write_1", jjRelaxed, flag0, 1);
jjFence("t0_release_fence", jjRelease);
jjRMW("t0_turn_exchange_1", jjRelaxed, turn, 1);
jjFence("t0_acquire_fence", jjAcquire);
MemoryOp t0_flag1_read = jjRead("t0_flag1_read", jjRelaxed, flag1);
MemoryOp t0_turn_read = jjRead("t0_turn_read", jjRelaxed, turn);

jjStartOfThread("t1_start", t1);
jjWrite("t1_flag1_write_1", jjRelaxed, flag1, 1);
jjFence("t1_release_fence", jjRelease);
jjRMW("t1_turn_exchange_1", jjRelaxed, turn, 0);
jjFence("t1_acquire_fence", jjAcquire);
MemoryOp t1_flag0_read = jjRead("t1_flag0_read", jjRelaxed, flag0);
MemoryOp t1_turn_read = jjRead("t1_turn_read", jjRelaxed, turn);

SatisfiabilityTester& st = jjGetGlobalTester();

//assert thread 0 acquires the lock
st.addGenericUserConstraintAtSees(
new OrConstraint(
new ConstrainSeesValue(t0_flag1_read, 0),
new ConstrainSeesValue(t0_turn_read, 0)
));

//assert thread 1 acquires the lock
st.addGenericUserConstraintAtSees(
new OrConstraint(
new ConstrainSeesValue(t1_flag0_read, 0),
new ConstrainSeesValue(t1_turn_read, 1)
));

bool const satisfiable = st.isSatisfiable(); //expected possible
if (satisfiable)
{ cout << "ChrisImpl: Condition Satisfiable" << endl;
st.printState();
cout << "ChrisImpl: Condition Satisfiable" << endl;
}else
cout << "ChrisImpl: Condition Not satisfiable" << endl;
}
_________________________________________________________

AFAICT, it is "identical" to Dmitriy's impl except I use explicit
`atomic_thread_fence()' instructions, and relaxed for every atomic
operation...

Joshua Maurice

unread,

Jun 20, 2011, 11:05:40 PM6/20/11

to

On Jun 20, 3:45 pm, "Chris M. Thomasson" <cris...@charter.net> wrote:
> "Joshua Maurice" <joshuamaur...@gmail.com> wrote in message

I just added some basic support for branching and such. Basically
label statements, and unconditional branches, and branch if the
previous read is equal to a specified int literal. It's working out
pretty well. Two caveats.

One, I put in a user configurable number which specifies how much
branching complexity should be considered. That is, how many times
shall it allow an execution to execute the same branch before ignoring
any more (taken or not taken, more or less that). So, if you set it
too low, it'll miss allowed executions. Set it too high on a complex
enough piece of code, and it'll run for the age of the universe.

Two. Currently, I tried to test the correct solutions with the
criteria that "not both fail to acquire the lock", but unfortunately
my thing reports that it's possible for both to acquire the lock -
incorrectly so. The printed execution is one where one of the threads
continually loads old values, which is disallowed by 29.3 / 13. I
offer no guarantees that I'll ever make it work, though I'm thinking
about how to now. Maybe something like if it sees a thread with a
sequence of reads without writes to some object at the end of the
execution, and it sees all the reads loading some stale value, and the
number of reads is beyond some configurable amount, then it ignores
that execution...

The old interface without branching is still there. The new one is
build on top of the old one to support branching. The new one with
branching is correct only if you set the branching factor high enough,
whereas the old one has no configurable parameters and is simply
correct (modulo bugs).

I'll probably get around to uploading what I have sometime soon. This
has just been a fun side project for me. It was fun. Not sure if I'll
too much more. Not enough value for me for too much work I suspect to
get anything else working. It's expressive enough to cover all
interesting basic algorithms in C++0x (maybe, I think?), and I'll use
sequentially consistent reasoning for the rest.

Masakuni Oishi

unread,

Jun 21, 2011, 2:24:31 AM6/21/11

to

Anthony Williams <anthony....@gmail.com> wrote:

Hmm... Interesting.

I'm not an expert on IA-64, too, but I think
the code would be compiled as such:

// Thread 1
US1: st [y] = 1
MF1: mf
UL1: ld r1 = [x]

// Thread 2
US2: st [x] = 1

// Thread 3
UL2: ld r2 = [x]
MF2: mf
UL3: ld r3 = [y]

If the memory model of IA-64 allows ordering such as:
RV_3(US2) -> F(MF2) -> F(MF1) -> RV_1(US2)
then the outcome r1 == 0 && r2 == 1 && r3 == 0 is possible,
I think.

The Intel's specification seems not prohibiting such ordering,
but I'm not sure that it may occur actually.

-- masakuni

Alexander Terekhov

unread,

Jun 21, 2011, 8:54:00 AM6/21/11

to

Anthony Williams wrote:
[...]

> > Suppose that I can program in terms of C++MM's acquire/release. The
> > platform is x86.
> >
> > How am I supposed to ensure remote write atomicity ala IA-64 releases
> > for a particular store (relevant loads) using C++MM and NOT pessimizing
> > everything to seq_cst?
>
> In the general case, you can't. Acquire/release is essentially pairwise
> synchronization in the C++0x MM; if you need a total order then you must
> use seq_cst.

I have no problem using seq_cst for either particular store or relevant
loads but I don't want to pessimizing everything to seq_cst.

>
> However, in particular cases then you can order particular subsets.

Please elaborate.

> If you have an example of code that you would like to work then I can
> see if I can think of a way to make it work without using seq_cst
> everywhere.
>
> > The next question is regarding store-load fencing for particular
> > release-store and subsequent acquire-load.
>
> One issue here is that the terminology and effects of SPARC membar
> constraints do not map cleanly onto C++0x ordering constraints. SPARC

Do you have an example for a C++MM's seq_cst fence actually not
providing a store-load barrier in compiled code on some platform?

regards,
alexander.

Alexander Terekhov

unread,

Jun 21, 2011, 9:59:44 AM6/21/11

to

Anthony Williams wrote:

[... remote write atomicity ...]

> However, in particular cases then you can order particular subsets. If
> you have an example of code that you would like to work then I can see
> if I can think of a way to make it work without using seq_cst everywhere.

See examples in

http://portal.acm.org/citation.cfm?id=1375591
("4.2 Unintended Consequences of Relaxing Write-Atomicity")

"4.2.1 Write-to-Read Causality (WRC) Must be Upheld"

"4.2.2 Read-to-Write Causality (RWC)"

"4.2.3 Subtle Interplay Between Coherence and Causality"

regards,
alexander.

Alexander Terekhov

unread,

Jun 21, 2011, 12:37:18 PM6/21/11

to

Alexander Terekhov wrote:
>
> Anthony Williams wrote:
> [...]
> > > Suppose that I can program in terms of C++MM's acquire/release. The
> > > platform is x86.
> > >
> > > How am I supposed to ensure remote write atomicity ala IA-64 releases
> > > for a particular store (relevant loads) using C++MM and NOT pessimizing
> > > everything to seq_cst?
> >
> > In the general case, you can't. Acquire/release is essentially pairwise
> > synchronization in the C++0x MM; if you need a total order then you must
> > use seq_cst.
>
> I have no problem using seq_cst for either particular store or relevant

> loads ...

I take that back. I'd rather get rid of seq_cst altogether.

atomic<>'s loads and stores ought to support the following 'modes':

Whether load/store is competing (default) or not. Competing load
means that there might be concurrent store. Competing store means
that there might be concurrent load or store. Non-competing
load/store can be performed non-atomically.

Whether competing load/store needs remote write atomicity (default
is no remote write atomicity). All competing loads/stores must
specify the same remote write atomicity mode.

Whether load/store has specified reordering constraint (default
is no constraint specified) in terms of the following reordering
modes:

Whether preceding loads (in program order) can be reordered
across it (can by default).

Whether preceding stores (in program order) can be reordered
across it (can by default).

Whether subsequent loads (in program order) can be reordered
across it (can by default). For load, the set of constrained
subsequent loads can be limited to only dependant loads (aka
'consume' mode).

Whether subsequent stores (in program order) can be reordered
across it (can by default). For load, there is an implicit
reordering constraint regarding dependent stores (no need to
specify it).

atomic<>'s fence operation can be used to specify reordering
constraint using the same modes.

regards,
alexander.

Alexander Terekhov

unread,

Jun 22, 2011, 7:24:09 AM6/22/11

to

Anthony Williams wrote:
[...]

> >> > http://portal.acm.org/citation.cfm?id=1926394
> >> >
> >> > the "Incorrect implementation" above is actually correct (probably
> >> > because according to them, MFENCE has the same semantics as PPC's
> >> > hwsync).
> >>
> >> That paper also says:
> >>
> >> "Sequentially consistent atomics The proposal above includes
> >> two implementations of sequentially consistent atomic reads and
> >> writes; one with the x86 locked instructions, and the other with
> >> fence instructions on both the reads and writes. However, we can
> >> prove that it suffices either to place an mfence before every sc read,
> >> or after every sc write, but that it is not necessary to do both. In
> >> practice, placing the fence after the sc writes is expected to yield
> >> higher performance."
> >>
> >> i.e. they think that
> >>
> >> Load SC: MOV
> >> Store SC: MOV, MFENCE
> >>
> >> is a valid implementation.
> >
> > They are talking about X86-TSO (not actual X86), which is equivalent to
> > SPARC-TSO (with #StoreLoad spelled as MFENCE),
>
> Their X86-TSO paper says that it tries to model actual X86 cpus, and
> tries to be consistent with the documented semantics, but favouring
> actual CPU behaviour over the specs.

I've just checked the recent Intel specs... and according to

http://www.intel.com/Assets/PDF/manual/253668.pdf
(May 2011)

recent X86 hardware (P6 and More Recent Processor Families) *is* a step
back to TSO, see

"8.2.2 Memory Ordering in P6 and More Recent Processor Families

. . .

Any two stores are seen in a consistent order by processors other than
those performing the stores"

and

"8.2.3.7 Stores Are Seen in a Consistent Order by Other Processors"

(IRIW without any fencing or XCHG)

More googling...

http://blogs.oracle.com/dave/entry/x86_platform_memory_model_clarifications

David Dice reported it back in 2009:

"Intel has taken steps toward addressing some of the concerns about
potential platform memory model relaxations that I identified in a
previous blog entry. Specifically see section 7.2.3.7 of their latest
Software Developer's Manual which now guarantees global store ordering."

2009's 7.2.3.7 is current 8.2.3.7

regards,
alexander.

Anthony Williams

unread,

Jun 22, 2011, 12:01:56 PM6/22/11

to

Alexander Terekhov <tere...@web.de> writes:

> Anthony Williams wrote:
>
> [... remote write atomicity ...]
>
>> However, in particular cases then you can order particular subsets. If
>> you have an example of code that you would like to work then I can see
>> if I can think of a way to make it work without using seq_cst everywhere.
>
> See examples in
>
> http://portal.acm.org/citation.cfm?id=1375591
> ("4.2 Unintended Consequences of Relaxing Write-Atomicity")
>
> "4.2.1 Write-to-Read Causality (WRC) Must be Upheld"

Figure 5:

std::atomic<int> X(0),Y(0);

T1
X.store(1,memory_order_relaxed);

T2
r1=X.load(memory_order_relaxed);
atomic_thread_fence(memory_order_release);
Y.store(memory_order_relaxed);

T3
r2=Y.load(memory_order_relaxed)
atomic_thread_fence(memory_order_acquire);
r3=X.load(memory_order_relaxed);

The required ordering comes from 29.8p2 and 1.10p16, since T2 stores
to
Y and T3 loads from Y: if r1==1 and r2==1 then r3==1.

> "4.2.2 Read-to-Write Causality (RWC)"

Figure 6:

std::atomic<int> X(0),Y(0);

T1
X.store(1,T1MO);

T2
r1=X.load(T2MO1);
atomic_thread_fence(T2MO2);
r2=Y.load(T2MO3);

T3
Y.store(1,T3MO1);
atomic_thread_fence(T3MO2);
r3=X.load(T3MO3);

In this case, there are no orderings such that acquire and release can
have any impact. Therefore the choice of ordering for each operation
is
either memory_order_seq_cst or memory_order_relaxed.

I presume we are trying to find a set of ordering constraints such
that
r1==1, r2==0, r3==0 is not possible.

The T1MO is memory_order_relaxed then both T2MO1 and T3MO3 must be
memory_order_seq_cst, and either of T2MO2 and T2MO3 must be
memory_order_seq_cst, and either of T3MO1 and T3MO2 must be
memory_order_seq_cst.

With any of these combinations, if r2==0 then the read of r1 must
occur
before the read of r3 in S, and r1==1, r3==0 is thus impossible by
29.3p3 (otherwise it is not consistent with the modification order of
X).

If T1MO is memory_order_seq_cst, then it is enough to make T2MO1,
T2MO2
and T3MO2 memory_order_seq_cst; the rest can be memory_order_relaxed.

If r1==1 then the load of r1 is after the store to X in S. If r2==0
then
the fence in T2 must be before the fence from T3 in S (by 29.3p5, if
the
fence in T3 is prior to the fence in T2 then r2==1). If T2 precedes T3
in S, and the store to X precedes the load of r1 in S then the store
to
X precedes T3 in S. By 29.3p5 we must have r3==1.

> "4.2.3 Subtle Interplay Between Coherence and Causality"

std::atomic<int> X(0),Y(0);

T1
X.store(1,T1MO);

T2
r1=X.load(T2MO1);
atomic_thread_fence(T2MO2);
r2=Y.load(T2MO3);

T3
Y.store(1,T3MO1);
atomic_thread_fence(T3MO2);
X.store(2,T3MO3);

T4
r3=X.load(T4MO1);
atomic_thread_fence(T4MO2);
r4=X.load(T4MO3);

In this example, the "violating result" is given as r1==1, r2==0,
r3==2,
r4==1.

r3==2, r4==1 requires that the store X=2 is prior to the store X=1 in
the modification order of X. There are no stores on thread 4, and no
ordering constraints can change this requirement, so T4MO1, T4MO2 and
T4MO3 can all be memory_order_relaxed.

The requirement that r1==1 requires that the load of r1 is seeing the
result of the store X=1, which is after the store X=2 from T3 in the
MO
of X. If T3MO3 and T2MO1, T3MO1 and T2MO3 are memory_order_seq_cst
then
everything works, even if everything else is memory_order_relaxed,
since
the load of r1 must be later in SC than the store X=2, in order to see
the later stored value, which means the store of Y=1 occurs before the
load of r2, and r2 must be 1.

Likewise, if T3MO3 and T2MO1, T3MO2 and T2MO2 are
memory_order_seq_cst,
then the same rules apply to mean the fence in T3 is before the fence
in
T2 in S, and by 29.3p5 r2==1.

It is possible that there are some alternatives with fewer
memory_order_seq_cst operations, but I don't see them right now.

Message has been deleted

Masakuni Oishi

unread,

Jun 22, 2011, 2:19:33 PM6/22/11

to

Here is another example.

/***************/
// Initially
atomic<int> x(0), y(0);

// Thread 1:
r3 = y.load(memory_order_relaxed);
atomic_thread_fence(memory_order_seq_cst);
r1 = x.load(memory_order_relaxed);

// Thread 2:
x.store(1, memory_order_relaxed);

// Thread 3:
r2 = x.load(memory_order_relaxed);
atomic_thread_fence(memory_order_seq_cst);

y.store(1, memory_order_relaxed);
/***************/

In this code, the outcome r1 == 0 && r2 == 1 && r3 == 1
is NOT allowed under the C++0x memory model.
(29.8p2 and 1.10p16)

But, if we compile the code on IA-64,

// Thread 1

UL3: ld r3 = [y]
MF1: mf
UL1: ld r1 = [x]

// Thread 2
US2: st [x] = 1

// Thread 3
UL2: ld r2 = [x]
MF2: mf

US1: st [y] = 1

it seems that the outcome r1 == 0 && r2 == 1 && r3 == 1
is possible.
Isn't it?

-- masakuni

Masakuni Oishi

unread,

Jun 22, 2011, 2:57:25 PM6/22/11

to

Furthermore,

/***************/
// Initially
atomic<int> x(0), y(0);

// Thread 1:
x.store(1, memory_order_relaxed);

// Thread 2:
r1 = x.load(memory_order_relaxed);
y.store(1, memory_order_release);

// Thread 3:
r2 = y.load(memory_order_acquire);
r3 = x.load(memory_order_relaxed);
/***************/

this code does not allow the outcome
r1 == 1 && r2 == 1 && r3 == 0

under the C++0x memory model.

But, the compiled code on IA-64:

// Thread 1
US: st [x] = 1

// Thread 2

UL1: ld r1 = [x]

SR: st.rel [y] = 1

// Thread 3
LA: ld.acq r2 = [y]
UL2: ld r3 = [x]

seems that the outcome
r1 == 1 && r2 == 1 && r3 == 0 is possible.

Therefore, we may not compile a relaxed store
to a plain ST.

-- masakuni

Anthony Williams

unread,

Jun 24, 2011, 4:42:38 AM6/24/11

to

OK

> But, if we compile the code on IA-64,
>
> // Thread 1
> UL3: ld r3 = [y]
> MF1: mf
> UL1: ld r1 = [x]
>
> // Thread 2
> US2: st [x] = 1
>
> // Thread 3
> UL2: ld r2 = [x]
> MF2: mf
> US1: st [y] = 1
>
> it seems that the outcome r1 == 0 && r2 == 1 && r3 == 1
> is possible.
> Isn't it?

I'm not sure. As far as I can make out from the Itanium memory
ordering docs, if r2==1 and r3==1 then:

RV3(US2)->R(UL2)->F(MF2)->RV1(US1)->R(UL3)->F(MF1)->R(UL1)

That on its own doesn't tell use what r1 is, since RV1(US2) is not in
that chain.

Unless there is something that I've missed, there is no constraint
that requires RV1(US2)->R(UL1), so you are right: r1==0 is permitted.
Is there anyone who is an expert on IA-64 who can confirm this, or
point out what is missing?

Alexander Terekhov

unread,

Jun 27, 2011, 7:24:48 AM6/27/11

to

Anthony Williams wrote:
[...]

That's because visibility of 2nd thread's store with respect to 1st
thread can be delayed until after 3rd thread's store is made visible
with respect 1st thread and 1st thread can load 'stale' value of x into
r1 (due to lack of remote write atomicity). Itanium's st.rel aside, on
PowerPC, the most heavy fence (hwsync) would prevent that outcome
(hwsync in 1st thread precludes it from observing 'stale' value of x)
but Itanium's mf is not that strong.

regards,
alexander.

Alexander Terekhov

unread,

Jun 27, 2011, 8:05:07 AM6/27/11

to

Masakuni Oishi wrote:
[...]

> /***************/
> // Initially
> atomic<int> x(0), y(0);
>
> // Thread 1:
> x.store(1, memory_order_relaxed);
>
> // Thread 2:
> r1 = x.load(memory_order_relaxed);
> y.store(1, memory_order_release);
>
> // Thread 3:
> r2 = y.load(memory_order_acquire);
> r3 = x.load(memory_order_relaxed);
> /***************/
>
> this code does not allow the outcome
> r1 == 1 && r2 == 1 && r3 == 0
> under the C++0x memory model.

Well,

http://www.rdrop.com/users/paulmck/scalability/paper/N2745r.2011.03.04a.html

"Example 1: Load/Load ...

Thread 0:

r1 = x.load(memory_order_relaxed);
y.store(1, memory_order_release);

Thread 1:

r2 = y.load(memory_order_acquire);
r3 = x.load(memory_order_relaxed);

Thread 2 (required for assertion):

x.store(1, memory_order_relaxed);"

says that "the C++ memory model does not actually require the loads to
be ordered in this case, since thread 0's relaxed store is not
guaranteed to be seen in any particular order by thread 0's and 1's
relaxed loads. " ("thread 0's relaxed store" is meant to say "thread 2's
relaxed store" of course)

regards,
alexander.

Joshua Maurice

unread,

Jun 29, 2011, 7:14:13 PM6/29/11

to

On Jun 27, 5:05 am, Alexander Terekhov <terek...@web.de> wrote:
> Masakuni Oishi wrote:
>
> [...]
>
>
>
>
>
>
>
>
>
> > /***************/
> > // Initially
> > atomic<int> x(0), y(0);
>
> > // Thread 1:
> > x.store(1, memory_order_relaxed);
>
> > // Thread 2:
> > r1 = x.load(memory_order_relaxed);
> > y.store(1, memory_order_release);
>
> > // Thread 3:
> > r2 = y.load(memory_order_acquire);
> > r3 = x.load(memory_order_relaxed);
> > /***************/
>
> > this code does not allow the outcome
> > r1 == 1 && r2 == 1 && r3 == 0
> > under the C++0x memory model.
>
> Well,
>

> http://www.rdrop.com/users/paulmck/scalability/paper/N2745r.2011.03.0...

>
> "Example 1: Load/Load ...
>
> Thread 0:
>
> r1 = x.load(memory_order_relaxed);
> y.store(1, memory_order_release);
>
> Thread 1:
>
> r2 = y.load(memory_order_acquire);
> r3 = x.load(memory_order_relaxed);
>
> Thread 2 (required for assertion):
>
> x.store(1, memory_order_relaxed);"
>
> says that "the C++ memory model does not actually require the loads to
> be ordered in this case, since thread 0's relaxed store is not
> guaranteed to be seen in any particular order by thread 0's and 1's
> relaxed loads. " ("thread 0's relaxed store" is meant to say "thread 2's
> relaxed store" of course)

Google groups is acting up, so sorry if someone's already gotten to
this.

I think you're saying that "r1 == 1 && r2 == 1 && r3 == 0" is the
result of an allowed execution according to the C++0x rules. I believe
this is incorrect. I claim that there is no such allowed execution. My
program says as much (uploaded else-thread), but I don't trust it at
the moment, so I did the reasoning myself. My reasoning is as
follows:

/***************/
// Initially
atomic<int> x(0);
atomic<int> y(0);
start thread 1;
start thread 2;
start thread 3;

// Thread 1:
x.store(1, memory_order_relaxed);

// Thread 2:
r1 = x.load(memory_order_relaxed);
y.store(1, memory_order_release);

// Thread 3:
r2 = y.load(memory_order_acquire);
r3 = x.load(memory_order_relaxed);
/***************/

//end conditions: r1 == 1 && r2 == 1 && r3 == 0

Let's see if there's an allowed execution that fits these constraints.
We'll do so by assuming the constraints. If we derive a contradiction,
then there is no allowed execution under these constraints. If we
apply all of the possible rules and do not get a contradiction, then
we've found an allowed execution.

* The constraints we're assuming are:
"r1 = x.load(memory_order_relaxed);" sees value 1
"r2 = y.load(memory_order_acquire);" sees value 1
"r3 = x.load(memory_order_relaxed);" sees value 0
* The only way to satisfy these constraints is:
"r1 = x.load(memory_order_relaxed);"
sees "x.store(1, memory_order_relaxed);"
"r2 = y.load(memory_order_acquire);"
sees "y.store(1, memory_order_release);"
"r3 = x.load(memory_order_relaxed);"
sees "atomic<int> x(0);"

* Remember:
"r2 = y.load(memory_order_acquire);"
sees "y.store(1, memory_order_release);"
* 29.3 p2 gives:
"y.store(1, memory_order_release);"
syncs-with "r2 = y.load(memory_order_acquire);"
* Repeated applications of 1.10 p11 give:
"r1 = x.load(memory_order_relaxed);"
inter-thread-hb "r3 = x.load(memory_order_relaxed);"
* 1.10 p12 gives:
"r1 = x.load(memory_order_relaxed);"
happens-before "r3 = x.load(memory_order_relaxed);"
* 1.10 p16 gives:
For T, U so that
"r1 = x.load(memory_order_relaxed);" sees T, and
"r3 = x.load(memory_order_relaxed);" sees U,
then T is U, or T precedes U in the modification order of X
aka: T <=mod-order-X U
* Thus
"x.store(1, memory_order_relaxed);"
<=mod-order-X "atomic<int> x(0);"

* 30.3.1.2 p5 gives:
"start thread 1;"
syncs-with "x.store(1, memory_order_relaxed);"
* Repeated applications of 1.10 p11 give:
"atomic<int> x(0);"
inter-thread-hb "x.store(1, memory_order_relaxed);"
* 1.10 p12 gives:
"atomic<int> x(0);"
happens-before "x.store(1, memory_order_relaxed);"
* 1.10 p15 gives:
"atomic<int> x(0);"
precedes "x.store(1, memory_order_relaxed);"

in the modification order of X

aka: "atomic<int> x(0);"
<mod-order-X "x.store(1, memory_order_relaxed);"

* So, we've derived the following:
"x.store(1, memory_order_relaxed);"
<=mod-order-X "atomic<int> x(0);"
"atomic<int> x(0);"
<mod-order-X "x.store(1, memory_order_relaxed);"
* Thus contradiction.
* Thus there is no allowed execution which satisfies the assumed
constraints.
* Aka: There is no allowed execution (in C++0x) which can produce:
r1 == 1 && r2 == 1 && r3 == 0.

Alexander Terekhov

unread,

Jun 30, 2011, 8:26:19 AM6/30/11

to

Anthony Williams wrote:
>
> Alexander Terekhov <tere...@web.de> writes:
>
> > Anthony Williams wrote:

> > [...]
> >> SC fences should order relaxed reads, but the final result was that they
> >> don't, as it was too expensive on some architectures (such as IA-64).
> >
> > Care to elaborate? What is the problem regarding IA-64?
>
> I don't remember, and my notes don't say. I think there was some
> discussion that relaxed loads would have to be more than just a plain LD
> instruction if SC fences were to order them, but I really don't remember
> the details, and wasn't privy to the entire discussion anyway.

(from the ongoing email discussion regarding
http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html )

Hans Boehm wrote:
[...]
> > [load relaxed]
> >
> >> The problem with a plain ld is that it can be reordered with other
> >> loads from the same location, violating cache coherence.
> >
> > Do you mean that
> >
> > (initially: x = 0)
> >
> > T1: st x = 1;
> > T2: r1 = ld x; r2 = ld x;
> >
> > r1 == 1, r2 == 0 is allowed due to unordered loads?
> >
> > Well, that's really interesting...
>
> According to the spec, yes. Whether any implementations allow that,
> I do not know.
[...]
> Memory_order_relaxed indeed appears to be entirely useless on Itanium.
[...]
> Abbreviating the example, everything is relaxed, except fences are
> seq_cst:
>
> Thread 1: Thread 2: Thread 3:
> r3 = y (1) x = 1 r2 = x (1)
> fence1 fence2
> r1 = x (0) y = 1
>
> We conclude by 29.8p2 that fence2 happens-before fence1.
>
> Since the initializing write of zero to x precedes the explicit
> store to x in modification order, and the thread 3 read of x
> sees the later one, so must the thread 1 read of x, so this is
> indeed disallowed.
>
> For the Itanium spec, I'm referring to
> http://download.intel.com/design/itanium/downloads/25142901.pdf ,
> "A Formal Specification of Intel Itanium Processor Family
> Memory Ordering", Oct. 2002.
>
> In the Itanium translation with st, this indeed does not appear
> to be enforced. The store to x can occur and become locally
> visible before anything else (except the initializing write),
> and remotely visible on thread 3 shortly thereafter, without
> imposing any constraints on remote visibility at thread 1. And,
> in spite of my earlier claim, I think the same issue arises if
> we use acquire/release operations on y instead of the fences.
> Thus my natural inclination to blame all of our problems on
> fences isn't even correct here :-)
>
> I really need write atomicity in order to get the kinds of
> transitivity guarantees I need, and only st.rel provides those.
>
> This seems to have no impact on RMW operations, since those
> become globally visible in a single order, even if they are only
> acquire operations.

regards,
alexander.

Anthony Williams

unread,

Jul 1, 2011, 5:25:36 AM7/1/11

to

On Jun 27, 12:24 pm, Alexander Terekhov <terek...@web.de> wrote:

Right, so it appears that on Itanium we cannot use plain LD and ST for
memory_order_relaxed operations --- at least one of them must be
stronger. For now, I'm not sure whether we can get away with making
only one stronger, or whether we need to go for making both loads and
stores stronger.

Anthony

Alexander Terekhov

unread,

Jul 1, 2011, 6:48:13 AM7/1/11

to

Anthony Williams wrote:
[...]

> Right, so it appears that on Itanium we cannot use plain LD and ST for
> memory_order_relaxed operations --- at least one of them must be
> stronger. For now, I'm not sure whether we can get away with making
> only one stronger, or whether we need to go for making both loads and
> stores stronger.

It turns out that Itanium doesn't respect read-after-read dependence
(two loads from the same location without a store to the same location
in-between in program order) for plain ld. As long as C++11 requires
such coherence (makes sense, PowerPC does respect RAR dependence),
ld.acq must be used instead of plain ld.

Plain st can not be used for memory_order_relaxed stores as well because
in some cases C++11 requires remote write atomicity for
memory_order_relaxed stores (Joshua Maurice posted here examples) and
hence st.rel (it provides remote write atomicity) must be used instead
of plain st.

On PowerPC, the situation is better (plain LD and ST can be used for
memory_order_relaxed loads/stores) because PowerPC respects RAR and
provides cumulativity for lwsync/hwsync fences, see:

http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/pldi105-sarkar.pdf

Note that under 'modes' model, remote write atomicity is a separate mode
and hence plain st can be used for 'very relaxed' (the default) stores
even on Itanium (I think that Itanium's incoherence for plain ld sucks;
ld.acq must be used for 'very relaxed' (the default) loads under 'modes'
model just like under C++11).

regards,
alexander.

Masakuni Oishi

unread,

Jul 2, 2011, 2:15:16 PM7/2/11

to

Alexander Terekhov wrote:
> It turns out that Itanium doesn't respect read-after-read dependence
> (two loads from the same location without a store to the same location
> in-between in program order) for plain ld. As long as C++11 requires
> such coherence (makes sense, PowerPC does respect RAR dependence),
> ld.acq must be used instead of plain ld.
>
> Plain st can not be used for memory_order_relaxed stores as well because
> in some cases C++11 requires remote write atomicity for
> memory_order_relaxed stores (Joshua Maurice posted here examples) and
> hence st.rel (it provides remote write atomicity) must be used instead
> of plain st.

I see.

Now, back to my first question.

/***************/
// Initially
atomic<int> x(0), y(0);

// Thread 1:
x.store(1, memory_order_relaxed);

// Thread 2:
r1 = x.load(memory_order_relaxed);

atomic_thread_fence(memory_order_seq_cst);
r2 = y.load(memory_order_relaxed);

// Thread 3:
y.store(1, memory_order_relaxed);
atomic_thread_fence(memory_order_seq_cst);
r3 = x.load(memory_order_relaxed);
/***************/

In this code, is there any hardware that permits
the outcome r1 == 1 && r2 == 0 && r3 == 0?

If not, I want C++MM to prohibit such outcome.

--
masakuni

Alexander Terekhov

unread,

Jul 5, 2011, 4:07:20 AM7/5/11

to

Masakuni Oishi wrote:
[...]

>
> Now, back to my first question.
>
> /***************/
> // Initially
> atomic<int> x(0), y(0);
>
> // Thread 1:
> x.store(1, memory_order_relaxed);
>
> // Thread 2:
> r1 = x.load(memory_order_relaxed);
> atomic_thread_fence(memory_order_seq_cst);
> r2 = y.load(memory_order_relaxed);
>
> // Thread 3:
> y.store(1, memory_order_relaxed);
> atomic_thread_fence(memory_order_seq_cst);
> r3 = x.load(memory_order_relaxed);
> /***************/
>
> In this code, is there any hardware that permits
> the outcome r1 == 1 && r2 == 0 && r3 == 0?

Itanium memory model permits the outcome r1 == 1 && r2 == 0 && r3 == 0
with plain st for relaxed stores.

>
> If not, I want C++MM to prohibit such outcome.

The final draft C++MM prohibits such outcome nevertheless and thus
relaxed stores must use st.rel on Itanium.

regards,
alexander.

Edek

unread,

Jul 10, 2011, 1:10:26 PM7/10/11

to

On 06/16/2011 12:03 PM, Alexander Terekhov wrote:
>
> Edek wrote:
> [...]
>> As a side note, I compiled all ops on x86_64, and atomic<int>::load
>> is mfence/load/mfence even for relaxed. I'd need some time to analyse
>> the thing from perspective of other ops implementations in all
>> orderings, and this leads me to the not so subtle question:
>>
>> I understand people _should definitely_ use the default.
>> I assume the goal, or the official word for it is 'rationale', was
>> to let people 'who understand' write code that would be both portable
>> and more efficient than the default. I am somehow beginning to doubt
>> both my capabilities and whether this will be at all possible - I mean
>> the efficiency, not portability (the latter can be simply done right).
>> Though maybe the bottom line is that at least the precise implementation
>> will have at least a couple of fences less than the default, at the
>> price of - I think no one argues about that - complexity.
>
> Try
>
> http://www.decadent.org.uk/pipermail/cpp-threads/2008-December/001933.html
> (Brief tentative example x86 Implementation for C/C++ Memory Model)
>
> and
>
> http://portal.acm.org/citation.cfm?id=1926394
> (Mathematizing C++ concurrency)
>
> regards,
> alexander.

I wanted to say something after I digest them, but that seems to be a
future task. So:

thanks. It is a pity I cannot follow everything interesting. I have read
those, but I need to dig deeper sometime later.

Edek

unread,

Jul 10, 2011, 1:47:43 PM7/10/11

to

On 06/16/2011 06:32 PM, Alexander Terekhov wrote:
>
> Anthony Williams wrote:
> [...]

>> For reference, 29.3p8 says:
>>
>> [ Note: memory_order_seq_cst ensures sequential consistency only for a
>> program that is free of data races and uses exclusively
>> memory_order_seq_cst operations. Any use of weaker ordering will
>> invalidate this guarantee unless extreme care is used. In particular,
>
> This is appalling.
>
> regards,
> alexander.

This is beautiful.

The way I see it, is that seq_cst provides for the default use. All
other orderings are supposed to serve dedicated purposes. The bad
side effect could be that seq_cst affects everything else including
such weird orderings as relaxed, in the sense that they would be
required to be heavier than needed.

But I understand from your post (5th July) that it still does for
Itanium.

The point I am trying to make is that the above quote is meant
to eliminate this side effect, although it does not succeed in every
case. I do not understand why this is a problem; but, well, I do
not fully understand why relaxed order is really needed either.
Not that I do not see the purpose of relaxed if the compiler does a
literal line-to-line translation - then, having a simple store (or load)
can be better than that plus some fence, but I see no reason
why a compiler (or a future CPU) could not eliminate the overhead. [1]
Of course, this is limited to cases where relaxed atomic cannot be
replaced by non-atomic data.

I think I understand why you say this is appalling. But it
looks like a reasonable API contract to me. It does affect ops with
a bit less strict memory orderings, but without the above clause
it would be worse, not better.

Am I missing anything obvious, especially in the space between
sequentially ordered and "in reasonable time" operations?

Edek

[1] Take:

line 1: x.store(val, relaxed)
line 2: y.store(val, release)

Supposing line 1 needs any fence because of the memory model
consistency, maybe the fence can be omitted for the particular
platform, because line 2 already provides enough ordering to
"include" line 1? Including seq_cst in the picture,
if my logic is correct, would ruin this.

questions about memory_order_seq_cst fence

Masakuni Oishi

Edek

Joshua Maurice

Joshua Maurice

Masakuni Oishi

Masakuni Oishi

Anthony Williams

Joshua Maurice

Anthony Williams

Masakuni Oishi

Chris M. Thomasson

Masakuni Oishi

Joshua Maurice

Edek

Joshua Maurice

Joshua Maurice

Anthony Williams

Anthony Williams

Alexander Terekhov

Alexander Terekhov

Anthony Williams

Alexander Terekhov

Masakuni Oishi

Masakuni Oishi

Anthony Williams

Edek

Anthony Williams

Anthony Williams

Alexander Terekhov

Anthony Williams

Alexander Terekhov

Alexander Terekhov

Chris M. Thomasson

Chris M. Thomasson

Chris M. Thomasson

Anthony Williams

Anthony Williams

Chris M. Thomasson

Anthony Williams

Anthony Williams

Alexander Terekhov

Alexander Terekhov

Chris M. Thomasson

Joshua Maurice

Chris M. Thomasson

Anthony Williams

Alexander Terekhov

Anthony Williams

Chris M. Thomasson

Joshua Maurice

Alexander Terekhov

Alexander Terekhov

Alexander Terekhov

Chris M. Thomasson

Joshua Maurice

Chris M. Thomasson

Joshua Maurice

Joshua Maurice

Anthony Williams

Anthony Williams

Alexander Terekhov

Alexander Terekhov

Anthony Williams

Anthony Williams

Chris M. Thomasson

Joshua Maurice

Masakuni Oishi

Alexander Terekhov

Alexander Terekhov

Alexander Terekhov

Alexander Terekhov

Anthony Williams

Masakuni Oishi

Masakuni Oishi

Anthony Williams

Alexander Terekhov

Alexander Terekhov

Joshua Maurice

Alexander Terekhov