C++0x: sequentially consistent atomic operations

Dmitriy V'jukov

unread,

Aug 6, 2008, 9:58:23 AM8/6/08

to

I'm trying to figure out what synchronizes-with edges sequentially
consistent atomic operations must introduce.
My reasoning is based on latest C++0x draft:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2691.pdf
And Peter Dimov's excellent proposal on bidirectional fences:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2633.html

First example:

std::atomic<int> x, y; // both initially 0
void thread1()
{
x.fetch_add(1, std::memory_order_seq_cst);
int r1 = y.load(std::memory_order_relaxed);
}
void thread1()
{
y.fetch_add(1, std::memory_order_seq_cst);
int r2 = x.load(std::memory_order_relaxed);
}

As I understand, (r1 == 0) && (r2 == 0) is possible output, because no
synchronizes-with edges introduced (at least I can't find anything
which says the opposite).
Is this intended? It's quite counter-intuitive. fetch_add operations
are ordered, and they both acquire and release. So why here is no
synchronizes-with edge between them? Then I don't understand what
differentiate seq_cst operations from acq_rel operations, here seq_cst
operations acts exactly like acq_rel operations.

Second example:

std::atomic<int> x, y; // both initially 0
void thread1()
{
x.fetch_add(1, std::memory_order_seq_cst);
int r1 = y.load(std::memory_order_relaxed);
}
void thread1()
{
y.store(1, std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_seq_cst);
int r2 = x.load(std::memory_order_relaxed);
}

Here is the same story.

I think that for seq_cst operations there must be something like:
"Sequentially consistent operation or fence which is release
synchronizes-with EVERY subsequent in total order S (of all
sequentially consistent operations and fences) sequentially consistent
operation or fence which is acquire"

Then output (r1 == 0) && (r2 == 0) will be prohibited in both
examples.

Dmitriy V'jukov

Peter Dimov

unread,

Aug 6, 2008, 3:09:28 PM8/6/08

to

On Aug 6, 4:58 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:
> I'm trying to figure out what synchronizes-with edges sequentially
> consistent atomic operations must introduce.
> My reasoning is based on latest C++0x draft:http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2691.pdf
> And Peter Dimov's excellent proposal on bidirectional fences:http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2633.html
>
> First example:
>
> std::atomic<int> x, y; // both initially 0
> void thread1()
> {
> x.fetch_add(1, std::memory_order_seq_cst);
> int r1 = y.load(std::memory_order_relaxed);}
>
> void thread1()
> {
> y.fetch_add(1, std::memory_order_seq_cst);
> int r2 = x.load(std::memory_order_relaxed);
>
> }
>
> As I understand, (r1 == 0) && (r2 == 0) is possible output, because no
> synchronizes-with edges introduced (at least I can't find anything
> which says the opposite).
> Is this intended?

I didn't participate in specifying the seq_cst semantics (Herb Sutter
was, I believe, the primary driving force behind them) but I think
that this is by design. seq_cst works only if you apply it
consistently; it's not designed to integrate with relaxed.

void thread1()
{
x.store( 1, std::memory_order_seq_cst );
int r1 = y.load( std::memory_order_seq_cst );
}

void thread2()
{
y.store( 1, std::memory_order_seq_cst );
int r2 = x.load( std::memory_order_seq_cst );
}

I think that this was done for efficiency reasons.

> Second example:
>
> std::atomic<int> x, y; // both initially 0
> void thread1()
> {
> x.fetch_add(1, std::memory_order_seq_cst);
> int r1 = y.load(std::memory_order_relaxed);}
>
> void thread1()
> {
> y.store(1, std::memory_order_relaxed);
> std::atomic_thread_fence(std::memory_order_seq_cst);
> int r2 = x.load(std::memory_order_relaxed);
> }

seq_cst fences share the general seq_cst property of not playing well
with others:

void thread1()
{
x.store( 1, std::memory_order_relaxed );
std::atomic_thread_fence( std::memory_order_seq_cst );
int r1 = y.load( std::memory_order_relaxed );
}

void thread2()
{
y.store( 1, std::memory_order_relaxed );
std::atomic_thread_fence( std::memory_order_seq_cst );
int r2 = x.load( std::memory_order_relaxed );
}

Dmitriy V'jukov

unread,

Aug 6, 2008, 3:33:07 PM8/6/08

to

On Aug 6, 11:09 pm, Peter Dimov <pdi...@gmail.com> wrote:

> I didn't participate in specifying the seq_cst semantics (Herb Sutter
> was, I believe, the primary driving force behind them) but I think
> that this is by design. seq_cst works only if you apply it
> consistently; it's not designed to integrate with relaxed.
>
> void thread1()
> {
> x.store( 1, std::memory_order_seq_cst );
> int r1 = y.load( std::memory_order_seq_cst );
>
> }
>
> void thread2()
> {
> y.store( 1, std::memory_order_seq_cst );
> int r2 = x.load( std::memory_order_seq_cst );
>
> }

Yes, this will work for sure.
So seq_cst is like 'const' or 'GPL' :)

> I think that this was done for efficiency reasons.

As I understand, if (1) operations are ordered (seq_cst operations are
definitely ordered), and (2) operations are both acquire and release
(seq_cst RMW is acquire and release for sure), then (3) they
synchronize-with automatically. On which architecture it's possible to
satisfy (1) and (2), but not (3)? It's quite counter-intuitive for
me...

> seq_cst fences share the general seq_cst property of not playing well
> with others:

The problem is that sometimes you don't have control over some atomic
operations.
For example we have queue:

class queue
{
void push(v)
{
...
head.exchange(..., std::memory_order_seq_cst);
}
};

We can't change the queue.
And we want to add 'blocking':

class blocking_queue
{
queue q;

void push(v)
{
q.push(v);
// here we need to synchronize-with concurrent pop(),
// and queue::push() already execute strong enough memory fence,
// but we can't take advantage of it, we have to execute one more
// strong memory fence
std::atomic_thread_fence(std::memory_order_seq_cst);
if (load_something())
...
}
};

Dmitriy V'jukov

Peter Dimov

unread,

Aug 6, 2008, 6:53:29 PM8/6/08

to

On Aug 6, 10:33 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:
> On Aug 6, 11:09 pm, Peter Dimov <pdi...@gmail.com> wrote:

...

> > I think that this was done for efficiency reasons.
>
> As I understand, if (1) operations are ordered (seq_cst operations are
> definitely ordered), and (2) operations are both acquire and release
> (seq_cst RMW is acquire and release for sure), then (3) they
> synchronize-with automatically. On which architecture it's possible to
> satisfy (1) and (2), but not (3)? It's quite counter-intuitive for
> me...

The basic principle is that if A synchronizes-with B, then A is a
write with release semantics and B is a read with acquire semantics. A
read can't synchronize-with something. A seq_cst load is not acq_rel
(release doesn't apply to loads), it only has acquire semantics.
Ordinary operations can still sink below a seq_cst load as a result of
compiler or hardware reorderings, and this doesn't break the
sequential consistency of a data-race-free program. (If my
understanding is correct.)

> The problem is that sometimes you don't have control over some atomic
> operations.

I know, and this is why the fences are defined to not only synchronize
with each other, but with other acquire and release operations as
well. The classic "write sync-with read" model is extended so that a
fence can be substituted for either or both the "source" and the
"target" of the synchronizes-with edge. But it still does not allow
"read sync-with fence" and the other arbitrary combinations; this
would be against the spirit of the current memory model.

(Even the current modification is viewed as risky.)

Dmitriy V'jukov

unread,

Aug 7, 2008, 7:14:15 AM8/7/08

to

On Aug 7, 2:53 am, Peter Dimov <pdi...@gmail.com> wrote:

> > As I understand, if (1) operations are ordered (seq_cst operations are
> > definitely ordered), and (2) operations are both acquire and release
> > (seq_cst RMW is acquire and release for sure), then (3) they
> > synchronize-with automatically. On which architecture it's possible to
> > satisfy (1) and (2), but not (3)? It's quite counter-intuitive for
> > me...
>
> The basic principle is that if A synchronizes-with B, then A is a
> write with release semantics and B is a read with acquire semantics. A
> read can't synchronize-with something. A seq_cst load is not acq_rel
> (release doesn't apply to loads), it only has acquire semantics.
> Ordinary operations can still sink below a seq_cst load as a result of
> compiler or hardware reorderings, and this doesn't break the
> sequential consistency of a data-race-free program. (If my
> understanding is correct.)

Let's for now consider only seq_cst atomic RMW.
What you mean:
1. This is basic principle of how hardware works. If so, then no
questions. I'm just curious what hardware can satisfy (1) and (2), but
not (3).
2. This is basic principle of current C++0x memory model. If so, then
the question is whether current C++0x memory model can be extended to
something like this:

"Sequentially consistent operation or fence which is release
synchronizes-with EVERY subsequent in total order S (of all
sequentially consistent operations and fences) sequentially consistent

operation or fence which is acquire"?
My supposition is that such rules can be efficiently implemented, and
such rules are intuitive and useful. What do you think?

> > The problem is that sometimes you don't have control over some atomic
> > operations.
>
> I know, and this is why the fences are defined to not only synchronize
> with each other, but with other acquire and release operations as
> well. The classic "write sync-with read" model is extended so that a
> fence can be substituted for either or both the "source" and the
> "target" of the synchronizes-with edge. But it still does not allow
> "read sync-with fence" and the other arbitrary combinations; this
> would be against the spirit of the current memory model.

There are 2 kinds of synchronization:
1. Let's call it "queue". One thread have to pass some information to
another. First thread executes release, and second - acquire. In this
situation what you've described (synchronization between store-release
and fence-acquire, and synchronization between fence-release and load-
acquire) works perfectly. I.e. fences and atomic operations can co-
operate in this kind of synchronization.
2. Let's call it "mutex". Two threads have to decide "who is the
first". Examples are mutex acquire with atomic RMW; Peterson's mutex
acquire with seq_cst fences; signaling on eventcount. And in this
situation what you've described doesn't work at all (not counting fact
that two seq_cst fences synchronizes-with). Current model was NOT
extended to handle this kind of synchronization. I.e. fences and
atomic operations don't co-operate at all in this kind of
synchronization. It's quite "dishonestly". And in some situations it
can basically double synchronization overheads, just to "stay
consistent with C++0x rules".

Dmitriy V'jukov

Peter Dimov

unread,

Aug 7, 2008, 10:01:51 AM8/7/08

to

On Aug 7, 2:14 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:

> 2. This is basic principle of current C++0x memory model. If so, then
> the question is whether current C++0x memory model can be extended to
> something like this:
> "Sequentially consistent operation or fence which is release
> synchronizes-with EVERY subsequent in total order S (of all
> sequentially consistent operations and fences) sequentially consistent
> operation or fence which is acquire"?

Maybe. To make sure that I understand, this would allow fence.seq_cst
in signal and cas.seq_cst in get, right? You'll still need
load.seq_cst in "signal_relaxed" though, even when the publication is
done with store.seq_cst. The implicit #StoreLoad is (conceptually)
only injected between two seq_cst ops, it won't necessarily be present
if the load is not seq_cst.

Dmitriy V'jukov

unread,

Aug 7, 2008, 11:19:45 AM8/7/08

to

No. The ultimate goal is:

int r1 = 0, r2 = 0;

void producer()
{
// publication
head.exchange(..., seq_cst);
// signaling
if (eventcount.load(relaxed))
int r1 = 1;
}

void consumer()
{
// consumption
node = head.load(relaxed);
if (0 == node)
{
// get
eventcount.exchange(..., seq_cst);
// consumption one more time
node = head.load(relaxed);
if (node)
r2 = 1;
}
}

And I want that (r1 == 0) && (r2 == 0) will be prohibited. I.e. or
publication must synchronize-with get, or vice versa.

On x86 this must result in "optimal" implementation:

void producer()
{
// publication
// head.exchange(..., seq_cst);
InterlockedExchange(...);
// signaling
// if (eventcount.load(relaxed))
if (plain_load(eventcount))
int r1 = 1;
}

void consumer()
{
// consumption
// node = head.load(relaxed);
node = plain_load(head);
if (0 == node)
{
// get
// eventcount.exchange(..., seq_cst);
InterlockedExchange(...);
// consumption one more time
// node = head.load(relaxed);
node = plain_load(head);
if (node)
r2 = 1;
}
}

Dmitriy V'jukov

Peter Dimov

unread,

Aug 7, 2008, 3:26:30 PM8/7/08

to

On Aug 7, 6:19 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:

> No. The ultimate goal is:
>
> int r1 = 0, r2 = 0;
>
> void producer()
> {
> // publication
> head.exchange(..., seq_cst);
> // signaling
> if (eventcount.load(relaxed))
> int r1 = 1;
>
> }
>
> void consumer()
> {
> // consumption
> node = head.load(relaxed);
> if (0 == node)
> {
> // get
> eventcount.exchange(..., seq_cst);
> // consumption one more time
> node = head.load(relaxed);
> if (node)
> r2 = 1;
> }
>
> }
>
> And I want that (r1 == 0) && (r2 == 0) will be prohibited. I.e. or
> publication must synchronize-with get, or vice versa.

Under the C++MM, relaxed operations never synchronize with anything,
and you have no fences, so this won't work (in principle), no matter
what the fence semantics are. Unless I'm missing something.

(At least the first load in consumer needs to have acquire semantics
though. Or you can add an acquire fence in an 'else'.)

> On x86 this must result in "optimal" implementation:

...

On x86, all loads map to a mov instruction, even load.seq_cst (if my
understanding is correct), so if x86 is your target, you can just use
load.seq_cst. The hardware that is closest to the edge of the C++MM is
POWER. I must admit that I don't know for sure how seq_cst is
implemented on POWER though. It's possible that the above will work
there in practice.

Dmitriy V'jukov

unread,

Aug 7, 2008, 5:03:15 PM8/7/08

to

On 7 авг, 23:26, Peter Dimov <pdi...@gmail.com> wrote:

> > And I want that (r1 == 0) && (r2 == 0) will be prohibited. I.e. or
> > publication must synchronize-with get, or vice versa.
>
> Under the C++MM, relaxed operations never synchronize with anything,
> and you have no fences, so this won't work (in principle), no matter
> what the fence semantics are. Unless I'm missing something.

Sh#t! There must be:
node = head.load(acquire);

I want to seq_cst atomic RMW in one thread synchronize-with seq_cst
atomic RMW in another thread. Forget about current C++0x MM for a
moment. It this reasonably?

> > On x86 this must result in "optimal" implementation:
>
> ...
>
> On x86, all loads map to a mov instruction, even load.seq_cst (if my
> understanding is correct), so if x86 is your target, you can just use
> load.seq_cst. The hardware that is closest to the edge of the C++MM is
> POWER. I must admit that I don't know for sure how seq_cst is
> implemented on POWER though. It's possible that the above will work
> there in practice.

No, x86 is not my target. Well, if seq_cst load will be implemented as
plain load on ALL architectures, maybe this changes the situation... I
have to think some more. At least examples in first post will be
working, if I change relaxed loads to seq_cst loads. But is this the
case, that seq_cst loads will implemented as plain loads on ALL
architectures?

Dmitriy V'jukov

Peter Dimov

unread,

Aug 7, 2008, 5:42:11 PM8/7/08

to

On Aug 8, 12:03 am, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:
> I want to seq_cst atomic RMW in one thread synchronize-with seq_cst
> atomic RMW in another thread.

Two RMW ops targeting different locations?

As I understand it, the design goal of seq_cst atomics was to make it
possible for programmers to write sequentially consistent code without
sacrificing performance (or rather, sacrificing it as little as
possible). Typically, two seq_cst operations to different locations do
not need to synchronize with one another to achieve SC. Imposing this
requirement will decrease the (theoretical) performance of seq_cst-
using programs.

> But is this the case, that seq_cst loads will implemented as plain loads on ALL architectures?

My understanding of the POWER MM is very limited, but I've heard that
a seq_cst load there maps to SYNC; LL/SC loop; SYNC. Which is terribly
expensive. I suspect that for optimal performance one would still need
two or three variants of the same algorithm.

Dmitriy V'jukov

unread,

Aug 8, 2008, 3:26:51 AM8/8/08

to

On 8 авг, 01:42, Peter Dimov <pdi...@gmail.com> wrote:
> On Aug 8, 12:03 am, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:
>
> > I want to seq_cst atomic RMW in one thread synchronize-with seq_cst
> > atomic RMW in another thread.
>
> Two RMW ops targeting different locations?

Yes.

> As I understand it, the design goal of seq_cst atomics was to make it
> possible for programmers to write sequentially consistent code without
> sacrificing performance (or rather, sacrificing it as little as
> possible). Typically, two seq_cst operations to different locations do
> not need to synchronize with one another to achieve SC. Imposing this
> requirement will decrease the (theoretical) performance of seq_cst-
> using programs.

Consider slides 11 and 12 from Anthony Williams presentation:
http://accu.org/content/conf2008/Williams-future_of_concurrency.pdf

seq_cst stores and RMW have to synchronize with one another, because
they have to from global order. This is the main difference from
acq_rel. So they already synchronize with one another, why not declare
that they also 'synchronize-with' one another?

> > But is this the case, that seq_cst loads will implemented as plain loads on ALL architectures?
>
> My understanding of the POWER MM is very limited, but I've heard that
> a seq_cst load there maps to SYNC; LL/SC loop; SYNC. Which is terribly
> expensive. I suspect that for optimal performance one would still need
> two or three variants of the same algorithm.

Oh! I definitely don't want to insert 2 superfluous 'SYNC; LL/SC loop;
SYNC'!
What about several variants of the algorithm, yeah, maybe sometimes
this will be the case. But how I can express example with eventcount:
http://groups.google.com/group/comp.programming.threads/msg/92692b9b70106e9f
in optimal for POWER way?
I still think that allowing two seq_cst atomic RMW to different
locations to 'synchronize-with' one another is beneficial...

Dmitriy V'jukov

Peter Dimov

unread,

Aug 8, 2008, 7:30:13 AM8/8/08

to

On Aug 8, 10:26 am, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:

> seq_cst stores and RMW have to synchronize with one another, because
> they have to from global order. This is the main difference from
> acq_rel. So they already synchronize with one another, why not declare
> that they also 'synchronize-with' one another?

I'm not 100% positive, but I suspect that an implementation where
store.seq_cst maps to MOV, and load.seq_cst maps to LOCK XADD, is
conforming. Stated differently, I think that, while the current model
demands this:

// T1
store.seq_cst x, 1
load.seq_cst y

// T2
store.seq_cst y, 1
load.seq_cst x

to "work", or to not load (0, 0), it is possible to write an
implementation which would make one of these two:

// T1
store.release x, 1
load.seq_cst y

// T2
store.release y, 1
load.seq_cst x

//

// T1
store.seq_cst x, 1
load.acquire y

// T2
store.seq_cst y, 1
load.acquire x

to "fail". (I don't know if it's possible in practice to make an
implementation that would "fail" both.)

> What about several variants of the algorithm, yeah, maybe sometimes

> this will be the case. But how I can express example with eventcount:http://groups.google.com/group/comp.programming.threads/msg/92692b9b7...

> in optimal for POWER way?

I suspect that for POWER the best course of action would be to write
the assembly and then backport to fences, replacing SYNC with
fence.seq_cst and LWSYNC with fence.acq_rel (or acquire, or release,
depending on context). :-)

Dmitriy V'jukov

unread,

Aug 11, 2008, 3:15:13 PM8/11/08

to

But this is not an end in itself. If C++MM will require that seq_cst
operations must synchronize-with each other, then former
implementation will be just prohibited. Is it bad? I think that latter
implementation (stores are heavier, and loads are easier) is
preferable anyway. What I miss here?

Dmitriy V'jukov

Anthony Williams

unread,

Aug 13, 2008, 4:09:45 AM8/13/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> writes:

> I'm trying to figure out what synchronizes-with edges sequentially
> consistent atomic operations must introduce.
> My reasoning is based on latest C++0x draft:
> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2691.pdf
> And Peter Dimov's excellent proposal on bidirectional fences:
> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2633.html
>
> First example:
>
> std::atomic<int> x, y; // both initially 0
> void thread1()
> {
> x.fetch_add(1, std::memory_order_seq_cst);
> int r1 = y.load(std::memory_order_relaxed);
> }
> void thread1()
> {
> y.fetch_add(1, std::memory_order_seq_cst);
> int r2 = x.load(std::memory_order_relaxed);
> }
>
> As I understand, (r1 == 0) && (r2 == 0) is possible output, because no
> synchronizes-with edges introduced (at least I can't find anything
> which says the opposite).
> Is this intended? It's quite counter-intuitive. fetch_add operations
> are ordered, and they both acquire and release. So why here is no
> synchronizes-with edge between them? Then I don't understand what
> differentiate seq_cst operations from acq_rel operations, here seq_cst
> operations acts exactly like acq_rel operations.

This is intended. Operations on different variables cannot synchronize
with each other.

Unless *all* your operations are seq_cst, seq_cst is the same as
acquire for loads, release for stores and acq_rel for RMW.

> Second example:
>
> std::atomic<int> x, y; // both initially 0
> void thread1()
> {
> x.fetch_add(1, std::memory_order_seq_cst);
> int r1 = y.load(std::memory_order_relaxed);
> }
> void thread1()
> {
> y.store(1, std::memory_order_relaxed);
> std::atomic_thread_fence(std::memory_order_seq_cst);
> int r2 = x.load(std::memory_order_relaxed);
> }
>
> Here is the same story.

Yes. As Peter already said, relaxed operations can't participate in
synchronize-with relationships.

Anthony
--
Anthony Williams | Just Software Solutions Ltd
Custom Software Development | http://www.justsoftwaresolutions.co.uk
Registered in England, Company Number 5478976.
Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL

Anthony Williams

unread,

Aug 13, 2008, 4:11:31 AM8/13/08

to

Peter Dimov <pdi...@gmail.com> writes:

> On x86, all loads map to a mov instruction, even load.seq_cst

Yes. If store.seq_cst is XCHG, load.seq_cst need only be a plain MOV.

Dmitriy V'jukov

unread,

Aug 15, 2008, 6:55:00 PM8/15/08

to

On 13 авг, 12:09, Anthony Williams <anthony....@gmail.com> wrote:

> > As I understand, (r1 == 0) && (r2 == 0) is possible output, because no
> > synchronizes-with edges introduced (at least I can't find anything
> > which says the opposite).
> > Is this intended? It's quite counter-intuitive. fetch_add operations
> > are ordered, and they both acquire and release. So why here is no
> > synchronizes-with edge between them? Then I don't understand what
> > differentiate seq_cst operations from acq_rel operations, here seq_cst
> > operations acts exactly like acq_rel operations.
>
> This is intended.

Ok.
But then I am just curious - why?
It is impossible to implement efficiently on some architectures?
It will over-complicate definitions?
It is just useless?

> > Here is the same story.
>
> Yes. As Peter already said, relaxed operations can't participate in
> synchronize-with relationships.

NO! Not relaxed operations. One seq_cst operation synchronizes-with
another seq_cst operation. Relaxed operations are under standard
rules.

--
Dmitriy V'jukov

Relacy Race Detector: Make your synchronization correct!
http://groups.google.ru/group/relacy/web

Anthony Williams

unread,

Aug 18, 2008, 8:02:03 AM8/18/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> writes:

> On 13 авг, 12:09, Anthony Williams <anthony....@gmail.com> wrote:
>
>> > As I understand, (r1 == 0) && (r2 == 0) is possible output, because no
>> > synchronizes-with edges introduced (at least I can't find anything
>> > which says the opposite).
>> > Is this intended? It's quite counter-intuitive. fetch_add operations
>> > are ordered, and they both acquire and release. So why here is no
>> > synchronizes-with edge between them? Then I don't understand what
>> > differentiate seq_cst operations from acq_rel operations, here seq_cst
>> > operations acts exactly like acq_rel operations.
>>
>> This is intended.
>
> Ok.
> But then I am just curious - why?

I don't know the full details, but the impression I got from the
atomics working group was that if you want full ordering use seq_cst
everywhere. If you care about efficiency, seq_cst is generally too
expensive anyway, so we don't want seq_cst semantics to affect the
ordering of other operations.

seq_cst operations form a total order. On some architectures this is
very expensive to do (imagine a distributed system). Forcing this
total order to add synchronizes-with edges increases that cost even
more, as the other non-seq-cst operations now have to participate in
the order.

>> > Here is the same story.
>>
>> Yes. As Peter already said, relaxed operations can't participate in
>> synchronize-with relationships.
>
> NO! Not relaxed operations. One seq_cst operation synchronizes-with
> another seq_cst operation. Relaxed operations are under standard
> rules.

Your example was:

std::atomic<int> x, y; // both initially 0
void thread1()
{
x.fetch_add(1, std::memory_order_seq_cst);
int r1 = y.load(std::memory_order_relaxed);
}

void thread2()

{
y.store(1, std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_seq_cst);
int r2 = x.load(std::memory_order_relaxed);
}

As I understand it, you want the fence to "synchronize-with" the
fetch_add in some way, so that r1=r2=0 is prohibited.

Fences don't work that way. The fence only participates in
synchronizes-with if the load of y from thread1 is an acquire *and
sees the value stored by thread2*, or the load of x from thread2 is an
acquire *and sees the value stored by thread1*.

For the fence to force the ordering would require that the relaxed
operations were part of the total ordering: not only would the
runtime/hardware have to ensure that all the seq_cst operations
happened in a total order, but that all accesses to all variables that
occurred before or after a seq_cst operation were made visible to
other threads. This would be a major cost, as every seq_cst operation
would then essentially have to synchronize the entire data set of the
application.

Peter Dimov

unread,

Aug 18, 2008, 6:56:36 PM8/18/08

to

On Aug 11, 10:15 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:

> But this is not an end in itself. If C++MM will require that seq_cst
> operations must synchronize-with each other, then former
> implementation will be just prohibited. Is it bad? I think that latter
> implementation (stores are heavier, and loads are easier) is
> preferable anyway. What I miss here?

I think that this prevents a number of currently-valid optimizations.
Consider this example:

// T1:

x.store( 1, relaxed );
q1.store( 1, seq_cst );
r1 = y.load( relaxed );

// T2:

y.store( 1, relaxed );
q2.store( 1, seq_cst );
r1 = x.load( relaxed );

Currently, the seq_cst stores do not prevent upward reordering of the
loads (on compiler or hardware level) because a seq_cst store is only
a release operation. (You can of course use q.swap to prevent it.)
This reordering doesn't break sequential consistency for race-free
programs that use only seq_cst atomics and ordinary non-atomic ops and
may potentially improve their performance.

Further, assume that q1 is local to T1 and q2 is local to T2. In this
case, the C++MM allows operations on these locations to be optimized
out. If they were required to synchronize with one another, it
wouldn't be possible for them to be ignored, even when the location is
private to the thread.

I'm not actually claiming that these optimizations are important in
practice, just that they may have been the reason for the C++MM to be
specified in the way it is.

Dmitriy V'jukov

unread,

Aug 19, 2008, 6:48:00 PM8/19/08

to

On 19 авг, 02:56, Peter Dimov <pdi...@gmail.com> wrote:
> On Aug 11, 10:15 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:
>
> > But this is not an end in itself. If C++MM will require that seq_cst
> > operations must synchronize-with each other, then former
> > implementation will be just prohibited. Is it bad? I think that latter
> > implementation (stores are heavier, and loads are easier) is
> > preferable anyway. What I miss here?
>
> I think that this prevents a number of currently-valid optimizations.
> Consider this example:
>
> // T1:
>
> x.store( 1, relaxed );
> q1.store( 1, seq_cst );
> r1 = y.load( relaxed );
>
> // T2:
>
> y.store( 1, relaxed );
> q2.store( 1, seq_cst );
> r1 = x.load( relaxed );
>
> Currently, the seq_cst stores do not prevent upward reordering of the
> loads (on compiler or hardware level) because a seq_cst store is only
> a release operation. (You can of course use q.swap to prevent it.)
> This reordering doesn't break sequential consistency for race-free
> programs that use only seq_cst atomics and ordinary non-atomic ops and
> may potentially improve their performance.

I was talking only about seq_cst RMW operations.
Seq_cst store doesn't have to synchronize-with other seq_cst store.
Well, ok, in "my proposal" seq_csq store can synchronize-with seq_cst
load. But this makes a little sense, because if load will be before
store in total order, then nobody will synchronize-with.

> Further, assume that q1 is local to T1 and q2 is local to T2. In this
> case, the C++MM allows operations on these locations to be optimized
> out. If they were required to synchronize with one another, it
> wouldn't be possible for them to be ignored, even when the location is
> private to the thread.
>
> I'm not actually claiming that these optimizations are important in
> practice, just that they may have been the reason for the C++MM to be
> specified in the way it is.

Yes, such optimizations will be prohibited. And yes, I agree that
"thread local atomics" are NOT the right use case to reason about
memory model...

Dmitriy V'jukov

unread,

Aug 19, 2008, 7:26:32 PM8/19/08

to

On 18 авг, 16:02, Anthony Williams <anthony....@gmail.com> wrote:

> I don't know the full details, but the impression I got from the
> atomics working group was that if you want full ordering use seq_cst
> everywhere. If you care about efficiency, seq_cst is generally too
> expensive anyway, so we don't want seq_cst semantics to affect the
> ordering of other operations.

For now, I don't yet see where "my proposal" will affect performance
of non seq_cst operations.

The problem that it's not always possible to consistently use only
seq_cst operations, or only not seq_cst operations. Current rules hit
composability. Assume you want to use component which uses seq_cst
operations, but you don't want to use seq_cst operations in the rest
of the code. You will have to insert additional atomic operations only
to integrate the component. Or assume that you have to integrate
component which uses seq_cst operations with another component which
doesn't use seq_cst operations. The same problem.
Or assume that one algorithm can be expressed more naturally with
seq_cst operations, and another - with non seq_cst operations. And you
have to use both. Once again, you will have to insert additional
atomic operations only to integrate them.
Or even worse. You are writing generic algorithm (something like Chris
Thomasson's eventcount). And you don't know with what you will have to
synchronize. With seq_cst fence, or with seq_cst rmw operation. Ooops!
Currently you have to issue both - and fence, and rmw operation.
I see clear advantage in better cooperation of seq_seq operations with
the rest of the world.

> seq_cst operations form a total order. On some architectures this is
> very expensive to do (imagine a distributed system). Forcing this
> total order to add synchronizes-with edges increases that cost even
> more, as the other non-seq-cst operations now have to participate in
> the order.

But they have anyway!
Because seq_cst store is release, and seq_cst load is acquire. So all
operations anyway have to complete before seq_cst store, because other
thread can issue load-acquire which will load value stored by seq_cst
store. And similar for seq_cst load.
Maybe I'm missing something, but, according to my current
understanding, what I'm proposing will work anyway, it's just formally
not legal.

> >> > Here is the same story.
>
> >> Yes. As Peter already said, relaxed operations can't participate in
> >> synchronize-with relationships.
>
> > NO! Not relaxed operations. One seq_cst operation synchronizes-with
> > another seq_cst operation. Relaxed operations are under standard
> > rules.
>
> Your example was:
>
> std::atomic<int> x, y; // both initially 0
> void thread1()
> {
> x.fetch_add(1, std::memory_order_seq_cst);
> int r1 = y.load(std::memory_order_relaxed);}
>
> void thread2()
> {
> y.store(1, std::memory_order_relaxed);
> std::atomic_thread_fence(std::memory_order_seq_cst);
> int r2 = x.load(std::memory_order_relaxed);
>
> }
>
> As I understand it, you want the fence to "synchronize-with" the
> fetch_add in some way, so that r1=r2=0 is prohibited.

Yes.

> Fences don't work that way. The fence only participates in
> synchronizes-with if the load of y from thread1 is an acquire *and
> sees the value stored by thread2*, or the load of x from thread2 is an
> acquire *and sees the value stored by thread1*.

I understand that they don't work that way. The question is - why they
don't work that way? And can they work another way?

> For the fence to force the ordering would require that the relaxed
> operations were part of the total ordering: not only would the
> runtime/hardware have to ensure that all the seq_cst operations
> happened in a total order, but that all accesses to all variables that
> occurred before or after a seq_cst operation were made visible to
> other threads. This would be a major cost, as every seq_cst operation
> would then essentially have to synchronize the entire data set of the
> application.

But seq_cst operations have to synchronize the entire data set of the
application anyway!

If standard would not claim that seq_cst store is release and seq_cst
load is acquire, than I would agree. Than seq_cst operations will
cooperate only with other seq_cst operations, and not with release/
acquire operations.
But standard claims that seq_cst store is release and seq_cst load is
acquire, so I can write something like this:
void thread1()
{
non_atomic_data = 1;
atomic_flag.store(1, seq_cst);
}
void thread2()
{
if (1 == atomic_flag.load(acquire))
assert(1 == non_atomic_data);
}

or:
void thread1()
{
non_atomic_data = 1;
atomic_flag.store(1, release);
}
void thread2()
{
if (1 == atomic_flag.load(seq_cst))
assert(1 == non_atomic_data);

Anthony Williams

unread,

Aug 20, 2008, 3:11:28 AM8/20/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> writes:

> On 18 авг, 16:02, Anthony Williams <anthony....@gmail.com> wrote:
>
>> I don't know the full details, but the impression I got from the
>> atomics working group was that if you want full ordering use seq_cst
>> everywhere. If you care about efficiency, seq_cst is generally too
>> expensive anyway, so we don't want seq_cst semantics to affect the
>> ordering of other operations.
>
> For now, I don't yet see where "my proposal" will affect performance
> of non seq_cst operations.

If every seq_cst operation synchronizes-with every other seq_cst
operation, this affects the allowed compiler optimizations. It
effectively makes every seq_cst operation a full fence, even if no
other thread can see that operation:

int global(0);
std::atomic_int global_atomic(0);

void thread1()
{
global=1;
std::atomic_int ai;
ai.store(1,std::memory_order_seq_cst);
global_atomic.store(1,std::memory_order_relaxed);
}

void thread2()
{
if(global_atomic.load(std::memory_order_seq_cst)==1)
{
assert(global==1);
}
}

Under the current memory model, ai is local so it can be completely
eliminated. The stores to global and global_atomic are not ordered, so
the assert in thread2 is actually a data race.

Under your proposal, the store to ai cannot be elimiated since it must
synchronize-with all other seq_cst operations. In particular, it must
synchronize-with the seq_cst load in thread2 of an unrelated
variable. Since the store to global_atomic is after the store to ai,
if the load reads 1, the assert is no longer a data race, and will not
fire either, since the value of global must be 1.

> The problem that it's not always possible to consistently use only
> seq_cst operations, or only not seq_cst operations. Current rules hit
> composability. Assume you want to use component which uses seq_cst
> operations, but you don't want to use seq_cst operations in the rest
> of the code. You will have to insert additional atomic operations only
> to integrate the component. Or assume that you have to integrate
> component which uses seq_cst operations with another component which
> doesn't use seq_cst operations. The same problem.

That depends on how you write your components. Can you give an example?

> Or assume that one algorithm can be expressed more naturally with
> seq_cst operations, and another - with non seq_cst operations. And you
> have to use both. Once again, you will have to insert additional
> atomic operations only to integrate them.

Again, can you give an example?

> Or even worse. You are writing generic algorithm (something like Chris
> Thomasson's eventcount). And you don't know with what you will have to
> synchronize. With seq_cst fence, or with seq_cst rmw operation. Ooops!
> Currently you have to issue both - and fence, and rmw operation.
> I see clear advantage in better cooperation of seq_seq operations with
> the rest of the world.

Can you show the example in code so I can see where you feel that
both the fence and RMW operations are necessary?

>> seq_cst operations form a total order. On some architectures this is
>> very expensive to do (imagine a distributed system). Forcing this
>> total order to add synchronizes-with edges increases that cost even
>> more, as the other non-seq-cst operations now have to participate in
>> the order.
>
> But they have anyway!
> Because seq_cst store is release, and seq_cst load is acquire. So all
> operations anyway have to complete before seq_cst store, because other
> thread can issue load-acquire which will load value stored by seq_cst
> store. And similar for seq_cst load.

Yes, but that's a pairwise ordering: the store-release vs the
load-acquire. If the compiler can see that there is no such ordering,
it can eliminate the cost. Also, it is only synchronization between
two threads, not globally.

> Maybe I'm missing something, but, according to my current
> understanding, what I'm proposing will work anyway, it's just formally
> not legal.

I believe you are wrong. I gave an example above.

>> >> > Here is the same story.
>>
>> >> Yes. As Peter already said, relaxed operations can't participate in
>> >> synchronize-with relationships.
>>
>> > NO! Not relaxed operations. One seq_cst operation synchronizes-with
>> > another seq_cst operation. Relaxed operations are under standard
>> > rules.
>>
>> Your example was:
>>
>> std::atomic<int> x, y; // both initially 0
>> void thread1()
>> {
>> x.fetch_add(1, std::memory_order_seq_cst);
>> int r1 = y.load(std::memory_order_relaxed);}
>>
>> void thread2()
>> {
>> y.store(1, std::memory_order_relaxed);
>> std::atomic_thread_fence(std::memory_order_seq_cst);
>> int r2 = x.load(std::memory_order_relaxed);
>>
>> }
>>
>> As I understand it, you want the fence to "synchronize-with" the
>> fetch_add in some way, so that r1=r2=0 is prohibited.
>
> Yes.
>
>> Fences don't work that way. The fence only participates in
>> synchronizes-with if the load of y from thread1 is an acquire *and
>> sees the value stored by thread2*, or the load of x from thread2 is an
>> acquire *and sees the value stored by thread1*.
>
> I understand that they don't work that way. The question is - why they
> don't work that way? And can they work another way?

I don't see how you could write code that required them to work
another way. You would have no way of knowing whether your operations
had occurred before the fence or after, and thus whether or not it was
safe to read other data. If you introduced a variable to identify
whether or not the synchronization had occurred then you get the
behaviour you require under the current model anyway.

>> For the fence to force the ordering would require that the relaxed
>> operations were part of the total ordering: not only would the
>> runtime/hardware have to ensure that all the seq_cst operations
>> happened in a total order, but that all accesses to all variables that
>> occurred before or after a seq_cst operation were made visible to
>> other threads. This would be a major cost, as every seq_cst operation
>> would then essentially have to synchronize the entire data set of the
>> application.
>
> But seq_cst operations have to synchronize the entire data set of the
> application anyway!

No. They have to synchronize the seq_cst operations themselves, but
other than that they only have to synchronize the data between a
thread that does a release and a thread that does an acquire (even if
those release and acquire are consequences of seq_cst). They do not
force the compiler/hardware to synchronize all data with all threads
that perform seq_cst operations.

> If standard would not claim that seq_cst store is release and seq_cst
> load is acquire, than I would agree. Than seq_cst operations will
> cooperate only with other seq_cst operations, and not with release/
> acquire operations.
> But standard claims that seq_cst store is release and seq_cst load is
> acquire, so I can write something like this:
> void thread1()
> {
> non_atomic_data = 1;
> atomic_flag.store(1, seq_cst);
> }
> void thread2()
> {
> if (1 == atomic_flag.load(acquire))
> assert(1 == non_atomic_data);
> }
>
> or:
> void thread1()
> {
> non_atomic_data = 1;
> atomic_flag.store(1, release);
> }
> void thread2()
> {
> if (1 == atomic_flag.load(seq_cst))
> assert(1 == non_atomic_data);
> }

Yes. That's nice normal pairwise synchronization. Depending on
precisely what operations you use you can still get different results
from different threads if you mix seq_cst with acquire/release.

Dmitriy V'jukov

unread,

Aug 24, 2008, 6:20:35 AM8/24/08

to

On 20 авг, 11:11, Anthony Williams <anthony....@gmail.com> wrote:

> If every seq_cst operation synchronizes-with every other seq_cst
> operation, this affects the allowed compiler optimizations. It
> effectively makes every seq_cst operation a full fence, even if no
> other thread can see that operation:
>
> int global(0);
> std::atomic_int global_atomic(0);
>
> void thread1()
> {
> global=1;
> std::atomic_int ai;
> ai.store(1,std::memory_order_seq_cst);
> global_atomic.store(1,std::memory_order_relaxed);
>
> }
>
> void thread2()
> {
> if(global_atomic.load(std::memory_order_seq_cst)==1)
> {
> assert(global==1);
> }
>
> }
>
> Under the current memory model, ai is local so it can be completely
> eliminated. The stores to global and global_atomic are not ordered, so
> the assert in thread2 is actually a data race.
>
> Under your proposal, the store to ai cannot be elimiated since it must
> synchronize-with all other seq_cst operations. In particular, it must
> synchronize-with the seq_cst load in thread2 of an unrelated
> variable. Since the store to global_atomic is after the store to ai,
> if the load reads 1, the assert is no longer a data race, and will not
> fire either, since the value of global must be 1.

Well, yes. But trying to optimize for "thread local atomics" looks a
bit strange for me. It's like trying to optimize boost::shared_ptr to
the case when where is only one reference...

> > The problem that it's not always possible to consistently use only
> > seq_cst operations, or only not seq_cst operations. Current rules hit
> > composability. Assume you want to use component which uses seq_cst
> > operations, but you don't want to use seq_cst operations in the rest
> > of the code. You will have to insert additional atomic operations only
> > to integrate the component. Or assume that you have to integrate
> > component which uses seq_cst operations with another component which
> > doesn't use seq_cst operations. The same problem.
>
> That depends on how you write your components. Can you give an example?

Consider eventcount implementation:
http://groups.google.com/group/comp.programming.threads/msg/71eae0808af7e0ed
Function eventcount::signal_relaxed() tries to "reuse" heavy memory
fence already executed by message-passing queue's enqueue() function.
But we don't know what kind of "heavy memory fence" was executed. Is
it seq_cst fence? Or seq_cst atomic rmw? If seq_cst atomic rmw, then
on what memory location?

> I don't see how you could write code that required them to work
> another way. You would have no way of knowing whether your operations
> had occurred before the fence or after, and thus whether or not it was
> safe to read other data. If you introduced a variable to identify
> whether or not the synchronization had occurred then you get the
> behaviour you require under the current model anyway.

There a number of algorithms, for which it's enough.
For example, Peterson's mutex:
http://en.wikipedia.org/wiki/Peterson%27s_algorithm
or work-stealing deque:
http://www.bluebytesoftware.com/blog/CommentView,guid,1665653b-b5f3-49b4-8144-cfbc5e8c632b.aspx
of eventcount:
http://groups.google.com/group/comp.programming.threads/msg/71eae0808af7e0ed

Of course, all variables must be std::atomic<>, so formally it's
always safe to read them.

> No. They have to synchronize the seq_cst operations themselves, but
> other than that they only have to synchronize the data between a
> thread that does a release and a thread that does an acquire (even if
> those release and acquire are consequences of seq_cst). They do not
> force the compiler/hardware to synchronize all data with all threads
> that perform seq_cst operations.

But can synchronization between 2 threads be implemented more
efficiently than synchronization between all threads? I think that for
commodity SMP/multicore it's not the case. I.e. it's impossible to
implement synchronization between 2 threads more efficiently than
synchronization between all threads.

My standpoint here is that if standard can give some additional
guarantees, w/o incurring additional overheads, than it's worth doing
to give those guarantees. Because there can be algorithms which can
benefit from those guarantees.

Currently C++0x seq_cst atomic operations are not fully sequentially
consistent. Seq_cst operations in most current memory models are more
sequentially consistent, so to say. Java atomics, C# Interlocked
operations and x86 locked instructions, all provide guarantees that
I'm talking about. Their behavior is intuitive, and C++0x's behavior
is not intuitive.

Another moment. Current C++0x rules make it difficult to port existing
Java, C#, x86 and just abstract 'academia' algorithms to C++0x. Naive
approach to use seq_cst atomic RMW operations instead of Java (C#,
x86, abstract) atomic RMW operations will not always work. Instead of
Java atomic RMW operations one must write something like this:
atomic_var.fetch_add(1);
std::atomic_thread_fence(std::memory_order_seq_cst);
And it's not accurate modeling. I'm not sure whether this inaccuracy
can break some algorithms or not. But, still, in my opinion, C++0x's
seq_cst atomics expose not intuitive behavior.

Dmitriy V'jukov

Peter Dimov

unread,

Aug 24, 2008, 1:30:48 PM8/24/08

to

On Aug 24, 1:20 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:

> Currently C++0x seq_cst atomic operations are not fully sequentially
> consistent.

They are.

> Another moment. Current C++0x rules make it difficult to port existing
> Java, C#, x86 and just abstract 'academia' algorithms to C++0x. Naive
> approach to use seq_cst atomic RMW operations instead of Java (C#,
> x86, abstract) atomic RMW operations will not always work.

Not correct.

Dmitriy V'jukov

unread,

Aug 24, 2008, 2:55:44 PM8/24/08

to

On 24 авг, 21:30, Peter Dimov <pdi...@gmail.com> wrote:
> On Aug 24, 1:20 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:
>
> > Currently C++0x seq_cst atomic operations are not fully sequentially
> > consistent.
>
> They are.

They consistent only with other seq_cst operations. And most current
MMs (Java/C#/x86) give stronger guarantees, that seq_cst operations
consistent with other non seq_cst operations too.

> > Another moment. Current C++0x rules make it difficult to port existing
> > Java, C#, x86 and just abstract 'academia' algorithms to C++0x. Naive
> > approach to use seq_cst atomic RMW operations instead of Java (C#,
> > x86, abstract) atomic RMW operations will not always work.
>
> Not correct.

Ok, I rephrase: It's easy to port any existing algorithm to C++0x
(just use seq_cst operations everywhere), but it's difficult to
achieve good performance at the same time. And performance is usually
the only reason why we use atomic operations.

Dmitriy V'jukov

Peter Dimov

unread,

Aug 24, 2008, 4:49:02 PM8/24/08

to

On Aug 24, 9:55 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:

> On 24 Á×Ç, 21:30, Peter Dimov <pdi...@gmail.com> wrote:
>
> > On Aug 24, 1:20 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:
>
> > > Currently C++0x seq_cst atomic operations are not fully sequentially
> > > consistent.
>
> > They are.
>
> They consistent only with other seq_cst operations. And most current
> MMs (Java/C#/x86) give stronger guarantees, that seq_cst operations
> consistent with other non seq_cst operations too.

Why do you think that the Java MM gives stronger guarantees than the C+
+MM?

Dmitriy V'jukov

unread,

Aug 25, 2008, 1:54:03 AM8/25/08

to

On 25 авг, 00:49, Peter Dimov <pdi...@gmail.com> wrote:

> > > > Currently C++0x seq_cst atomic operations are not fully sequentially
> > > > consistent.
>
> > > They are.
>
> > They consistent only with other seq_cst operations. And most current
> > MMs (Java/C#/x86) give stronger guarantees, that seq_cst operations
> > consistent with other non seq_cst operations too.
>
> Why do you think that the Java MM gives stronger guarantees than the C+
> +MM?

I'm not 100% sure. But I believe that Java/CLI MMs do give stronger
guarantees.
In JSR-133 Cookbook Doug Lea defines possible reorderings and fences
in terms of Sparc membars:
http://gee.cs.oswego.edu/dl/jmm/cookbook.html
And #StoreLoad emitted between volatile store and volatile load do
give stronger guarantees.
CLI MM is very informal. But as I understand intended reading is that,
if you can't construct sequentially consistent interleaving of threads
for some behavior, then this behavior is prohibited.
I need to read more!

Dmitriy V'jukov

Peter Dimov

unread,

Aug 25, 2008, 3:19:10 AM8/25/08

to

On Aug 25, 8:54 am, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:

> I'm not 100% sure. But I believe that Java/CLI MMs do give stronger
> guarantees.

Is there a (Java) example that can demonstrate that?

> In JSR-133 Cookbook Doug Lea defines possible reorderings and fences
> in terms of Sparc membars:http://gee.cs.oswego.edu/dl/jmm/cookbook.html
> And #StoreLoad emitted between volatile store and volatile load do
> give stronger guarantees.

Well, it doesn't really matter what one particular Java VM on SPARC
does. :-) The guarantees are given by the Java memory model, not by
the hardware or the cookbook.

(SPARC gives pretty strong guarantees since it's RCsc even in its RMO
mode, and this is mainly of theoretical importance since in practice
it's always TSO.)

Peter Dimov

unread,

Aug 25, 2008, 4:07:17 AM8/25/08

to

On Aug 25, 8:54 am, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:

> CLI MM is very informal. But as I understand intended reading is that,
> if you can't construct sequentially consistent interleaving of threads
> for some behavior, then this behavior is prohibited.

This would disallow the non-TSO example, and it's allowed on x86. I
remember reading that the first version of the CLR MM disallowed it,
and it was later changed to allow it. In either case, I don't think
that this SC reasoning applies to x86.

Dmitriy V'jukov

unread,

Aug 25, 2008, 6:05:13 AM8/25/08

to

Okay, I don't have arguments any more.
I only pointed out to some moments which in my opinion is not
intuitive or maybe can be improved. If everybody except me think that
everything is okay, than no problem. It's far more likely that I am
just wrong. I will just try to commit to my memory current rules.

Dmitriy V'jukov

Peter Dimov

unread,

Aug 25, 2008, 8:00:31 AM8/25/08

to

On Aug 25, 1:05 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:

> I only pointed out to some moments which in my opinion is not
> intuitive or maybe can be improved.

Your comments are appreciated. The new seq_cst fence definition is
likely to change from

"Sequentially consistent fences participate in the single total order
S over all memory_order_seq_cst operations (29.1/2). Each sequentially
consistent fence synchronizes with the sequentially consistent fence
that follows it in S."

to

" If a memory_order_seq_cst fence F is sequenced before an atomic
operation A on an object M, A observes either the last
memory_order_seq_cst modification of M preceding F in the total order
S, or a later modification of M in its modification order.

If an atomic modification operation A of an object M is sequenced
before a memory_order_seq_cst fence F, and a memory_order_seq_cst
operation B on M follows F in S, B observes either the effects of A on
M, or a later modification of M in its modification order.

If an atomic modification operation A of an object M is sequenced
before a memory_order_seq_cst fence Fa, a memory_order_seq_cst fence
Fb is sequenced before an atomic operation B on M, and Fb follows Fa
in S, B observes either the effects of A on M, or a later modification
on M in its modification order."

partly in response to your pointing out that seq_cst loads and stores
do not interact well with seq_cst fences.

Sadly, this new definition still doesn't fix the problem with
Peterson's algorithm.

Dmitriy V'jukov

unread,

Aug 26, 2008, 1:23:30 PM8/26/08

to

Great!
As I see this allows following code:

thread1()
{
store-relaxed X, 1
fence-seq-cst
r1 = load-relaxed Y
}

thread2()
{
store-seq-cst Y, 1
r2 = load-seq-cst X
}

in the sense that (r1 == 0) && (r2 == 0) is prohibited.

Informally, relaxed stores and loads can be used 'around' seq-cst
fence. Right?

I think this can be beneficial to (1) architectures where seq-cst
fence is cheaper than seq-cst operation (Sparc), (2) architectures
where store-release and load-acquire is not free (Itanium), (3)
algorithms where we can replace several seq-cst stores with one seq-
cst fence.

Great!

Dmitriy V'jukov

unread,

Sep 20, 2008, 4:14:05 AM9/20/08

to

On 25 авг, 11:19, Peter Dimov <pdi...@gmail.com> wrote:

> (SPARC gives pretty strong guarantees since it's RCsc even in its RMO
> mode, and this is mainly of theoretical importance since in practice
> it's always TSO.)

64-bit Linux runs Sparc in RMO mode!

Dmitriy V'jukov