330 views

Skip to first unread message

Aug 6, 2008, 9:58:23 AM8/6/08

to

I'm trying to figure out what synchronizes-with edges sequentially

consistent atomic operations must introduce.

My reasoning is based on latest C++0x draft:

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2691.pdf

And Peter Dimov's excellent proposal on bidirectional fences:

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2633.html

consistent atomic operations must introduce.

My reasoning is based on latest C++0x draft:

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2691.pdf

And Peter Dimov's excellent proposal on bidirectional fences:

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2633.html

First example:

std::atomic<int> x, y; // both initially 0

void thread1()

{

x.fetch_add(1, std::memory_order_seq_cst);

int r1 = y.load(std::memory_order_relaxed);

}

void thread1()

{

y.fetch_add(1, std::memory_order_seq_cst);

int r2 = x.load(std::memory_order_relaxed);

}

As I understand, (r1 == 0) && (r2 == 0) is possible output, because no

synchronizes-with edges introduced (at least I can't find anything

which says the opposite).

Is this intended? It's quite counter-intuitive. fetch_add operations

are ordered, and they both acquire and release. So why here is no

synchronizes-with edge between them? Then I don't understand what

differentiate seq_cst operations from acq_rel operations, here seq_cst

operations acts exactly like acq_rel operations.

Second example:

std::atomic<int> x, y; // both initially 0

void thread1()

{

x.fetch_add(1, std::memory_order_seq_cst);

int r1 = y.load(std::memory_order_relaxed);

}

void thread1()

{

y.store(1, std::memory_order_relaxed);

std::atomic_thread_fence(std::memory_order_seq_cst);

int r2 = x.load(std::memory_order_relaxed);

}

Here is the same story.

I think that for seq_cst operations there must be something like:

"Sequentially consistent operation or fence which is release

synchronizes-with EVERY subsequent in total order S (of all

sequentially consistent operations and fences) sequentially consistent

operation or fence which is acquire"

Then output (r1 == 0) && (r2 == 0) will be prohibited in both

examples.

Dmitriy V'jukov

Aug 6, 2008, 3:09:28 PM8/6/08

to

On Aug 6, 4:58 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:

> I'm trying to figure out what synchronizes-with edges sequentially

> consistent atomic operations must introduce.

> My reasoning is based on latest C++0x draft:http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2691.pdf

> And Peter Dimov's excellent proposal on bidirectional fences:http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2633.html

>

> First example:

>

> std::atomic<int> x, y; // both initially 0

> void thread1()

> {

> x.fetch_add(1, std::memory_order_seq_cst);

> int r1 = y.load(std::memory_order_relaxed);}

>

> void thread1()

> {

> y.fetch_add(1, std::memory_order_seq_cst);

> int r2 = x.load(std::memory_order_relaxed);

>

> }

>

> As I understand, (r1 == 0) && (r2 == 0) is possible output, because no

> synchronizes-with edges introduced (at least I can't find anything

> which says the opposite).

> Is this intended?

> I'm trying to figure out what synchronizes-with edges sequentially

> consistent atomic operations must introduce.

> My reasoning is based on latest C++0x draft:http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2691.pdf

> And Peter Dimov's excellent proposal on bidirectional fences:http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2633.html

>

> First example:

>

> std::atomic<int> x, y; // both initially 0

> void thread1()

> {

> x.fetch_add(1, std::memory_order_seq_cst);

> int r1 = y.load(std::memory_order_relaxed);}

>

> void thread1()

> {

> y.fetch_add(1, std::memory_order_seq_cst);

> int r2 = x.load(std::memory_order_relaxed);

>

> }

>

> As I understand, (r1 == 0) && (r2 == 0) is possible output, because no

> synchronizes-with edges introduced (at least I can't find anything

> which says the opposite).

> Is this intended?

I didn't participate in specifying the seq_cst semantics (Herb Sutter

was, I believe, the primary driving force behind them) but I think

that this is by design. seq_cst works only if you apply it

consistently; it's not designed to integrate with relaxed.

void thread1()

{

x.store( 1, std::memory_order_seq_cst );

int r1 = y.load( std::memory_order_seq_cst );

}

void thread2()

{

y.store( 1, std::memory_order_seq_cst );

int r2 = x.load( std::memory_order_seq_cst );

}

I think that this was done for efficiency reasons.

> Second example:

>

> std::atomic<int> x, y; // both initially 0

> void thread1()

> {

> x.fetch_add(1, std::memory_order_seq_cst);

> int r1 = y.load(std::memory_order_relaxed);}

>

> void thread1()

> {

> y.store(1, std::memory_order_relaxed);

> std::atomic_thread_fence(std::memory_order_seq_cst);

> int r2 = x.load(std::memory_order_relaxed);

> }

seq_cst fences share the general seq_cst property of not playing well

with others:

void thread1()

{

x.store( 1, std::memory_order_relaxed );

std::atomic_thread_fence( std::memory_order_seq_cst );

int r1 = y.load( std::memory_order_relaxed );

}

void thread2()

{

y.store( 1, std::memory_order_relaxed );

std::atomic_thread_fence( std::memory_order_seq_cst );

int r2 = x.load( std::memory_order_relaxed );

}

Aug 6, 2008, 3:33:07 PM8/6/08

to

On Aug 6, 11:09 pm, Peter Dimov <pdi...@gmail.com> wrote:

> I didn't participate in specifying the seq_cst semantics (Herb Sutter

> was, I believe, the primary driving force behind them) but I think

> that this is by design. seq_cst works only if you apply it

> consistently; it's not designed to integrate with relaxed.

>

> void thread1()

> {

> x.store( 1, std::memory_order_seq_cst );

> int r1 = y.load( std::memory_order_seq_cst );

>

> }

>

> void thread2()

> {

> y.store( 1, std::memory_order_seq_cst );

> int r2 = x.load( std::memory_order_seq_cst );

>

> }

Yes, this will work for sure.

So seq_cst is like 'const' or 'GPL' :)

> I think that this was done for efficiency reasons.

As I understand, if (1) operations are ordered (seq_cst operations are

definitely ordered), and (2) operations are both acquire and release

(seq_cst RMW is acquire and release for sure), then (3) they

synchronize-with automatically. On which architecture it's possible to

satisfy (1) and (2), but not (3)? It's quite counter-intuitive for

me...

> seq_cst fences share the general seq_cst property of not playing well

> with others:

The problem is that sometimes you don't have control over some atomic

operations.

For example we have queue:

class queue

{

void push(v)

{

...

head.exchange(..., std::memory_order_seq_cst);

}

};

We can't change the queue.

And we want to add 'blocking':

class blocking_queue

{

queue q;

void push(v)

{

q.push(v);

// here we need to synchronize-with concurrent pop(),

// and queue::push() already execute strong enough memory fence,

// but we can't take advantage of it, we have to execute one more

// strong memory fence

std::atomic_thread_fence(std::memory_order_seq_cst);

if (load_something())

...

}

};

Dmitriy V'jukov

Aug 6, 2008, 6:53:29 PM8/6/08

to

On Aug 6, 10:33 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:

> On Aug 6, 11:09 pm, Peter Dimov <pdi...@gmail.com> wrote:

> On Aug 6, 11:09 pm, Peter Dimov <pdi...@gmail.com> wrote:

...

> > I think that this was done for efficiency reasons.

>

> As I understand, if (1) operations are ordered (seq_cst operations are

> definitely ordered), and (2) operations are both acquire and release

> (seq_cst RMW is acquire and release for sure), then (3) they

> synchronize-with automatically. On which architecture it's possible to

> satisfy (1) and (2), but not (3)? It's quite counter-intuitive for

> me...

The basic principle is that if A synchronizes-with B, then A is a

write with release semantics and B is a read with acquire semantics. A

read can't synchronize-with something. A seq_cst load is not acq_rel

(release doesn't apply to loads), it only has acquire semantics.

Ordinary operations can still sink below a seq_cst load as a result of

compiler or hardware reorderings, and this doesn't break the

sequential consistency of a data-race-free program. (If my

understanding is correct.)

> The problem is that sometimes you don't have control over some atomic

> operations.

I know, and this is why the fences are defined to not only synchronize

with each other, but with other acquire and release operations as

well. The classic "write sync-with read" model is extended so that a

fence can be substituted for either or both the "source" and the

"target" of the synchronizes-with edge. But it still does not allow

"read sync-with fence" and the other arbitrary combinations; this

would be against the spirit of the current memory model.

(Even the current modification is viewed as risky.)

Aug 7, 2008, 7:14:15 AM8/7/08

to

On Aug 7, 2:53 am, Peter Dimov <pdi...@gmail.com> wrote:

> > As I understand, if (1) operations are ordered (seq_cst operations are

> > definitely ordered), and (2) operations are both acquire and release

> > (seq_cst RMW is acquire and release for sure), then (3) they

> > synchronize-with automatically. On which architecture it's possible to

> > satisfy (1) and (2), but not (3)? It's quite counter-intuitive for

> > me...

>

> The basic principle is that if A synchronizes-with B, then A is a

> write with release semantics and B is a read with acquire semantics. A

> read can't synchronize-with something. A seq_cst load is not acq_rel

> (release doesn't apply to loads), it only has acquire semantics.

> Ordinary operations can still sink below a seq_cst load as a result of

> compiler or hardware reorderings, and this doesn't break the

> sequential consistency of a data-race-free program. (If my

> understanding is correct.)

Let's for now consider only seq_cst atomic RMW.

What you mean:

1. This is basic principle of how hardware works. If so, then no

questions. I'm just curious what hardware can satisfy (1) and (2), but

not (3).

2. This is basic principle of current C++0x memory model. If so, then

the question is whether current C++0x memory model can be extended to

something like this:

"Sequentially consistent operation or fence which is release

synchronizes-with EVERY subsequent in total order S (of all

sequentially consistent operations and fences) sequentially consistent

operation or fence which is acquire"?

My supposition is that such rules can be efficiently implemented, and

such rules are intuitive and useful. What do you think?

> > The problem is that sometimes you don't have control over some atomic

> > operations.

>

> I know, and this is why the fences are defined to not only synchronize

> with each other, but with other acquire and release operations as

> well. The classic "write sync-with read" model is extended so that a

> fence can be substituted for either or both the "source" and the

> "target" of the synchronizes-with edge. But it still does not allow

> "read sync-with fence" and the other arbitrary combinations; this

> would be against the spirit of the current memory model.

There are 2 kinds of synchronization:

1. Let's call it "queue". One thread have to pass some information to

another. First thread executes release, and second - acquire. In this

situation what you've described (synchronization between store-release

and fence-acquire, and synchronization between fence-release and load-

acquire) works perfectly. I.e. fences and atomic operations can co-

operate in this kind of synchronization.

2. Let's call it "mutex". Two threads have to decide "who is the

first". Examples are mutex acquire with atomic RMW; Peterson's mutex

acquire with seq_cst fences; signaling on eventcount. And in this

situation what you've described doesn't work at all (not counting fact

that two seq_cst fences synchronizes-with). Current model was NOT

extended to handle this kind of synchronization. I.e. fences and

atomic operations don't co-operate at all in this kind of

synchronization. It's quite "dishonestly". And in some situations it

can basically double synchronization overheads, just to "stay

consistent with C++0x rules".

Dmitriy V'jukov

Aug 7, 2008, 10:01:51 AM8/7/08

to

On Aug 7, 2:14 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:

> 2. This is basic principle of current C++0x memory model. If so, then

> the question is whether current C++0x memory model can be extended to

> something like this:

> "Sequentially consistent operation or fence which is release

> synchronizes-with EVERY subsequent in total order S (of all

> sequentially consistent operations and fences) sequentially consistent

> operation or fence which is acquire"?

Maybe. To make sure that I understand, this would allow fence.seq_cst

in signal and cas.seq_cst in get, right? You'll still need

load.seq_cst in "signal_relaxed" though, even when the publication is

done with store.seq_cst. The implicit #StoreLoad is (conceptually)

only injected between two seq_cst ops, it won't necessarily be present

if the load is not seq_cst.

Aug 7, 2008, 11:19:45 AM8/7/08

to

No. The ultimate goal is:

int r1 = 0, r2 = 0;

void producer()

{

// publication

head.exchange(..., seq_cst);

// signaling

if (eventcount.load(relaxed))

int r1 = 1;

}

void consumer()

{

// consumption

node = head.load(relaxed);

if (0 == node)

{

// get

eventcount.exchange(..., seq_cst);

// consumption one more time

node = head.load(relaxed);

if (node)

r2 = 1;

}

}

And I want that (r1 == 0) && (r2 == 0) will be prohibited. I.e. or

publication must synchronize-with get, or vice versa.

On x86 this must result in "optimal" implementation:

void producer()

{

// publication

// head.exchange(..., seq_cst);

InterlockedExchange(...);

// signaling

// if (eventcount.load(relaxed))

if (plain_load(eventcount))

int r1 = 1;

}

void consumer()

{

// consumption

// node = head.load(relaxed);

node = plain_load(head);

if (0 == node)

{

// get

// eventcount.exchange(..., seq_cst);

InterlockedExchange(...);

// consumption one more time

// node = head.load(relaxed);

node = plain_load(head);

if (node)

r2 = 1;

}

}

Dmitriy V'jukov

Aug 7, 2008, 3:26:30 PM8/7/08

to

On Aug 7, 6:19 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:

> No. The ultimate goal is:

>

> int r1 = 0, r2 = 0;

>

> void producer()

> {

> // publication

> head.exchange(..., seq_cst);

> // signaling

> if (eventcount.load(relaxed))

> int r1 = 1;

>

> }

>

> void consumer()

> {

> // consumption

> node = head.load(relaxed);

> if (0 == node)

> {

> // get

> eventcount.exchange(..., seq_cst);

> // consumption one more time

> node = head.load(relaxed);

> if (node)

> r2 = 1;

> }

>

> }

>

> And I want that (r1 == 0) && (r2 == 0) will be prohibited. I.e. or

> publication must synchronize-with get, or vice versa.

Under the C++MM, relaxed operations never synchronize with anything,

and you have no fences, so this won't work (in principle), no matter

what the fence semantics are. Unless I'm missing something.

(At least the first load in consumer needs to have acquire semantics

though. Or you can add an acquire fence in an 'else'.)

> On x86 this must result in "optimal" implementation:

...

On x86, all loads map to a mov instruction, even load.seq_cst (if my

understanding is correct), so if x86 is your target, you can just use

load.seq_cst. The hardware that is closest to the edge of the C++MM is

POWER. I must admit that I don't know for sure how seq_cst is

implemented on POWER though. It's possible that the above will work

there in practice.

Aug 7, 2008, 5:03:15 PM8/7/08

to

On 7 авг, 23:26, Peter Dimov <pdi...@gmail.com> wrote:

> > And I want that (r1 == 0) && (r2 == 0) will be prohibited. I.e. or

> > publication must synchronize-with get, or vice versa.

>

> Under the C++MM, relaxed operations never synchronize with anything,

> and you have no fences, so this won't work (in principle), no matter

> what the fence semantics are. Unless I'm missing something.

Sh#t! There must be:

node = head.load(acquire);

I want to seq_cst atomic RMW in one thread synchronize-with seq_cst

atomic RMW in another thread. Forget about current C++0x MM for a

moment. It this reasonably?

> > On x86 this must result in "optimal" implementation:

>

> ...

>

> On x86, all loads map to a mov instruction, even load.seq_cst (if my

> understanding is correct), so if x86 is your target, you can just use

> load.seq_cst. The hardware that is closest to the edge of the C++MM is

> POWER. I must admit that I don't know for sure how seq_cst is

> implemented on POWER though. It's possible that the above will work

> there in practice.

No, x86 is not my target. Well, if seq_cst load will be implemented as

plain load on ALL architectures, maybe this changes the situation... I

have to think some more. At least examples in first post will be

working, if I change relaxed loads to seq_cst loads. But is this the

case, that seq_cst loads will implemented as plain loads on ALL

architectures?

Dmitriy V'jukov

Aug 7, 2008, 5:42:11 PM8/7/08

to

On Aug 8, 12:03 am, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:

> I want to seq_cst atomic RMW in one thread synchronize-with seq_cst

> atomic RMW in another thread.

> I want to seq_cst atomic RMW in one thread synchronize-with seq_cst

> atomic RMW in another thread.

Two RMW ops targeting different locations?

As I understand it, the design goal of seq_cst atomics was to make it

possible for programmers to write sequentially consistent code without

sacrificing performance (or rather, sacrificing it as little as

possible). Typically, two seq_cst operations to different locations do

not need to synchronize with one another to achieve SC. Imposing this

requirement will decrease the (theoretical) performance of seq_cst-

using programs.

> But is this the case, that seq_cst loads will implemented as plain loads on ALL architectures?

My understanding of the POWER MM is very limited, but I've heard that

a seq_cst load there maps to SYNC; LL/SC loop; SYNC. Which is terribly

expensive. I suspect that for optimal performance one would still need

two or three variants of the same algorithm.

Aug 8, 2008, 3:26:51 AM8/8/08

to

On 8 авг, 01:42, Peter Dimov <pdi...@gmail.com> wrote:

> On Aug 8, 12:03 am, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:

>

> > I want to seq_cst atomic RMW in one thread synchronize-with seq_cst

> > atomic RMW in another thread.

>

> Two RMW ops targeting different locations?

> On Aug 8, 12:03 am, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:

>

> > I want to seq_cst atomic RMW in one thread synchronize-with seq_cst

> > atomic RMW in another thread.

>

> Two RMW ops targeting different locations?

Yes.

> As I understand it, the design goal of seq_cst atomics was to make it

> possible for programmers to write sequentially consistent code without

> sacrificing performance (or rather, sacrificing it as little as

> possible). Typically, two seq_cst operations to different locations do

> not need to synchronize with one another to achieve SC. Imposing this

> requirement will decrease the (theoretical) performance of seq_cst-

> using programs.

Consider slides 11 and 12 from Anthony Williams presentation:

http://accu.org/content/conf2008/Williams-future_of_concurrency.pdf

seq_cst stores and RMW have to synchronize with one another, because

they have to from global order. This is the main difference from

acq_rel. So they already synchronize with one another, why not declare

that they also 'synchronize-with' one another?

> > But is this the case, that seq_cst loads will implemented as plain loads on ALL architectures?

>

> My understanding of the POWER MM is very limited, but I've heard that

> a seq_cst load there maps to SYNC; LL/SC loop; SYNC. Which is terribly

> expensive. I suspect that for optimal performance one would still need

> two or three variants of the same algorithm.

Oh! I definitely don't want to insert 2 superfluous 'SYNC; LL/SC loop;

SYNC'!

What about several variants of the algorithm, yeah, maybe sometimes

this will be the case. But how I can express example with eventcount:

http://groups.google.com/group/comp.programming.threads/msg/92692b9b70106e9f

in optimal for POWER way?

I still think that allowing two seq_cst atomic RMW to different

locations to 'synchronize-with' one another is beneficial...

Dmitriy V'jukov

Aug 8, 2008, 7:30:13 AM8/8/08

to

On Aug 8, 10:26 am, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:

> seq_cst stores and RMW have to synchronize with one another, because

> they have to from global order. This is the main difference from

> acq_rel. So they already synchronize with one another, why not declare

> that they also 'synchronize-with' one another?

I'm not 100% positive, but I suspect that an implementation where

store.seq_cst maps to MOV, and load.seq_cst maps to LOCK XADD, is

conforming. Stated differently, I think that, while the current model

demands this:

// T1

store.seq_cst x, 1

load.seq_cst y

// T2

store.seq_cst y, 1

load.seq_cst x

to "work", or to not load (0, 0), it is possible to write an

implementation which would make one of these two:

// T1

store.release x, 1

load.seq_cst y

// T2

store.release y, 1

load.seq_cst x

//

// T1

store.seq_cst x, 1

load.acquire y

// T2

store.seq_cst y, 1

load.acquire x

to "fail". (I don't know if it's possible in practice to make an

implementation that would "fail" both.)

> What about several variants of the algorithm, yeah, maybe sometimes

> this will be the case. But how I can express example with eventcount:http://groups.google.com/group/comp.programming.threads/msg/92692b9b7...

> in optimal for POWER way?

I suspect that for POWER the best course of action would be to write

the assembly and then backport to fences, replacing SYNC with

fence.seq_cst and LWSYNC with fence.acq_rel (or acquire, or release,

depending on context). :-)

Aug 11, 2008, 3:15:13 PM8/11/08

to

But this is not an end in itself. If C++MM will require that seq_cst

operations must synchronize-with each other, then former

implementation will be just prohibited. Is it bad? I think that latter

implementation (stores are heavier, and loads are easier) is

preferable anyway. What I miss here?

Dmitriy V'jukov

Aug 13, 2008, 4:09:45 AM8/13/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> writes:

> I'm trying to figure out what synchronizes-with edges sequentially

> consistent atomic operations must introduce.

> My reasoning is based on latest C++0x draft:

> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2691.pdf

> And Peter Dimov's excellent proposal on bidirectional fences:

> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2633.html

>

> First example:

>

> std::atomic<int> x, y; // both initially 0

> void thread1()

> {

> x.fetch_add(1, std::memory_order_seq_cst);

> int r1 = y.load(std::memory_order_relaxed);

> }

> void thread1()

> {

> y.fetch_add(1, std::memory_order_seq_cst);

> int r2 = x.load(std::memory_order_relaxed);

> }

>

> As I understand, (r1 == 0) && (r2 == 0) is possible output, because no

> synchronizes-with edges introduced (at least I can't find anything

> which says the opposite).

> Is this intended? It's quite counter-intuitive. fetch_add operations

> are ordered, and they both acquire and release. So why here is no

> synchronizes-with edge between them? Then I don't understand what

> differentiate seq_cst operations from acq_rel operations, here seq_cst

> operations acts exactly like acq_rel operations.

This is intended. Operations on different variables cannot synchronize

with each other.

Unless *all* your operations are seq_cst, seq_cst is the same as

acquire for loads, release for stores and acq_rel for RMW.

> Second example:

>

> std::atomic<int> x, y; // both initially 0

> void thread1()

> {

> x.fetch_add(1, std::memory_order_seq_cst);

> int r1 = y.load(std::memory_order_relaxed);

> }

> void thread1()

> {

> y.store(1, std::memory_order_relaxed);

> std::atomic_thread_fence(std::memory_order_seq_cst);

> int r2 = x.load(std::memory_order_relaxed);

> }

>

> Here is the same story.

Yes. As Peter already said, relaxed operations can't participate in

synchronize-with relationships.

Anthony

--

Anthony Williams | Just Software Solutions Ltd

Custom Software Development | http://www.justsoftwaresolutions.co.uk

Registered in England, Company Number 5478976.

Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL

Aug 13, 2008, 4:11:31 AM8/13/08

to

Peter Dimov <pdi...@gmail.com> writes:

> On x86, all loads map to a mov instruction, even load.seq_cst

Yes. If store.seq_cst is XCHG, load.seq_cst need only be a plain MOV.

Aug 15, 2008, 6:55:00 PM8/15/08

to

On 13 авг, 12:09, Anthony Williams <anthony....@gmail.com> wrote:

> > As I understand, (r1 == 0) && (r2 == 0) is possible output, because no

> > synchronizes-with edges introduced (at least I can't find anything

> > which says the opposite).

> > Is this intended? It's quite counter-intuitive. fetch_add operations

> > are ordered, and they both acquire and release. So why here is no

> > synchronizes-with edge between them? Then I don't understand what

> > differentiate seq_cst operations from acq_rel operations, here seq_cst

> > operations acts exactly like acq_rel operations.

>

> This is intended.

Ok.

But then I am just curious - why?

It is impossible to implement efficiently on some architectures?

It will over-complicate definitions?

It is just useless?

> > Here is the same story.

>

> Yes. As Peter already said, relaxed operations can't participate in

> synchronize-with relationships.

NO! Not relaxed operations. One seq_cst operation synchronizes-with

another seq_cst operation. Relaxed operations are under standard

rules.

--

Dmitriy V'jukov

Relacy Race Detector: Make your synchronization correct!

http://groups.google.ru/group/relacy/web

Aug 18, 2008, 8:02:03 AM8/18/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> writes:

> On 13 авг, 12:09, Anthony Williams <anthony....@gmail.com> wrote:

>

>> > As I understand, (r1 == 0) && (r2 == 0) is possible output, because no

>> > synchronizes-with edges introduced (at least I can't find anything

>> > which says the opposite).

>> > Is this intended? It's quite counter-intuitive. fetch_add operations

>> > are ordered, and they both acquire and release. So why here is no

>> > synchronizes-with edge between them? Then I don't understand what

>> > differentiate seq_cst operations from acq_rel operations, here seq_cst

>> > operations acts exactly like acq_rel operations.

>>

>> This is intended.

>

> Ok.

> But then I am just curious - why?

I don't know the full details, but the impression I got from the

atomics working group was that if you want full ordering use seq_cst

everywhere. If you care about efficiency, seq_cst is generally too

expensive anyway, so we don't want seq_cst semantics to affect the

ordering of other operations.

seq_cst operations form a total order. On some architectures this is

very expensive to do (imagine a distributed system). Forcing this

total order to add synchronizes-with edges increases that cost even

more, as the other non-seq-cst operations now have to participate in

the order.

>> > Here is the same story.

>>

>> Yes. As Peter already said, relaxed operations can't participate in

>> synchronize-with relationships.

>

> NO! Not relaxed operations. One seq_cst operation synchronizes-with

> another seq_cst operation. Relaxed operations are under standard

> rules.

Your example was:

std::atomic<int> x, y; // both initially 0

void thread1()

{

x.fetch_add(1, std::memory_order_seq_cst);

int r1 = y.load(std::memory_order_relaxed);

}

void thread2()

{

y.store(1, std::memory_order_relaxed);

std::atomic_thread_fence(std::memory_order_seq_cst);

int r2 = x.load(std::memory_order_relaxed);

}

As I understand it, you want the fence to "synchronize-with" the

fetch_add in some way, so that r1=r2=0 is prohibited.

Fences don't work that way. The fence only participates in

synchronizes-with if the load of y from thread1 is an acquire *and

sees the value stored by thread2*, or the load of x from thread2 is an

acquire *and sees the value stored by thread1*.

For the fence to force the ordering would require that the relaxed

operations were part of the total ordering: not only would the

runtime/hardware have to ensure that all the seq_cst operations

happened in a total order, but that all accesses to all variables that

occurred before or after a seq_cst operation were made visible to

other threads. This would be a major cost, as every seq_cst operation

would then essentially have to synchronize the entire data set of the

application.

Aug 18, 2008, 6:56:36 PM8/18/08

to

On Aug 11, 10:15 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:

> But this is not an end in itself. If C++MM will require that seq_cst

> operations must synchronize-with each other, then former

> implementation will be just prohibited. Is it bad? I think that latter

> implementation (stores are heavier, and loads are easier) is

> preferable anyway. What I miss here?

I think that this prevents a number of currently-valid optimizations.

Consider this example:

// T1:

x.store( 1, relaxed );

q1.store( 1, seq_cst );

r1 = y.load( relaxed );

// T2:

y.store( 1, relaxed );

q2.store( 1, seq_cst );

r1 = x.load( relaxed );

Currently, the seq_cst stores do not prevent upward reordering of the

loads (on compiler or hardware level) because a seq_cst store is only

a release operation. (You can of course use q.swap to prevent it.)

This reordering doesn't break sequential consistency for race-free

programs that use only seq_cst atomics and ordinary non-atomic ops and

may potentially improve their performance.

Further, assume that q1 is local to T1 and q2 is local to T2. In this

case, the C++MM allows operations on these locations to be optimized

out. If they were required to synchronize with one another, it

wouldn't be possible for them to be ignored, even when the location is

private to the thread.

I'm not actually claiming that these optimizations are important in

practice, just that they may have been the reason for the C++MM to be

specified in the way it is.

Aug 19, 2008, 6:48:00 PM8/19/08

to

On 19 авг, 02:56, Peter Dimov <pdi...@gmail.com> wrote:

> On Aug 11, 10:15 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:

>

> > But this is not an end in itself. If C++MM will require that seq_cst

> > operations must synchronize-with each other, then former

> > implementation will be just prohibited. Is it bad? I think that latter

> > implementation (stores are heavier, and loads are easier) is

> > preferable anyway. What I miss here?

>

> I think that this prevents a number of currently-valid optimizations.

> Consider this example:

>

> // T1:

>

> x.store( 1, relaxed );

> q1.store( 1, seq_cst );

> r1 = y.load( relaxed );

>

> // T2:

>

> y.store( 1, relaxed );

> q2.store( 1, seq_cst );

> r1 = x.load( relaxed );

>

> Currently, the seq_cst stores do not prevent upward reordering of the

> loads (on compiler or hardware level) because a seq_cst store is only

> a release operation. (You can of course use q.swap to prevent it.)

> This reordering doesn't break sequential consistency for race-free

> programs that use only seq_cst atomics and ordinary non-atomic ops and

> may potentially improve their performance.

> On Aug 11, 10:15 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:

>

> > But this is not an end in itself. If C++MM will require that seq_cst

> > operations must synchronize-with each other, then former

> > implementation will be just prohibited. Is it bad? I think that latter

> > implementation (stores are heavier, and loads are easier) is

> > preferable anyway. What I miss here?

>

> I think that this prevents a number of currently-valid optimizations.

> Consider this example:

>

> // T1:

>

> x.store( 1, relaxed );

> q1.store( 1, seq_cst );

> r1 = y.load( relaxed );

>

> // T2:

>

> y.store( 1, relaxed );

> q2.store( 1, seq_cst );

> r1 = x.load( relaxed );

>

> Currently, the seq_cst stores do not prevent upward reordering of the

> loads (on compiler or hardware level) because a seq_cst store is only

> a release operation. (You can of course use q.swap to prevent it.)

> This reordering doesn't break sequential consistency for race-free

> programs that use only seq_cst atomics and ordinary non-atomic ops and

> may potentially improve their performance.

I was talking only about seq_cst RMW operations.

Seq_cst store doesn't have to synchronize-with other seq_cst store.

Well, ok, in "my proposal" seq_csq store can synchronize-with seq_cst

load. But this makes a little sense, because if load will be before

store in total order, then nobody will synchronize-with.

> Further, assume that q1 is local to T1 and q2 is local to T2. In this

> case, the C++MM allows operations on these locations to be optimized

> out. If they were required to synchronize with one another, it

> wouldn't be possible for them to be ignored, even when the location is

> private to the thread.

>

> I'm not actually claiming that these optimizations are important in

> practice, just that they may have been the reason for the C++MM to be

> specified in the way it is.

Yes, such optimizations will be prohibited. And yes, I agree that

"thread local atomics" are NOT the right use case to reason about

memory model...

Aug 19, 2008, 7:26:32 PM8/19/08

to

On 18 авг, 16:02, Anthony Williams <anthony....@gmail.com> wrote:

> I don't know the full details, but the impression I got from the

> atomics working group was that if you want full ordering use seq_cst

> everywhere. If you care about efficiency, seq_cst is generally too

> expensive anyway, so we don't want seq_cst semantics to affect the

> ordering of other operations.

For now, I don't yet see where "my proposal" will affect performance

of non seq_cst operations.

The problem that it's not always possible to consistently use only

seq_cst operations, or only not seq_cst operations. Current rules hit

composability. Assume you want to use component which uses seq_cst

operations, but you don't want to use seq_cst operations in the rest

of the code. You will have to insert additional atomic operations only

to integrate the component. Or assume that you have to integrate

component which uses seq_cst operations with another component which

doesn't use seq_cst operations. The same problem.

Or assume that one algorithm can be expressed more naturally with

seq_cst operations, and another - with non seq_cst operations. And you

have to use both. Once again, you will have to insert additional

atomic operations only to integrate them.

Or even worse. You are writing generic algorithm (something like Chris

Thomasson's eventcount). And you don't know with what you will have to

synchronize. With seq_cst fence, or with seq_cst rmw operation. Ooops!

Currently you have to issue both - and fence, and rmw operation.

I see clear advantage in better cooperation of seq_seq operations with

the rest of the world.

> seq_cst operations form a total order. On some architectures this is

> very expensive to do (imagine a distributed system). Forcing this

> total order to add synchronizes-with edges increases that cost even

> more, as the other non-seq-cst operations now have to participate in

> the order.

But they have anyway!

Because seq_cst store is release, and seq_cst load is acquire. So all

operations anyway have to complete before seq_cst store, because other

thread can issue load-acquire which will load value stored by seq_cst

store. And similar for seq_cst load.

Maybe I'm missing something, but, according to my current

understanding, what I'm proposing will work anyway, it's just formally

not legal.

> >> > Here is the same story.

>

> >> Yes. As Peter already said, relaxed operations can't participate in

> >> synchronize-with relationships.

>

> > NO! Not relaxed operations. One seq_cst operation synchronizes-with

> > another seq_cst operation. Relaxed operations are under standard

> > rules.

>

> Your example was:

>

> std::atomic<int> x, y; // both initially 0

> void thread1()

> {

> x.fetch_add(1, std::memory_order_seq_cst);

> int r1 = y.load(std::memory_order_relaxed);}

>

> void thread2()

> {

> y.store(1, std::memory_order_relaxed);

> std::atomic_thread_fence(std::memory_order_seq_cst);

> int r2 = x.load(std::memory_order_relaxed);

>

> }

>

> As I understand it, you want the fence to "synchronize-with" the

> fetch_add in some way, so that r1=r2=0 is prohibited.

Yes.

> Fences don't work that way. The fence only participates in

> synchronizes-with if the load of y from thread1 is an acquire *and

> sees the value stored by thread2*, or the load of x from thread2 is an

> acquire *and sees the value stored by thread1*.

I understand that they don't work that way. The question is - why they

don't work that way? And can they work another way?

> For the fence to force the ordering would require that the relaxed

> operations were part of the total ordering: not only would the

> runtime/hardware have to ensure that all the seq_cst operations

> happened in a total order, but that all accesses to all variables that

> occurred before or after a seq_cst operation were made visible to

> other threads. This would be a major cost, as every seq_cst operation

> would then essentially have to synchronize the entire data set of the

> application.

But seq_cst operations have to synchronize the entire data set of the

application anyway!

If standard would not claim that seq_cst store is release and seq_cst

load is acquire, than I would agree. Than seq_cst operations will

cooperate only with other seq_cst operations, and not with release/

acquire operations.

But standard claims that seq_cst store is release and seq_cst load is

acquire, so I can write something like this:

void thread1()

{

non_atomic_data = 1;

atomic_flag.store(1, seq_cst);

}

void thread2()

{

if (1 == atomic_flag.load(acquire))

assert(1 == non_atomic_data);

}

or:

void thread1()

{

non_atomic_data = 1;

atomic_flag.store(1, release);

}

void thread2()

{

if (1 == atomic_flag.load(seq_cst))

assert(1 == non_atomic_data);

Aug 20, 2008, 3:11:28 AM8/20/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> writes:

> On 18 авг, 16:02, Anthony Williams <anthony....@gmail.com> wrote:

>

>> I don't know the full details, but the impression I got from the

>> atomics working group was that if you want full ordering use seq_cst

>> everywhere. If you care about efficiency, seq_cst is generally too

>> expensive anyway, so we don't want seq_cst semantics to affect the

>> ordering of other operations.

>

> For now, I don't yet see where "my proposal" will affect performance

> of non seq_cst operations.

If every seq_cst operation synchronizes-with every other seq_cst

operation, this affects the allowed compiler optimizations. It

effectively makes every seq_cst operation a full fence, even if no

other thread can see that operation:

int global(0);

std::atomic_int global_atomic(0);

void thread1()

{

global=1;

std::atomic_int ai;

ai.store(1,std::memory_order_seq_cst);

global_atomic.store(1,std::memory_order_relaxed);

}

void thread2()

{

if(global_atomic.load(std::memory_order_seq_cst)==1)

{

assert(global==1);

}

}

Under the current memory model, ai is local so it can be completely

eliminated. The stores to global and global_atomic are not ordered, so

the assert in thread2 is actually a data race.

Under your proposal, the store to ai cannot be elimiated since it must

synchronize-with all other seq_cst operations. In particular, it must

synchronize-with the seq_cst load in thread2 of an unrelated

variable. Since the store to global_atomic is after the store to ai,

if the load reads 1, the assert is no longer a data race, and will not

fire either, since the value of global must be 1.

> The problem that it's not always possible to consistently use only

> seq_cst operations, or only not seq_cst operations. Current rules hit

> composability. Assume you want to use component which uses seq_cst

> operations, but you don't want to use seq_cst operations in the rest

> of the code. You will have to insert additional atomic operations only

> to integrate the component. Or assume that you have to integrate

> component which uses seq_cst operations with another component which

> doesn't use seq_cst operations. The same problem.

That depends on how you write your components. Can you give an example?

> Or assume that one algorithm can be expressed more naturally with

> seq_cst operations, and another - with non seq_cst operations. And you

> have to use both. Once again, you will have to insert additional

> atomic operations only to integrate them.

Again, can you give an example?

> Or even worse. You are writing generic algorithm (something like Chris

> Thomasson's eventcount). And you don't know with what you will have to

> synchronize. With seq_cst fence, or with seq_cst rmw operation. Ooops!

> Currently you have to issue both - and fence, and rmw operation.

> I see clear advantage in better cooperation of seq_seq operations with

> the rest of the world.

Can you show the example in code so I can see where you feel that

both the fence and RMW operations are necessary?

>> seq_cst operations form a total order. On some architectures this is

>> very expensive to do (imagine a distributed system). Forcing this

>> total order to add synchronizes-with edges increases that cost even

>> more, as the other non-seq-cst operations now have to participate in

>> the order.

>

> But they have anyway!

> Because seq_cst store is release, and seq_cst load is acquire. So all

> operations anyway have to complete before seq_cst store, because other

> thread can issue load-acquire which will load value stored by seq_cst

> store. And similar for seq_cst load.

Yes, but that's a pairwise ordering: the store-release vs the

load-acquire. If the compiler can see that there is no such ordering,

it can eliminate the cost. Also, it is only synchronization between

two threads, not globally.

> Maybe I'm missing something, but, according to my current

> understanding, what I'm proposing will work anyway, it's just formally

> not legal.

I believe you are wrong. I gave an example above.

>> >> > Here is the same story.

>>

>> >> Yes. As Peter already said, relaxed operations can't participate in

>> >> synchronize-with relationships.

>>

>> > NO! Not relaxed operations. One seq_cst operation synchronizes-with

>> > another seq_cst operation. Relaxed operations are under standard

>> > rules.

>>

>> Your example was:

>>

>> std::atomic<int> x, y; // both initially 0

>> void thread1()

>> {

>> x.fetch_add(1, std::memory_order_seq_cst);

>> int r1 = y.load(std::memory_order_relaxed);}

>>

>> void thread2()

>> {

>> y.store(1, std::memory_order_relaxed);

>> std::atomic_thread_fence(std::memory_order_seq_cst);

>> int r2 = x.load(std::memory_order_relaxed);

>>

>> }

>>

>> As I understand it, you want the fence to "synchronize-with" the

>> fetch_add in some way, so that r1=r2=0 is prohibited.

>

> Yes.

>

>> Fences don't work that way. The fence only participates in

>> synchronizes-with if the load of y from thread1 is an acquire *and

>> sees the value stored by thread2*, or the load of x from thread2 is an

>> acquire *and sees the value stored by thread1*.

>

> I understand that they don't work that way. The question is - why they

> don't work that way? And can they work another way?

I don't see how you could write code that required them to work

another way. You would have no way of knowing whether your operations

had occurred before the fence or after, and thus whether or not it was

safe to read other data. If you introduced a variable to identify

whether or not the synchronization had occurred then you get the

behaviour you require under the current model anyway.

>> For the fence to force the ordering would require that the relaxed

>> operations were part of the total ordering: not only would the

>> runtime/hardware have to ensure that all the seq_cst operations

>> happened in a total order, but that all accesses to all variables that

>> occurred before or after a seq_cst operation were made visible to

>> other threads. This would be a major cost, as every seq_cst operation

>> would then essentially have to synchronize the entire data set of the

>> application.

>

> But seq_cst operations have to synchronize the entire data set of the

> application anyway!

No. They have to synchronize the seq_cst operations themselves, but

other than that they only have to synchronize the data between a

thread that does a release and a thread that does an acquire (even if

those release and acquire are consequences of seq_cst). They do not

force the compiler/hardware to synchronize all data with all threads

that perform seq_cst operations.

> If standard would not claim that seq_cst store is release and seq_cst

> load is acquire, than I would agree. Than seq_cst operations will

> cooperate only with other seq_cst operations, and not with release/

> acquire operations.

> But standard claims that seq_cst store is release and seq_cst load is

> acquire, so I can write something like this:

> void thread1()

> {

> non_atomic_data = 1;

> atomic_flag.store(1, seq_cst);

> }

> void thread2()

> {

> if (1 == atomic_flag.load(acquire))

> assert(1 == non_atomic_data);

> }

>

> or:

> void thread1()

> {

> non_atomic_data = 1;

> atomic_flag.store(1, release);

> }

> void thread2()

> {

> if (1 == atomic_flag.load(seq_cst))

> assert(1 == non_atomic_data);

> }

Yes. That's nice normal pairwise synchronization. Depending on

precisely what operations you use you can still get different results

from different threads if you mix seq_cst with acquire/release.

Aug 24, 2008, 6:20:35 AM8/24/08

to

On 20 авг, 11:11, Anthony Williams <anthony....@gmail.com> wrote:

> If every seq_cst operation synchronizes-with every other seq_cst

> operation, this affects the allowed compiler optimizations. It

> effectively makes every seq_cst operation a full fence, even if no

> other thread can see that operation:

>

> int global(0);

> std::atomic_int global_atomic(0);

>

> void thread1()

> {

> global=1;

> std::atomic_int ai;

> ai.store(1,std::memory_order_seq_cst);

> global_atomic.store(1,std::memory_order_relaxed);

>

> }

>

> void thread2()

> {

> if(global_atomic.load(std::memory_order_seq_cst)==1)

> {

> assert(global==1);

> }

>

> }

>

> Under the current memory model, ai is local so it can be completely

> eliminated. The stores to global and global_atomic are not ordered, so

> the assert in thread2 is actually a data race.

>

> Under your proposal, the store to ai cannot be elimiated since it must

> synchronize-with all other seq_cst operations. In particular, it must

> synchronize-with the seq_cst load in thread2 of an unrelated

> variable. Since the store to global_atomic is after the store to ai,

> if the load reads 1, the assert is no longer a data race, and will not

> fire either, since the value of global must be 1.

Well, yes. But trying to optimize for "thread local atomics" looks a

bit strange for me. It's like trying to optimize boost::shared_ptr to

the case when where is only one reference...

> > The problem that it's not always possible to consistently use only

> > seq_cst operations, or only not seq_cst operations. Current rules hit

> > composability. Assume you want to use component which uses seq_cst

> > operations, but you don't want to use seq_cst operations in the rest

> > of the code. You will have to insert additional atomic operations only

> > to integrate the component. Or assume that you have to integrate

> > component which uses seq_cst operations with another component which

> > doesn't use seq_cst operations. The same problem.

>

> That depends on how you write your components. Can you give an example?

Consider eventcount implementation:

http://groups.google.com/group/comp.programming.threads/msg/71eae0808af7e0ed

Function eventcount::signal_relaxed() tries to "reuse" heavy memory

fence already executed by message-passing queue's enqueue() function.

But we don't know what kind of "heavy memory fence" was executed. Is

it seq_cst fence? Or seq_cst atomic rmw? If seq_cst atomic rmw, then

on what memory location?

> I don't see how you could write code that required them to work

> another way. You would have no way of knowing whether your operations

> had occurred before the fence or after, and thus whether or not it was

> safe to read other data. If you introduced a variable to identify

> whether or not the synchronization had occurred then you get the

> behaviour you require under the current model anyway.

There a number of algorithms, for which it's enough.

For example, Peterson's mutex:

http://en.wikipedia.org/wiki/Peterson%27s_algorithm

or work-stealing deque:

http://www.bluebytesoftware.com/blog/CommentView,guid,1665653b-b5f3-49b4-8144-cfbc5e8c632b.aspx

of eventcount:

http://groups.google.com/group/comp.programming.threads/msg/71eae0808af7e0ed

Of course, all variables must be std::atomic<>, so formally it's

always safe to read them.

> No. They have to synchronize the seq_cst operations themselves, but

> other than that they only have to synchronize the data between a

> thread that does a release and a thread that does an acquire (even if

> those release and acquire are consequences of seq_cst). They do not

> force the compiler/hardware to synchronize all data with all threads

> that perform seq_cst operations.

But can synchronization between 2 threads be implemented more

efficiently than synchronization between all threads? I think that for

commodity SMP/multicore it's not the case. I.e. it's impossible to

implement synchronization between 2 threads more efficiently than

synchronization between all threads.

My standpoint here is that if standard can give some additional

guarantees, w/o incurring additional overheads, than it's worth doing

to give those guarantees. Because there can be algorithms which can

benefit from those guarantees.

Currently C++0x seq_cst atomic operations are not fully sequentially

consistent. Seq_cst operations in most current memory models are more

sequentially consistent, so to say. Java atomics, C# Interlocked

operations and x86 locked instructions, all provide guarantees that

I'm talking about. Their behavior is intuitive, and C++0x's behavior

is not intuitive.

Another moment. Current C++0x rules make it difficult to port existing

Java, C#, x86 and just abstract 'academia' algorithms to C++0x. Naive

approach to use seq_cst atomic RMW operations instead of Java (C#,

x86, abstract) atomic RMW operations will not always work. Instead of

Java atomic RMW operations one must write something like this:

atomic_var.fetch_add(1);

std::atomic_thread_fence(std::memory_order_seq_cst);

And it's not accurate modeling. I'm not sure whether this inaccuracy

can break some algorithms or not. But, still, in my opinion, C++0x's

seq_cst atomics expose not intuitive behavior.

Dmitriy V'jukov

Aug 24, 2008, 1:30:48 PM8/24/08

to

On Aug 24, 1:20 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:

> Currently C++0x seq_cst atomic operations are not fully sequentially

> consistent.

They are.

> Another moment. Current C++0x rules make it difficult to port existing

> Java, C#, x86 and just abstract 'academia' algorithms to C++0x. Naive

> approach to use seq_cst atomic RMW operations instead of Java (C#,

> x86, abstract) atomic RMW operations will not always work.

Not correct.

Aug 24, 2008, 2:55:44 PM8/24/08

to

On 24 авг, 21:30, Peter Dimov <pdi...@gmail.com> wrote:

> On Aug 24, 1:20 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:

>

> > Currently C++0x seq_cst atomic operations are not fully sequentially

> > consistent.

>

> They are.

> On Aug 24, 1:20 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:

>

> > Currently C++0x seq_cst atomic operations are not fully sequentially

> > consistent.

>

> They are.

They consistent only with other seq_cst operations. And most current

MMs (Java/C#/x86) give stronger guarantees, that seq_cst operations

consistent with other non seq_cst operations too.

> > Another moment. Current C++0x rules make it difficult to port existing

> > Java, C#, x86 and just abstract 'academia' algorithms to C++0x. Naive

> > approach to use seq_cst atomic RMW operations instead of Java (C#,

> > x86, abstract) atomic RMW operations will not always work.

>

> Not correct.

Ok, I rephrase: It's easy to port any existing algorithm to C++0x

(just use seq_cst operations everywhere), but it's difficult to

achieve good performance at the same time. And performance is usually

the only reason why we use atomic operations.

Dmitriy V'jukov

Aug 24, 2008, 4:49:02 PM8/24/08

to

On Aug 24, 9:55 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:

> On 24 Á×Ç, 21:30, Peter Dimov <pdi...@gmail.com> wrote:

>

> > On Aug 24, 1:20 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:

>

> > > Currently C++0x seq_cst atomic operations are not fully sequentially

> > > consistent.

>

> > They are.

>

> They consistent only with other seq_cst operations. And most current

> MMs (Java/C#/x86) give stronger guarantees, that seq_cst operations

> consistent with other non seq_cst operations too.

> On 24 Á×Ç, 21:30, Peter Dimov <pdi...@gmail.com> wrote:

>

> > On Aug 24, 1:20 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:

>

> > > Currently C++0x seq_cst atomic operations are not fully sequentially

> > > consistent.

>

> > They are.

>

> They consistent only with other seq_cst operations. And most current

> MMs (Java/C#/x86) give stronger guarantees, that seq_cst operations

> consistent with other non seq_cst operations too.

Why do you think that the Java MM gives stronger guarantees than the C+

+MM?

Aug 25, 2008, 1:54:03 AM8/25/08

to

On 25 авг, 00:49, Peter Dimov <pdi...@gmail.com> wrote:

> > > > Currently C++0x seq_cst atomic operations are not fully sequentially

> > > > consistent.

>

> > > They are.

>

> > They consistent only with other seq_cst operations. And most current

> > MMs (Java/C#/x86) give stronger guarantees, that seq_cst operations

> > consistent with other non seq_cst operations too.

>

> Why do you think that the Java MM gives stronger guarantees than the C+

> +MM?

I'm not 100% sure. But I believe that Java/CLI MMs do give stronger

guarantees.

In JSR-133 Cookbook Doug Lea defines possible reorderings and fences

in terms of Sparc membars:

http://gee.cs.oswego.edu/dl/jmm/cookbook.html

And #StoreLoad emitted between volatile store and volatile load do

give stronger guarantees.

CLI MM is very informal. But as I understand intended reading is that,

if you can't construct sequentially consistent interleaving of threads

for some behavior, then this behavior is prohibited.

I need to read more!

Dmitriy V'jukov

Aug 25, 2008, 3:19:10 AM8/25/08

to

On Aug 25, 8:54 am, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:

> I'm not 100% sure. But I believe that Java/CLI MMs do give stronger

> guarantees.

Is there a (Java) example that can demonstrate that?

> In JSR-133 Cookbook Doug Lea defines possible reorderings and fences

> in terms of Sparc membars:http://gee.cs.oswego.edu/dl/jmm/cookbook.html

> And #StoreLoad emitted between volatile store and volatile load do

> give stronger guarantees.

Well, it doesn't really matter what one particular Java VM on SPARC

does. :-) The guarantees are given by the Java memory model, not by

the hardware or the cookbook.

(SPARC gives pretty strong guarantees since it's RCsc even in its RMO

mode, and this is mainly of theoretical importance since in practice

it's always TSO.)

Aug 25, 2008, 4:07:17 AM8/25/08

to

On Aug 25, 8:54 am, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:

> CLI MM is very informal. But as I understand intended reading is that,

> if you can't construct sequentially consistent interleaving of threads

> for some behavior, then this behavior is prohibited.

This would disallow the non-TSO example, and it's allowed on x86. I

remember reading that the first version of the CLR MM disallowed it,

and it was later changed to allow it. In either case, I don't think

that this SC reasoning applies to x86.

Aug 25, 2008, 6:05:13 AM8/25/08

to

Okay, I don't have arguments any more.

I only pointed out to some moments which in my opinion is not

intuitive or maybe can be improved. If everybody except me think that

everything is okay, than no problem. It's far more likely that I am

just wrong. I will just try to commit to my memory current rules.

Dmitriy V'jukov

Aug 25, 2008, 8:00:31 AM8/25/08

to

On Aug 25, 1:05 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:

> I only pointed out to some moments which in my opinion is not

> intuitive or maybe can be improved.

Your comments are appreciated. The new seq_cst fence definition is

likely to change from

"Sequentially consistent fences participate in the single total order

S over all memory_order_seq_cst operations (29.1/2). Each sequentially

consistent fence synchronizes with the sequentially consistent fence

that follows it in S."

to

" If a memory_order_seq_cst fence F is sequenced before an atomic

operation A on an object M, A observes either the last

memory_order_seq_cst modification of M preceding F in the total order

S, or a later modification of M in its modification order.

If an atomic modification operation A of an object M is sequenced

before a memory_order_seq_cst fence F, and a memory_order_seq_cst

operation B on M follows F in S, B observes either the effects of A on

M, or a later modification of M in its modification order.

If an atomic modification operation A of an object M is sequenced

before a memory_order_seq_cst fence Fa, a memory_order_seq_cst fence

Fb is sequenced before an atomic operation B on M, and Fb follows Fa

in S, B observes either the effects of A on M, or a later modification

on M in its modification order."

partly in response to your pointing out that seq_cst loads and stores

do not interact well with seq_cst fences.

Sadly, this new definition still doesn't fix the problem with

Peterson's algorithm.

Aug 26, 2008, 1:23:30 PM8/26/08

to

Great!

As I see this allows following code:

thread1()

{

store-relaxed X, 1

fence-seq-cst

r1 = load-relaxed Y

}

thread2()

{

store-seq-cst Y, 1

r2 = load-seq-cst X

}

in the sense that (r1 == 0) && (r2 == 0) is prohibited.

Informally, relaxed stores and loads can be used 'around' seq-cst

fence. Right?

I think this can be beneficial to (1) architectures where seq-cst

fence is cheaper than seq-cst operation (Sparc), (2) architectures

where store-release and load-acquire is not free (Itanium), (3)

algorithms where we can replace several seq-cst stores with one seq-

cst fence.

Great!

Dmitriy V'jukov

Sep 20, 2008, 4:14:05 AM9/20/08

to

On 25 авг, 11:19, Peter Dimov <pdi...@gmail.com> wrote:

> (SPARC gives pretty strong guarantees since it's RCsc even in its RMO

> mode, and this is mainly of theoretical importance since in practice

> it's always TSO.)

64-bit Linux runs Sparc in RMO mode!

Dmitriy V'jukov

Reply all

Reply to author

Forward

0 new messages