Memory Barriers, Compiler Optimizations, etc.

Scott Meyers

unread,

Feb 1, 2005, 11:38:08 PM2/1/05

to

I've encountered documents (including, but not limited to, postings on this
newsgroup) suggesting that acquire barriers and release barriers are not
standalone entities but are instead associated with loads and stores,
respectively. Furthermore, making them standalone seems to change
semantics (see below). Yet APIs for inserting them via languages like C or
C++ seem to deal with barriers alone -- there are no associated reads or
writes. (See, for example,
http://msdn.microsoft.com/library/default.asp?url=/library/en-
us/vclang/html/vclrf_readwritebarrier.asp.)
So I have two questions about this. First, are acquire/release properly
part of loads/stores, or does it make sense for them to be standalone? If
the former, how are programmers in languages like C/C++ expected to make
the association between reads/writes and memory barriers?

Next, is it reasonable to assume that compilers will recognize memory
barrier instructions and not perform code motion that is contrary to their
meaning? For example:

x = a;
insertAcquireBarrier(); // or x.acquire = 1 if the barrier
// should not be standalone
y = b;

Assuming that x, y, a, and b are all distinct locations, is it reasonable
to assume that no compiler will move the assignment to y above the barrier,
or is it necessary to declare x and y volatile to prevent such code motion?

Finally, is the following reasoning (prepared for another purpose, but
usable here, I hope) about the semantics of memory barriers correct?

Based on the ICSA 90 paper introducing release consistency, I think of an
acquire barrier as a way of saying "I'm about to enter a critical section,"
and a release barrier as a way of saying "I'm about to leave a critical
section." So consider this situation, where we want to ensure that
memory location 1 is accessed before memory location 2:

Access memory location 1
Announce entry to critical section // acquire barrier
Announce exit from critical section // release barrier
Access memory location 2

We have to prevent stuff from moving out of the critical section, but
there's no reason to keep stuff from moving into it. That is, if x is a
shared variable, we need to access it only within a critical section, but
if y is thread-local, compilers can perform code motion to move access of y
into the critical section without harm (except that the critical section is
now going to take longer to execute).

Neither access above is inside the critical section, so both can be moved:

Announce entry to critical section
Access memory location 1 // moved this down
Access memory location 2 // moved this up
Announce exit from critical section

But within a critical section, instructions can be reordered at will, as
long as they are independent. So let's assume that the two memory
locations are independent. That makes this reordering possible:

Announce entry to critical section
Access memory location 2
Access memory location 1
Announce exit from critical section

And now we're hosed.

On the other hand, if the memory barriers are part of the loads/stores, we
have this:

acquire & access memory location 2
access memory location 3

Because you can't move subsequent accesses up above an acquire (i.e. you
can't move something out of a critical section), you're guaranteed that
location 1 must be accessed before location 2.

Thanks for all clarifications,

Scott

Gianni Mariani

unread,

Feb 2, 2005, 2:14:30 AM2/2/05

to

Scott Meyers wrote:
...

>
> x = a;
> insertAcquireBarrier(); // or x.acquire = 1 if the barrier
> // should not be standalone
> y = b;
>
> Assuming that x, y, a, and b are all distinct locations, is it reasonable
> to assume that no compiler will move the assignment to y above the barrier,
> or is it necessary to declare x and y volatile to prevent such code motion?

There is no standard, however, GCC (and I believe MSVC) provide
non-standard mechanisms to ensure that the aquire barrier is not moved
(by the way that the barrier function is defined).

>
> Finally, is the following reasoning (prepared for another purpose, but
> usable here, I hope) about the semantics of memory barriers correct?
>
> Based on the ICSA 90 paper introducing release consistency, I think of an
> acquire barrier as a way of saying "I'm about to enter a critical section,"
> and a release barrier as a way of saying "I'm about to leave a critical
> section." So consider this situation, where we want to ensure that
> memory location 1 is accessed before memory location 2:
>
> Access memory location 1
> Announce entry to critical section // acquire barrier
> Announce exit from critical section // release barrier
> Access memory location 2
>
> We have to prevent stuff from moving out of the critical section, but
> there's no reason to keep stuff from moving into it. That is, if x is a
> shared variable, we need to access it only within a critical section, but
> if y is thread-local, compilers can perform code motion to move access of y
> into the critical section without harm (except that the critical section is
> now going to take longer to execute).

I'd be very surprised to see the compiler violate the sequence of the
memory barrier.

>
> Neither access above is inside the critical section, so both can be moved:
>
> Announce entry to critical section
> Access memory location 1 // moved this down
> Access memory location 2 // moved this up
> Announce exit from critical section
>
> But within a critical section, instructions can be reordered at will, as
> long as they are independent. So let's assume that the two memory
> locations are independent. That makes this reordering possible:
>
> Announce entry to critical section
> Access memory location 2
> Access memory location 1
> Announce exit from critical section
>
> And now we're hosed.

The compiler (or the code) would be broken if it did that.

>
> On the other hand, if the memory barriers are part of the loads/stores, we
> have this:
>
> acquire & access memory location 2
> access memory location 3
>
> Because you can't move subsequent accesses up above an acquire (i.e. you
> can't move something out of a critical section), you're guaranteed that
> location 1 must be accessed before location 2.

All aquire does is to guarentee that any load (memory fetch) operations,
possibly many, that have been requested before the barrier instruction
are completed before any subsequent memory fetch operations.

Similarly, release guarentees that all memory store operations before
the release barrier instruction are made visible to other threads
(CPU's) before any memory store operations after the release instruction.

It's more like a sequence point.

volatile int v1 = BAD;
volatile bool done = false;

reader:
a: bool is_done = done;
b: aquire();
c: if ( is_done ) play_with( v1 );

writer:
x: v1 = GOOD;
y: release();
z: done = true;

It's more like synchronizing points where the order of memory
modifications must remain consistent with memory load operations.

In this case b: guarentees that the load for v1 must happen after the
load of v1 and the store for v1 (x:) must happen before the store to z:.

Hence, the reader thread will never see the value of v1==BAD when done
is true.

SenderX

unread,

Feb 2, 2005, 3:47:54 AM2/2/05

to

> So I have two questions about this. First, are acquire/release properly
> part of loads/stores, or does it make sense for them to be standalone? If
> the former, how are programmers in languages like C/C++ expected to make
> the association between reads/writes and memory barriers?

Ok. Well, acquire/release semantics work very well with confining the
visiblity of loads and stores in a critical section. They can also be used
to send objects between processors via producer/consumer relationship. You
should wrap up the load/store and membar in a single function. These
functions will work for for this kind of stuff most of the time; you need to
study the specs for your specific compiler...

I will use Alex's notation for the code... ;)

/*
sink-store barrier
extern void* ac_cpu_i686_mb_store_ssb
( void**, void* )
*/
align 16
ac_cpu_i686_mb_store_ssb PROC
mov ecx, [esp + 4]
mov eax, [esp + 8]
sfence
mov [ecx], eax
ret
ac_cpu_i686_mb_store_ssb ENDP

/*
hoist-load barrier with dd "hint"
extern void* ac_cpu_i686_mb_load_ddhlb
( void** )
*/
align 16
ac_cpu_i686_mb_load_ddhlb PROC
mov ecx, [esp + 4]
mov eax, [ecx]
lfence
ret
ac_cpu_i686_mb_load_ddhlb ENDP

/*
classic release (slb+ssb -- see below)
extern void* ac_cpu_i686_mb_store_rel
( void**, void* )
*/
align 16
ac_cpu_i686_mb_store_rel PROC
mov ecx, [esp + 4]
mov eax, [esp + 8]
mfence
mov [ecx], eax
ret
ac_cpu_i686_mb_store_rel ENDP

/*
acquire with data dependency
extern void* ac_cpu_i686_mb_load_ddacq
( void** )
*/
align 16
ac_cpu_i686_mb_load_ddacq PROC
mov ecx, [esp + 4]
mov eax, [ecx]
mfence
ret
ac_cpu_i686_mb_load_ddacq ENDP

/* DCL pseudo-code using fine-grain barriers */

1. static T *shared = 0;

2. T *local = ac_cpu_i686_mb_load_ddhlb( &shared );
3. if ( ! local )
4. { ac_mutex_lock( &static_mutex );
5. if ( ! ( local = shared ) )
6. { local = ac_cpu_i686_mb_store_ssb( &shared, new T ); }
7. ac_mutex_unlock( &static_mutex );
}

/* DCL pseudo-code using coarse barriers */

1. static T *shared = 0;

2. T *local = ac_cpu_i686_mb_load_ddacq( &shared );
3. if ( ! local )
4. { ac_mutex_lock( &static_mutex );
5. if ( ! ( local = shared ) )
6. { local = ac_cpu_i686_mb_store_rel( &shared, new T ); }
7. ac_mutex_unlock( &static_mutex );
}

See how memory barriers can be embedded in the correct place within loads
and stores to create a sort of producer/consumer relationship wrt common
shared data? Also, combining all of this in a single externally assembled
function can cut down on the chances of a rouge compiler reordering your
"critical-sequence" under your nose, and your application crashing seven or
eight months down the line from some mystery race-condition...

;)

Alexander Terekhov

unread,

Feb 2, 2005, 7:01:50 AM2/2/05

to

Scott Meyers wrote:
>
> I've encountered documents (including, but not limited to, postings on this
> newsgroup) suggesting that acquire barriers and release barriers are not
> standalone entities but are instead associated with loads and stores,
> respectively.

Not quite. E.g. see m_lock_status.store_conditional(new, msync) below.

// doesn't provide "POSIX-safety" with respect to destruction
class mutex_for_XBOX_NEXT { // noncopyable

atomic<int> m_lock_status; // 0: free, 1/-1: locked/contention
auto_reset_event m_retry_event; // prohibitively slow bin.sema/gate

template<typename T>
int attempt_update(int old, int new, T msync) {
while (!m_lock_status.store_conditional(new, msync)) {
int fresh = m_lock_status.load_reserved(msync::none);
if (fresh != old)
return fresh;
}
return old;
}

public:

// ctor/dtor [w/o lazy event init]

bool trylock() throw() {
return !(m_lock_status.load_reserved(msync::none) ||
attempt_update(0, 1, msync::acq));
}

// bool timedlock() omitted for brevity

void lock() throw() {
int old = m_lock_status.load_reserved(msync::none);
if (old || old = attempt_update(0, 1, msync::acq)) {
do {
while (old < 0 ||
old = attempt_update(1, -1, msync::acq)) {
m_retry_event.wait();
old = m_lock_status.load_reserved(msync::none);
if (!old) break;
}
} while (old = attempt_update(0, -1, msync::acq));
}
}

void unlock() throw() {
if (m_lock_status.load_reserved(msync::none) < 0 ||
attempt_update(1, 0, msync::rel) < 0) { // or just !SC
m_lock_status.store(0, msync::rel);
m_retry_event.set();
}
}

};

> Furthermore, making them standalone seems to change
> semantics (see below). Yet APIs for inserting them via languages like C or
> C++ seem to deal with barriers alone -- there are no associated reads or
> writes. (See, for example,
> http://msdn.microsoft.com/library/default.asp?url=/library/en-
> us/vclang/html/vclrf_readwritebarrier.asp.)

Yeah.

> So I have two questions about this. First, are acquire/release properly
> part of loads/stores, or does it make sense for them to be standalone?

It's part of "operation".

> If
> the former, how are programmers in languages like C/C++ expected to make
> the association between reads/writes and memory barriers?

std::atomic<>

>
> Next, is it reasonable to assume that compilers will recognize memory
> barrier instructions and not perform code motion that is contrary to their
> meaning?

Yep.

> For example:
>
> x = a;
> insertAcquireBarrier(); // or x.acquire = 1 if the barrier
> // should not be standalone
> y = b;
>
> Assuming that x, y, a, and b are all distinct locations, is it reasonable
> to assume that no compiler will move the assignment to y above the barrier,
> or is it necessary to declare x and y volatile to prevent such code motion?

I guess you mean

...
x = a;
y.store(b, msync::rel);

It is necessary to have a compiler capable to understand atomic<>
and unidirectional reordering constraint associated with its store(T,
msync::rel_t) member function.

[...]

> But within a critical section, instructions can be reordered at will, as
> long as they are independent. So let's assume that the two memory
> locations are independent. That makes this reordering possible:
>
> Announce entry to critical section
> Access memory location 2
> Access memory location 1
> Announce exit from critical section
>
> And now we're hosed.

What do you mean?

>
> On the other hand, if the memory barriers are part of the loads/stores,

No. Acquire is part of "Announce entry to critical section" operation
and release is part of "Announce exit from critical section" thing.

regards,
alexander.

Joseph Seigh

unread,

Feb 2, 2005, 7:43:15 AM2/2/05

to

On Tue, 1 Feb 2005 20:38:08 -0800, Scott Meyers <Use...@aristeia.com> wrote:

> I've encountered documents (including, but not limited to, postings on this
> newsgroup) suggesting that acquire barriers and release barriers are not
> standalone entities but are instead associated with loads and stores,
> respectively. Furthermore, making them standalone seems to change
> semantics (see below). Yet APIs for inserting them via languages like C or
> C++ seem to deal with barriers alone -- there are no associated reads or
> writes. (See, for example,
> http://msdn.microsoft.com/library/default.asp?url=/library/en-
> us/vclang/html/vclrf_readwritebarrier.asp.)
> So I have two questions about this. First, are acquire/release properly
> part of loads/stores, or does it make sense for them to be standalone? If
> the former, how are programmers in languages like C/C++ expected to make
> the association between reads/writes and memory barriers?

Memory barriers aren't directly observable, so you have to define them
in terms of their effect on stuff that is observable, e.g. reads and
writes mainly.

>
> Next, is it reasonable to assume that compilers will recognize memory
> barrier instructions and not perform code motion that is contrary to their
> meaning? For example:
>
> x = a;
> insertAcquireBarrier(); // or x.acquire = 1 if the barrier
> // should not be standalone
> y = b;

s/will/should/

Yes.

>
> Assuming that x, y, a, and b are all distinct locations, is it reasonable
> to assume that no compiler will move the assignment to y above the barrier,
> or is it necessary to declare x and y volatile to prevent such code motion?

Hypothetically, yes. Volatile wouldn't help as it has no meaning for
threads. If the variables are only known to the local scope, ie. they're
not external or have had an address taken, then the compiler can move them
whereever it wants since no other thread can see them. It might be nice
to have a new attribute like "shared" rather than volatile to start with
a clean slate. "shared" would actually have to have a real thread
behavior relevant definition, not whatever the implementation feels like
which is the case with volatile.

For the thread that executed that critical section, the accesses always
appear to have happened in program logical order. Any reordering by
the compiler and processor is supposed to be transparent to that thread.
As far as what other threads can see, the order of accesses done with
out a lock is undefined and your example has them done while not holding
the lock. Note that there are two sets of accesses done by two
different threads. If only one thread uses a lock, you still
have a problem determining the order of accesses by the other threads
if they don't do the accesses using the same lock.

The problem with discussing what should be happening here is that Posix
never formally defined sematics for synchronization. You develope a
fairly good idea after doing threaded programming for a while, though
some still seem to be off a bit. I made an attempt at a formal defintion
here http://groups.google.com/groups?threadm=3A111C5A.A49B55CA%40genuity.com
which maybe you can take a look at. It might give you a sense of what some
of the issues are. I've redid the memory visibility definition so what
I have now is substantially different. It also attempts to define other
synchronization constructs. It's unfinished at this point since it takes
a lot of concentration to work on it and formal sematics doesn't seem to
be a hight priority with anyone, and I already have a good idea of what
the semantics probably are.

>Thanks for all clarifications,
>
> Scott
>

--
Joe Seigh

Alexander Terekhov

unread,

Feb 2, 2005, 7:39:11 AM2/2/05

to

SenderX wrote:
[...]

> ac_cpu_i686_mb_store_ssb PROC
> mov ecx, [esp + 4]
> mov eax, [esp + 8]
> sfence
> mov [ecx], eax
> ret
> ac_cpu_i686_mb_store_ssb ENDP

Compiler reordering aside for a moment, ordinary cpu_i686's stores
have release semantics (ssb+slb). sfense (nop.ssb+hsb) is not needed.
And, BTW, ordinary cpu_i686's loads have "full" acquire semantics
(hsb+hlb).

regards,
alexander.

SenderX

unread,

Feb 2, 2005, 9:31:54 AM2/2/05

to

"Alexander Terekhov" <tere...@web.de> wrote in message
news:4200C9EF...@web.de...

Yeah, I should just comment them out for documentation.

align 16

ac_cpu_i686_mb_store_ssb PROC
mov ecx, [esp + 4]
mov eax, [esp + 8]

; sfence may be needed here on future x86 cpu's

mov [ecx], eax
ret
ac_cpu_i686_mb_store_ssb ENDP

I an still wondering about intel stuff being able to reorder a write
followed by a read to a different location...

Alexander Terekhov

unread,

Feb 2, 2005, 9:30:23 AM2/2/05

to

Just to clarify...

Alexander Terekhov wrote:
[...]

> void lock() throw() {
> int old = m_lock_status.load_reserved(msync::none);
> if (old || old = attempt_update(0, 1, msync::acq)) {
> do {
> while (old < 0 ||
> old = attempt_update(1, -1, msync::acq)) {

^^^^^^^^^^

Acq above is needed in order to ensure proper ordering with respect
to "semaphore lock" operation on m_retry_event below. Lock status
transition 1 -> -1 must complete before "semaphore lock" operation
will take place (otherwise a "deadlock" can arise).

> m_retry_event.wait();
> old = m_lock_status.load_reserved(msync::none);

^^^^^^^^^^^

Here proper ordering is ensured by semaphore lock operation
(m_retry_event.wait()) which is meant to provide acquire semantics
(m_lock_status.load_reserved(msync::none) must complete after
semaphore lock operation).

> if (!old) break;
> }
> } while (old = attempt_update(0, -1, msync::acq));
> }
> }
>
> void unlock() throw() {
> if (m_lock_status.load_reserved(msync::none) < 0 ||
> attempt_update(1, 0, msync::rel) < 0) { // or just !SC
> m_lock_status.store(0, msync::rel);
> m_retry_event.set();

^^^^^^^^^^^^^^^^^^^

Ordering here is also important. m_retry_event.set() (semaphore
unlock operation) is meant to provide release semantics
(m_lock_status.store(0, msync::rel) must complete before semaphore
unlock operation).

> }
> }
>
> };

regards,
alexander.

Alexander Terekhov

unread,

Feb 2, 2005, 9:49:57 AM2/2/05

to

SenderX wrote:
[...]

> I an still wondering about intel stuff being able to reorder a write
> followed by a read to a different location...

That's because (think of op(X) as "store" and op(Y) as "load" to/from
different locations)

op(X).release
op(Y).acquire

doesn't prevent

op(Y).acquire
op(X).release

reordering.

In the case lock operations (now think IA64), reordering must not
induce deadlocks (lock operation must not "suspend" preceding
releases) but critical regions can "overlap" for better performance.

regards,
alexander.

SenderX

unread,

Feb 2, 2005, 6:07:53 PM2/2/05

to

> That's because (think of op(X) as "store" and op(Y) as "load" to/from
> different locations)
>
> op(X).release
> op(Y).acquire
>
> doesn't prevent
>
> op(Y).acquire
> op(X).release
>
> reordering.

op(X).release
mf
op(Y).acquire

should prevent op(Y) affects on shared memory from becoming visible before
op(x)?

>
> In the case lock operations (now think IA64), reordering must not
> induce deadlocks (lock operation must not "suspend" preceding
> releases) but critical regions can "overlap" for better performance.

I see.

Alexander Terekhov

unread,

Feb 2, 2005, 6:41:50 PM2/2/05

to

SenderX wrote:
[...]

> op(X).release
> mf
> op(Y).acquire
>
> should prevent op(Y) affects on shared memory from becoming visible before
> op(x)?

Yes. BTW, revised Java volatiles and JSR-166 atomics are required to
do that. ("Java memory model traded performance for simplicity in a
few cases".)

regards,
alexander.

Markus Schaaf

unread,

Feb 3, 2005, 10:00:24 AM2/3/05

to

"Scott Meyers" <Use...@aristeia.com> wrote:

> I've encountered documents (including, but not limited to, postings on this
> newsgroup) suggesting that acquire barriers and release barriers are not
> standalone entities but are instead associated with loads and stores,
> respectively. Furthermore, making them standalone seems to change
> semantics (see below). Yet APIs for inserting them via languages like C or
> C++ seem to deal with barriers alone -- there are no associated reads or
> writes. (See, for example,
> http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vclang/html/vclrf_readwritebarrier.asp.)

It seems worth noting, that acquire and release barriers are just a variety
of memory visibility models. If you are concerned with broader aspects of
memory visibility, say, from perspective of a general purpose programming
language, it might be useful to widen the focus.

Also you may not be aware of, that Microsoft's _ReadWriteBarrier is just a
flag to the optimizer, similar to volatile. It doesn't insert any kind of
memory barrier at processor level. There are other intrinsics that do, but
these are processor specific, like inline assembly language, and surely not
intended for ordinary pragramming.

Alexander Terekhov

unread,

Feb 3, 2005, 10:33:17 AM2/3/05

to

Markus Schaaf wrote:
[...]

> > http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vclang/html/vclrf_readwritebarrier.asp.)
>
> It seems worth noting, that acquire and release barriers are just a variety
> of memory visibility models. If you are concerned with broader aspects of
> memory visibility, say, from perspective of a general purpose programming
> language, it might be useful to widen the focus.

Right. Idiotic "global memory" optimization blocker aside for a moment, MS
folks actually meant:

int G;
int i;

atomic<int> ReleaseF(0), WaitF(0);

void f(void *p)
{
G = 1;

WaitF.store(1, msync::ssb);
while (ReleaseF.load(msync::hsb) == 0);

G = 2;
}

int main()
{
_beginthread(f, 0, NULL); // New thread

while (WaitF.load(msync::hlb) == 0)
Sleep(1);

if (G == 1)
puts("G is equal to 1, as expected.");
else
puts("G is NOT equal to 1!");

ReleaseF.store(1, msync::slb);
}

regards,
alexander.

Gianni Mariani

unread,

Feb 3, 2005, 11:53:12 AM2/3/05

to

Alexander Terekhov wrote:
...

> WaitF.store(1, msync::ssb);
> while (ReleaseF.load(msync::hsb) == 0);

where is msync defined ?

On another topic- ssb == surely somthing broken ? Just a pet peeve, I
hate seeing the use of a TLA when sink_store_barrier would have worked
just as well.

Scott Meyers

unread,

Feb 3, 2005, 12:01:24 PM2/3/05

to

On Wed, 02 Feb 2005 07:43:15 -0500, Joseph Seigh wrote:
> On Tue, 1 Feb 2005 20:38:08 -0800, Scott Meyers <Use...@aristeia.com> wrote:
> > Assuming that x, y, a, and b are all distinct locations, is it reasonable
> > to assume that no compiler will move the assignment to y above the barrier,
> > or is it necessary to declare x and y volatile to prevent such code motion?
>
> Hypothetically, yes. Volatile wouldn't help as it has no meaning for
> threads. If the variables are only known to the local scope, ie. they're
> not external or have had an address taken, then the compiler can move them
> whereever it wants since no other thread can see them.

My concern wrt volatile was that treatments of memory issues refer to
"program order" as if it's the same as "source code order," but with
compilers moving stuff around prior to code generation, "source code order"
may be quite different from "program order." At least in C++, if I want to
ensure that the the relative order of these reads is preserved,

x = a; // I want x to be read before y
y = b;

declaring x and y volatile will do it. Compilers can still move the reads
around wrt reads and writes of non-volatile data, but to remain compliant
with the C++ standard, x must be read before y in the generated code, i.e.,
in program order.

However, if compilers recognize and respect the semantics of membars, the
need for volatile goes away, because I can just stick a membar between the
reads (which I need anyway), and the problem is solved.

Incidently, I understand how compiler intrinsics like Microsoft's
_ReadWriteBarrier are recognized by compilers, but from what I've read in
this group, there seems to be the assumption that calling an externally
defined function containing assembler will prevent code motion across
calls to the function, because compilers must pessimistically assume that
calls to the function affect all memory locations. With increasingly
aggressiving cross-module inlining technology available, this seems like a
bet that gets worse and worse with time. It's not hard to imagine a build
system that can see that a called function doesn't affect the value of a
global variable and thus move a read or write of that variable across the
call. Is there a reason this can't happen, or are we just lucky that our
tools are, for the time being, both conservative and kind of dumb?

Regarding the other responses to my post, I have to study them before I
respond.

Thanks,

Scott

Joseph Seigh

unread,

Feb 3, 2005, 1:47:08 PM2/3/05

to

On Thu, 3 Feb 2005 09:01:24 -0800, Scott Meyers <Use...@aristeia.com> wrote:

> On Wed, 02 Feb 2005 07:43:15 -0500, Joseph Seigh wrote:
>> On Tue, 1 Feb 2005 20:38:08 -0800, Scott Meyers <Use...@aristeia.com> wrote:
>> > Assuming that x, y, a, and b are all distinct locations, is it reasonable
>> > to assume that no compiler will move the assignment to y above the barrier,
>> > or is it necessary to declare x and y volatile to prevent such code motion?
>>
>> Hypothetically, yes. Volatile wouldn't help as it has no meaning for
>> threads. If the variables are only known to the local scope, ie. they're
>> not external or have had an address taken, then the compiler can move them
>> whereever it wants since no other thread can see them.
>
> My concern wrt volatile was that treatments of memory issues refer to
> "program order" as if it's the same as "source code order," but with
> compilers moving stuff around prior to code generation, "source code order"
> may be quite different from "program order." At least in C++, if I want to
> ensure that the the relative order of these reads is preserved,
>
> x = a; // I want x to be read before y
> y = b;
>
> declaring x and y volatile will do it. Compilers can still move the reads
> around wrt reads and writes of non-volatile data, but to remain compliant
> with the C++ standard, x must be read before y in the generated code, i.e.,
> in program order.

I guess. I'm not real familiar with volatile since it's not that useful
in threading. If expressions are sequence points then that should make
every statement a sequence point also.

>
> However, if compilers recognize and respect the semantics of membars, the
> need for volatile goes away, because I can just stick a membar between the
> reads (which I need anyway), and the problem is solved.

AFAIK they don't, so we have to use the ad hoc solutions that we use
now.

>
> Incidently, I understand how compiler intrinsics like Microsoft's
> _ReadWriteBarrier are recognized by compilers, but from what I've read in
> this group, there seems to be the assumption that calling an externally
> defined function containing assembler will prevent code motion across
> calls to the function, because compilers must pessimistically assume that
> calls to the function affect all memory locations. With increasingly
> aggressiving cross-module inlining technology available, this seems like a
> bet that gets worse and worse with time. It's not hard to imagine a build
> system that can see that a called function doesn't affect the value of a
> global variable and thus move a read or write of that variable across the
> call. Is there a reason this can't happen, or are we just lucky that our
> tools are, for the time being, both conservative and kind of dumb?

The latter. We're just lucky for now. There seems to be extreme
antipathy towards threading issues in the C community at least. Try
to ask any thread specific questions in the C newsgroups at least and
you get "C has nothing to do with threads" response. There's less of
that in the C++ newsgroups now since Herb Sutter, Andrei Alexandrescu,
and yourself maybe, have picked up on and started promoting threading.

For example, I never got any authoritative response as to why Linux
assumes int loads and stores are atomic in ia32. Apparentlly it's
either some undocumented communication somewhere or, more likely,
someone is just assuming that since gcc does atomic load/store of int
for every case they've observed, it must do so for all cases.

It sort of the same for separately compiled external functions. You assume
that the compiler has to drop optimization for any variable that has
had its address gotten from or passed to an external routine, or has
the external attribute. It could break at some point and we'll have
to start writing all the synchronization functions in external assembler
programs. That will make memory barriers more expensive than they already are.

It's not just C and C++ you have to worry about. Hardware architects have
even less of a clue about multi-threading than compiler writers. Their
sophistication ends at using a test and set to implement a lock. They
have no notion of how people are actually doing concurrent programming.
With the use of RCU (Read Copy Update) in the Linux kernal, they've
adopted the use of load dependent memory barriers to avoid the more
expensive load fence memory barriers. The load dependent memory barriers
aren't part of any architected memory model, so hardware architects
definitely are not aware that they're being used. It's a distinct
possibility that some hardware vendor will break it, much to their
detriment in the marketplace. There's a pseudo-op in Linux for this
so they can put in a real memory barrier if needed. Currently alpha
processors don't support dependent load memory ordering. There's was
a discussion on this in Linux kernel mailing list back during the
implementation of RCU in Linux, but there's no explicit documentation
that will carry forward.

--
Joe Seigh

Joseph Seigh

unread,

Feb 3, 2005, 1:53:33 PM2/3/05

to

On Thu, 03 Feb 2005 08:53:12 -0800, Gianni Mariani <gi2n...@mariani.ws> wrote:

> Alexander Terekhov wrote:
> ...
>> WaitF.store(1, msync::ssb);
>> while (ReleaseF.load(msync::hsb) == 0);
>
> where is msync defined ?
>

It isn't. Alexander thinks his mnemonics are obvious
and don't need to be defined. I've never understood
what he's talking about when he resorts to his
mnemonic jargon.

--
Joe Seigh

Alexander Terekhov

unread,

Feb 3, 2005, 3:07:53 PM2/3/05

to

"ssb" is relaxed "release" (without its "slb" part).

"hsb" is relaxed "acquire" (without its "hlb" part).

Got it now?

regards,
alexander.

SenderX

unread,

Feb 3, 2005, 7:58:10 PM2/3/05

to

> My concern wrt volatile was that treatments of memory issues refer to
> "program order" as if it's the same as "source code order," but with
> compilers moving stuff around prior to code generation, "source code
> order"
> may be quite different from "program order." At least in C++, if I want
> to
> ensure that the the relative order of these reads is preserved,

> declaring x and y volatile will do it. Compilers can still move the reads
> around wrt reads and writes of non-volatile data, but to remain compliant
> with the C++ standard, x must be read before y in the generated code,
> i.e.,
> in program order.

I use volatile for source code documentation only. That about how usefull it
really is wrt this kind of stuff.

;(...

> However, if compilers recognize and respect the semantics of membars, the
> need for volatile goes away, because I can just stick a membar between the
> reads (which I need anyway), and the problem is solved.

..."if compilers recognize and respect the semantics of membars"...
^^^^^^^^^^^^^^^^^^^^^

It would be nice to have a compiler that could advertise "We handle calls to
any memory barrier or critical function in a safe and effective manner."
Something simple and magical like this would be sort of a start:

/* full fence barrier */
extern void my_mb_fence( void );

/* any other functions that's critical... */
extern void my_mutex_lock( void );
extern void my_mutex_unlock( void );
[ect...]

Now we use some magical #pragma's to inform the compiler of our own barriers
and critical functions:

/* Inidicate to compiler that my_mb_fence is actually
a memory barrier. Now the compiler would have some
critical information. */
#pragma memory_barrier( "my_mb_fence" );

/* Inidicate to compiler the mb_mutex_lock is actually
the lock portion of a custom mutex. */
#pragma mutex_lock_function( "my_mutex_lock" );

/* Inidicate to compiler the my_mutex_unlock is actually
the unlock portion of a custom mutex. */
#pragma mutex_unlock_function( "my_mutex_unlock" );

What do think about this "simple" strategy???

as for volatile, it should probablly be dropped for something like this:

__attribute__( (shared_variable) ) int shared_var;

Humm... Compiler writers REALLY need to get in on this!

> Incidently, I understand how compiler intrinsics like Microsoft's
> _ReadWriteBarrier are recognized by compilers, but from what I've read in
> this group, there seems to be the assumption that calling an externally
> defined function containing assembler will prevent code motion across
> calls to the function, because compilers must pessimistically assume that
> calls to the function affect all memory locations. With increasingly
> aggressiving cross-module inlining technology available, this seems like a
> bet that gets worse and worse with time.

Yup. Its is basically all we have for now. ;(...

My AppCore library relies on external assembled function to "attempt to
reduce" the number of chances a rouge compiler would have to reorder
"critical-sequence" of loads, stores, and function calls. After somebody
reads its documentation, and follow the links contained in it to this thread
( and others ), nobody will want to use the damn thing!!!!

:O sh$T#@$

lol

SenderX

unread,

Feb 3, 2005, 8:10:02 PM2/3/05

to

>> 1. static T *shared = 0;
>>
>>
>> 2. T *local = ac_cpu_i686_mb_load_ddhlb( &shared );
>> 3. if ( ! local )
>> 4. { ac_mutex_lock( &static_mutex );
>> 5. if ( ! ( local = shared ) )
>> 6. { local = ac_cpu_i686_mb_store_ssb( &shared, new T ); }
>> 7. ac_mutex_unlock( &static_mutex );
>> }

> If
> the former, how are programmers in languages like C/C++ expected to make
> the association between reads/writes and memory barriers?

Just to clarify and answer your question directly:

loads are usually associated with "consumer side" memory barriers.
hoist-load-xxx, acquire, ect...

stores are usually associated with "producer side" memory barriers.
sink-store-xxx, release, ect...

SenderX

unread,

Feb 3, 2005, 8:20:08 PM2/3/05

to

One more "important" thing...

> Just to clarify and answer your question directly:
>
>
> loads are usually associated with "consumer side" memory barriers.
> hoist-load-xxx, acquire, ect...

loads: the consumer-side barrier is usually executed "AFTER" the load.

movl %(eax), %ecx
; consumer-side barrier

>
> stores are usually associated with "producer side" memory barriers.
> sink-store-xxx, release, ect...

stores: the producer-side barrier is usually executed "BEFORE" the store.

; consumer-side barrier
movl %ecx, %(eax)

This setup allows for the producer/consumer relationship to work very well
with weakly-ordered multi-CPU shared memory environments.

Marcin 'Qrczak' Kowalczyk

unread,

Feb 3, 2005, 8:39:29 PM2/3/05

to

"SenderX" <x...@xxx.com> writes:

> loads: the consumer-side barrier is usually executed "AFTER" the load.
>
> movl %(eax), %ecx
> ; consumer-side barrier

> stores: the producer-side barrier is usually executed "BEFORE" the store.

>
> ; consumer-side barrier
> movl %ecx, %(eax)

Are you sure that it's not the opposite?

--
__("< Marcin Kowalczyk
\__/ qrc...@knm.org.pl
^^ http://qrnik.knm.org.pl/~qrczak/

Joseph Seigh

unread,

Feb 3, 2005, 9:03:08 PM2/3/05

to

On Thu, 3 Feb 2005 17:10:02 -0800, SenderX <x...@xxx.com> wrote:

>

>
>> If
>> the former, how are programmers in languages like C/C++ expected to make
>> the association between reads/writes and memory barriers?
>
>
> Just to clarify and answer your question directly:
>
>
> loads are usually associated with "consumer side" memory barriers.
> hoist-load-xxx, acquire, ect...
>
> stores are usually associated with "producer side" memory barriers.
> sink-store-xxx, release, ect...
>
>
>
>

I'm leaning toward the way
http://www.hpl.hp.com/research/linux/qprof/README_atomic_ops.txt
does it where the "memory barrier" is a qualifier to the operation
it's tacked on to. Which means a load_acquire would have a
different membar than a store_acquire. For sparc, load_acquire
would have "#LoadLoad | #LoadStore" and store_acquire would have
"#StoreLoad | #StoreStore". For ia32 it would be MFENCE for both.

--
Joe Seigh

SenderX

unread,

Feb 3, 2005, 9:28:33 PM2/3/05

to

>> loads: the consumer-side barrier is usually executed "AFTER" the load.
>>
>> movl %(eax), %ecx
>> ; consumer-side barrier
>
>> stores: the producer-side barrier is usually executed "BEFORE" the store.
>>
>> ; consumer-side barrier

^^^^^^^^^^^^^^^^^^

Thats suppose to be producer-side.

>> movl %ecx, %(eax)
>

> Are you sure that it's not the opposite?

Yes, I'm sure. Here is why:

/* example of correct code */

/* processor A - executes producer-barrier "before" the store to shared */
object_t *local = create_object( . );
ac_cpu_mb_producer();
shared = local;

/* processor B - executes consumer-barrier "after" the load from shared */
object_t *local = shared;
ac_cpu_mb_consumer_dep();
if ( local ) { use_object( local ); }

If you reverse that logic, you break the hell of it:

/* example of busted code */

/* processor A */
object_t *local = create_object( . );
shared = local;
ac_cpu_mb_producer(); /* no, no */

/* processor B */
ac_cpu_mb_consumer_dep(); /* no, no */
object_t *local = shared;
if ( local ) { use_object( local ); }

Understand now?

;)

Scott Meyers

unread,

Feb 3, 2005, 10:37:28 PM2/3/05

to

On Thu, 03 Feb 2005 13:47:08 -0500, Joseph Seigh wrote:
> I guess. I'm not real familiar with volatile since it's not that useful
> in threading. If expressions are sequence points then that should make
> every statement a sequence point also.

There is a sequence point at the end of each statement, but that doesn't
help any for purposes of ordering reads and writes, because sequence points
constrain only observable behavior, and reads and write to nonvolatile data
are not considered observable by the C++ standard. This is actually a
feature. Without it, you can't do things like hoist loop-invariant
computations out of loops, do common subexpression elimination, etc. I
mean, really, as a general rule, you do NOT want to execute the code the
programmer actually wrote, because it's fairly awful, from a performance
point of view. In general :-)

From my perspective, there are two problems for C/C++ programmers:
- How do we keep compilers from ordering our reads/write in a way
contrary to what we want to be present in the generated code (i.e.,
"program order")?
- How do we make sure that the reads/writes that take place during
execution are made visible to other threads in the order in which they
take place in our thread?
If compilers recognize memory barriers, memory barriers solve both
problems. If compilers do not recognize them, we may need more than memory
barriers, e.g., volatile.

> > However, if compilers recognize and respect the semantics of membars, the
> > need for volatile goes away, because I can just stick a membar between the
> > reads (which I need anyway), and the problem is solved.
>
> AFAIK they don't, so we have to use the ad hoc solutions that we use
> now.

Oh, goody.

> It sort of the same for separately compiled external functions. You
> assume that the compiler has to drop optimization for any variable that
> has had its address gotten from or passed to an external routine, or has
> the external attribute. It could break at some point and we'll have to
> start writing all the synchronization functions in external assembler
> programs. That will make memory barriers more expensive than they
> already are.

It seems clear to me that we need a way to communicate with compilers so
that they know the constraints we want to impose. They can either
recognize the signifcance of certain constructs (e.g., my understanding is
that compilers compliant with Posix threads must not optimize across
certain library calls) or we can add pragmas or something. Then again, I
could be completely off base here. I'm new to this stuff.

Scott

Scott Meyers

unread,

Feb 3, 2005, 10:45:10 PM2/3/05

to

On Thu, 3 Feb 2005 16:58:10 -0800, SenderX wrote:
> I use volatile for source code documentation only. That about how usefull it
> really is wrt this kind of stuff.

So if you want to write something like this:

int data;
bool dataIsReady;

...

data = 22; // set data to communicate to other threads
dataIsReady = true; // let other threads know that they can read data

How do you ensure that the compiler doesn't invert the order of those
assignments?

> What do think about this "simple" strategy???

It seems reasonable to me, and it's consistent with another posting I just
made. In theory, you could get arbitrarily fine-grained with the
information you pass to the compilers this way. Unfortunately, pragmas are
defined to be inherently platform-specific, and my guess is that it'd be
very difficult to get the C or C++ committees to standardize pragmas. A
similar idea is something like Microsoft's attributes, which, oddly enough,
might be easier push, because it's introducing something brand new rather
than changing the semantics of something that currently exists.

Scott

Scott Meyers

unread,

Feb 3, 2005, 10:53:37 PM2/3/05

to

On Thu, 03 Feb 2005 21:07:53 +0100, Alexander Terekhov wrote:
> > It isn't. Alexander thinks his mnemonics are obvious
> > and don't need to be defined. I've never understood
> > what he's talking about when he resorts to his
> > mnemonic jargon.
>
> "ssb" is relaxed "release" (without its "slb" part).
>
> "hsb" is relaxed "acquire" (without its "hlb" part).

Can you put up a web page or something that summarizes all the stuff you
refer to frequently? Tracking it down is hard for those of us who tuned in
late. For example, I have this recollection that at one point I managed to
Google down a posting in which you summarized most/all of your notations
and what they mean, but I didn't bookmark it. I know that you are not big
on the handholding, but a summary crib sheet would be really helpful.

Thanks,

Scott

Scott Meyers

unread,

Feb 3, 2005, 11:13:29 PM2/3/05

to

On Tue, 01 Feb 2005 23:14:30 -0800, Gianni Mariani wrote:
> All aquire does is to guarentee that any load (memory fetch) operations,
> possibly many, that have been requested before the barrier instruction
> are completed before any subsequent memory fetch operations.

I post this with great trepedation, because I've gotten this backwards
several times before, but my understanding is that an acquire guarantees
that subsequent memory operations will not take place before any operations
preceding the acquire, i.e., that memory references "after" the barrier (in
program order) won't migrate up to "before" the barrier. However, it's a
unidirectional barrier, so memory operations preceding the barrier may
migrate down to after it. Conceptually, we can move memory operations into
the critical section, but we can't move opertions inside the critical
section to above the acquire (i.e., out of the critical section).

Did I get it wrong again, did I misread what you wrote, or is there a
misstatement above?

> volatile int v1 = BAD;
> volatile bool done = false;
>
> reader:
> a: bool is_done = done;
> b: aquire();
> c: if ( is_done ) play_with( v1 );

Yes, but consider:

reader:
int x = 22;
a: bool is_done = done;
b: aquire();
c: if ( is_done ) play_with( v1 );

The assignment to x can be moved down to between b and c, right? Also, the
acquire is really meant to be associated with the store to is_done, right?

Scott

Scott Meyers

unread,

Feb 3, 2005, 11:23:43 PM2/3/05

to

On Wed, 02 Feb 2005 13:01:50 +0100, Alexander Terekhov wrote:
> > the former, how are programmers in languages like C/C++ expected to make
> > the association between reads/writes and memory barriers?
>
> std::atomic<>

I'm not sure how to interpret this. There is no atomic in std, and in
another post, you described std::atomic as "a dream." So I don't know
whether atomic is a template that actually exists and is in widespread use
or is something that only you use or is something that doesn't actually
exists in executable form but is rather something you'd like to see exist.
My question was not rhetorical. From other postings in this thread, I get
the impression that what C/C++ programmers actually do is call
compiler-specific intrinsics or externally compiled functions (possibly
implemented in assembler) that bundle reads/write and membars together,
then trust that their build systems (i.e., compilers/linkers/runtime
systems) won't unbundle and reorder things such that the semantics of the
system are changed. Do you disagree with this?

> > Next, is it reasonable to assume that compilers will recognize memory
> > barrier instructions and not perform code motion that is contrary to their
> > meaning?
>
> Yep.

Others have said no. Can you please explain why you believe that C/C++
compilers will not perform code motion across memory barriers, especially
barriers that are not intrinsics but are instead implemented as externally
compiled functions containing, say, arbitrary assembler?

> I guess you mean
>
> ...
> x = a;
> y.store(b, msync::rel);
>
> It is necessary to have a compiler capable to understand atomic<>
> and unidirectional reordering constraint associated with its store(T,
> msync::rel_t) member function.

Do such compilers exist? Are they in widespread use?

Scott

SenderX

unread,

Feb 4, 2005, 12:31:27 AM2/4/05

to

"Scott Meyers" <Use...@aristeia.com> wrote in message
news:MPG.1c6c91928...@news.hevanet.com...

This is some of them, he might have more now!

;)

msync::none // nothing (e.g. for refcount<T, basic>::increment)
msync::fence // classic fence (acq+rel -- see below)
msync::acq // classic acquire (hlb+hsb -- see below)
msync::ddacq // acquire via data dependency
msync::hlb // hoist-load barrier -- acquire not affecting stores
msync::ddhlb // ...
msync::hsb // hoist-store barrier -- acquire not affecting loads
msync::ddhsb // ...
msync::rel // classic release (slb+ssb -- see below)
msync::slb // sink-load barrier -- release not affecting stores
msync::ssb // sink-store barrier -- release not affecting loads
msync::slfence // store-load fence (ssb+hlb -- see above)
msync::sfence // store-fence (ssb+hsb -- see above)
msync::lfence // load-fence (slb+hlb -- see above)

Gianni Mariani

unread,

Feb 4, 2005, 1:24:38 AM2/4/05

to

Scott Meyers wrote:
> On Tue, 01 Feb 2005 23:14:30 -0800, Gianni Mariani wrote:
>
>>All aquire does is to guarentee that any load (memory fetch) operations,
>>possibly many, that have been requested before the barrier instruction
>>are completed before any subsequent memory fetch operations.
>
>
> I post this with great trepedation, because I've gotten this backwards
> several times before, but my understanding is that an acquire guarantees
> that subsequent memory operations will not take place before any operations
> preceding the acquire, i.e., that memory references "after" the barrier (in
> program order) won't migrate up to "before" the barrier. However, it's a
> unidirectional barrier, so memory operations preceding the barrier may
> migrate down to after it. Conceptually, we can move memory operations into
> the critical section, but we can't move opertions inside the critical
> section to above the acquire (i.e., out of the critical section).

I don't have any trepedation, I suspect that if I am wrong, I'll get
told sooner or later. :-)

It depends on arhitecture.

I came across another list of names for barriers.

LoadLoad
LoadStore
StoreLoad
StoreStore
ref: http://gee.cs.oswego.edu/dl/jmm/cookbook.html

>
> Did I get it wrong again, did I misread what you wrote, or is there a
> misstatement above?

Given the prolific nature of the nature of the kinds of memory barriers,
you may be right or we may both be right. Documentation seems scant and
in some respects because we have a vast number of "theoretical" machines
to deal with, both hard and virtual.

>
>
>>volatile int v1 = BAD;
>>volatile bool done = false;
>>
>>reader:
>>a: bool is_done = done;
>>b: aquire();
>>c: if ( is_done ) play_with( v1 );
>
>
> Yes, but consider:
>
> reader:
> int x = 22;
> a: bool is_done = done;
> b: aquire();
> c: if ( is_done ) play_with( v1 );
>
> The assignment to x can be moved down to between b and c, right?

I suspect there could be a CPU that would do this, yes.

Also, the
> acquire is really meant to be associated with the store to is_done, right?

Yes, it is meant to be associated with load(done)/load(v1). i.e. is_done
is just a temporary that is only available to a single thread (on it's
stack) so no barriers are required to read from is_done since to other
thread can change it.

read(done) -> is_done
aquire
check(is_done) then read(v1)

Hence, if done is read, and it is false, v1 is not used. Worst case, v1
is not read but v1 is also not BAD.

Combine this with.

x: v1 = GOOD;
y: release();
z: done = true;

store(v1)
release (make the new v1 visible)
store(done)

There is no way that is_done can be true without v1 being GOOD.

Alexander Terekhov

unread,

Feb 4, 2005, 4:22:28 AM2/4/05

to

Gianni Mariani wrote:
[...]

> I came across another list of names for barriers.
>
> LoadLoad
> LoadStore
> StoreLoad
> StoreStore
> ref: http://gee.cs.oswego.edu/dl/jmm/cookbook.html

That cookbook isn't entirely accurate to begin with. SPARC barriers
are bidirectional fences. The thing is that you rarely need more than
unidirectional constraint(s) associated with this or that operation.

regards,
alexander.

Alexander Terekhov

unread,

Feb 4, 2005, 6:32:35 AM2/4/05

to

Scott Meyers wrote:
>
> On Wed, 02 Feb 2005 13:01:50 +0100, Alexander Terekhov wrote:
> > > the former, how are programmers in languages like C/C++ expected to make
> > > the association between reads/writes and memory barriers?
> >
> > std::atomic<>
>
> I'm not sure how to interpret this. There is no atomic in std, and in
> another post, you described std::atomic as "a dream." So I don't know
> whether atomic is a template that actually exists and is in widespread use
> or is something that only you use or is something that doesn't actually
> exists in executable form but is rather something you'd like to see exist.

The latter.

[...]

>
> > > Next, is it reasonable to assume that compilers will recognize memory
> > > barrier instructions and not perform code motion that is contrary to their
> > > meaning?
> >
> > Yep.
>
> Others have said no. Can you please explain why you believe that C/C++
> compilers will not perform code motion across memory barriers, especially
> barriers that are not intrinsics but are instead implemented as externally
> compiled functions containing, say, arbitrary assembler?

I meant hypothetical future implementations with atomic<>.

>
> > I guess you mean
> >
> > ...
> > x = a;
> > y.store(b, msync::rel);
> >
> > It is necessary to have a compiler capable to understand atomic<>
> > and unidirectional reordering constraint associated with its store(T,
> > msync::rel_t) member function.
>
> Do such compilers exist? Are they in widespread use?

Not yet.

regards,
alexander.

Alexander Terekhov

unread,

Feb 4, 2005, 6:39:12 AM2/4/05

to

Scott Meyers wrote:
>
> On Thu, 03 Feb 2005 21:07:53 +0100, Alexander Terekhov wrote:
> > > It isn't. Alexander thinks his mnemonics are obvious
> > > and don't need to be defined. I've never understood
> > > what he's talking about when he resorts to his
> > > mnemonic jargon.
> >
> > "ssb" is relaxed "release" (without its "slb" part).
> >
> > "hsb" is relaxed "acquire" (without its "hlb" part).
>
> Can you put up a web page or something that summarizes all the stuff you
> refer to frequently?

http://www.google.de/groups?threadm=414E9E40.A66D4F48%40web.de
(std::msync)

regards,
alexander.

Joseph Seigh

unread,

Feb 4, 2005, 7:52:44 AM2/4/05

to

On Thu, 3 Feb 2005 19:37:28 -0800, Scott Meyers <Use...@aristeia.com> wrote:

> From my perspective, there are two problems for C/C++ programmers:
> - How do we keep compilers from ordering our reads/write in a way
> contrary to what we want to be present in the generated code (i.e.,
> "program order")?
> - How do we make sure that the reads/writes that take place during
> execution are made visible to other threads in the order in which they
> take place in our thread?
> If compilers recognize memory barriers, memory barriers solve both
> problems. If compilers do not recognize them, we may need more than memory
> barriers, e.g., volatile.

Volatile was defined before they knew what they were doing. It is fairly
useless for the purposes of threaded programming. Even Java, which did
have precise semantics for volatile w.r.t. threading, needed two tries to
to give it useful semantics and from what I've heard, still doesn't have
it right.

[...]

>
> It seems clear to me that we need a way to communicate with compilers so
> that they know the constraints we want to impose. They can either
> recognize the signifcance of certain constructs (e.g., my understanding is
> that compilers compliant with Posix threads must not optimize across
> certain library calls) or we can add pragmas or something. Then again, I
> could be completely off base here. I'm new to this stuff.
>

Posix compliance of C compilers is after the fact. C doesn't recognise threads.
Adding pragmas would be the better route for compilers since neither Posix nor
Microsoft have a formal definition of thread semantics. Kind of difficult to
implement something if you don't know what it is. Having pragmas would shift
the burden to the threading and synchronization library implementers. Plus
it gets the compiler writers out of having to implement more than one set of
synchronization semantics which they would have to do since they're already
in widespread use. I would hate to be the compiler writer that had to tell
Linus that gcc won't support the Linux kernel anymore. Linux kernel threads
don't use Posix pthreads api for synchronization.

These ad hoc solutions aren't as bad as you think. You just need to have
an api that has well defined semantics so if you do have a problem at some
point you can identify what and where the problem is. So for example, you
don't want to use volatile directly even if you did think it had some useful
behavior because if it did break you'd have no single point at which to fix
things. You'd have to look at every single occurance of volatile, try to
figure out how the programmer was trying to use it, and try to come up with
an alternate fix.

--
Joe Seigh

SenderX

unread,

Feb 5, 2005, 1:08:12 AM2/5/05

to

>> I use volatile for source code documentation only. That about how usefull
>> it
>> really is wrt this kind of stuff.
>
> So if you want to write something like this:
>
> int data;
> bool dataIsReady;
>
> ...
>
> data = 22; // set data to communicate to other threads
> dataIsReady = true; // let other threads know that they can read data
>
> How do you ensure that the compiler doesn't invert the order of those
> assignments?

You could use volatile and cross your fingers, or put it "all" in a mutex...
But, we want a speedy lock-free solution, like your example shows. You could
identify this code as a "critical-sequence" of operations and, IMHO, do it
in assembly. I am starting to find that assembly language can be a lot more
useful than C wrt lock-free programming in general; It eases my mind... lol.
I have become more and more paranoid about this subject because the i686
version of my AppCore library is done. All that I have left to do is finish
the documentation. People may start to use it because of the high-peformance
its provides, and I want there experience to be a good one. So, I'll quickly
scribble down an ad-hoc lock-free solution for your question in C and i686:

/* C-style compile-time assertion for 32-bit cpu */
struct 32bit_compile_assert_int_must_be_a_word
{ int test[( sizeof( int ) == 4 ) ? 1 : 0]; };

/* MUST be two "adjacent" words!!! */
typedef struct __attribute__( (packed) ) cs_
{
int data;
int dataIsReady;

} cs_t;

/* "safely" produces data. Safe is good! */
extern void i686_cs_produce_data( cs_t*, int );
.globl i686_cs_produce_data
i686_cs_produce_data:
movl 4(%esp), %eax
movl 8(%esp), %ecx
movl %ecx, (%eax)
; sfence may be needed right here on future x86
movl $1, 4(%eax)
ret

/* abstract this cpu specific code into common api. */
#define cs_produce_data i686_cs_produce_data /* ;) */

/* now, a thread-safe version of your example in C */
static cs_t my_data = { 0, 0 };

cs_produce_data( &my_data, 22 );

>>> How do you ensure that the compiler doesn't invert the order of those
> assignments?

Now the compiler doesn't even have a chance to reorder anything.

:)

> It seems reasonable to me, and it's consistent with another posting I just
> made. In theory, you could get arbitrarily fine-grained with the
> information you pass to the compilers this way.

Yeah, I thought it could be a simple and straight forward method for passing
all sorts of critical information about your custom functions directly to
the compiler.

> Unfortunately, pragmas are
> defined to be inherently platform-specific, and my guess is that it'd be
> very difficult to get the C or C++ committees to standardize pragmas.

That's what I thought. However, I think the idea, at least, justifies a
thoughtful discussion in the C/C++ committees.

> A
> similar idea is something like Microsoft's attributes, which, oddly
> enough,
> might be easier push, because it's introducing something brand new rather
> than changing the semantics of something that currently exists.

Humm...

Ziv Caspi

unread,

Feb 4, 2005, 7:22:51 PM2/4/05

to

"Scott Meyers" <Use...@aristeia.com> wrote in message

news:MPG.1c6c8f8e5...@news.hevanet.com...

> On Thu, 3 Feb 2005 16:58:10 -0800, SenderX wrote:
>> I use volatile for source code documentation only. That about how usefull
>> it
>> really is wrt this kind of stuff.
>
> So if you want to write something like this:
>
> int data;
> bool dataIsReady;
>
> ...
>
> data = 22; // set data to communicate to other threads
> dataIsReady = true; // let other threads know that they can read data
>
> How do you ensure that the compiler doesn't invert the order of those
> assignments?

The answer depends on whom you ask, and so not very helpful.

Some people (Microsoft, in particular) holds that code generation of
multi-threaded programs must preserve the (very loose) rules of C/C++, and
so simply declaring both variables as volatile means that the code generated
by the compiler will have the assignment to data precede the assignment to
dataIsReady.

Others (many of them on this group) contend that C/C++ is explicitly *not*
about MT programs, and you can't rely on any guarantees.

In any case, C/C++ provides no mechanism to prevent the processor itself
from reordering, so even if you belong to the first group, you get no
standard guarantees.

>> What do think about this "simple" strategy???
>
> It seems reasonable to me, and it's consistent with another posting I just
> made. In theory, you could get arbitrarily fine-grained with the
> information you pass to the compilers this way. Unfortunately, pragmas
> are
> defined to be inherently platform-specific, and my guess is that it'd be
> very difficult to get the C or C++ committees to standardize pragmas. A
> similar idea is something like Microsoft's attributes, which, oddly
> enough,
> might be easier push, because it's introducing something brand new rather
> than changing the semantics of something that currently exists.

Microsoft has already provided some guarantees here, as specified in
http://www.microsoft.com/whdc/driver/kernel/MP_issues.mspx. In particular,
it treats volatile reads as having acquire semantics, and volatile writes as
having release semantics, so if you target CL 14 (or later), you have a
solution. I've not hears of other C/C++ compilers that provide similar
guarantees, which probably is more a testament to my ignorance than anything
else :-)

Note that the guarantees we currently provide hold only for the platforms
Windows and CL run on -- x86, x64, and Itanium. We currently don't provide a
guaranteed "future-proof" model for future platforms, although some of us
would really like us to do so...

HTH,
Ziv Caspi

DISCLAIMER: I work for Microsoft. Opinions expressed here are my own, and
not my employer's. I don't work for the compiler team or the Windows team,
and the above is my understanding of their position, which might be wrong.

Ziv Caspi

unread,

Feb 4, 2005, 6:59:46 PM2/4/05

to

"Scott Meyers" <Use...@aristeia.com> wrote in message

news:MPG.1c6c96414...@news.hevanet.com...

> On Tue, 01 Feb 2005 23:14:30 -0800, Gianni Mariani wrote:
>> All aquire does is to guarentee that any load (memory fetch) operations,
>> possibly many, that have been requested before the barrier instruction
>> are completed before any subsequent memory fetch operations.
>
> I post this with great trepedation, because I've gotten this backwards
> several times before, but my understanding is that an acquire guarantees
> that subsequent memory operations will not take place before any
> operations
> preceding the acquire, i.e., that memory references "after" the barrier
> (in
> program order) won't migrate up to "before" the barrier. However, it's a
> unidirectional barrier, so memory operations preceding the barrier may
> migrate down to after it. Conceptually, we can move memory operations
> into
> the critical section, but we can't move opertions inside the critical
> section to above the acquire (i.e., out of the critical section).
>
> Did I get it wrong again, did I misread what you wrote, or is there a
> misstatement above?

No, you got it correctly. See also
http://www.microsoft.com/whdc/driver/kernel/MP_issues.mspx

* Acquire semantics mean that the results of the operation are visible
before the results of any operation that appears after it in the code
* Release semantics mean that the results of the operation are visible after
the results of any operation that appears before it in the code

HTH,
Ziv Caspi

Alexander Terekhov

unread,

Feb 5, 2005, 8:17:07 AM2/5/05

to

Ziv Caspi wrote:
[...]

> Microsoft has already provided some guarantees here, as specified in
> http://www.microsoft.com/whdc/driver/kernel/MP_issues.mspx. In particular,
> it treats volatile reads as having acquire semantics, and volatile writes as
> having release semantics, so if you target CL 14 (or later), you have a
> solution.

I doubt it. KeMemoryBarrier* idiocy and braindead illustrations from
that piece speak volumes to the contrary.

regards,
alexander.

Joseph Seigh

unread,

Feb 5, 2005, 11:33:44 AM2/5/05

to

On Sat, 5 Feb 2005 02:22:51 +0200, Ziv Caspi <zi...@netvision.net.il> wrote:

> Microsoft has already provided some guarantees here, as specified in
> http://www.microsoft.com/whdc/driver/kernel/MP_issues.mspx. In particular,
> it treats volatile reads as having acquire semantics, and volatile writes as
> having release semantics, so if you target CL 14 (or later), you have a
> solution. I've not hears of other C/C++ compilers that provide similar
> guarantees, which probably is more a testament to my ignorance than anything
> else :-)

Strictly speaking, acquire and release aren't accurate characterizations
Volatiles are totally ordered with respect to other volatiles separated by
sequence points And only at the compiler generated code level.

>
> Note that the guarantees we currently provide hold only for the platforms
> Windows and CL run on -- x86, x64, and Itanium. We currently don't provide a
> guaranteed "future-proof" model for future platforms, although some of us
> would really like us to do so...

The semantics of volatile are implemenation dependent, so Microsoft can
implement volatile with those semantics. However, Microsoft should make
it explicitly clear that those are guarantees only provided by Microsoft's
C/C++ compiler and not by the C/C++ standard or necessarily any other
compiler. In other words, such behavior may be non-portable. It's
considered good form to document non-standard behavior.

As far as future proofing, I though Microsoft was pushing for CLR to
go into C somehow.
--
Joe Seigh

Scott Meyers

unread,

Feb 5, 2005, 12:05:56 PM2/5/05

to

On Sat, 5 Feb 2005 02:22:51 +0200, Ziv Caspi wrote:
> Others (many of them on this group) contend that C/C++ is explicitly *not*
> about MT programs, and you can't rely on any guarantees.

Well, the C++ standard seems pretty clear to me that the sequence of reads
and writes to volatile data separated by sequence points must be preserved
by a conforming compiler. I can't tell from a quick glance whether the C99
standard offers the same guarantee. So what is the basis for the "can't
rely on any guarantees" camp? That there are many nonconforming compilers?
That the guarantee doesn't exist in C?

I think we all agree that volatile alone can't solve the problem, because
it affects only compilers, not hardware-based instruction reorderings.

Scott

Scott Meyers

unread,

Feb 5, 2005, 12:12:58 PM2/5/05

to

On Fri, 4 Feb 2005 22:08:12 -0800, SenderX wrote:
> >>> How do you ensure that the compiler doesn't invert the order of those
> > assignments?
>
> Now the compiler doesn't even have a chance to reorder anything.

In other words, you prevent the reordering of the assignments by not making
the assignments :-)

Scott

Joseph Seigh

unread,

Feb 5, 2005, 12:54:10 PM2/5/05

to

On Sat, 5 Feb 2005 09:05:56 -0800, Scott Meyers <Use...@aristeia.com> wrote:

> On Sat, 5 Feb 2005 02:22:51 +0200, Ziv Caspi wrote:
>> Others (many of them on this group) contend that C/C++ is explicitly *not*
>> about MT programs, and you can't rely on any guarantees.
>
> Well, the C++ standard seems pretty clear to me that the sequence of reads
> and writes to volatile data separated by sequence points must be preserved
> by a conforming compiler. I can't tell from a quick glance whether the C99
> standard offers the same guarantee. So what is the basis for the "can't
> rely on any guarantees" camp? That there are many nonconforming compilers?
> That the guarantee doesn't exist in C?

The problem is C can't really articulate what exactly the guarantee is.
When they originally defined volatile, they had only a very limited
notion of concurrency. Actually none. There was some notion of
asynchronicity but nothing that was actually useful. There were
unix signals but that was unix, not C which was supposed to be OS
independent. There was debugging which lets you see intermediate
states of storage. You'd have to have some sort of explicit support
for debuggers looking at intermediate storage state from the C standard
and extrapolate from that.

>
> I think we all agree that volatile alone can't solve the problem, because
> it affects only compilers, not hardware-based instruction reorderings.

It also only affects inter volatile ordering. You'd have to declare all
shared data volatile which would adversely affect performance.
--
Joe Seigh

Alexander Terekhov

unread,

Feb 5, 2005, 12:47:24 PM2/5/05

to

Scott Meyers wrote:

[... volatiles and preservation ...]

> So what is the basis for the "can't rely on any guarantees" camp?

Same basis as with respect to preservation of sequence of std IO
calls to dev/null (with destination known in advance so to speak).
The thing is that C/C++ volatile abuse isn't really "observable
behavior" under single-threaded C/C++ standard(s), you know.

regards,
alexander.

Neill Clift [MSFT]

unread,

Feb 5, 2005, 6:15:10 PM2/5/05

to

"Alexander Terekhov" <tere...@web.de> wrote in message
news:4204C753...@web.de...

>
> I doubt it. KeMemoryBarrier* idiocy and braindead illustrations from
> that piece speak volumes to the contrary.
>

I have seen some discussion from you on this point but it's
basically quite hard to deduce your point from many of your
posts. I have had to ask you for clarification on things before
but I haven't expended that energy for all of them.
To be honest I think the name
KeMemoryBarrierWithoutFence was a poor choice for a
statement name. I would have picked a name that made it
more obvious that we were preventing compiler reordering
around this point.
I have seen you complain about the spinlock unlock
routine that uses a volatile store. I call this the unlock
optimization when I talk about it. With release semantics
in the compiler (or a statement to prevent compiler
reordering) and release semantics for the processor
via special instructions we believe this is a valid
optimization. Clearly there are some sequences that
may not work with the unlock optimization that
would work with a full barrier:

lock ();
a = 1;
unlock ();

if (b) {
xxx;
}

The read of b might be rearranged to before the
assignment of a. Algorithms like Petersons that rely
on a write followed by a read can't be made to work
by inserting an unlock before the reads. I don't believe
this is problem in practice. Anyone doing this has
references to memory locations that can change
outside of locks and hence has to know about
barriers.

You have also complained about the fact that
we don't document that
InterlockedCompareExchange is not a barrier
in the failure case. I have confirmed that I
believe it's our intention. You have complained
that you don't believe it's the case for some future
supported XBOX platform I have no knowledge of.
You also suggested we don't do the right thing on
the alpha. I believe we do have a full barrier
for this call such that you can't get rearrangements
like this between the assignment of a and b:

a = 1;
if (InterlockedCompareExchange (&z, 1, 0) != 0) { // fails
b = 2;
}

I believe there is a problem with something like this:

if ((a = InterlockedCompareExchangePointer (&z, p1, p2)) != p2) { // fails
a->val = 1;
}

Here the assignment to a->val may be reordered to before
the interlocked operation (well before the load locked etc
but not before the mb). All ancient history as we don't
develop for this platform now.

So if you explain your issues in a way I can understand.
Clearly stating what you think is an issue I could send
details to the appropriate people and try and effect
change if I agree.
Neill.

Neill Clift [MSFT]

unread,

Feb 5, 2005, 6:28:02 PM2/5/05

to

"Alexander Terekhov" <tere...@web.de> wrote in message

news:420506AC...@web.de...

>
> Same basis as with respect to preservation of sequence of std IO
> calls to dev/null (with destination known in advance so to speak).
> The thing is that C/C++ volatile abuse isn't really "observable
> behavior" under single-threaded C/C++ standard(s), you know.
>

In section 6.5.3 of the ANSI C standard as a footnote it says:

'A volatile declaration may be used to describe an object
corresponding to a memory-mapped input/output port...'

So compiler reordering would be visable via it's effect
on a hardware device.
I got this by looking in Mr Schildt's book :-)
Neill.

David Schwartz

unread,

Feb 5, 2005, 8:26:11 PM2/5/05

to

"Scott Meyers" <Use...@aristeia.com> wrote in message

news:MPG.1c6e9cce1...@news.hevanet.com...

> Well, the C++ standard seems pretty clear to me that the sequence of reads
> and writes to volatile data separated by sequence points must be preserved
> by a conforming compiler. I can't tell from a quick glance whether the
> C99
> standard offers the same guarantee. So what is the basis for the "can't
> rely on any guarantees" camp? That there are many nonconforming
> compilers?
> That the guarantee doesn't exist in C?

This is a meaningless requirement because it doesn't say *where* the
order needs to be preserved. One could argue that an L2 cache violates this
requirment and the C standard requires you to disable the L2 cache for
volatile accesses. The problem is that the standard simply calls the order
of such accesses part of the 'oberservable behavior' of the program with no
concept of how or where such a thing is to be observed.

> I think we all agree that volatile alone can't solve the problem, because
> it affects only compilers, not hardware-based instruction reorderings.

It does not affect the compiler. The 'as-if' rule permits the compiler
to make any changes that don't affect the observable behavior. Since nobody
can agree how this is observable behavior, there is effectively no
restriction on compilers either.

DS

David Hopwood

unread,

Feb 5, 2005, 11:58:28 PM2/5/05

to

Neill Clift [MSFT] wrote:

> "Alexander Terekhov" <tere...@web.de> wrote:
>
>>Same basis as with respect to preservation of sequence of std IO
>>calls to dev/null (with destination known in advance so to speak).
>>The thing is that C/C++ volatile abuse isn't really "observable
>>behavior" under single-threaded C/C++ standard(s), you know.
>
> In section 6.5.3 of the ANSI C standard as a footnote it says:
>
> 'A volatile declaration may be used to describe an object
> corresponding to a memory-mapped input/output port...'
>
> So compiler reordering would be visable via it's effect
> on a hardware device.

If there is in fact any hardware device involved.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

Scott Meyers

unread,

Feb 6, 2005, 9:04:08 AM2/6/05

to

On Sat, 5 Feb 2005 17:26:11 -0800, David Schwartz wrote:
> This is a meaningless requirement because it doesn't say *where* the
> order needs to be preserved. One could argue that an L2 cache violates this
> requirment and the C standard requires you to disable the L2 cache for
> volatile accesses. The problem is that the standard simply calls the order
> of such accesses part of the 'oberservable behavior' of the program with no
> concept of how or where such a thing is to be observed.

I'd imagine it's the observable behavior of the abstract machine, not any
real machine, since the entire standard involves only an abstract machine.
My take would be if x and y are volatile and the generated code (program
order) accesses x before y, the compiler is off the hook, regardless of
what happens on any real machine at runtime. After all, single
threaded-programs will always behave as if x is accessed before y,
regardless of what the hardware does (at least that's my understanding),
and the standards have no concept of more than one thread. I can imagine
programmers wanting stronger guarantees, but I can't imagine compiler
writers offering weaker guarantees. Are there real compiliers where use of
volatile does not have the effect of totally ordering accesses in the
generated code to volatile data?

Scott

Joseph Seigh

unread,

Feb 6, 2005, 11:26:37 AM2/6/05

to

On Sun, 6 Feb 2005 06:04:08 -0800, Scott Meyers <Use...@aristeia.com> wrote:

> [...] I can imagine

> programmers wanting stronger guarantees, but I can't imagine compiler
> writers offering weaker guarantees. Are there real compiliers where use of
> volatile does not have the effect of totally ordering accesses in the
> generated code to volatile data?
>

You should ask that question in the C/C++ newsgroups where the compiler
writers hang out. Cross post to this newsgroups because I'd be interested
in what their response, if any, would be.

If they do respond, ask if they'd mind putting the guarantees in writing.

Bear in mind that the fact that Posix had to create a unilateral Posix
compliance certification of C compilers should tell you something about
how cooperative the C compiler community has been in the past.

--
Joe Seigh

David Schwartz

unread,

Feb 6, 2005, 3:50:54 PM2/6/05

to

"Scott Meyers" <Use...@aristeia.com> wrote in message

news:MPG.1c6fc390e...@news.hevanet.com...

> On Sat, 5 Feb 2005 17:26:11 -0800, David Schwartz wrote:

>> This is a meaningless requirement because it doesn't say *where* the
>> order needs to be preserved. One could argue that an L2 cache violates
>> this
>> requirment and the C standard requires you to disable the L2 cache for
>> volatile accesses. The problem is that the standard simply calls the
>> order
>> of such accesses part of the 'oberservable behavior' of the program with
>> no
>> concept of how or where such a thing is to be observed.

> I'd imagine it's the observable behavior of the abstract machine, not any
> real machine, since the entire standard involves only an abstract machine.

Exactly. The problem is, real machines don't have a (single,
well-defined) point from which they can be observed. This makes the
observation requirement meaningless.

> My take would be if x and y are volatile and the generated code (program
> order) accesses x before y, the compiler is off the hook, regardless of
> what happens on any real machine at runtime.

Nonsense. The C++ standard applies to the entire machine, not just the
compiler. A compiler would not be conforming to the C++ standard if it
generated code that might actually confrom (in the sense of an abstract
machine) on some hypothetical hardware, the system as a whole complies if
the compiler generates conforming code when run on a particular piece of
hardware.

The standard is about an abstract machine, not compiled code. In fact,
it doesn't even require the compiler to generate object code at all. It just
requires certain particular results.

> After all, single
> threaded-programs will always behave as if x is accessed before y,
> regardless of what the hardware does (at least that's my understanding),

The sentence above may or may not be true, but it's definitely not about
the C++ standard. The C++ standard is not modified by how particular
hardware implementations act. It's either comprehensible in terms of an
abstract machine or it's not. The observation requirement for volatile
accesses is, quite literally, incomprehensible in terms of an abstract
machine.

> and the standards have no concept of more than one thread. I can imagine
> programmers wanting stronger guarantees, but I can't imagine compiler
> writers offering weaker guarantees. Are there real compiliers where use
> of
> volatile does not have the effect of totally ordering accesses in the
> generated code to volatile data?

The standard is not about the generated code itself, it's about what the
generated code does when it's run on the hardware. You cannot have a
conforming C++ compiler whose target is "no hardware in particular". The C++
standard is in terms of an abstract machine and a conforming compiler must
conform on some particular piece of hardware.

One could argue that hardware on which the concept of observability of
memory accesses is impossible makes it impossible to write a conforming C++
compiler. On modern x86 systems, you *cannot* enforce the order of volatile
variable accesses in the sense that the C++ standard appears to require.
However, you can't just arbitrarily pick one point in the implementation and
say "ahh, that's where the C++ standard was talking about observing, between
the compiler and the processor executing the compiled code" because between
the processor and the memory controller is an equally valid point of
observation.

DS

SenderX

unread,

Feb 7, 2005, 12:46:42 AM2/7/05

to

> So if you explain your issues in a way I can understand.
> Clearly stating what you think is an issue I could send
> details to the appropriate people and try and effect
> change if I agree.

Here are some questions that need clarification in your documentation:

1. Can external data depend on the return value from a failed CAS?

2. Does WaitFor_xxx_Object(s) API's have acquire semantics?

3. How about ReleaseSemaphore? Does that have release semantics?

4:

static volatile int a = 0, b = 0;

a = 1;
b = 2;
ReleaseSemaphore( sema, ... );

AnotherThread
------------------
WaitForSingleObject( sema, ... );
if ( a + b != 3 ) { abort(); }

Does this example work with the memory visibility model that Microsoft
semaphores conform to?

Humm... ;)

Add this to your documentation:

ReleaseSemaphore
--------------------

A release barrier is executed before the semaphore increment. This means
that all preceding loads and stores will be visible before the increment.

ReleaseSemaphore - WaitFor_xxx API
--------------------

An acquire barrier is executed after the semaphore decrement. This means
that the decrement will be visible before any subsequent loads and stores.

I wonder if Alex would approve of this type of documentation...

:O

Neill Clift [MSFT]

unread,

Feb 7, 2005, 1:37:21 AM2/7/05

to

"SenderX" <x...@xxx.com> wrote in message
news:uOCdnVM_fvS...@comcast.com...

>> So if you explain your issues in a way I can understand.
>> Clearly stating what you think is an issue I could send
>> details to the appropriate people and try and effect
>> change if I agree.
>
> Here are some questions that need clarification in your documentation:
>
> 1. Can external data depend on the return value from a failed CAS?

I don't undestand what your asking.

>
> 2. Does WaitFor_xxx_Object(s) API's have acquire semantics?

WaitForSingleObject etc as well as signallers like SetEvent have to
be full barriers. You can only signal a thread to do something if the
signal is a barrier to operations you did prior to the signal. The same
argument would apply to waits. Win32 programs would likely not
work without this and we wouldn't violate it.

Alexander Terekhov

unread,

Feb 7, 2005, 5:17:07 AM2/7/05

to

"Neill Clift [MSFT]" wrote:
>
> "Alexander Terekhov" <tere...@web.de> wrote in message
> news:420506AC...@web.de...
> >
> > Same basis as with respect to preservation of sequence of std IO
> > calls to dev/null (with destination known in advance so to speak).
> > The thing is that C/C++ volatile abuse isn't really "observable
> > behavior" under single-threaded C/C++ standard(s), you know.
> >
>
> In section 6.5.3 of the ANSI C standard as a footnote it says:
>
> 'A volatile declaration may be used to describe an object
> corresponding to a memory-mapped input/output port...'

Yes. Note that I said "volatile abuse". Simply put, when you write
something like

volatile int i = 1;

int main() {
return --i;
}

you better be prepared that smart compiler can tranform it to

int main() {
}

regards,
alexander.

Alexander Terekhov

unread,

Feb 7, 2005, 5:24:13 AM2/7/05

to

SenderX wrote:
[...]

> I wonder if Alex would approve of this type of documentation...

My take on this issue (memory isolation and atomic<> aside for
a moment) can be found here:

http://www.google.de/groups?selm=41C301DF.5B50EDB7%40web.de

regards,
alexander.

Alexander Terekhov

unread,

Feb 7, 2005, 6:13:10 AM2/7/05

to

"Neill Clift [MSFT]" wrote:
[...]

> I have seen you complain about the spinlock unlock
> routine that uses a volatile store.

If your volatiles have Java-post-JSR-133-like "release+" semantics
for writes (note: totally unneeded for standard C/C++ semantics --
sig_atomic_t statics for async signals and auto locals for jumps),
then you don't need that idiotic "WithoutFence". If your volatiles
don't have Java-post-JSR-133-like "release+" semantics for writes,
then that "WithoutFence" thing won't help (on Itanic MP).

[... InterlockedCompareExchange ...]

> You also suggested we don't do the right thing on
> the alpha.

I suggested you post the code. Still waiting.

> I believe we do have a full barrier

I believe that barrier makes little sense in the case of
comparison failure.

regards,
alexander.

Joseph Seigh

unread,

Feb 7, 2005, 7:47:45 AM2/7/05

to

On Mon, 07 Feb 2005 11:17:07 +0100, Alexander Terekhov <tere...@web.de> wrote:

>
> Yes. Note that I said "volatile abuse". Simply put, when you write
> something like
>
> volatile int i = 1;
>
> int main() {
> return --i;
> }
>
> you better be prepared that smart compiler can tranform it to
>
> int main() {
> }
>

The variable i doesn't have an external attribute. It can't be
seen anyway.

--
Joe Seigh

Alexander Terekhov

unread,

Feb 7, 2005, 7:46:03 AM2/7/05

to

And since the C/C++ standards don't concern themselves with threads,
use of "internal" volatiles by multiple threads doesn't constitute
"external attribute" in your terminology. IOW, C/C++ compiler can
simply ignore volatile abuse and treat all such variables as
nonvolatile.

regards,
alexander.

Joseph Seigh

unread,

Feb 7, 2005, 8:00:51 AM2/7/05

to

On Mon, 07 Feb 2005 12:13:10 +0100, Alexander Terekhov <tere...@web.de> wrote:

>
> "Neill Clift [MSFT]" wrote:
> [...]

>> I believe we do have a full barrier

>
> I believe that barrier makes little sense in the case of
> comparison failure.
>

What are you saying? That failed compare and swaps shouldn't
have memory barrier semantics?

--
Joe Seigh

Alexander Terekhov

unread,

Feb 7, 2005, 7:57:23 AM2/7/05

to

Yes (actually "unspecified"). Just like with failed
pthread_mutex_trylock(). See XBD 4.10. "Unless explicitly stated
otherwise, if one of the above functions returns an error, it is
unspecified whether the invocation causes memory to be
synchronized."

regards,
alexander.

Joseph Seigh

unread,

Feb 7, 2005, 8:17:58 AM2/7/05

to

Yes, but you didn't have any threads in your example. You need to come
up with more realistic examples.

--
Joe Seigh

Joseph Seigh

unread,

Feb 7, 2005, 8:22:29 AM2/7/05

to

I'm aware of having failed CAS logic in singleton, DCL, etc... logic.
I'm not sure where trylock fits in.

--
Joe Seigh

Alexander Terekhov

unread,

Feb 7, 2005, 8:44:58 AM2/7/05

to

Joseph Seigh wrote:
[...]

> Yes, but you didn't have any threads in your example. You need to come
> up with more realistic examples.

Threads and standard C/C++ volatiles are irrelevant concepts with
respect to each other. Addition of threads doesn't change anything.

regards,
alexander.

Alexander Terekhov

unread,

Feb 7, 2005, 9:08:33 AM2/7/05

to

Joseph Seigh wrote:
[...]

> I'm aware of having failed CAS logic in singleton, DCL, etc... logic.

That logic is flawed. CAS is no substitute for load with associated
acquire barrier.

regards,
alexander.

Neill Clift [MSFT]

unread,

Feb 7, 2005, 11:52:59 AM2/7/05

to

"Alexander Terekhov" <tere...@web.de> wrote in message

news:42074D46...@web.de...

>
> "Neill Clift [MSFT]" wrote:
> [...]
>> I have seen you complain about the spinlock unlock
>> routine that uses a volatile store.
>
> If your volatiles have Java-post-JSR-133-like "release+" semantics
> for writes (note: totally unneeded for standard C/C++ semantics --
> sig_atomic_t statics for async signals and auto locals for jumps),
> then you don't need that idiotic "WithoutFence". If your volatiles
> don't have Java-post-JSR-133-like "release+" semantics for writes,
> then that "WithoutFence" thing won't help (on Itanic MP).

I think your missing the point that our compiler is changing.
It's neither of these two states currently.
In shipped compilers we only honour ordering volatile to volatile
statements. On IA64 we are generating acquire and release
semantics for volatile references. It may well be that in the
future we shift to an acquire/release model for volatile references.
We could have already done so as I don't work on the compiler.

>
> [... InterlockedCompareExchange ...]
>
>> You also suggested we don't do the right thing on
>> the alpha.
>
> I suggested you post the code. Still waiting.

Yeah like thats going to happen.

>
>> I believe we do have a full barrier
>
> I believe that barrier makes little sense in the case of
> comparison failure.
>

Well the problem with that is that in the failure case the CAS
still returns a usable value. I could easily constuct a reasonable
program taking advantage of this. Callers would likely expect
ordering here. It's likely many existing program take advantage
of this. Even if you define it as undefined unless people actualy
see something go wrong they likely code to the implementation.
For us the penetration of x86 forces it's rules in many areas.
Neill.

SenderX

unread,

Feb 7, 2005, 2:41:18 PM2/7/05

to

> > Here are some questions that need clarification in your documentation:
> >
> > 1. Can external data depend on the return value from a failed CAS?
>
> I don't undestand what your asking.

static CMyData my_data;
static volatile LONG flag = 0;

my_data.setup();
InterlockedCompareExchange( &flag, 1, 0 );

AnotherConcurrentThread
------------------

while ( ! InterlockedCompareExchange( &flag, 0, 0 ) ) { Sleep( 0 ); _asm
pause; }
// flag is set
my_data.use();

Is this safe?

Joseph Seigh

unread,

Feb 7, 2005, 3:53:22 PM2/7/05

to

while (flag != 1)
sleep(0);
InterlockedCompareExchange(&flag, 1, 1);

would be better. DCL would be a less contrived example of
the usage you're trying to show, e.g. initFreeQueue here

http://groups-beta.google.com/group/comp.programming.threads/msg/31803c3398658e06

Microsoft should just say that their synchronization functions
"synchronize" memory and not bother to define memory synchronization
which Posix doesn't bother to define either. :)

--
Joe Seigh

Alexander Terekhov

unread,

Feb 7, 2005, 4:05:44 PM2/7/05

to

SenderX wrote:
[...]

> static CMyData my_data;
> static volatile LONG flag = 0;

atomic<bool> flag(false);

>
> my_data.setup();
> InterlockedCompareExchange( &flag, 1, 0 );

flag.store(true, msync::rel);

>
> AnotherConcurrentThread
> ------------------
>
> while ( ! InterlockedCompareExchange( &flag, 0, 0 ) ) { Sleep( 0 ); _asm

while (!flag.load(msync::acq)) ...

regards,
alexander.

Neill Clift [MSFT]

unread,

Feb 7, 2005, 4:11:25 PM2/7/05

to

"SenderX" <x...@xxx.com> wrote in message

news:v5GdnRpcBK9...@comcast.com...

Yes. As another poster mentions a less contrived example
would be using say InterlockedCompareExchange to do
one time initialization and if the call fails using the return
as the initialized value.
I would consider it a bug if we didn't honour this and push
for it to be fixed in current platforms.

Alexander Terekhov

unread,

Feb 7, 2005, 4:16:27 PM2/7/05

to

Joseph Seigh wrote:
[...]

> DCL would be a less contrived example of
> the usage you're trying to show, e.g. initFreeQueue here
>
> http://groups-beta.google.com/group/comp.programming.threads/msg/31803c3398658e06

It's lockless DCCI, not "DCL" (DCSI).

http://groups.google.de/groups?selm=415BD983.E2DA2114%40web.de

regards,
alexander.

Alexander Terekhov

unread,

Feb 7, 2005, 4:28:47 PM2/7/05

to

"Neill Clift [MSFT]" wrote:
[...]

> InterlockedCompareExchange to do
> one time initialization and if the call fails using the return
> as the initialized value.
> I would consider it a bug if we didn't honour this and push
> for it to be fixed in current platforms.

Your logic nicely illustrates superiority of "the original"
IBM style CAS. ;-)

regards,
alexander.

SenderX

unread,

Feb 7, 2005, 6:49:21 PM2/7/05

to

>> Is this safe?
>>
>
> Yes. As another poster mentions a less contrived example
> would be using say InterlockedCompareExchange to do
> one time initialization and if the call fails using the return
> as the initialized value.

If the initialized value was a pointer to a new object, would the pointed to
object be fully visible?

Neill Clift [MSFT]

unread,

Feb 7, 2005, 10:36:30 PM2/7/05

to

"SenderX" <x...@xxx.com> wrote in message

news:F6-dnZxMFdI...@comcast.com...

I mentioned this case in another post. I beleive that the answer
should be yes and would likely log bugs if I found a current
platform with an issue like this. Clearly you would need
ordering on the creation side also.
Neill.

David Schwartz

unread,

Feb 8, 2005, 3:27:51 PM2/8/05

to

"Ziv Caspi" <zi...@netvision.net.il> wrote in message
news:cu2548$189$3...@news2.netvision.net.il...

> In any case, C/C++ provides no mechanism to prevent the processor itself
> from reordering, so even if you belong to the first group, you get no
> standard guarantees.

I can't believe it! How can you possibly argue that the C/C++ standard
imposes requirements on the *compiler* that the *processor* is free to
violate and still comply with the standard? That's utterly absurd. If the
compiler can't generate code that prevents the processor from violating the
C/C++ standard, the compiler does not conform. Period.

> Note that the guarantees we currently provide hold only for the platforms
> Windows and CL run on -- x86, x64, and Itanium. We currently don't provide
> a guaranteed "future-proof" model for future platforms, although some of
> us would really like us to do so...

In other words, it is *not* the official position that the C/C++
standard requires such things.

DS