Memory visibility and MS Interlocked instructions

Scott Meyers

unread,

Aug 25, 2005, 12:12:59 PM8/25/05

to

Suppose thread W (the writer) writes a value to variable x and thread R
(the reader) later reads the value of x. On a relaxed memory architecture,
my understanding has been that to guarantee that R sees the value written
by W, W must follow the write with a release membar and R must precede the
read with an acquire membar. The membars might be used directly, or they
might be used indirectly via locks, e.g., both W and R could access x after
acquiring the correct mutex. Conceptually (though not operationally), I
think of the release membar as forcing W's local memory to be flushed to
main memory and the acquire membar as forcing R's local memory to be
synched with main memory.

What's important about my undertanding is that guaranteeing that R sees the
correct value depends on both W and R taking actions. It's not enough for
W alone to use a membar or a lock, and it's not enough for R alone to use a
membar or a lock: both must do something to ensure that W's write is
visible to R.

This model does not seem to be consistent with the documented semantics of
Microsoft's Interlocked instructions at
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dllproc/base/about_synchronization.asp,
which "ensure that previous read and write requests have completed and are
made visible to other processors, and ensure that that no subsequent read
or write requests have started." The page gives this example:

BOOL volatile fValueHasBeenComputed = FALSE;

void CacheComputedValue()
{
if (!fValueHasBeenComputed)
{
iValue = ComputeValue();
InterlockedExchange((LONG*)&fValueHasBeenComputed, TRUE);
}
}

The InterlockedExchange function ensures that the value of iValue is
updated for all processors before the value of fValueHasBeenComputed is
set to TRUE.

What confuses me is that here only the writer needs to take an action to
guarantee that readers will see things in the proper order, where my
understanding had been that the reader, too, would have to take some action
before reading iValue and fValueHasBeenComputed to ensure that it didn't
get a stale value for one or both.

Obviously I'm missing something. Can somebody please clear up my
confusion?

Thanks,

Scott

David Schwartz

unread,

Aug 25, 2005, 1:05:11 PM8/25/05

to

"Scott Meyers" <use...@aristeia.com> wrote in message
news:11grrhq...@corp.supernews.com...

> Obviously I'm missing something. Can somebody please clear up my
> confusion?

The MS Interlocked instructions can only work on platforms where the
compiler puts in any necessary memory barriers on all accesses to
'volatile' variables. This happens to be very easy to do on x86.

DS

Joe Seigh

unread,

Aug 25, 2005, 1:07:16 PM8/25/05

to

No, you didn't miss anything. Another incorrect DCL example. Although
they haven't shown the reader code or how ComputeValue handles concurrent
invocations. The "ComputedValue" could be write-only for all we know. :)

--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.

Scott Meyers

unread,

Aug 25, 2005, 1:14:33 PM8/25/05

to

David Schwartz wrote:
> The MS Interlocked instructions can only work on platforms where the
> compiler puts in any necessary memory barriers on all accesses to
> 'volatile' variables. This happens to be very easy to do on x86.

But this would require that MS insert membars on all volatile accesses,
because there is, in general, no way to know whether another part of the
program uses an interlocked instruction. Do MS and MS-compatible compilers
really do that?

Scott

David Schwartz

unread,

Aug 25, 2005, 1:35:15 PM8/25/05

to

"Scott Meyers" <use...@aristeia.com> wrote in message

news:11grv57...@corp.supernews.com...

> David Schwartz wrote:

On x86, it's not needed. I'm not sure about other platforms.

DS

David Schwartz

unread,

Aug 25, 2005, 1:37:11 PM8/25/05

to

"David Schwartz" <dav...@webmaster.com> wrote in message
news:dekvgl$vj1$1...@nntp.webmaster.com...

>> But this would require that MS insert membars on all volatile accesses,
>> because there is, in general, no way to know whether another part of the
>> program uses an interlocked instruction. Do MS and MS-compatible
>> compilers
>> really do that?

> On x86, it's not needed. I'm not sure about other platforms.

Here's the famous quote from the Linux kernel:

* For now, "wmb()" doesn't actually do anything, as all
* Intel CPU's follow what Intel calls a *Processor Order*,
* in which all writes are seen in the program order even
* outside the CPU.

DS

Alexander Terekhov

unread,

Aug 25, 2005, 1:38:19 PM8/25/05

to

Scott Meyers wrote:

[... busted CacheComputedValue()/FetchComputedValue() example ...]

atomic<int> iValue;
atomic<BOOL> fValueHasBeenComputed(FALSE);
int ComputeValue();
// http://tinyurl.com/68jav
#pragma isolated_call(ComputeValue)

void CacheComputedValue()
{
if (!fValueHasBeenComputed.load(msync::naked_competing))
{
iValue.store(ComputeValue(), msync::naked_competing);
fValueHasBeenComputed.store(TRUE, msync::ssb);
}
}

BOOL FetchComputedValue(int *piResult)
{
if (fValueHasBeenComputed.load(msync::cchlb_true))
{
*piResult = iValue.load(msync::naked_competing);
return TRUE;
}
else
return FALSE;
}

To Peter: I've extended cc* stuff with path specific variants so
that you can give compiler a hint on which path you don't really
need isync.

regards,
alexander.

John Hickin

unread,

Aug 25, 2005, 1:31:51 PM8/25/05

to

I think they are confused, not you :-)

At the risk of falling into a trap, here goes:

LONG volatile fValueHasBeenComputedState = 0;

void CacheComputedValue()
{
switch (InterlockedExchangeAdd(&fValueHasBeenComputedState,(LONG)1))
{
case 0;
iValue = ComputeValue();
InterlockedExchange(&fValueHasBeenComputedState, 2);
break;
case 1:
while(InterlockedExchangeAdd(&fValueHasBeenComputedState,(LONG)0)
== (LONG)1) { }
// fall through
default:
// nothing to do
;
}
}

"Scott Meyers" <use...@aristeia.com> wrote in message

news:11grrhq...@corp.supernews.com...

David Schwartz

unread,

Aug 25, 2005, 1:43:39 PM8/25/05

to

"John Hickin" <hic...@nortelnetworks.com> wrote in message
news:dekva7$p36$1...@zcars129.ca.nortel.com...

> LONG volatile fValueHasBeenComputedState = 0;
>
> void CacheComputedValue()
> {
> switch (InterlockedExchangeAdd(&fValueHasBeenComputedState,(LONG)1))
> {
> case 0;
> iValue = ComputeValue();
> InterlockedExchange(&fValueHasBeenComputedState, 2);
> break;
> case 1:
>
> while(InterlockedExchangeAdd(&fValueHasBeenComputedState,(LONG)0)
> == (LONG)1) { }
> // fall through
> default:
> // nothing to do
> ;
> }
> }

This is a disaster on a machine with more than 2 CPUs or
hyper-threading. In the more than 2 CPUs case, two CPUs in the 'while' loop
will totally saturate the FSB, causing the CPU that needs to release the
lock to be unable to do so. In the hyper-threaded case, the virtual CPU in
the while loop will steal resources from the thread that's doing useful
work.

DS

Alexander Terekhov

unread,

Aug 25, 2005, 1:45:43 PM8/25/05

to

Scott Meyers wrote:
[...]

> But this would require that MS insert membars on all volatile accesses,
> because there is, in general, no way to know whether another part of the
> program uses an interlocked instruction. Do MS and MS-compatible compilers
> really do that?

Nobody really knows what they do.

http://groups.google.de/group/comp.programming.threads/msg/63b5c4eccdbc7528
http://groups.google.de/group/comp.programming.threads/msg/13cbf9a3e446bef0

regards,
alexander.

David Schwartz

unread,

Aug 25, 2005, 2:13:21 PM8/25/05

to

"Alexander Terekhov" <tere...@web.de> wrote in message
news:430E03C7...@web.de...

> Scott Meyers wrote:
> [...]
>> But this would require that MS insert membars on all volatile accesses,
>> because there is, in general, no way to know whether another part of the
>> program uses an interlocked instruction. Do MS and MS-compatible
>> compilers
>> really do that?

> Nobody really knows what they do.

And nobody knows how much of what they do is defined behavior that we
can rely on or just "it happens to work on today's machines" behavior.

DS

John Hickin

unread,

Aug 25, 2005, 2:14:31 PM8/25/05

to

"David Schwartz" <dav...@webmaster.com> wrote in message

news:del00d$vps$1...@nntp.webmaster.com...
>

>
> This is a disaster on a machine with more than 2 CPUs or
> hyper-threading. In the more than 2 CPUs case, two CPUs in the 'while'
loop
> will totally saturate the FSB, causing the CPU that needs to release the
> lock to be unable to do so. In the hyper-threaded case, the virtual CPU in
> the while loop will steal resources from the thread that's doing useful
> work.

So I was correct about falling into a trap :-)

Regards, John.

David Schwartz

unread,

Aug 25, 2005, 2:26:45 PM8/25/05

to

"John Hickin" <hic...@nortelnetworks.com> wrote in message

news:del1q8$6f2$1...@zcars129.ca.nortel.com...

Yeah. You can fix the HT problem by putting a 'rep nop' in the 'while'
loop. You can fix the 3+ CPUs problem by using a read-only instruction to
spin on and calling the locked function only if the read suggests the
compare/swap will succeed.

However, there are more traps you can fall into. ;)

DS

Scott Meyers

unread,

Aug 25, 2005, 3:07:23 PM8/25/05

to

Joe Seigh wrote:
> No, you didn't miss anything. Another incorrect DCL example. Although
> they haven't shown the reader code or how ComputeValue handles concurrent
> invocations. The "ComputedValue" could be write-only for all we know. :)

May I assume then that my understanding is correct that both writer and
reader must participate in a handshake to ensure that writer changes to
memory are visible to readers in a relaxed memory architecture? If so,
would it be reasonable to conclude that the semantics of the interlocked
instructions are not generally implementable, though they may be
implementable on particular architectures?

Thanks,

Scott

Joe Seigh

unread,

Aug 25, 2005, 3:39:31 PM8/25/05

to

Yes on the first part. I'm not sure what you mean by the second part.
Can you simulate them if they're not available natively? Yes you
can use a spin lock to guarantee atomicity and help from the kernel
(implemented as a syscall probably) to prevent the thread from preempting
while holding the lock. The operation will be of fixed duration and
thus be "lock-free" by definition.

Alexander Terekhov

unread,

Aug 25, 2005, 3:42:35 PM8/25/05

to

Scott Meyers wrote:
[...]

> May I assume then that my understanding is correct that both writer and
> reader must participate in a handshake to ensure that writer changes to
> memory are visible to readers in a relaxed memory architecture?

Not really (if you mean assembly level and/or atomic<>).

> If so,
> would it be reasonable to conclude that the semantics of the interlocked
> instructions are not generally implementable, though they may be
> implementable on particular architectures?

But MS interlocked stuff is brain-dead anyway.

regards,
alexander.

Alexander Terekhov

unread,

Aug 25, 2005, 3:52:22 PM8/25/05

to

Joe Seigh wrote:
[...]

> Yes on the first part.

No on the first part to the extent that "... thread W (the writer)

writes a value to variable x and thread R (the reader) later reads

the value of x ... to guarantee that R sees the value written by

W, W must follow the write with a release membar and R must

precede the read with an acquire membar" is not true on shared
memory MP hardware level where acquire and release merely ensure
ordering with respect to other accesses.

regards,
alexander.

Peter Dimov

unread,

Aug 25, 2005, 4:01:17 PM8/25/05

to

Alexander Terekhov wrote:

> BOOL FetchComputedValue(int *piResult)
> {
> if (fValueHasBeenComputed.load(msync::cchlb_true))
> {
> *piResult = iValue.load(msync::naked_competing);
> return TRUE;
> }
> else
> return FALSE;
> }
>
> To Peter: I've extended cc* stuff with path specific variants so
> that you can give compiler a hint on which path you don't really
> need isync.

Almost missed it.

Can you give approximate PPC translation for the expression

fValueHasBeenComputed.load(msync::cchlb_true)

assuming that the compiler is not smart/atomics-aware and can't just
insert the isync after the 'if'? (IOW, it can't analyze and optimize
the 'load' in context... a library implementation, for example.)

Marcin 'Qrczak' Kowalczyk

unread,

Aug 25, 2005, 4:01:57 PM8/25/05

to

"David Schwartz" <dav...@webmaster.com> writes:

>> But this would require that MS insert membars on all volatile
>> accesses, because there is, in general, no way to know whether
>> another part of the program uses an interlocked instruction.
>> Do MS and MS-compatible compilers really do that?
>
> On x86, it's not needed. I'm not sure about other platforms.

I guess MS doesn't care about other platforms.

--
__("< Marcin Kowalczyk
\__/ qrc...@knm.org.pl
^^ http://qrnik.knm.org.pl/~qrczak/

Scott Meyers

unread,

Aug 25, 2005, 4:04:37 PM8/25/05

to

Alexander Terekhov wrote:
> No on the first part to the extent that "... thread W (the writer)
> writes a value to variable x and thread R (the reader) later reads
> the value of x ... to guarantee that R sees the value written by
> W, W must follow the write with a release membar and R must
> precede the read with an acquire membar" is not true on shared
> memory MP hardware level where acquire and release merely ensure
> ordering with respect to other accesses.

I was actually asking about the truth of this statement:

Both writer and reader must participate in a handshake to ensure that

writer changes to memory are visible to readers in a relaxed memory

architecture.

What struck me most about the documentation of the interlocked instructions
was that it suggested there was no need for a handshake, i.e., if the
writer called an interlocked routine, there was no need for a reader to do
anything special when reading. From the responses I've seen in this thread,
I get the impression that a handshake *is* required, that both writer and
reader must do something to ensure that the reader sees what the writer
last wrote, at least in the general case.

As for your specific statement above, I gather you mean that on some
architectures, more specific membars may be necessary rather than just
release by the writer and acquire by the reader. Is that correct?

Scott

Alexander Terekhov

unread,

Aug 25, 2005, 4:23:34 PM8/25/05

to

Peter Dimov wrote:
[...]

> Can you give approximate PPC translation for the expression
>
> fValueHasBeenComputed.load(msync::cchlb_true)
>
> assuming that the compiler is not smart/atomics-aware and can't just
> insert the isync after the 'if'? (IOW, it can't analyze and optimize
> the 'load' in context... a library implementation, for example.)

B.2.3 Safe Fetch (Book II):

---
In this example it is assumed that the address of the storage
operand to be loaded is in GPR 3, the contents of the storage
operand are returned in GPR 4, ...

lwz r4,0(r3) #load shared data
cmpw r4,r4 #set CR0 to "equal"
bne- $-8 #branch never taken
---

and just add isync after "branch never taken".

regards,
alexander.

Joe Seigh

unread,

Aug 25, 2005, 4:31:33 PM8/25/05

to

There's some conditional logic involved and possibly some atomicity
involved. Are you merely quibbling or are you claiming there are
membars with no observable effect?

Alexander Terekhov

unread,

Aug 25, 2005, 5:07:18 PM8/25/05

to

Scott Meyers wrote:
[...]

>
> Alexander Terekhov wrote:
> > No on the first part to the extent that "... thread W (the writer)
> > writes a value to variable x and thread R (the reader) later reads
> > the value of x ... to guarantee that R sees the value written by
> > W, W must follow the write with a release membar and R must
> > precede the read with an acquire membar" is not true on shared
> > memory MP hardware level where acquire and release merely ensure
> > ordering with respect to other accesses.
>
> I was actually asking about the truth of this statement:
>
> Both writer and reader must participate in a handshake to ensure that
> writer changes to memory are visible to readers in a relaxed memory
> architecture.

If all you care about is visibility of one single memory location
(your "variable x") then it will (eventually) become visible without
any acquire or release (on hardware level).

>
> What struck me most about the documentation of the interlocked instructions
> was that it suggested there was no need for a handshake, i.e., if the
> writer called an interlocked routine, there was no need for a reader to do
> anything special when reading. From the responses I've seen in this thread,
> I get the impression that a handshake *is* required, that both writer and
> reader must do something to ensure that the reader sees what the writer
> last wrote, at least in the general case.

In the general case with more than one single memory location, you
do need release [sink barrier] -> acquire [hoist barrier] handshake.

>
> As for your specific statement above, I gather you mean that on some
> architectures, more specific membars may be necessary rather than just
> release by the writer and acquire by the reader. Is that correct?

On some software DSMs, acquire/release operations cause lazy/eager
[respectively] propagation of modified data. I've never heard of
such hardware.

regards,
alexander.

David Schwartz

unread,

Aug 25, 2005, 5:23:19 PM8/25/05

to

"Marcin 'Qrczak' Kowalczyk" <qrc...@knm.org.pl> wrote in message
news:87br3lz...@qrnik.zagroda...

> "David Schwartz" <dav...@webmaster.com> writes:

>>> But this would require that MS insert membars on all volatile
>>> accesses, because there is, in general, no way to know whether
>>> another part of the program uses an interlocked instruction.
>>> Do MS and MS-compatible compilers really do that?

>> On x86, it's not needed. I'm not sure about other platforms.

> I guess MS doesn't care about other platforms.

They certainly didn't when the first developed the Interlocked*
functions, or their memory model for WIN32.

DS

Peter Dimov

unread,

Aug 25, 2005, 5:35:37 PM8/25/05

to

Alexander Terekhov wrote:
> Peter Dimov wrote:
> [...]
> > Can you give approximate PPC translation for the expression
> >
> > fValueHasBeenComputed.load(msync::cchlb_true)
> >
> > assuming that the compiler is not smart/atomics-aware and can't just
> > insert the isync after the 'if'? (IOW, it can't analyze and optimize
> > the 'load' in context... a library implementation, for example.)
>

> lwz r4,0(r3) #load shared data
> cmpw r4,r4 #set CR0 to "equal"
> bne- $-8 #branch never taken
> ---
>
> and just add isync after "branch never taken".

That's what I thought. In this case there is no difference between
cchlb_true and cchlb (and ccacq, and even plain acq because of the fake
control dependency.)

Joe Seigh

unread,

Aug 25, 2005, 5:52:50 PM8/25/05

to

You're not entirely sure what Alexander is talking about either?
As far as I can tell from his comments about ordering not being
enough is that he thinks the memory ops have to complete in
some cases which is not true as far as I know. Usually you only
need those kind of memory barriers for putting boundaries on hardware
errors which can be imprecise and you want to ensure the error gets
reported in the right place. E.g. process switching and IPC.
For normal synchronization you don't need it because there is
no error isolation between threads.

Alexander Terekhov

unread,

Aug 25, 2005, 6:25:10 PM8/25/05

to

Peter Dimov wrote:
[...]

> That's what I thought. In this case there is no difference between
> cchlb_true and cchlb (and ccacq, and even plain acq because of the fake
> control dependency.)

Yep. http://tinyurl.com/7nkrt

regards,
alexander.

Alexander Terekhov

unread,

Aug 25, 2005, 6:29:13 PM8/25/05

to

Joe Seigh wrote:
[...]

> > That's what I thought. In this case there is no difference between
> > cchlb_true and cchlb (and ccacq, and even plain acq because of the fake
> > control dependency.)
> >
>
> You're not entirely sure what Alexander is talking about either?

http://tinyurl.com/amlh4

regards,
alexander.

Joe Seigh

unread,

Aug 25, 2005, 8:13:09 PM8/25/05

to

So what's the point other than a control dependent barrier is slower
then a plain load/load and is no more functional?

Alexander Terekhov

unread,

Aug 26, 2005, 4:59:29 AM8/26/05

to

Joe Seigh wrote:
[...]

> So what's the point other than a control dependent barrier is slower
> then a plain load/load and is no more functional?

It's no slower. The speed and effects of the dumbest implementation of
op(msync::cchlb{_path}) is the same as op(msync::acq). It can be done
better.

regards,
alexander.

Peter Dimov

unread,

Aug 26, 2005, 5:48:58 AM8/26/05

to

But to be done better, it requires a _very_ smart compiler, right?

if( load(&v, ccacq_true) == 0 )
{
// #1
}
else
{
// #2
}

load r1, v
cmp r1, 0
bne @2

@1:
isync
#1
b @3

@2:
#2

@3:

Alexander Terekhov

unread,

Aug 26, 2005, 6:04:52 AM8/26/05

to

Peter Dimov wrote:
[...]

> But to be done better, it requires a _very_ smart compiler, right?

Nah. Just smart. ;-)

>
> if( load(&v, ccacq_true) == 0 )
> {
> // #1
> }
> else
> {
> // #2
> }
>
> load r1, v
> cmp r1, 0
> bne @2
>
> @1:
> isync
> #1
> b @3
>
> @2:
> #2
>
> @3:

Yep. And compiler can hoist loads and stores on #2 and #3 pathes
(move'em above if... suppose that control always rich #3).

regards,
alexander.

Alexander Terekhov

unread,

Aug 26, 2005, 6:24:03 AM8/26/05

to

Alexander Terekhov wrote:
[...]

> > if( load(&v, ccacq_true) == 0 )
> > {
> > // #1
> > }
> > else
> > {
> > // #2
> > }
> >
> > load r1, v
> > cmp r1, 0
> > bne @2
> >
> > @1:
> > isync
> > #1
> > b @3
> >
> > @2:
> > #2
> >
> > @3:
>
> Yep. And compiler can hoist loads and stores on #2 and #3 pathes
> (move'em above if... suppose that control always rich #3).

With respect to stores on %2, I mean that implementation can
speculative *commit* stores on %2 and simply undo'em in the case
of misprediction. Plain ccacq would not allow that.

regards,
alexander.

Joe Seigh

unread,

Aug 26, 2005, 7:33:31 AM8/26/05

to

You have an api that is way more complicated than it probably needs to be.
Sun has #LoadLoad, #LoadStore, etc... which gives you all the different
memory ordering that you would need. You seem to have a lot more
variations which aren't well defined and with no examples of what situations
they would be needed in. What situations wouldn't Sun's membars work in
that yours would?

Note that load_depends is a special case as it's a sort of poor mans
acquire membar on platforms that don't have real ones. But it's not
the same as acquire so you have to be a little careful in it's use.

Alexander Terekhov

unread,

Aug 26, 2005, 8:09:46 AM8/26/05

to

Joe Seigh wrote:
[...]

> Sun has #LoadLoad, #LoadStore, etc... which gives you all the different
> memory ordering that you would need.

Sun has a whole bunch of (compound) bidirectional fences. A subset
of Sun's fences can be used to implement unidirectional hoist/sink
stuff and also barrier(msync) intrinsic with "true == (msync &
msync_rel) && (msync & msync_acq)" precondition

ssb|hsb -> StoreStore // eieio or {lw}sync
ssb|hlb -> StoreLoad // sync
slb|hsb -> LoadStore // {lw}sync
slb|hlb -> LoadLoad // {lw}sync
rel|hsb -> StoreStore+LoadStore // {lw}sync
rel|hlb -> StoreLoad+LoadLoad // sync
ssb|acq -> StoreStore+StoreLoad // sync
slb|acq -> LoadStore+LoadLoad // {lw}sync
rel|acq -> Sledgehammer proper // sync

on Sun's hardware, but Sun's model sucks because bidirectional
constraints are just way too heavy. As for cc/dd stuff, go read
D.3.3. It says that cchsb (in somewaht dumb incarnation) is implied
on Sparc hardware (for load{-modify-store} stuff) just like ddhlb,
ddhsb, and ddacq (ddhlb+ddhsb). But cchlb does require a fence (in
RMO): the same (trailing) MEMBAR #LoadLoad as ccacq (cchlb + cchsb).

That does NOT mean that compilers can't be more effecient than
Sun's hardware regarding reordering constraints (I mean compiler
reordering).

regards,
alexander.

Joe Seigh

unread,

Aug 26, 2005, 8:25:48 AM8/26/05

to

Alexander Terekhov wrote:
> Joe Seigh wrote:
> [...]
>
>>Sun has #LoadLoad, #LoadStore, etc... which gives you all the different
>>memory ordering that you would need.
>
>
> Sun has a whole bunch of (compound) bidirectional fences. A subset
> of Sun's fences can be used to implement unidirectional hoist/sink
> stuff and also barrier(msync) intrinsic with "true == (msync &
> msync_rel) && (msync & msync_acq)" precondition

Whoa! Define bidirectional and unidirectional first.

>
> ssb|hsb -> StoreStore // eieio or {lw}sync
> ssb|hlb -> StoreLoad // sync
> slb|hsb -> LoadStore // {lw}sync
> slb|hlb -> LoadLoad // {lw}sync
> rel|hsb -> StoreStore+LoadStore // {lw}sync
> rel|hlb -> StoreLoad+LoadLoad // sync
> ssb|acq -> StoreStore+StoreLoad // sync
> slb|acq -> LoadStore+LoadLoad // {lw}sync

Suns mnemonics seem more intuitive than pirate mnemonics.
Unless you're trying to get ready for Talk Like a Pirate Day,
which is Sept. 19, BTW.

> rel|acq -> Sledgehammer proper // sync

LoadLoad+LoadStore+StoreStore+StoreLoad I assume.

>
> on Sun's hardware, but Sun's model sucks because bidirectional
> constraints are just way too heavy. As for cc/dd stuff, go read
> D.3.3. It says that cchsb (in somewaht dumb incarnation) is implied
> on Sparc hardware (for load{-modify-store} stuff) just like ddhlb,
> ddhsb, and ddacq (ddhlb+ddhsb). But cchlb does require a fence (in
> RMO): the same (trailing) MEMBAR #LoadLoad as ccacq (cchlb + cchsb).
>
> That does NOT mean that compilers can't be more effecient than
> Sun's hardware regarding reordering constraints (I mean compiler
> reordering).

What do compilers have to do with anything, other then you need
to address ordering issues with them also, i.e. the "membar" api
needs to address both. For example a "LoadLoad" api would need
a hardware LoadLoad membar and LoadLoad ordering by the compiler.
>
> regards,
> alexander.

Alexander Terekhov

unread,

Aug 26, 2005, 9:09:24 AM8/26/05

to

Joe Seigh wrote: ...

1.2 Ordering constraints

Perhaps the most contentious aspect of the design of the atomics
library has turned out to be the set of ordering constraints. This
matters for several reasons:

1. Lock-free algorithms often require very limited and specific
constraints on the order in which memory operations become visible
to other threads. Even relatively limited constraints such as
“release”, may be too broad, and impose unneeded constraints on
both the compiler and hardware.

2. Different processors often provide certain very limited
constraints at small or no additional cost, where the cost to
enforce something like an “acquire” constraint may be more major.

[...]

3. As discussed briefly below, adding further constraints appears
to significantly complicate the memory model.

At the moment, there is no clear consensus

</quote>

Why don't you simply spend your weekend studying the archives, Joe.

jupiter.robustserver.com/pipermail/cpp-threads_decadentplace.org.uk

regards,
alexander.

Joe Seigh

unread,

Aug 26, 2005, 9:37:05 AM8/26/05

to

I look at it occasionally but I don't spend a lot of time looking at
it if I see you're barking up the wrong tree. You don't know how
to define semantics so you're looking at various implementations to
see if that gives you any ideas how to go about it, which it hasn't
so far.

I already *know* how to define semantics. What's more, I work with
extremely esoteric (too esoteric according to some) synchronization
and nothing you guys appear to be doing would be of any use for me.

And you still haven stated what you mean by bidirectional and
unidirectional.

Alexander Terekhov

unread,

Aug 26, 2005, 9:41:08 AM8/26/05

to

Joe Seigh wrote: ...

1.1 Atomics library approach

We have been discussing a library design in which primitives atomically
read and/or update a memory location, and may optionally provide some
memory ordering guarantees.

Nearly all lock-free algorithms require both atomic, i.e. indivisible,
operations on certain pieces of data, as well as a mechanism for
specifying ordering constraints. Allowing them to be combined in a
single operation (as opposed to providing separate atomic operations
and “memory barriers”) makes it possible to cleanly and precisely
express the programmer’s intent, avoids some unnecessary constraints
on particularly compiler reordering, and makes it easy to take
advantage of hardware primitives that often combine them.

[...]

On the vast majority of existing X86 processors, the “load” and “load
with acquire ordering semantics” primitives could use the same hardware
implementation, and would merely impose different compiler constraints.
Other processor architectures are split as to whether they require
separate implementations. The same applies to “store” and “store with
release semantics”.

We expect that in the final version of the library, operations such as
load and store will be parameterized in some form with an ordering
constraint

</quote>

Why don't you simply spend your weekend studying the archives, Joe.

jupiter.robustserver.com/pipermail/cpp-threads_decadentplace.org.uk

regards,
alexander.

Joe Seigh

unread,

Aug 26, 2005, 10:30:35 AM8/26/05

to

You're talking about atomic<T>. Memory barriers are an implementation
detail, not part of the semantics. Unless of course you don't know
know to specify semantics with specifying an implementation. But it's
not a good idea to conflate one with the other.

>
> Why don't you simply spend your weekend studying the archives, Joe.
>
> jupiter.robustserver.com/pipermail/cpp-threads_decadentplace.org.uk
>

Why can't you directly answer specific questions like what is meant
by bidirectional and unidirectional? Instead you blindside the
discussions with pirate talk semantics which nobody has a clue what
you're talking about.

Alexander Terekhov

unread,

Aug 26, 2005, 11:26:22 AM8/26/05

to

Joe Seigh wrote:
[...]

> Why can't you directly answer specific questions like what is meant
> by bidirectional and unidirectional?

I'm reluctant because you're going to forget it almost immediately and
I'll just waste time and bandwidth once again. But ok,

http://groups.google.de/group/comp.programming.threads/msg/3f519417b2a619c5
http://groups.google.de/group/comp.programming.threads/msg/a08095e4e4b61155

regards,
alexander.

Alexander Terekhov

unread,

Aug 26, 2005, 11:37:43 AM8/26/05

to

Marcin 'Qrczak' Kowalczyk wrote:
>
> "David Schwartz" <dav...@webmaster.com> writes:
>
> >> But this would require that MS insert membars on all volatile
> >> accesses, because there is, in general, no way to know whether
> >> another part of the program uses an interlocked instruction.
> >> Do MS and MS-compatible compilers really do that?
> >
> > On x86, it's not needed. I'm not sure about other platforms.
>
> I guess MS doesn't care about other platforms.

Actually MS doesn't seem to care much about implications on X86 too.
Regarding making C/C++ volatiles sequentially consistent [SC] like
in revised Java (and its .Net clone so to speak):

Consider having thread A execute the following, where initially x,
y and z are all zero:

atomic_store(&x, 1); [annotation: SC volatile store]
r1 = atomic_load(&y); [annotation: SC volatile load]
if (!r1) ++z;

while thread B executes:

atomic_store(&y, 1); [annotation: SC volatile store]
r2 = atomic_load(&x); [annotation: SC volatile load]
if (!r2) ++z;

Under a sequentially consistent interpretation, one of the atomic
store operations must execute first. Hence r1 and r2 cannot both be
zero, and hence there is no data race involving z. There are data
races involving x and y, but those accesses are made through the
atomic operations library, and hence must be allowed. Atomic
accesses are not meaningful if there is no data race.

The difficulty is that there are strong reasons to support variants
of atomic store and atomic load that allow them to be reordered, i.e.
that allow the atomic load to become visible to other threads before
the atomic store.

For example, preventing this reordering on some common X86 processors
incurs a penalty of over 100 processor cycles in each thread. Both
the ordinary load and store operations, as well as the acquire and
release versions from the preceding section, will allow this
reordering.

For variants that allow reordering, the above program should really
invoke undefined semantics, since r1 and r2 can both be zero, and
hence there is a data race on the ordinary variable accesses to z.

</quote>

To Joe: why don't you simply spend your weekend studying the archives.

jupiter.robustserver.com/pipermail/cpp-threads_decadentplace.org.uk

regards,
alexander.

Joe Seigh

unread,

Aug 26, 2005, 11:50:39 AM8/26/05

to

One issue you seem to be touching upon is whether memory accesses have to
complete before some instruction "executes", or not initiate until after
some instruction "executes". I haven't seen any situations, outside of
context switching by the OS, where this is required. I haven't seen any
examples from you where you think this is required, i.e. it won't work
without it. AFAIK, all you need is relative ordering of the memory
accesses. You can't directly observe the instruction "execution" anyway.

Alexander Terekhov

unread,

Aug 26, 2005, 12:04:09 PM8/26/05

to

Joe Seigh wrote:
[...]

> One issue you seem to be touching upon is whether memory accesses have to
> complete before some instruction "executes", or not initiate until after
> some instruction "executes".

It only seems that way to you. Actually, it's about "touching" shared
memory (performing accesses that yield observable results stored/loaded
to/from shared memory), not initiation or completion of instructions
(whatever that means).

regards,
alexander.

Joe Seigh

unread,

Aug 26, 2005, 12:19:57 PM8/26/05

to

Alexander Terekhov wrote:
> To Joe: why don't you simply spend your weekend studying the archives.
>
> jupiter.robustserver.com/pipermail/cpp-threads_decadentplace.org.uk
>

Done.

1) None of you know how to define semantics for atomic ops.
2) You have no examples, let alone compelling ones, on why
anything more complex than simple memory ordering needs to be
exposed at the api level rather than left as an implementation
decision. *

* The one exception I'm aware of (because I use it myself) is atomic
load depends and we can get away with that because so much of lock-free
depends on pointer swizzling.

Alexander Terekhov

unread,

Aug 26, 2005, 12:40:54 PM8/26/05

to

Joe Seigh wrote:
>
> Alexander Terekhov wrote:
> > To Joe: why don't you simply spend your weekend studying the archives.
> >
> > jupiter.robustserver.com/pipermail/cpp-threads_decadentplace.org.uk
> >
>
> Done.

Where can I buy that supersonic accelerator drug you're on?

>
> 1) None of you know how to define semantics for atomic ops.

Feel free to illuminate me.

> 2) You have no examples, let alone compelling ones, on why
> anything more complex than simple memory ordering needs to be
> exposed at the api level rather than left as an implementation
> decision.

I'm just curious: what sort of complex implementation of simple memory
ordering do you have in your mind?

regards,
alexander.

Joe Seigh

unread,

Aug 26, 2005, 12:57:33 PM8/26/05

to

Alexander Terekhov wrote:
> Joe Seigh wrote:
>
>>Alexander Terekhov wrote:
>>
>>>To Joe: why don't you simply spend your weekend studying the archives.
>>>
>>>jupiter.robustserver.com/pipermail/cpp-threads_decadentplace.org.uk
>>>
>>
>>Done.
>
>
> Where can I buy that supersonic accelerator drug you're on?

There wasn't that much to catch up on.

>
>
>>1) None of you know how to define semantics for atomic ops.
>
>
> Feel free to illuminate me.

It's not my job. The burden of proof is on you since you've
taken it on yourselves to define this stuff. You do have
the excuse that there's been no good examples of prior work
in this area, Posix having made a miserable failure of it
themselves.

>
>
>>2) You have no examples, let alone compelling ones, on why
>> anything more complex than simple memory ordering needs to be
>> exposed at the api level rather than left as an implementation
>> decision.
>
>
> I'm just curious: what sort of complex implementation of simple memory
> ordering do you have in your mind?

Wrong way around. I asked you the question. What usage example
requires api level semantics more comples than simple memory
ordering? I already gave an example of the one exception I'm
aware of.

>
> regards,
> alexander.

Peter Dimov

unread,

Aug 26, 2005, 3:51:46 PM8/26/05

to

Alexander Terekhov wrote:
> Scott Meyers wrote:
>
> [... busted CacheComputedValue()/FetchComputedValue() example ...]
>
> atomic<int> iValue;
> atomic<BOOL> fValueHasBeenComputed(FALSE);
> int ComputeValue();
> // http://tinyurl.com/68jav
> #pragma isolated_call(ComputeValue)
>
> void CacheComputedValue()
> {
> if (!fValueHasBeenComputed.load(msync::naked_competing))
> {
> iValue.store(ComputeValue(), msync::naked_competing);

iValue doesn't seem to need atomicity... but that's not my point today.

> fValueHasBeenComputed.store(TRUE, msync::ssb);
> }
> }
>
> BOOL FetchComputedValue(int *piResult)
> {
> if (fValueHasBeenComputed.load(msync::cchlb_true))
> {
> *piResult = iValue.load(msync::naked_competing);
> return TRUE;
> }
> else
> return FALSE;
> }
>
> To Peter: I've extended cc* stuff with path specific variants so
> that you can give compiler a hint on which path you don't really
> need isync.

This doesn't stand a chance, but something along those lines:

if( fValueHasBeenComputed.load(msync::ccacq) )
{
ccacq_barrier();
*piResult = iValue.load(msync::naked_competing);
return TRUE;
}
else
return FALSE;

is perfectly implementable. load/ccacq will map to either load/acq or
load/none, and ccacq_barrier() will map to no-op or #loadLoad/isync,
respectively.

Luke Elliott

unread,

Aug 26, 2005, 4:27:43 PM8/26/05

to

Scott Meyers wrote:

[snip]

>
> Can somebody please clear up my
> confusion?
>

All the replies so far seem confusing... How about:

Aparently not.

Peter Dimov

unread,

Aug 26, 2005, 4:32:46 PM8/26/05

to

True. I'll give it a try.

Peter Dimov

unread,

Aug 26, 2005, 4:41:27 PM8/26/05

to

Scott Meyers wrote:

[...]

> What's important about my undertanding is that guaranteeing that R sees the
> correct value depends on both W and R taking actions. It's not enough for
> W alone to use a membar or a lock, and it's not enough for R alone to use a
> membar or a lock: both must do something to ensure that W's write is
> visible to R.
>
> This model does not seem to be consistent with the documented semantics of
> Microsoft's Interlocked instructions at
> http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dllproc/base/about_synchronization.asp,
> which "ensure that previous read and write requests have completed and are
> made visible to other processors, and ensure that that no subsequent read
> or write requests have started." The page gives this example:
>
> BOOL volatile fValueHasBeenComputed = FALSE;
>
> void CacheComputedValue()
> {
> if (!fValueHasBeenComputed)
> {
> iValue = ComputeValue();
> InterlockedExchange((LONG*)&fValueHasBeenComputed, TRUE);
> }
> }
>
> The InterlockedExchange function ensures that the value of iValue is
> updated for all processors before the value of fValueHasBeenComputed is
> set to TRUE.
>
> What confuses me is that here only the writer needs to take an action to
> guarantee that readers will see things in the proper order, where my
> understanding had been that the reader, too, would have to take some action
> before reading iValue and fValueHasBeenComputed to ensure that it didn't
> get a stale value for one or both.

Correct. Consider this reader:

if( fValueHasBeenComputed )
{
// do something with iValue
}

Nothing stops the compiler or the hardware from reading iValue before
fValueHasBeenComputed. Even if the writes to iValue and
fValueHasBeenComputed are ordered by the writer, it is still possible
for the reader to access an uninitialized iValue because of a
speculative load.

This can be fixed with a mutex, in which case the lock release/acquire
"handshake" will provide the necessary ordering. It can also be fixed
by attaching an "acquire" label to the load of fValueHasBeenComputed,
which will prevent the speculative execution of the subsequent loads
(and stores, but these generally don't cross a conditional branch).

Chris Thomasson

unread,

Aug 26, 2005, 6:32:14 PM8/26/05

to

> What confuses me is that here only the writer needs to take an action to
> guarantee that readers will see things in the proper order, where my
> understanding had been that the reader, too, would have to take some
> action
> before reading iValue and fValueHasBeenComputed to ensure that it didn't
> get a stale value for one or both.

> BOOL volatile fValueHasBeenComputed = FALSE;
>
> void CacheComputedValue()
> {
> if (!fValueHasBeenComputed)

^^^^^^^^^^

Relying on compiler to inject a data-dependant acquire, hoist, 'whatever'
barrier for volatile loads is probably not a good practice...

;)

> {
> iValue = ComputeValue();
> InterlockedExchange((LONG*)&fValueHasBeenComputed, TRUE);

^^^^^^^^^^

IMO, the atomic exchange has to ensure that fValueHasBeenComputed is
atomically set to TRUE "after" ComputeValues(...) 's effects are made
visible. The ordering for the writer would have to go something like this:

1( call init_func ) => 2( set atomic_flag )

The state transition would require a release barrier in order to ensure that
state 1's effects are fully visible to state 2. This kind of brings up
another question:

Will InterlockedExchange ensure that the ( fValueHasBeenComputed == TRUE )
condition is not made visible "before" ComputeValues(...)'s effects are made
visible? Humm... This example of DCL "may" be busted on the writer's end as
well...

> Obviously I'm missing something.

Na.

:)

Joe Seigh

unread,

Aug 27, 2005, 11:25:24 AM8/27/05

to

Joe Seigh wrote:
> * The one exception I'm aware of (because I use it myself) is atomic
> load depends and we can get away with that because so much of lock-free
> depends on pointer swizzling.
>

Speaking of that, what do most people think are its semantics? A qualified
acquire (#LoadLoad+#LoadStore) or just a qualified #LoadLoad?

I believe Linux uses the former. The only platform that requires an explicit
memory barrier, Alpha, has no #LoadLoad equivalent, just a full membar and
a #StoreStore membar. And current dependend loads on the other platforms have
acquire semantics AFAIK.

I'm thinking of making the atomic_load_depends in my set of atomics just
provide #LoadLoad semantics. Because I strongly suspect that Intel and/or
AMD will break the dependent load hack down the road. If you have acquire
semantics, then you will be forced to use MFENCE which will affect
performance more than just using LFENCE for #LoadLoad semantics.

If you go the #LoadLoad route, then writing to shared objects accessed by
dependent load will require exlicit synchronization to ensure full acquire
semantics, something that is likely being used anyway.

I'm trying to avoid being blindsided by Intel/AMD who seem to have almost
no awareness of what's going on in synchronization.

Alexander Terekhov

unread,

Aug 27, 2005, 12:23:09 PM8/27/05

to

Joe Seigh wrote:
>
> Joe Seigh wrote:
> > * The one exception I'm aware of (because I use it myself) is atomic
> > load depends and we can get away with that because so much of lock-free
> > depends on pointer swizzling.
> >
>
> Speaking of that, what do most people think are its semantics? A qualified
> acquire (#LoadLoad+#LoadStore) or just a qualified #LoadLoad?

Unidirectional ddacq (ddhlb+ddhsb). The problem is that compiler may be
inclined to turn your data dependency into control condition for which
you'd need ccacq (cchlb+cchsb), not ddacq. Note also that while cchsb
is implied on on all existing hardware, cchlb is not.

regards,
alexander.

Joe Seigh

unread,

Aug 27, 2005, 12:43:42 PM8/27/05

to

Compiler issues aside. And its not existing hardware I'm worried about.
A lock-free linked list traversal wins over a locked version because
in part the lock-free list node accesses aren't any more expensive
than the locked ones. That would change if you had to put in full
membars. Although the tests I did with hazard pointers w/ and w/o
membars didn't show the former doing all that horribly. But I think
I'm getting mixed signals here. Do full memory barriers matter and
if not why is anyone wasting time trying to optimize them?

David Hopwood

unread,

Aug 27, 2005, 3:05:57 PM8/27/05

to

Joe Seigh wrote:
> Joe Seigh wrote:
>
>> * The one exception I'm aware of (because I use it myself) is atomic
>> load depends and we can get away with that because so much of lock-free
>> depends on pointer swizzling.
>
> Speaking of that, what do most people think are its semantics? A qualified
> acquire (#LoadLoad+#LoadStore) or just a qualified #LoadLoad?

AFAICS, only the latter is needed in most cases. For example, consider a
publisher/subscriber pattern. The publisher needs a #StoreStore; the subscriber
needs a dependent #LoadLoad. The subscriber does not store to the published
data at all, so there is no reason why it would need a #LoadStore.

> I believe Linux uses the former. The only platform that requires an
> explicit memory barrier, Alpha, has no #LoadLoad equivalent, just a full
> membar and a #StoreStore membar. And current dependend loads on the other
> platforms have acquire semantics AFAIK.
>
> I'm thinking of making the atomic_load_depends in my set of atomics just
> provide #LoadLoad semantics. Because I strongly suspect that Intel and/or
> AMD will break the dependent load hack down the road.

Given that it is basically just a coincidence that it happens to work on the
current processor implementations, that's quite possible.

> If you have acquire semantics, then you will be forced to use MFENCE which
> will affect performance more than just using LFENCE for #LoadLoad semantics.
>
> If you go the #LoadLoad route, then writing to shared objects accessed by
> dependent load will require exlicit synchronization to ensure full acquire
> semantics, something that is likely being used anyway.
>
> I'm trying to avoid being blindsided by Intel/AMD who seem to have almost
> no awareness of what's going on in synchronization.

Hey, that's not fair. Intel and AMD's documentation clearly do *not*
guarantee anything about dependent loads. If it breaks, tough. This is
no different from any other implementation-defined behaviour. I see no
reason why Intel or AMD should be constrained to continue to support
every random property of their current processor models that some bunch
of hackers might rely on without any justification from the docs.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

Alexander Terekhov

unread,

Aug 27, 2005, 3:33:20 PM8/27/05

to

David Hopwood wrote:
[...]

> Intel and AMD's documentation clearly do *not*
> guarantee anything about dependent loads.

Because they guarantee (albeit in a somewhat confusing language)
processor consistency model with classic "full" acquire for loads.

http://groups.google.de/group/comp.lang.c++.moderated/msg/40e6f068496500a7

regards,
alexander.

Joe Seigh

unread,

Aug 27, 2005, 3:40:54 PM8/27/05

to

David Hopwood wrote:

> Joe Seigh wrote:
>>
>> I'm trying to avoid being blindsided by Intel/AMD who seem to have almost
>> no awareness of what's going on in synchronization.
>
>
> Hey, that's not fair. Intel and AMD's documentation clearly do *not*
> guarantee anything about dependent loads. If it breaks, tough. This is
> no different from any other implementation-defined behaviour. I see no
> reason why Intel or AMD should be constrained to continue to support
> every random property of their current processor models that some bunch
> of hackers might rely on without any justification from the docs.
>

The Linux kernel developers are a bunch of hackers? :)
Actually, no one is relying on it to work. That's what
the wrapper macros are for. They let you add a membar if
the implementation dependent stuff breaks. But it would be ironic
if Intel inadvertently breaks the very stuff people are
using to make multi-core processors more scalable. We're
just trying to save Intel from themselves. It's not a
correctness of implementation issue, it's a performance
of implementation issue.

Alexander Terekhov

unread,

Aug 27, 2005, 3:48:02 PM8/27/05

to

Joe Seigh wrote:
[...]

> just trying to save Intel from themselves. It's not a
> correctness of implementation issue, it's a performance

Under x86 memory model, all loads (including dependent ones) behave
in-order with respect to preceding loads. Processor can perform out-
of-order speculative loads but they never yield incorrect results
(processor detects memory ordering violations and rolls back).

regards,
alexander.

David Hopwood

unread,

Aug 27, 2005, 7:39:48 PM8/27/05

to

Alexander Terekhov wrote:
> David Hopwood wrote:
> [...]
>
>> Intel and AMD's documentation clearly do *not*
>>guarantee anything about dependent loads.

On re-reading the docs, scratch "clearly" :-)

> Because they guarantee (albeit in a somewhat confusing language)
> processor consistency model with classic "full" acquire for loads.
>
> http://groups.google.de/group/comp.lang.c++.moderated/msg/40e6f068496500a7

I don't see it. If it is true that:

# Under x86 memory model, all loads (including dependent ones) behave
# in-order with respect to preceding loads. Processor can perform out-
# of-order speculative loads but they never yield incorrect results
# (processor detects memory ordering violations and rolls back).

and also that

# In a multiprocessor system, the following rules apply:
# * Individual processors obey the same rules as in a single
# processor system.
# * Writes by a single processor are observed in the same
# order by all other processors.
# [...]

then what's the point of the lfence instruction? Presumably it isn't just
a no-op? An example of an ordering that is prevented by adding an lfence
might help me to understand this.

Incidentally, are we talking about just 32-bit x86 here, or also AMD64/EM64T?

--
David Hopwood <david.nosp...@blueyonder.co.uk>

David Hopwood

unread,

Aug 27, 2005, 7:44:29 PM8/27/05

to

Joe Seigh wrote:
> David Hopwood wrote:
>> Joe Seigh wrote:
>>>
>>> I'm trying to avoid being blindsided by Intel/AMD who seem to have
>>> almost no awareness of what's going on in synchronization.
>>
>> Hey, that's not fair. Intel and AMD's documentation clearly do *not*
>> guarantee anything about dependent loads. If it breaks, tough. This is
>> no different from any other implementation-defined behaviour. I see no
>> reason why Intel or AMD should be constrained to continue to support
>> every random property of their current processor models that some bunch
>> of hackers might rely on without any justification from the docs.
>
> The Linux kernel developers are a bunch of hackers? :)

They'd tell you so themselves ;-)

> Actually, no one is relying on it to work. That's what
> the wrapper macros are for. They let you add a membar if
> the implementation dependent stuff breaks. But it would be ironic
> if Intel inadvertently breaks the very stuff people are
> using to make multi-core processors more scalable. We're
> just trying to save Intel from themselves. It's not a
> correctness of implementation issue, it's a performance
> of implementation issue.

That's not clear at all. If dependent loads break, some lfence instructions
will have to be added. But they will only have to be added in places where
the code is actually relying on a load-load constraint, whereas the current
semantics (whatever they are :-) potentially affect the performance of *every*
load. If Intel or AMD broke dependent loads, I assume it would be because
they'd benchmarked the change and found that there was a significant performance
gain. I don't know what the ratio of all loads to membars is, but it's got to
be very high, so just a tiny improvement in performance due to relaxing the
constraints on loads *could* vastly outweigh the cost of the added lfences.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

Joe Seigh

unread,

Aug 28, 2005, 7:45:02 AM8/28/05

to

The official memory model states otherwise so you probably shouldn't
use the term "memory model" that way. You mean load ordering as
implemented. It's not clear whether you're talking about processor
order or memory order here. If the latter it seems like a lot of
exotic bus snooping with an expensive performance hit just to enforce
a memory model which is not even the official one. Seems even more strange
since they clearly and explicitly state exceptions to store order
for performance reasons.

Joe Seigh

unread,

Aug 28, 2005, 8:03:39 AM8/28/05

to

Intel most likely benchmarks based on their official memory model and they'd
had no way of distinquishing between normal loads and loads that rely on
dependend loads for proper ordering, loads that would require LFENCE after
the fact. So their projections of performance inprovement would only be
based on current LFENCE usage, not future LFENCE usage which would be much
greater. So the true effect of the change wouldn't be known until after the
processors got changed and present software (e.g. Linux kernel) got changed
to run on the new processors correctly.

I'm not too worried about LFENCE now. I'm assuming a reasonably optimal
implmentation will be about as expensive as a dependent load in situations
where all the accesses are dependent anyway. It could be a problem if
Intel implements it as a serializing instruction rather than as an ordering
instructions. And MFENCE could be a problem if you're required to use it
instead because of your atomic api semantics since to avoid store penalties
you'd have to avoid *all* stores, even ones onto local non-shared memory.

Alexander Terekhov

unread,

Aug 28, 2005, 12:07:28 PM8/28/05

to

David Hopwood wrote:
[...]
> then what's the point of the lfence instruction?

SSE* fences are meant to control out-of-order SSE* writes of strings
(sfence/mfence) and disable speculation (to cache stuff in order) of
loads (lfence/mfence). SSE* stuff and ordering observable on "system
bus" aside for a moment, the x86 memory model (processor consistency)
did't change since 486. See also Intel Itanium Architecture Software
Developer's Manual 6.3.4: "IA-32 instructions are mapped into the
Itanium memory ordering model as follows...".

regards,
alexander.

David Hopwood

unread,

Aug 28, 2005, 12:23:40 PM8/28/05

to

Joe Seigh wrote:
> David Hopwood wrote:
>> Joe Seigh wrote:
>>
>>> Actually, no one is relying on it to work. That's what
>>> the wrapper macros are for. They let you add a membar if
>>> the implementation dependent stuff breaks. But it would be ironic
>>> if Intel inadvertently breaks the very stuff people are
>>> using to make multi-core processors more scalable. We're
>>> just trying to save Intel from themselves. It's not a
>>> correctness of implementation issue, it's a performance
>>> of implementation issue.
>>
>> That's not clear at all. If dependent loads break, some lfence
>> instructions will have to be added. But they will only have to be
>> added in places where the code is actually relying on a load-load
>> constraint, whereas the current semantics (whatever they are :-)
>> potentially affect the performance of *every* load. If Intel or AMD
>> broke dependent loads, I assume it would be because they'd benchmarked
>> the change and found that there was a significant performance gain.
>> I don't know what the ratio of all loads to membars is, but it's got
>> to be very high, so just a tiny improvement in performance due to
>> relaxing the constraints on loads *could* vastly outweigh the cost
>> of the added lfences.
>
> Intel most likely benchmarks based on their official memory model and
> they'd had no way of distinquishing between normal loads and loads that

> rely on dependent loads for proper ordering, loads that would require

> LFENCE after the fact. So their projections of performance inprovement
> would only be based on current LFENCE usage, not future LFENCE usage
> which would be much greater.

I would prefer Intel and AMD to benchmark based on code that actually exists
and that follows the memory model *as specified*, than to speculate about
the performance of future versions of code that doesn't currently follow
the memory model (if I'm correct that it doesn't).

> So the true effect of the change wouldn't be known until after the
> processors got changed and present software (e.g. Linux kernel) got
> changed to run on the new processors correctly.

C'est la vie.

> I'm not too worried about LFENCE now. I'm assuming a reasonably optimal
> implmentation will be about as expensive as a dependent load in situations
> where all the accesses are dependent anyway.

Right, there's no reason why it should be any more expensive than that.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

David Hopwood

unread,

Aug 28, 2005, 3:32:07 PM8/28/05

to

Alexander Terekhov wrote:
> David Hopwood wrote:
> [...]
>
>>then what's the point of the lfence instruction?
>
> SSE* fences are meant to control out-of-order SSE* writes of strings
> (sfence/mfence)

That much makes sense. For simplicity, let's exclude string writes and
anything that changes cache behaviour from the usual defaults. Let's also
focus exclusively on what is visible to programs and not what happens on
the system bus.

> and disable speculation (to cache stuff in order) of loads (lfence/mfence).

Then I'm still confused.

Stores by each processor occur in program order, that's clear. You're saying
that stores made by processor 1 can nevertheless be loaded by processor 2
out of program order. I see that there could be memory models and
implementations for which this is possible, e.g. due to speculation.
But how is it consistent with saying that all loads have acquire semantics?

Example. Start with x == y == 0.

Processor 1:
a) x := 1
b) y := 1

Processor 2:
c) i := y
d) j := x

For a processor ordering model in which loads have acquire semantics,
{i == 1, j == 0} is not possible. If the effects of speculation can be
visible and need to be inhibited by an lfence between c) and d), then
this outcome is possible. Which is it for IA-32?

(And if anyone knows, which for AMD64 and EM64T?)

Some Googling turned up this description of the PPro (caveat: from 1997)
by Mike Haertel of Intel:
<http://mail-index.netbsd.org/tech-kern/1997/05/06/0000.html>

# The Pentium Pro's memory ordering model is called "processor ordering"
# and is a formalization of the 486's semantics. The 486 had
# a write-through cache with write queue to memory which was
# not snooped by loads on other processors.
#
# Loosely speaking, this means the ordering of events originating
# from any one processor in the system, as observed by other processors,
# is always the same. However, different observers are allowed
# to disagree on the interleaving of events from two or more processors.
#
# The PPro does speculative and out-of-order loads. However,
# it has a mechanism called the "memory order buffer" to ensure
# that the above memory ordering model is not violated. Load
# and store instructions do not get retired until the processor
# can prove there are no memory ordering violations in the actual
# order of execution that was used. Stores do not get sent to
# memory until they are ready to be retired. If the processor
# detects a memory ordering violation, it discards all unretired
# operations (including the offending memory operation) and
# restarts execution at the oldest unretired instruction.
#
# i.e. when a violation is detected the MOB whacks the machine ... :-)

Which is all fine, but if speculative loads have no effect on the memory
model, I still have no idea what the point of lfence is. Unless it is only
needed when the memory ordering is weakened using MTRRs etc.?

> SSE* stuff and ordering observable on "system
> bus" aside for a moment, the x86 memory model (processor consistency)
> did't change since 486. See also Intel Itanium Architecture Software
> Developer's Manual 6.3.4: "IA-32 instructions are mapped into the
> Itanium memory ordering model as follows...".

OK, but I'm skeptical about relying on that, because it documents an
implementation of IA-32 that *could* have stronger ordering guarantees
than IA-32 itself.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

Chris Thomasson

unread,

Aug 28, 2005, 6:06:16 PM8/28/05

to

> Will InterlockedExchange ensure that the ( fValueHasBeenComputed == TRUE )
> condition is not made visible "before" ComputeValues(...)'s effects are
> made visible? Humm... This example of DCL "may" be busted on the writer's
> end as well...

Humm, missed this part...

>> The InterlockedExchange function ensures that the value of iValue is
>> updated for all processors before the value of fValueHasBeenComputed is
>> set to TRUE.

Note how it doesn't say anything about ComputeValue(...), it just says that
iValue will be updated first. However, I guess that guarantees
ComputeValue(...) effects will be fully visible before the flag is set. Does
anyone totally trust that assertion, without disassembling the Interlocked
API's?

;)

Alexander Terekhov

unread,

Aug 29, 2005, 4:53:06 AM8/29/05

to

David Hopwood wrote:
[...]

> Stores by each processor occur in program order, that's clear. You're saying
> that stores made by processor 1 can nevertheless be loaded by processor 2
> out of program order. I see that there could be memory models and
> implementations for which this is possible, e.g. due to speculation.
> But how is it consistent with saying that all loads have acquire semantics?

http://groups.google.de/group/comp.arch/msg/c363731d2680ba8e

>
> Example. Start with x == y == 0.
>
> Processor 1:
> a) x := 1
> b) y := 1
>
> Processor 2:
> c) i := y
> d) j := x
>
> For a processor ordering model in which loads have acquire semantics,
> {i == 1, j == 0} is not possible. If the effects of speculation can be
> visible and need to be inhibited by an lfence between c) and d), then

For not weakened "ordinary" memory, the effects are not visible unless
you sit on the "system bus" or care about discards.

> this outcome is possible. Which is it for IA-32?

{i == 1, j == 0} is not possible (for not weakened "ordinary" memory).

>
> (And if anyone knows, which for AMD64 and EM64T?)

{i == 1, j == 0} is not possible (for not weakened "ordinary" memory).

[...]

> Which is all fine, but if speculative loads have no effect on the memory
> model, I still have no idea what the point of lfence is. Unless it is only
> needed when the memory ordering is weakened using MTRRs etc.?

That too.

http://groups.google.de/group/comp.programming.threads/msg/8d27f54bd4e85814

regards,
alexander.

Alexander Terekhov

unread,

Aug 29, 2005, 6:04:21 AM8/29/05

to

Joe Seigh wrote:
[...]

> The official memory model states otherwise so you probably shouldn't
> use the term "memory model" that way. You mean load ordering as
> implemented. It's not clear whether you're talking about processor
> order or memory order here.

In Intel speak, "memory order[ing]" is "the order in which the
processor issues reads (loads) and writes (stores) through the system
bus to system memory." (7.2. MEMORY ORDERING). Weakened memory and SSE
stuff aside for a moment, the *memory model* is processor consistency
(aka processor ordering in Intel speak) with all loads having acquire
semantics. SPO (Speculative Processor Ordering) implementation doesn't
brake it. End of story.

regards,
alexander.

Joe Seigh

unread,

Aug 29, 2005, 7:39:29 AM8/29/05

to

"1. Reads can be carried out speculatively and in any order."

Alexander Terekhov

unread,

Aug 29, 2005, 7:55:52 AM8/29/05

to

Joe Seigh wrote:
[...]

> "1. Reads can be carried out speculatively and in any order."

http://groups.google.de/group/comp.lang.c++.moderated/msg/40e6f068496500a7
("If you think it through, you will see..."

regards,
alexander.

Alexander Terekhov

unread,

Aug 29, 2005, 8:18:16 AM8/29/05

to

Alexander Terekhov wrote:
[...]
> Weakened memory

http://www.intel.com/design/pentiumII/applnots/24442201.pdf

regards,
alexander.

Joe Seigh

unread,

Aug 29, 2005, 8:40:00 AM8/29/05

to

It's not clear at all. Processor consistency is a totally meaningless term
to use for a multiprocessor memory model.

The above statement can be rewritten as

1. Reads can be carried out speculatively.
2. Reads can be carried out in any order.

You seem to be reading it as something else.

If you had a memory model that states reads were in order (ala sparc TSO) and
under the cover allowed out of order speculative reads, the reads would have
to be invalidated if *any* memory accesses by *any* processor were observed on
the system bus before any prior reads completed. Which means that as soon
as you went from single processor to multi-processor you would see a
significant degradation in performance which would get worse the more
processors you added. More processors, more memory bus activity.

This being the case, I don't think anyone at Intel would find themselves bound to
an unofficial memory model from the 90's when the official one would yield
significantly better performance.

And AMD clearly doesn't either. Go read *their* architecture manuals.

Alexander Terekhov

unread,

Aug 29, 2005, 8:48:44 AM8/29/05

to

Joe Seigh wrote:
[...]

> > http://groups.google.de/group/comp.lang.c++.moderated/msg/40e6f068496500a7
> > ("If you think it through, you will see..."
> >
>
> It's not clear at all. Processor consistency is a totally meaningless term
> to use for a multiprocessor memory model.

Processor consistency is the term of the art for multiprocessor *memory
model*.

Go google it.

And, BTW, regarding the term "memory model":

http://rsim.cs.uiuc.edu/~sadve/jmm/sc-.pdf

"The memory model is the interface between the system and the programmer
that defines the values that a read in a program is allowed to return."

As for SSE fencing...

ftp://download.intel.com/design/Pentium4/manuals/25366516.pdf

----
In general, ***WC semantics*** require software to ensure coherence,
with respect to other processors and other system agents (such as
graphics cards). Appropriate use of synchronization and fencing must
be performed for producer-consumer usage models.

[...]

The SFENCE (store fence) instruction provides greater control over
the ordering of store operations when using ***weakly-ordered***
memory types.

[...]

The SFENCE (Store Fence) instruction controls write ordering by
creating a fence for memory store operations. This instruction
guarantees that the result of every store instruction that precedes
the store fence in program order is globally visible before any
store instruction that follows the fence. The SFENCE instruction
provides an efficient way of ensuring ordering between procedures
that produce ***weakly-ordered*** data and procedures that consume
that data.

[...]

SSE2 extensions introduce two new fence instructions (LFENCE and
MFENCE) as companions to the SFENCE instruction introduced with SSE
extensions.

The LFENCE instruction establishes a memory fence for loads. It
guarantees ordering between two loads and prevents speculative loads
from passing the load fence (that is, no speculative loads are allowed
until all loads specified before the load fence have been
carried out).

The MFENCE instruction combines the functions of LFENCE and SFENCE by
establishing a memory fence for both loads and stores. It guarantees
that all loads and stores specified before the fence are globally
observable prior to any loads or stores being carried out after the
fence.
----

So part from the store-load fencing provided by mfence, SSE fence
instructions has really nothing to do with "ordinary" memory under
processor consistency memory model (load:acquire, store:release,
and locked-stuff:release+acquire).

Got it now?

regards,
alexander.

Joe Seigh

unread,

Aug 29, 2005, 11:17:35 AM8/29/05

to

Alexander Terekhov wrote:
> Joe Seigh wrote:
> [...]
>
>>>http://groups.google.de/group/comp.lang.c++.moderated/msg/40e6f068496500a7
>>>("If you think it through, you will see..."
>>>
>>
>>It's not clear at all. Processor consistency is a totally meaningless term
>>to use for a multiprocessor memory model.
>
>
> Processor consistency is the term of the art for multiprocessor *memory
> model*.
>
> Go google it.

I was thinking of processor logical order but it doesn't matter. It may
be useful for building HPC clusters but it's useless for SMMP (shared
memory multi-processing). That's because shared memory is the arbiter
in synchronization. I.e. it's the order of reads and writes to memory
that count, not the order by specific processors. Processors consistency
even explicitly states stores by different processors aren't necessarily
read in the order they appear in memory.

If you're thinking that doesn't matter because processor consistency
gives you release and acquire semantics between two processors. But
what if you have 3 (or more) processors? For example, processor A
initializes and stores its address in X. Processors B reads X. So
far, so good. All the stores by A are read in proper order by B.
Now processors B stores the object address in Y and processor C reads Y.
Now there's a problem. There's no guarantee that processor C will see
the writes by A in proper order (i.e. relative to the read of Y). So
processor consistency doesn't give you acquire and release as they are
commonly understood.

[...]
>
> Got it now?

And you *really* do need to read the AMD docs.

Sean Kelly

unread,

Aug 29, 2005, 11:24:34 AM8/29/05

to

Joe Seigh wrote:
>
> This being the case, I don't think anyone at Intel would find themselves bound to
> an unofficial memory model from the 90's when the official one would yield
> significantly better performance.

I don't know that they have a choice. If they changed the
implementation in a way that effected a visible change in program
behavior it would break a massive amount of code. I assume that were
something like this to happen it would be via an optional processing
model.

Sean

Sean Kelly

unread,

Aug 29, 2005, 11:33:06 AM8/29/05

to

Joe Seigh wrote:
>
> If you're thinking that doesn't matter because processor consistency
> gives you release and acquire semantics between two processors. But
> what if you have 3 (or more) processors? For example, processor A
> initializes and stores its address in X. Processors B reads X. So
> far, so good. All the stores by A are read in proper order by B.
> Now processors B stores the object address in Y and processor C reads Y.
> Now there's a problem. There's no guarantee that processor C will see
> the writes by A in proper order (i.e. relative to the read of Y). So
> processor consistency doesn't give you acquire and release as they are
> commonly understood.

So assuming this were the case, how would memory ordering be achieved
on Intel/AMD? The instruction set has precious few (ie. no)
instructions to achieve this.

> And you *really* do need to read the AMD docs.

I've got them but have been preferring the Intel docs as they're a bit
more readable. I assume this is in the section on the memory model?

Sean

Joe Seigh

unread,

Aug 29, 2005, 12:08:25 PM8/29/05

to

Sean Kelly wrote:
> Joe Seigh wrote:
>
>>If you're thinking that doesn't matter because processor consistency
>>gives you release and acquire semantics between two processors. But
>>what if you have 3 (or more) processors? For example, processor A
>>initializes and stores its address in X. Processors B reads X. So
>>far, so good. All the stores by A are read in proper order by B.
>>Now processors B stores the object address in Y and processor C reads Y.
>>Now there's a problem. There's no guarantee that processor C will see
>>the writes by A in proper order (i.e. relative to the read of Y). So
>>processor consistency doesn't give you acquire and release as they are
>>commonly understood.
>
>
> So assuming this were the case, how would memory ordering be achieved
> on Intel/AMD? The instruction set has precious few (ie. no)
> instructions to achieve this.

Any of the serializing instructions, e.g. cpuid, lock, etc.... Linux
uses a dummy XCHG agains the stack (implied LOCK). The xFENCE are offered
as being more efficient since all they basically do is serialize and not
do something else as well.

>
>
>>And you *really* do need to read the AMD docs.
>
>
> I've got them but have been preferring the Intel docs as they're a bit
> more readable. I assume this is in the section on the memory model?
>

Yes, in "AMD64 Architecture Programmer’s Manual Volume 2: System Programming",
chapter 7.

"Out-of-order reads are allowed. Out-of-order reads can occur
as a result of out-of-order instruction execution or
speculative execution. The processor can read memory out-of-
order to allow out-of-order execution to proceed."

Seems pretty clear IMO. They don't mention it as far as I can see offhand
but out-of-order execution would also allow stores to occur before logically
previous reads in some cases so MFENCE might be safer for release semantics
than SFENCE.

If there's a question of whether important apps are using a stricter
de facto memory model, AMD probably knows the answer already.

Peter Dimov

unread,

Aug 29, 2005, 12:21:53 PM8/29/05

to

Joe Seigh wrote:

> If you had a memory model that states reads were in order (ala sparc TSO) and

> under the cover allowed out of order speculative reads, ...

That's the x86 memory model.

> the reads would have to be invalidated if *any* memory accesses by *any*
> processor were observed on the system bus before any prior reads completed.

I don't see why. Only stores affecting the speculatively carried loads
will invalidate. If CPU #1 reads X, Y and Z out of order and CPU #2
writes to W, there is no reason to discard the values of X, Y and Z.

Peter Dimov

unread,

Aug 29, 2005, 12:32:18 PM8/29/05

to

Joe Seigh wrote:

> If you're thinking that doesn't matter because processor consistency
> gives you release and acquire semantics between two processors. But
> what if you have 3 (or more) processors? For example, processor A
> initializes and stores its address in X. Processors B reads X. So
> far, so good. All the stores by A are read in proper order by B.
> Now processors B stores the object address in Y and processor C reads Y.
> Now there's a problem. There's no guarantee that processor C will see
> the writes by A in proper order (i.e. relative to the read of Y). So
> processor consistency doesn't give you acquire and release as they are
> commonly understood.

Processor consistency behaves "as if" there is system memory (cache is
transparent) and CPU's have store queues that can satisfy their own
loads.

So CPUs #3..#N will see the same sequence if CPUs #1 and #2 perform
stores.

Discrepancies in store order only appear because CPU #1 can observe its
own stores early.

In your example, C will see the stores by A and B in order.

David Hopwood

unread,

Aug 29, 2005, 12:35:19 PM8/29/05

to

Alexander Terekhov wrote:
> Joe Seigh wrote:
> [...]
>
>>The official memory model states otherwise so you probably shouldn't
>>use the term "memory model" that way. You mean load ordering as
>>implemented. It's not clear whether you're talking about processor
>>order or memory order here.
>
> In Intel speak, "memory order[ing]" is "the order in which the
> processor issues reads (loads) and writes (stores) through the system
> bus to system memory." (7.2. MEMORY ORDERING).

*Oh*. Now I understand why their documentation doesn't make sense.
They're taking an aspect of the implementation and calling it a memory
model.

> Weakened memory and SSE
> stuff aside for a moment, the *memory model* is processor consistency
> (aka processor ordering in Intel speak) with all loads having acquire
> semantics. SPO (Speculative Processor Ordering) implementation doesn't

> [break] it. End of story.

Is then <http://gee.cs.oswego.edu/dl/jmm/cookbook.html> incorrect when
it says that SPO requires lfence to implement a LoadLoad barrier?

--
David Hopwood <david.nosp...@blueyonder.co.uk>

Alexander Terekhov

unread,

Aug 29, 2005, 12:43:38 PM8/29/05

to

Joe Seigh wrote:
>
> Sean Kelly wrote:
> > Joe Seigh wrote:
> >
> >>If you're thinking that doesn't matter because processor consistency
> >>gives you release and acquire semantics between two processors. But
> >>what if you have 3 (or more) processors? For example, processor A
> >>initializes and stores its address in X. Processors B reads X. So
> >>far, so good. All the stores by A are read in proper order by B.
> >>Now processors B stores the object address in Y and processor C reads Y.
> >>Now there's a problem. There's no guarantee that processor C will see
> >>the writes by A in proper order (i.e. relative to the read of Y). So
> >>processor consistency doesn't give you acquire and release as they are
> >>commonly understood.
> >
> >
> > So assuming this were the case, how would memory ordering be achieved
> > on Intel/AMD? The instruction set has precious few (ie. no)
> > instructions to achieve this.
>
> Any of the serializing instructions, e.g. cpuid, lock, etc.... Linux
> uses a dummy XCHG agains the stack (implied LOCK). The xFENCE are offered
> as being more efficient since all they basically do is serialize and not

> do something else as well. ^^^^^^^^^^^^^^^^^^^^^^^^^

Serialize what? (cpuid aside for a moment.) Chapter and verse please.

Please show some pseudo code with *FENCE that you think is needed to
ensure visibility of data published by A in X to C after republishing
it in Y by B.

I think you've been googling too much. ;-)

regards,
alexander.

Alexander Terekhov

unread,

Aug 29, 2005, 12:51:05 PM8/29/05

to

David Hopwood wrote:
[...]

> Is then <http://gee.cs.oswego.edu/dl/jmm/cookbook.html> incorrect when
> it says that SPO requires lfence to implement a LoadLoad barrier?

Ha!

< Forward Inline >

-------- Original Message --------
Message-ID: <e52efbe105062...@mail.gmail.com>
Date: Tue, 21 Jun 2005 18:24:31 +0200
From: Alexander Terekhov <snip>
To: Doug Lea <snip>
Subject: "Speculative Processor Ordering"

G'Day,

http://gee.cs.oswego.edu/dl/jmm/cookbook.html

----
x86-SPO=20
Allegedly upcoming Intel x86, AMD Opteron, and possibly others. Intel
calls consistency properties for these "Speculative Processor
Ordering" (SPO). (As of this writing, no existing x86 or x86-64
processors are known to be SPO. All are PO.) See above, plus AMD
x86-64 Architecture Programmer's Manual Volume 2: System Programming
----

I allege that with respect to ordinary (not-weakened, non-SSE-stuff, etc.)=
=20
memory visibility, x86-SPO is nothing but classic processor consistency=20
(same as x86-PO).

The (ugly) specs simply try to explain implementation behavior to the obser=
ver=20
sitting on the "System Bus", not the memory model (which is is processor=20
consistency) as seen by the program (i.e. apart from the activities seen on=
=20
that "System Bus" thingy).

IOW, "LoadLoad" is implied (as far as JMM is concerned) on both x86-PO=20
x86-SPO, and lfence can be safely omitted as long you don't care what=20
happens/can be observed on the "System Bus".

Now please prove me wrong. ;-)

TIA.

regards,
alexander.

Joe Seigh

unread,

Aug 29, 2005, 12:58:52 PM8/29/05

to

Peter Dimov wrote:

> Joe Seigh wrote:
>
>
>>the reads would have to be invalidated if *any* memory accesses by *any*
>>processor were observed on the system bus before any prior reads completed.
>
>
> I don't see why. Only stores affecting the speculatively carried loads
> will invalidate. If CPU #1 reads X, Y and Z out of order and CPU #2
> writes to W, there is no reason to discard the values of X, Y and Z.
>

Possibly. I was thinking of mixes of loads and stores but that would
probably require explicit load/store and store/load members which may
moot the whole issue. I wouldn't say it was safe without giving it
a bit of thought first.

Joe Seigh

unread,

Aug 29, 2005, 1:08:24 PM8/29/05

to

I googled per Alexander's suggestion
http://www.cs.nmsu.edu/~pfeiffer/classes/573/notes/consistency.html
which defined it as thus

"Writes done by a single processor are received by all other processors
in the order in which they were issued, but writes from different
processors may be seen in a different order by different processors."

Torsten Robitzki

unread,

Aug 29, 2005, 1:12:24 PM8/29/05

to

Scott Meyers wrote:

<snip>

> What confuses me is that here only the writer needs to take an action to
> guarantee that readers will see things in the proper order, where my
> understanding had been that the reader, too, would have to take some action
> before reading iValue and fValueHasBeenComputed to ensure that it didn't
> get a stale value for one or both.
>
> Obviously I'm missing something. Can somebody please clear up my
> confusion?

it realy depends on the architecture and the compiler/threading library.
If you have a architecture that only reorders writes, you have to emit a
membar() between the write to the data and the write to the flag.
Otherwise othere CPUs will not be able to see the changes to the data
and flag in the given order. The memory barrier will tell the hardware
to not reorder writes to memory across this membar.

In this case it will be sufficient to make sure that the compiler won't
reorder the reads on the reading side, which might be given as one of
the variables involed beeing volatile qualified.

If the hardware reorders read and writes one have to make sure that
reads aren't reordered by the hardware too.

regards
Torsten

Joe Seigh

unread,

Aug 29, 2005, 1:13:35 PM8/29/05

to

David Hopwood wrote:
> Alexander Terekhov wrote:
>
>> Joe Seigh wrote:
>> [...]
>>
>>> The official memory model states otherwise so you probably shouldn't
>>> use the term "memory model" that way. You mean load ordering as
>>> implemented. It's not clear whether you're talking about processor
>>> order or memory order here.
>>
>>
>> In Intel speak, "memory order[ing]" is "the order in which the
>> processor issues reads (loads) and writes (stores) through the system
>> bus to system memory." (7.2. MEMORY ORDERING).
>
>
> *Oh*. Now I understand why their documentation doesn't make sense.
> They're taking an aspect of the implementation and calling it a memory
> model.
>

If I understand it correctly, the speculative loads aren't observable by
programs using normal memory. You'd need a scope on the system bus to
notice them. If loads are indeed in order then they shouldn't have
mentioned the phrase speculative or out-of-order w.r.t. loads.

Joe Seigh

unread,

Aug 29, 2005, 1:26:59 PM8/29/05

to

Alexander Terekhov wrote:
> Joe Seigh wrote:
>
>>Sean Kelly wrote:
>>
>>>Joe Seigh wrote:
>>
>>Any of the serializing instructions, e.g. cpuid, lock, etc.... Linux
>>uses a dummy XCHG agains the stack (implied LOCK). The xFENCE are offered
>>as being more efficient since all they basically do is serialize and not
>>do something else as well. ^^^^^^^^^^^^^^^^^^^^^^^^^
>
>
> Serialize what? (cpuid aside for a moment.) Chapter and verse please.

"7.4. SERIALIZING INSTRUCTIONS
The IA-32 architecture defines several serializing instructions. These instructions force the
processor to complete all modifications to flags, registers, and memory by previous instructions
and to drain all buffered writes to memory before the next instruction is fetched and executed.
...
The following instructions are serializing instructions:
• Privileged serializing instructions—MOV (to control register),MOV (to debug register),
WRMSR, INVD, INVLPG, WBINVD, LGDT, LLDT, LIDT, and LTR.
• Non-privileged serializing instructions—CPUID, IRET, and RSM.
• Non-privileged memory ordering instructions—SFENCE, LFENCE, and MFENCE."

LOCK isn't serializing apparently, just ordering but I think there's an
unresolved issue with fetches moving accross LOCK.

>
> Please show some pseudo code with *FENCE that you think is needed to
> ensure visibility of data published by A in X to C after republishing
> it in Y by B.

I already proved processor consistency doesn't work the way you think
it does. QED.

Alexander Terekhov

unread,

Aug 29, 2005, 2:05:54 PM8/29/05

to

Joe Seigh wrote:
[...]

> I already proved processor consistency doesn't work the way you think
> it does.

http://research.compaq.com/wrl/people/kourosh/papers/1990-rc-isca.pdf
http://research.compaq.com/wrl/people/kourosh/papers/1993-tr-68.pdf

> QED.

QED. ;-)

regards,
alexander.

Alexander Terekhov

unread,

Aug 29, 2005, 2:20:09 PM8/29/05

to

Torsten Robitzki wrote:
[...]

> In this case it will be sufficient to make sure that the compiler won't
> reorder the reads on the reading side, which might be given as one of

> the variables involed beeing volatile qualified. ^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

What makes you think so?

regards,
alexander.

Joe Seigh

unread,

Aug 29, 2005, 2:24:28 PM8/29/05

to

Alexander Terekhov wrote:
> Joe Seigh wrote:
> [...]
>
>>I already proved processor consistency doesn't work the way you think
>>it does.
>
>
> http://research.compaq.com/wrl/people/kourosh/papers/1990-rc-isca.pdf
> http://research.compaq.com/wrl/people/kourosh/papers/1993-tr-68.pdf

Same thing.
I do not think this "processor consistency" means what you think it means.
>
>
>>QED.
>
>
> QED. ;-)

I do not think this QED means what you think it means.

Alexander Terekhov

unread,

Aug 29, 2005, 2:23:05 PM8/29/05

to

I mean: what makes you think that compiler won't reorder when "one of

the variables involed beeing volatile qualified."

regards,
alexander.

Alexander Terekhov

unread,

Aug 29, 2005, 2:35:41 PM8/29/05

to

Joe Seigh wrote:
>
> Alexander Terekhov wrote:
> > Joe Seigh wrote:
> > [...]
> >
> >>I already proved processor consistency doesn't work the way you think
> >>it does.
> >
> >
> > http://research.compaq.com/wrl/people/kourosh/papers/1990-rc-isca.pdf
> > http://research.compaq.com/wrl/people/kourosh/papers/1993-tr-68.pdf
>
> Same thing.
> I do not think this "processor consistency" means what you think it means.

Show me your proof for this processor consistency.

regards,
alexander.

Peter Dimov

unread,

Aug 29, 2005, 2:41:02 PM8/29/05

to

Joe Seigh wrote:
> Peter Dimov wrote:

> > Processor consistency behaves "as if" there is system memory (cache is
> > transparent) and CPU's have store queues that can satisfy their own
> > loads.
> >
> > So CPUs #3..#N will see the same sequence if CPUs #1 and #2 perform
> > stores.
> >
> > Discrepancies in store order only appear because CPU #1 can observe its
> > own stores early.
> >
> > In your example, C will see the stores by A and B in order.
> >
> I googled per Alexander's suggestion
> http://www.cs.nmsu.edu/~pfeiffer/classes/573/notes/consistency.html
> which defined it as thus
>
> "Writes done by a single processor are received by all other processors
> in the order in which they were issued, but writes from different
> processors may be seen in a different order by different processors."

Well, the x86 model (and SPARC TSO, IIUC) is stronger than that. I
don't know of an architecture that implements the above kind of
processor consistency.

x86 is the best compromise between programmability and performance. I
doubt that it will go away or be replaced by a weaker model. All weaker
models were killed by it.

Under the x86 model, if A writes X and B writes Y, processors C-F will
all see the same sequence of writes, X,Y (without loss of generality).
However B can observe Y,X. So the above definition does describe x86,
just not in its entirety.

Alexander Terekhov

unread,

Aug 29, 2005, 2:46:48 PM8/29/05

to

Peter Dimov wrote:
[...]

> x86 is the best compromise between programmability and performance. I
> doubt that it will go away or be replaced by a weaker model. All weaker

> models were killed by it. ^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^

Well, it's too earlier to consider... uhmm, for example CELL to be
killed by x86. Oder?

regards,
alexander.

Joe Seigh

unread,

Aug 29, 2005, 2:57:02 PM8/29/05

to

I already did. The store by processor B into Y is *after* the stores
into the object by processor A since they preceeded the store into
X by processor A. But since the stores into the object are by processor
A and the store into Y is by processor, different processors, you can't
infer that you're read them in the order they were written as *explicitly*
stated by the definition for processor consistency.

Normal memory barriers infer order of accesses to memory so it doesn't
matter if different processors do the stores as long as you have
memory barriers in the right places.

Joe Seigh

unread,

Aug 29, 2005, 3:04:06 PM8/29/05

to

Peter Dimov wrote:
> Joe Seigh wrote:

>>I googled per Alexander's suggestion
>>http://www.cs.nmsu.edu/~pfeiffer/classes/573/notes/consistency.html
>>which defined it as thus
>>
>> "Writes done by a single processor are received by all other processors
>> in the order in which they were issued, but writes from different
>> processors may be seen in a different order by different processors."
>
>
> Well, the x86 model (and SPARC TSO, IIUC) is stronger than that. I
> don't know of an architecture that implements the above kind of
> processor consistency.
>
> x86 is the best compromise between programmability and performance. I
> doubt that it will go away or be replaced by a weaker model. All weaker
> models were killed by it.

Unless I'm implementing a synchronization primative, I usually use the
synchronization primatives which guaranteed correctness no matter what
the memory model (within reason, i.e. you can port to it). I don't
think anyone thinks Posix is all that bad (as long as you don't compare
it to lock-free. :) ).

Alexander Terekhov

unread,

Aug 29, 2005, 3:09:59 PM8/29/05

to

Joe Seigh wrote:
[...]

> >>>http://research.compaq.com/wrl/people/kourosh/papers/1990-rc-isca.pdf
> >>>http://research.compaq.com/wrl/people/kourosh/papers/1993-tr-68.pdf
> >>
> >>Same thing.
> >>I do not think this "processor consistency" means what you think it means.
> >
> >
> > Show me your proof for this processor consistency.
> >
>
> I already did. The store by processor B into Y is *after* the stores
> into the object by processor A since they preceeded the store into
> X by processor A. But since the stores into the object are by processor
> A and the store into Y is by processor, different processors, you can't
> infer that you're read them in the order they were written as *explicitly*
> stated by the definition for processor consistency.

You mean as explicitly stated by definition in "2.2 Processor Consistency"
(1990-rc-isca.pdf) subject to "Extension to Dubois’ Abstraction" (1993-tr-
68.pdf) "Performing a Memory Request"?

How so?

regards,
alexander.

Torsten Robitzki

unread,

Aug 29, 2005, 3:30:47 PM8/29/05

to

Hi Alexander,

Alexander Terekhov wrote:

Because the cited mircosoft documentation showed some snipped of code
they claimed to work. Only one of the involved variables where volatile
qualified. It might be that microsoft compilers don't reorder _any_
access around a volatile variable.

The OP asked if in general a reader thread have to take any special
actions, to read a sequence of changes in the same order as written in a
releaxed memory model.

I've constructed a case where it will not be nesseary, and maybe Alpha
migth be such a case.

regards
Torsten

Joe Seigh

unread,

Aug 29, 2005, 3:38:00 PM8/29/05

to

Yes. That doesn't affect the definition of processor consistency. It's defining
what "Performaning a Memory Request" means, specifically with respect to a
single memory location. It doesn't affect stores by different processors to
different memory locations. You want sequential consistency, I think.