Intel x86 memory model question

Joe Seigh

unread,

Aug 30, 2005, 8:21:27 AM8/30/05

to

The question isn't what is the x86 memory model. If you
want to discuss that, you are welcome to join the fray on
c.p.t. The question is why can't or why doesn't Intel
want to document the x86 memory model since apparently
what is in the System Programming Guide is *not* the
memory model. I.e. not as far as program observable
behavior is concerned though it may be if you have
tracing scopes attached to the memory bus.

Is this some kind of Intel State Secret? Is writing
correct multi-threaded programs not in Intel's interest?

--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.

Mitch...@aol.com

unread,

Aug 30, 2005, 3:55:03 PM8/30/05

to

I didn't find it in the Intel book I have (Pentium Pro)

But chapter 7 in Volume 2 of AMD x86-64 Architecture Programmer's
Manual (System Programming) describes AMD's side of the situation,
starting on page 191 of the Purple Volume.

The problem is when you consider the number of memory modes {UC, CD,
WC, WP, WT and WB} that no simplistic statement can fully address what
the programmer can assume about memory and its ordering properties.
WriteBack (cacheable) memory is, however, Processor Consistent.

Joe Seigh

unread,

Aug 30, 2005, 5:09:23 PM8/30/05

to

The argument being presented in c.p.t. is that processor consistency
implies loads are in order, perhaps instigated by something Andy Glew
said about this here
http://groups.google.com/group/comp.arch/msg/96ec4a9fb75389a2

AFAICT, this is not true for 3 or more processors. E.g.

processor 1 stores into X
processor 2 see the store by 1 into X and stores into Y

So the store into Y occurred after causal reasoning.

processor 3 loads from Y
processor 3 loads from X

If loads were in order you could infer that if processor 3
sees the new value of Y then it will see the new value of X.
But the rules for processor consistency *clearly* state that
you will necessarily see stores by different processors in
order.

While there are still ordering constraints on the loads they
don't have to be strictly in order as Andy incorrectly infers.

Joe Seigh

unread,

Aug 30, 2005, 5:41:34 PM8/30/05

to

Joe Seigh wrote:
>
> processor 1 stores into X
> processor 2 see the store by 1 into X and stores into Y
>
> So the store into Y occurred after causal reasoning.
>
> processor 3 loads from Y
> processor 3 loads from X
>
> If loads were in order you could infer that if processor 3
> sees the new value of Y then it will see the new value of X.
> But the rules for processor consistency *clearly* state that
> you will necessarily see stores by different processors in
> order.

that should be

But the rules for processor consistency *clearly* state that

you will not necessarily see stores by different processors in
order.

already...@yahoo.com

unread,

Aug 30, 2005, 6:42:20 PM8/30/05

to

Joe Seigh wrote:
> The question isn't what is the x86 memory model. If you
> want to discuss that, you are welcome to join the fray on
> c.p.t. The question is why can't or why doesn't Intel
> want to document the x86 memory model since apparently
> what is in the System Programming Guide is *not* the
> memory model. I.e. not as far as program observable
> behavior is concerned though it may be if you have
> tracing scopes attached to the memory bus.
>

I don't understand what's particularly wrong with paragraph 7.2.2
ftp://download.intel.com/design/Pentium4/manuals/25366816.pdf
Could you be a bit more specific.

> Is this some kind of Intel State Secret? Is writing
> correct multi-threaded programs not in Intel's interest?
>

Obviously, writing correct multi-threaded SMP programs is in Intel's
interest. However, according to my understanding, Intel couldn't care
less about _lockless_ multi-threaded SMP programs. The reasons are
clear:
1. That's such a tiny niche!
2. Average programmer can't do it correctly regardless of the quality
of documentation.

Joe Seigh

unread,

Aug 30, 2005, 7:25:17 PM8/30/05

to

already...@yahoo.com wrote:
> Joe Seigh wrote:
>
>>The question isn't what is the x86 memory model. If you
>>want to discuss that, you are welcome to join the fray on
>>c.p.t. The question is why can't or why doesn't Intel
>>want to document the x86 memory model since apparently
>>what is in the System Programming Guide is *not* the
>>memory model. I.e. not as far as program observable
>>behavior is concerned though it may be if you have
>>tracing scopes attached to the memory bus.
>>
>
>
> I don't understand what's particularly wrong with paragraph 7.2.2
> ftp://download.intel.com/design/Pentium4/manuals/25366816.pdf
> Could you be a bit more specific.

Some people are interpreting processor consistency as implying
reads are in order and the statment
1. Reads can be carried out speculatively and in any order.
only applying to speculative reads (commit criteria being
in order at time of commit).

>
>
>>Is this some kind of Intel State Secret? Is writing
>>correct multi-threaded programs not in Intel's interest?
>>
>
>
> Obviously, writing correct multi-threaded SMP programs is in Intel's
> interest. However, according to my understanding, Intel couldn't care
> less about _lockless_ multi-threaded SMP programs. The reasons are
> clear:
> 1. That's such a tiny niche!
> 2. Average programmer can't do it correctly regardless of the quality
> of documentation.
>

You package as part of a (hopefully) easy to use api such as a
synchronized queue (which can use locks or be lock-free in the
implementation).

Eric P.

unread,

Aug 31, 2005, 11:06:35 AM8/31/05

to

Joe Seigh wrote:
>
> Joe Seigh wrote:
> >
> > processor 1 stores into X
> > processor 2 see the store by 1 into X and stores into Y
> >
> > So the store into Y occurred after causal reasoning.
> >
> > processor 3 loads from Y
> > processor 3 loads from X
> >
> > If loads were in order you could infer that if processor 3
> > sees the new value of Y then it will see the new value of X.
> > But the rules for processor consistency *clearly* state that
> > you will necessarily see stores by different processors in
> > order.
> that should be
>
> But the rules for processor consistency *clearly* state that
> you will not necessarily see stores by different processors in
> order.

I see what you are getting at, but for this to occur the new value
of Y would have to arrive at P3 before the new value of X from P1,
implying the msg from P2 to P3 somehow passed the msg from P1 to P3.
This would mean that no update order at all could be concluded
and the whole system would break.

Since they clearly do function, this is obviously not how they work :-)

Eric

Joe Seigh

unread,

Aug 31, 2005, 12:29:19 PM8/31/05

to

It turns out the x86 memory model is defined, it's just not defined in the
IA-32 manuals which is where you would expect it to be defined. It's defined
in the Itanium manuals and is equivalent to Sparc TSO memory model.

2.1.2 Loads and Stores
In the Itanium architecture, a load instruction has either unordered or acquire semantics while a
store instruction has either unordered or release semantics. By using acquire loads (ld.acq) and
release stores (st.rel), the memory reference stream of an Itanium-based program can be made to
operate according to the IA-32 ordering model. The Itanium architecture uses this behavior to
provide IA-32 compatibility. That is, an Itanium acquire load is equivalent to an IA-32 load and an
Itanium release store is equivalent to an IA-32 store, from a memory ordering perspective.

Seongbae Park

unread,

Aug 31, 2005, 12:46:14 PM8/31/05

to

Joe Seigh <jsei...@xemaps.com> wrote:
...

> It turns out the x86 memory model is defined, it's just not defined in the
> IA-32 manuals which is where you would expect it to be defined. It's defined
> in the Itanium manuals and is equivalent to Sparc TSO memory model.
>
> 2.1.2 Loads and Stores
> In the Itanium architecture, a load instruction has either unordered or acquire semantics while a
> store instruction has either unordered or release semantics. By using acquire loads (ld.acq) and
> release stores (st.rel), the memory reference stream of an Itanium-based program can be made to
> operate according to the IA-32 ordering model. The Itanium architecture uses this behavior to
> provide IA-32 compatibility. That is, an Itanium acquire load is equivalent to an IA-32 load and an
> Itanium release store is equivalent to an IA-32 store, from a memory ordering perspective.

I suspect the above paragraph is stronger than what it really wanted to say.
It seems that the intention was to say
that Itanium can correctly emulate x86 by running effectively in a TSO mode,
since x86's memory model is not stronger than TSO.

On http://blogs.msdn.com/cbrumme/archive/2003/05/17/51445.aspx:
> the memory model for X86 can be described as:
> 1. All stores are actually store.release.
> 2. All loads are normal loads.
> 3. Any use of the LOCK prefix (e.g. ?LOCK CMPXCHG? or ?LOCK INC?) creates a full fence.
--
#pragma ident "Seongbae Park, compiler, http://blogs.sun.com/seongbae/"

Joe Seigh

unread,

Aug 31, 2005, 3:45:06 PM8/31/05

to

Seongbae Park wrote:
> Joe Seigh <jsei...@xemaps.com> wrote:
> ...
>
>>It turns out the x86 memory model is defined, it's just not defined in the
>>IA-32 manuals which is where you would expect it to be defined. It's defined
>>in the Itanium manuals and is equivalent to Sparc TSO memory model.
>>
>> 2.1.2 Loads and Stores
>> In the Itanium architecture, a load instruction has either unordered or acquire semantics while a
>> store instruction has either unordered or release semantics. By using acquire loads (ld.acq) and
>> release stores (st.rel), the memory reference stream of an Itanium-based program can be made to
>> operate according to the IA-32 ordering model. The Itanium architecture uses this behavior to
>> provide IA-32 compatibility. That is, an Itanium acquire load is equivalent to an IA-32 load and an
>> Itanium release store is equivalent to an IA-32 store, from a memory ordering perspective.
>
>
> I suspect the above paragraph is stronger than what it really wanted to say.
> It seems that the intention was to say
> that Itanium can correctly emulate x86 by running effectively in a TSO mode,
> since x86's memory model is not stronger than TSO.
>

Hmm, that's possible. If you take IA-32's loads as being unordered they're not
entirely unordered due to the processor consistency model. It's likely that
nobody uses processor consistency as a programming memory model but since Intel
specified it as part of the memory model they have to adhere to it for compatibility
reasons. Is this the reason Itanium runs so slow in IA-32 mode? Because it has
to use ld.acq instead of ld for IA-32 loads? All because they used a memory
model that was more convenient for hardware architects than for programmers?

Seongbae Park

unread,

Aug 31, 2005, 5:57:58 PM8/31/05

to

Seongbae Park <Seongb...@Sun.COM> wrote:
> Joe Seigh <jsei...@xemaps.com> wrote:
> ...
>> It turns out the x86 memory model is defined, it's just not defined in the
>> IA-32 manuals which is where you would expect it to be defined. It's defined
>> in the Itanium manuals and is equivalent to Sparc TSO memory model.
>>
>> 2.1.2 Loads and Stores
>> In the Itanium architecture, a load instruction has either unordered or acquire semantics while a
>> store instruction has either unordered or release semantics. By using acquire loads (ld.acq) and
>> release stores (st.rel), the memory reference stream of an Itanium-based program can be made to
>> operate according to the IA-32 ordering model. The Itanium architecture uses this behavior to
>> provide IA-32 compatibility. That is, an Itanium acquire load is equivalent to an IA-32 load and an
>> Itanium release store is equivalent to an IA-32 store, from a memory ordering perspective.
>
> I suspect the above paragraph is stronger than what it really wanted to say.
> It seems that the intention was to say
> that Itanium can correctly emulate x86 by running effectively in a TSO mode,
> since x86's memory model is not stronger than TSO.

I take this back.
Actually the above statement depends on whether IA64 is RCsc or RCpc.
If it is RCpc, then by definition all special accesses are PC in RCpc,
and turning every accesses special accesses just turns in into PC.
If it is RCsc, then it is not really a TSO but SC which is stronger than PC
and hence can run the program correctly.

I didn't bother to look at IA64 manual - anybody care to comment on this ?
but I suspect that IA64 is RCpc and the manual is exactly correct after all.

Eric P.

unread,

Aug 31, 2005, 6:02:34 PM8/31/05

to

I think the underlying question you asked about the x86 is:

Does the Intel Processor Consistency model require processors
to wait for all other processors to acknowledge receipt of their
invalidates before any are allowed to use the new value?

The section 7.2.2 memory ordering info does not define an answer.

This would likely depend on the bus protocol details.
It might be implemented by having P1 send an invalidate X to P2
and not reply to a request from P2 for a read of the new value of
X until it had received an the invalidate acknowledgment from P3.

I haven't paid any attention to the I64 acquire release mechanism
as I figure I'll never run into it, so I'm not sure if that is
the same as a release.

Eric

Ricardo Bugalho

unread,

Sep 1, 2005, 6:22:44 AM9/1/05

to

On Wed, 31 Aug 2005 21:57:58 +0000, Seongbae Park wrote:

> I didn't bother to look at IA64 manual - anybody care to comment on this ?
> but I suspect that IA64 is RCpc and the manual is exactly correct after
> all.

It's RCpc indeed.

Ricardo Bugalho

unread,

Sep 1, 2005, 6:36:45 AM9/1/05

to

On Wed, 31 Aug 2005 18:02:34 -0400, Eric P. wrote:

>
> I think the underlying question you asked about the x86 is:
>
> Does the Intel Processor Consistency model require processors to wait
> for all other processors to acknowledge receipt of their invalidates
> before any are allowed to use the new value?
>

It does not.
The most straightforward example is buffered store forwarding: when a CPU
writes a value into memory, it can read it again directly from the store
buffer, even before it tries to make it visible to other processors.

Alexander Terekhov

unread,

Sep 1, 2005, 7:23:29 AM9/1/05

to

Not quite. Release stores to *WB* memory are constrained to ensure
"remote write atomicity". Classic RCpc is weaker in this respect
(and that's what makes RC != TSO). You better not rely on this
property because emulating it on CELLs (for example) will make your
ports run really slow. ;-)

regards,
alexander.

Alexander Terekhov

unread,

Sep 1, 2005, 7:25:30 AM9/1/05

to

Err..

Alexander Terekhov wrote:
>
> Ricardo Bugalho wrote:
> >
> > On Wed, 31 Aug 2005 21:57:58 +0000, Seongbae Park wrote:
> >
> > > I didn't bother to look at IA64 manual - anybody care to comment on this ?
> > > but I suspect that IA64 is RCpc and the manual is exactly correct after
> > > all.
> >
> > It's RCpc indeed.
>
> Not quite. Release stores to *WB* memory are constrained to ensure
> "remote write atomicity". Classic RCpc is weaker in this respect
> (and that's what makes RC != TSO). You better not rely on this

^
|
PC, not RC. -------------+

Joe Seigh

unread,

Sep 1, 2005, 7:37:54 AM9/1/05

to

So what does "manual is exactly correct" in this case mean? Are
IA-32 loads equivalent to IA64 ld.acq and they are not equivalent
to IA64 ld? I.e. the latter can't emulate a IA-32 load in all cases.

Alexander Terekhov

unread,

Sep 1, 2005, 7:49:57 AM9/1/05

to

Joe Seigh wrote:
[...]

> Are IA-32 loads equivalent to IA64 ld.acq and they are not equivalent
> to IA64 ld?

The ordering constraints are equivalent for IA32 loads and IA64 acquire
loads. But IA64 release stores to WB memory are more constrained than PC
stores, and IA32-under-IA64 effectively runs in TSO for WB memory, not
PC.

regards,
alexander.

Eric P.

unread,

Sep 1, 2005, 12:46:35 PM9/1/05

to

I meant with regard to other processors not to itself.

Within a processor, yes, the docs explicitly state that
data from buffered writes can be forwarded to waiting reads.
As I understand it, while such local forwarding can have consequences
for consistency models, presumably because it allows subsequent
instructions to complete earlier than they otherwise would have,
it should not have an effect remote data update ordering.

In short, store to load forwarding, in and of itself, would not
allow a new value of Y to arrive at P3 before the new value of X.

For this to occur seems to me to require both of:
(a) the cache protocol to distribute updates in a non atomic manner by
allowing a new value to be available before all acks are received.
(b) the bus topology and protocol to somehow allow a message to get
from P1 to P2 then P2 to P3 passing the one from P1 to P3,
possibly due to an error and retransmit.

Eric

Andy Glew

unread,

Sep 2, 2005, 2:51:50 PM9/2/05

to

Bottom quoting: asbestos donned!

I think that Joe Seigh has incorrectly assumed that processor
consistency implies (a) a global ordering of all loads, and (b) causal
ordering.

This is not true. At least, I am fairly certain that there is a
causal ordering memory model that is intermediate in semantics between
processor consistency and sequential consistency. (Google finfslots of
papers; I specifically recall Mossberger's survey.) And I do not
believe that I have ever seen a proof that processor consistency
implies a global ordering of all loads; I don't think such a proof
exists; I would be interested to see it if it does; and I strongly
suspect that there is a proof that orderings consistent with processor
consistency may violate causal ordering. Indeed, Joe may have
provided one.

(I do confess that I have occasionally wanted to move from processor
consistency to causal consistency, mainly because causal consistency
sounds like it should be easier to make proofs for; but I am not sure
if causal consistency is any easier to implement than sequential
consistency. Since sequential consistency is easy enough to
implement, I suspect that if we tighten up the memory model we will go
all the way.)

Nearly all statements in processor consistency are local.
For processors Pi, i = ...

Each Pi has a set of instructions Pi.Ij, some of which are loads, some
of which are stores. Notationally Pi.Lj and Pi.Sj, where the index
sets for Lj and Sj are not necessarily contiguous.

Each Pi also sees external stores in some order Pi.Xk.

The sequence of external stores seen by Pi, Pi.Xk, can be formed out
of an interleaving the set of stores from all other processors Pm.Sj,
m!=i. The only real constraint is that in this interleaving all of
the stores from a particular processor Pm.Sj appear in the order in
which they occurred on that processor; stores from a given processor
are not reordered in the sequence.

The sequence of external stores Pi.Xk is not necessarily equal to
Pj.Xk, for different processors i and j. I.e. although stores from
any single processor are performed in order at any other processor,
other processors do not necessarily see stores from different
processors interleaved in the same order. I.e. there is no single
global store order.

Instruction execution at a single Pi proceeds as if one instruction at
a time were executed, with some interleaving of the external stores
Pi.Xk. I.e. from the point of view of the local processor, it's loads
Pi.Lj are performed in order, and in order with the local stores
Pi.Sj. More specifically, there can be constructed an ordering Pi.Mx
which is an interleaving of Pi.Ij (and hence Pi.Lj and Pi.Sj) and
Pi.Xk, and local processor execution is consistent with such an
ordering Pi.Mx.

Note: we say "there can be constructed an ordering". But, so far as I
know, there is no easy way to construct such an ordering for an
particular processor. We know that one could be constructed, but we
don't know what it is. And certainly not an easy way to construct this
in an online manner.

And, again: there need not be a global ordering of stores from all
processors. And nor need there be a global ordering of loads.

A formal model must make a few more statements about the limited forms
of causality that are maintained in processor consistent system.
(E.g. two party causality; three party causality is not maintained, to
the best of my knowledge.) And, to be perfectly honest, I forget what
statements need to be made to differentiate between the two sub-types
of processor consistency: Gharacharloo type I and type II, where in
the latter you can forward from a store buffer (an implementation
consideration).

---

As Mitch says, the above can be briefly stated: WB memory is processor
consistent, type II. Describing the interaction of other memory types
is morecomplicated.

---

I do not know or care very much what the Itanium processor manual says
about x86 memory ordering. I wouldn't be surprised if they got it
wrong; or, as in the examples Joe provide, describe a mapping which
has explanatory value, but not definitional value.

---

Joe Seigh

unread,

Sep 2, 2005, 3:42:40 PM9/2/05

to

Andy Glew wrote:
>
> Bottom quoting: asbestos donned!
>
> I think that Joe Seigh has incorrectly assumed that processor
> consistency implies (a) a global ordering of all loads, and (b) causal
> ordering.

I think I was trying to prove that you couldn't imply global ordering
of loads.

Part of the problem is there's two target groups of programmers for
the memory model here. The processor consistency is alright if you're
doing HPC/parallel programming but isn't very useful if you're doing
general multi-threaded programming. There, all you really care about
is what the implicit global ordering between the various combinations
of loads and stores, and what memory barriers to use for the combinations
where ordering isn't defined.

In the ia32 docs, it's a little muddied because of the mention of
speculative loads. None the less I had assumed that loads weren't
ordered and that LFENCE or some other memory barrier or serializing
instruction was needed for global ordering of loads. However there
were some that claimed LFENCE wasn't needed. And the documentation
wasn't explicit enough to definitively counter their claims. And
it had to be really explicit given the rather incomprehensible
arguments they were presenting.

I've basically decided to ignore these people for now and stick with
my orginal interpretation of the ia32 memory model.

Alexander Terekhov

unread,

Sep 3, 2005, 7:58:10 AM9/3/05

to

Joe Seigh wrote:
[...]

> In the ia32 docs, it's a little muddied because of the mention of
> speculative loads. None the less I had assumed that loads weren't
> ordered and that LFENCE or some other memory barrier or serializing
> instruction was needed for global ordering of loads.

Neither will give you "global ordering of loads". Loads on ia32 are
in-order with respect to other loads and subsequent stores (by the
same processor). The only thing that differentiates PC from TSO is
the lack of remote write atomicity (in IA64 formal memory model
speak). Implementations (e.g. SPO) of course can do all sorts of
tricks to improve performance, but that doesn't change the memory
model. You're in denial.

regards,
alexander.

Joe Seigh

unread,

Sep 3, 2005, 8:35:01 AM9/3/05

to

Whatever. I'm going to use LFENCE for situations where I'd use
#LoadLoad on sparc (generic, not assuming TSO). And it's not
because I'm in denial. It's because nothing you say is
comprehensible. It's possible you are making some kind of
valid technical point but I have no way of telling.

Alexander Terekhov

unread,

Sep 3, 2005, 8:52:27 AM9/3/05

to

Joe Seigh wrote:
[...]

> Whatever. I'm going to use LFENCE for situations where I'd use
> #LoadLoad on sparc (generic, not assuming TSO).

You mean RMO? Reportedly, RMO is vaporware, so yeah, you'll get the
same "useful" effect on Sparc as on ia32 (weakly ordered WC memory
aside for a moment): none whatsoever.

regards,
alexander.

Joe Seigh

unread,

Sep 3, 2005, 9:46:18 AM9/3/05

to

In the same sense that Sparc documentation assumes the weakest possbile
architected memory model when documenting usage of its memory barriers.

I know that some sparc processors only implement TSO and Solaris assumes
and requires TSO (so far).

It's possible Intel processors are all effectivly implemented as TSO, but we're
talking about the architected memory model and have to assume that unless
writing model dependent code.

I like how you sidestepped whether LFENCE or some serializing instruction
is required in some situations between sucessive loads on Intel ia32 processors.
We're assuming weakly ordered memory I think, whatever the typical multiprocessor
Intel box meant to run Linux or windows uses. Whatever "write-back cacheable"
is.

:

This whole thing is bizarre. Any other architecture, e.g. IBM Z architecture,
powerpc, sparc, alpha, ... and there's no problem in discussing whether
memory barriers are needed in certain situations. Only in Intel ia32 and only
when Alexander participates. However, if you filter out any comments by
Alexander then the problem goes away. I should have put in an Alexander filter
earlier. Then I wouldn't have raised this issue in the first place, which
has probably put *me* in a few filters. :)

Alexander Terekhov

unread,

Sep 3, 2005, 9:58:21 AM9/3/05

to

Joe Seigh wrote:
[...]

> We're assuming weakly ordered memory I think, whatever the typical multiprocessor
> Intel box meant to run Linux or windows uses. Whatever "write-back cacheable"
> is.

It means PC (apart from the non-temporal weakly ordered stuff) under x86
native (not Itanicized x86, i.e. TSO for WB instead of PC), and you don't
need LFENCE under PC.

regards,
alexander.

Eric P.

unread,

Sep 3, 2005, 10:02:37 AM9/3/05

to

Joe Seigh wrote:

>
> Alexander Terekhov wrote:
> >
> > Neither will give you "global ordering of loads". Loads on ia32 are
> > in-order with respect to other loads and subsequent stores (by the
> > same processor). The only thing that differentiates PC from TSO is
> > the lack of remote write atomicity (in IA64 formal memory model
> > speak). Implementations (e.g. SPO) of course can do all sorts of
> > tricks to improve performance, but that doesn't change the memory
> > model. You're in denial.
> >
>
> Whatever. I'm going to use LFENCE for situations where I'd use
> #LoadLoad on sparc (generic, not assuming TSO). And it's not
> because I'm in denial. It's because nothing you say is
> comprehensible. It's possible you are making some kind of
> valid technical point but I have no way of telling.

As I understand it, the key to causal ordering is Atomic Visibility
whereby a write becomes visible simultaneously to all processors
other than the one that issued the write. According to Gharacharloo,
Processor Consistency does not require updates be Atomically Visible
and, in theory allows non causal ordering of the kind in your
example. TSO does require Atomic Visibility.

The reason PC allows this rather dubious ordering appears to be so
as to not disallow caches using a Write Update (as opposed to Write
Invalidate) coherency protocol. Imposing Atomic Visibility on a
Write Update cache would be very difficult because each cache would
receive the updated value but then each would have to prevent that
value from being used until all peers had ack'ed. Imposing Atomic
Visibility on a Write Invalidate cache is much easier - just don't
give out the new value until all invalidate ack's are received.

(Others have pointed out, however, that Write Update caches are
undesirable for other reasons so PC appears to give up atomicity in
order to gain the ability to use a cache design that no one wants to.
Go figure.)

The text of LFENCE instruction in the Intel instruction manual says
"Performs a serializing operation on all load-from-memory instructions
that were issued prior the LFENCE instruction. This serializing
operation guarantees that every load instruction that precedes in
program order the LFENCE instruction is globally visible before any
load instruction that follows the LFENCE instruction is globally
visible. The LFENCE instruction is ordered with respect to load
instructions, other LFENCE instructions,"...

seems to provide the guarantees for globally visibility and
therefore causality that you are looking for.

Eric

Alexander Terekhov

unread,

Sep 3, 2005, 10:34:56 AM9/3/05

to

"Eric P." wrote:
[...]

> The text of LFENCE instruction in the Intel instruction manual says
> "Performs a serializing operation on all load-from-memory instructions
> that were issued prior the LFENCE instruction. This serializing
> operation guarantees that every load instruction that precedes in
> program order the LFENCE instruction is globally visible before any
> load instruction that follows the LFENCE instruction is globally
> visible. The LFENCE instruction is ordered with respect to load
> instructions, other LFENCE instructions,"...
>
> seems to provide the guarantees for globally visibility and

What does "global visibility" means for loads under PC?

> therefore causality that you are looking for.

So where do you put the fence, then?

: processor 1 stores into X

: processor 2 see the store by 1 into X and stores into Y

: processor 3 loads from Y

: processor 3 loads from X

regards,
alexander.

Alexander Terekhov

unread,

Sep 3, 2005, 11:35:45 AM9/3/05

to

Joe Seigh wrote:

[... filters ...]

< Forward Quoted >

Newsgroups: comp.programming.threads
Subject: Re: Memory visibility and MS Interlocked instructions
From: David Hopwood <david.nosp...@blueyonder.co.uk>

-------- Original Message --------

David Hopwood wrote:
>
> Alexander Terekhov wrote:
> > Andy Glew of Intel (sorta) confirmed that x86 is classic PC.
> >
> > http://groups.google.de/group/comp.arch/msg/7200ec152c8cca0c

>
> Joe Seigh wrote:
> > The argument being presented in c.p.t. is that processor consistency
> > implies loads are in order, perhaps instigated by something Andy Glew
> > said about this here
> > http://groups.google.com/group/comp.arch/msg/96ec4a9fb75389a2
>

> and in another post:
> | "loads in order" means #LoadLoad between loads.

>
> > AFAICT, this is not true for 3 or more processors. E.g.
> >
> > processor 1 stores into X
> > processor 2 see the store by 1 into X and stores into Y
> >
> > So the store into Y occurred after causal reasoning.
>

> Processor consistency is weaker than causal consistency, remember.

>
> > processor 3 loads from Y
> > processor 3 loads from X
> >
> > If loads were in order you could infer that if processor 3
> > sees the new value of Y then it will see the new value of X.
>

> No.
>
> Start with X == Y == 0.
>
> P1: X := 1
>
> P2: t := X;
> if (t == 1) Y := 1
>
> P3: u := Y
> #LoadLoad // or acquire
> v := X
>
> {u == 1, v == 0} is possible. This is because P2 and P3 might see
> the stores to X and Y in a different order, because they are made
> by different processors. The #LoadLoad does not prevent this.

>
> > But the rules for processor consistency *clearly* state that

> > you will [not] necessarily see stores by different processors in

> > order.
> >
> > While there are still ordering constraints on the loads they
> > don't have to be strictly in order as Andy incorrectly infers.
>

> #LoadLoad between loads does not imply that you will necessarily
> see stores by different processors in a single global order. That
> is what you appear to be misunderstanding. In other words, there
> is nothing inconsistent between what Andy Glew's post said, and
> Alexander's assertion that load on x86 implies load.acq.
>
> --
> David Hopwood <david.nosp...@blueyonder.co.uk>

regards,
alexander.

Eric P.

unread,

Sep 3, 2005, 11:30:54 AM9/3/05

to

Alexander Terekhov wrote:
>
> "Eric P." wrote:
> [...]
> > The text of LFENCE instruction in the Intel instruction manual says
> > "Performs a serializing operation on all load-from-memory instructions
> > that were issued prior the LFENCE instruction. This serializing
> > operation guarantees that every load instruction that precedes in
> > program order the LFENCE instruction is globally visible before any
> > load instruction that follows the LFENCE instruction is globally
> > visible. The LFENCE instruction is ordered with respect to load
> > instructions, other LFENCE instructions,"...
> >
> > seems to provide the guarantees for globally visibility and
>
> What does "global visibility" means for loads under PC?

Point taken.

> > therefore causality that you are looking for.
>
> So where do you put the fence, then?
>
> : processor 1 stores into X
> : processor 2 see the store by 1 into X and stores into Y
> : processor 3 loads from Y
> : processor 3 loads from X
>
> regards,
> alexander.

I was wondering that myself. How about:
P3:
LD X
LFENCE
LD Y
LFENCE
LD X

This does seem a terrible price to pay for the 'advantages' one
gets from giving up Atomic Visability.

In practice I would be surprised if this could ever really occur.
When Joe posted the example, I thought it was impossible.
I was surprised to find that it was, in theory, possible, at
least according to the Gharacharloo definition of PC.
I would be more surprised if there was one programmer in a
million who did not consider this a hardware bug or wrote
code that took this into account. I'd bet people code to TSO.

Eric

Joe Seigh

unread,

Sep 3, 2005, 12:04:03 PM9/3/05

to

Alexander Terekhov wrote:
> So where do you put the fence, then?
>
> : processor 1 stores into X
> : processor 2 see the store by 1 into X and stores into Y
> : processor 3 loads from Y
> : processor 3 loads from X
>

Since this was my example I should clarify. It was meant to
show that PC alone wasn't sufficient to guarantee that if processor
3 saw the store into Y by processor 2 that it would see the
store into X by processor 1.

My understanding of the ia32 memory model is that you
need a fence instruction between the loads by processor 3
and a fence between the load and store by processor 2 to
make the guarantee work.

Alexander Terekhov

unread,

Sep 3, 2005, 12:21:43 PM9/3/05

to

"Eric P." wrote:
[...]

> I was wondering that myself. How about:
> P3:
> LD X
> LFENCE
> LD Y
> LFENCE
> LD X

That won't change anything. For causality, you need to CAS X on P3.

>
> This does seem a terrible price to pay for the 'advantages' one
> gets from giving up Atomic Visability.

Power architecture also doesn't guarantee atomic visibility.

Here's full-modes (apart from dd/cc stuff) load intrinsic in pseudo
code for CELLs and XBOXes. ;-)

http://tinyurl.com/83r9b

Constraint calculator:

http://tinyurl.com/9vamz

regards,
alexander.

Alexander Terekhov

unread,

Sep 3, 2005, 12:33:08 PM9/3/05

to

Joe Seigh wrote:
[...]

> My understanding of the ia32 memory model is that you
> need a fence instruction between the loads by processor 3

And what are you going to do on a (hypothetical) quad 486 (or
some other old ia32) box without SSE fences? ;-)

regards,
alexander.

Alexander Terekhov

unread,

Sep 3, 2005, 12:40:44 PM9/3/05

to

Joe Seigh wrote:
[...]

> My understanding of the ia32 memory model is that you
> need a fence instruction between the loads by processor 3

LFENCE/#LoadLoad is implied by processor consistency.

> and a fence between the load and store by processor 2 to
> make the guarantee work.

#LoadStore fence for P2 (load X ... store Y) is also implied by
processor consistency.

So what's the point?

regards,
alexander.

Eric P.

unread,

Sep 3, 2005, 3:30:40 PM9/3/05

to

Alexander Terekhov wrote:
>
> "Eric P." wrote:
> [...]
> > I was wondering that myself. How about:
> > P3:
> > LD X
> > LFENCE
> > LD Y
> > LFENCE
> > LD X
>
> That won't change anything. For causality, you need to CAS X on P3.

Yeah. X could change again after the first fence. Silly me. :-)
I was trying to avoid the fact that the LFENCE definition does NOT
require all queued invalidates to be delivered before proceeding.
That might allow the update to X to remain outstanding.

It would be simpler if they had used definitions like for the
Alpha Memory Barrier MB instruction:

"MB and CALL_PAL IMB force all preceding writes to at least reach
their respective coherency points. This does not mean that main-memory
writes have been done, just that the order of the eventual writes is
committed.

MB and CALL_PAL IMB also force all queued cache invalidates to be
delivered to the local caches before starting any subsequent reads
(that may otherwise cache hit on stale data) or writes (that may
otherwise write the cache, only to have the write effectively
overwritten by a late-delivered invalidate)."

> Power architecture also doesn't guarantee atomic visibility.

Not that it is relevant to the x86, but a PowerPC 750 manual that
I have from 1999 says

"3.3.5.1 Performed Loads and Stores
The PowerPC architecture defines a performed load operation as one
that has the addressed memory location bound to the target register
of the load instruction. The architecture defines a performed store
operation as one where the stored value is the value that any other
processor will receive when executing a load operation."

This would seem to indicate that, at least for that model,
it used atomic visibility. It still needs sync instructions
to prevent load & store reordering or bypassing.

Eric

Alexander Terekhov

unread,

Sep 5, 2005, 8:17:17 AM9/5/05

to

Andy Glew wrote:
[...]

> briefly stated: WB memory is processor consistent, type II.

Would you please confirm that in order to get SC semantics for x86 WB
memory, I just need to replace all loads by lock-cmpxchg with 42 in
accumulator and simply use resulting value in accumulator after cmpxchg
as load operation result... which would also provide store-load fencing
inside cmpxchg with respect to load from DEST?

TIA.

regards,
alexander.

Eric P.

unread,

Sep 5, 2005, 11:04:48 AM9/5/05

to

Alexander Terekhov wrote:
>
> "Eric P." wrote:
> [...]
> > I was wondering that myself. How about:
> > P3:
> > LD X
> > LFENCE
> > LD Y
> > LFENCE
> > LD X
>
> That won't change anything. For causality, you need to CAS X on P3.

Does the following basically reflect your reasoning:

Scenario:

processor 1 stores into X
processor 2 see the store by 1 into X and stores into Y
processor 3 loads from Y
processor 3 loads from X

1) Processor Consistency intrinsically allows P3 to have a new
value for Y and a stale value for X. This can be accomplished,
for example, by allowing P1 to hand out new values for X to
some peers before ensuring all old values of X are invalid.

There may be an invalidate X winging its' way from P1 to P3,
but there is no guarantee when it will arrive (other than it
do so before the next store by P1 arrives at P3).

2) SFENCE "guarantees that the results of every store instruction
that precedes the store fence in program order is globally visible
before any store instruction that follows the fence."
This is intended for use with weak ordered memory types.

The guarantee is that the value will be 'globally visible' at
some time in the future and before the next store, NOT that it
will be globally visible at the end of the SFENCE.

When used with normal, Processor Consistency and Write Back caching
memory this is exactly the same guarantee as PC provides, therefore
the SFENCE does nothing to change invalidate delivery.

3) LFENCE does not explicitly guarantee to drain all pending
invalidates for a processor. However even assuming that was
just a documentation oversight and that it really does drain them,
since there is no guarantee that P3 will have received its
invalidate, an LFENCE on P3 does not guarantee X is not stale.
P3 can still receive the new Y, LFENCE to drain the invalidates
and read the old X.

(I considered whether LFENCE might perform a 'global sync' by
communicating with all peers and ensure there were no outstanding
invalidates/updates in flight to itself before the drain in order
to ensure X was up to date. However I don't believe this would
work unless the global sync was itself atomic.)

4) The only way to guarantee that a processor has the most recent
value of a location is to take ownership of the variable,
and that requires a write. Since we actually want to read X,
we use CAS (x86 LOCK CMPXCHG) to read the most recent value.

So in the presence of Processor Consistency, with its lack of
Atomic Visibility, then the causally consistent sequence is:

P3:
LD Y, r1
Loop:
LD X, r2
CAS X, r2, r2
BEZ Loop

Eric

Alexander Terekhov

unread,

Sep 5, 2005, 11:59:42 AM9/5/05

to

"Eric P." wrote:
[...]

> Does the following basically reflect your reasoning:

[... 1 - 3 ...]

Yes.

> 4) The only way to guarantee that a processor has the most recent
> value of a location is to take ownership of the variable,
> and that requires a write. Since we actually want to read X,

^^^^^^^^^^^^^^^^^^^^^^^^^

That's the key.

> we use CAS (x86 LOCK CMPXCHG) to read the most recent value.
>
> So in the presence of Processor Consistency, with its lack of
> Atomic Visibility, then the causally consistent sequence is:
>
> P3:
> LD Y, r1
> Loop:
> LD X, r2
> CAS X, r2, r2
> BEZ Loop

That will work too, but you don't really need to LD X and loop on
CAS compare failure given that x86's cmpxchg always makes a write.
"The destination operand is written back if the comparison fails;
otherwise, the source operand is written into the destination. (The
processor never produces a locked read without also producing a
locked write.)"

So just do cmpxchg(&X, 42, 42) which will perform locked read-write
(with its read part store-load fenced from prior writes, I infer).
You'll get classic SC if you replace all loads with cmpxchg(&X, 42,
42). That's my understanding, and I'm eagerly awaiting confirmation
from Andy Glew and/or someone from Intel hanging at C++ memory model
mailing list.

http://tinyurl.com/aqgjj

regards,
alexander.

David Hopwood

unread,

Sep 5, 2005, 2:03:16 PM9/5/05

to

Eric P. wrote:
> Joe Seigh wrote:
>>Alexander Terekhov wrote:
>>
>>>Neither will give you "global ordering of loads". Loads on ia32 are
>>>in-order with respect to other loads and subsequent stores (by the
>>>same processor). The only thing that differentiates PC from TSO is
>>>the lack of remote write atomicity (in IA64 formal memory model
>>>speak). Implementations (e.g. SPO) of course can do all sorts of
>>>tricks to improve performance, but that doesn't change the memory
>>>model. You're in denial.
>>
>>Whatever. I'm going to use LFENCE for situations where I'd use
>>#LoadLoad on sparc (generic, not assuming TSO). And it's not
>>because I'm in denial. It's because nothing you say is
>>comprehensible. It's possible you are making some kind of
>>valid technical point but I have no way of telling.
>
> As I understand it, the key to causal ordering is Atomic Visibility
> whereby a write becomes visible simultaneously to all processors
> other than the one that issued the write. According to Gharacharloo,
> Processor Consistency does not require updates be Atomically Visible
> and, in theory allows non causal ordering of the kind in your
> example. TSO does require Atomic Visibility.

Right.

[...]

> The text of LFENCE instruction in the Intel instruction manual says
> "Performs a serializing operation on all load-from-memory instructions
> that were issued prior the LFENCE instruction. This serializing
> operation guarantees that every load instruction that precedes in
> program order the LFENCE instruction is globally visible before any
> load instruction that follows the LFENCE instruction is globally
> visible. The LFENCE instruction is ordered with respect to load
> instructions, other LFENCE instructions,"...
>
> seems to provide the guarantees for globally visibility and
> therefore causality that you are looking for.

It's not entirely clear what "globally visible" in the Intel manual
is supposed to mean in the terminology of
<http://research.compaq.com/wrl/people/kourosh/papers/1993-tr-68.pdf>,
but I think it means just "performed" (with respect to all processors),
*not* "globally performed".

--
David Hopwood <david.nosp...@blueyonder.co.uk>

David Hopwood

unread,

Sep 5, 2005, 2:14:29 PM9/5/05

to

Joe Seigh wrote:
> Alexander Terekhov wrote:
>
>> So where do you put the fence, then?
>>
>> : processor 1 stores into X
>> : processor 2 see the store by 1 into X and stores into Y
>> : processor 3 loads from Y
>> : processor 3 loads from X
>
> Since this was my example I should clarify. It was meant to
> show that PC alone wasn't sufficient to guarantee that if processor
> 3 saw the store into Y by processor 2 that it would see the
> store into X by processor 1.
>
> My understanding of the ia32 memory model is that you
> need a fence instruction between the loads by processor 3
> and a fence between the load and store by processor 2 to
> make the guarantee work.

My understanding is that if the claimed problem exists at all, adding
these fences won't fix it (as far as the model is concerned, possibly
as opposed to implementation details of specific chips).

--
David Hopwood <david.nosp...@blueyonder.co.uk>

Alexander Terekhov

unread,

Sep 5, 2005, 2:27:51 PM9/5/05

to

David Hopwood wrote:

[... SSE2 LFENCE ...]

> It's not entirely clear what "globally visible" in the Intel manual

It's just copy&paste leftover from SSE1 SFENCE description.

regards,
alexander.

Joe Seigh

unread,

Sep 5, 2005, 4:21:57 PM9/5/05

to

The architected memory model as opposed to the implemented one?

"Despite the fact that Pentium 4, Intel Xeon, and P6 family
processors support processor ordering, Intel does not guarantee that future processors will
support this model. To make software portable to future processors, it is recommended that operating
systems provide critical region and resource control constructs and API’s (application
program interfaces) based on I/O, locking, and/or serializing instructions be used to synchronize
access to shared areas of memory in multiple-processor systems."

That one? And what to people think the memory model that only
"I/O, locking, and/or serializing instructions" can synchronize is?

David Hopwood

unread,

Sep 5, 2005, 5:21:12 PM9/5/05

to

Joe Seigh wrote:
> David Hopwood wrote:
>> Joe Seigh wrote:
>>> Alexander Terekhov wrote:
>>>
>>>> So where do you put the fence, then?
>>>>
>>>> : processor 1 stores into X
>>>> : processor 2 see the store by 1 into X and stores into Y
>>>> : processor 3 loads from Y
>>>> : processor 3 loads from X
>>>
>>> Since this was my example I should clarify. It was meant to
>>> show that PC alone wasn't sufficient to guarantee that if processor
>>> 3 saw the store into Y by processor 2 that it would see the
>>> store into X by processor 1.
>>>
>>> My understanding of the ia32 memory model is that you
>>> need a fence instruction between the loads by processor 3
>>> and a fence between the load and store by processor 2 to
>>> make the guarantee work.
>>
>> My understanding is that if the claimed problem exists at all, adding
>> these fences won't fix it (as far as the model is concerned, possibly
>> as opposed to implementation details of specific chips).
>
> The architected memory model as opposed to the implemented one?

Yes, that's what I said.

> "Despite the fact that Pentium 4, Intel Xeon, and P6 family
> processors support processor ordering, Intel does not guarantee that
> future processors will support this model. To make software portable
> to future processors, it is recommended that operating systems provide
> critical region and resource control constructs and API’s (application
> program interfaces) based on I/O, locking, and/or serializing
> instructions be used to synchronize access to shared areas of
> memory in multiple-processor systems."

This is all perfectly sensible. "Future processors" from Intel are not
necessarily ISA-compatible with x86 anyway. For example, you need to
recompile to use long mode in EM64T. Also note that it doesn't say
"future x86 processors". Maybe they were talking about Itanic.

Even if they weren't talking about IA-64 or a different mode, it's
still a good idea to avoid dependencies on the memory model in
*applications*, since it is more difficult to change all apps that
have such dependencies than it is to change threading libraries in OS
and language implementations. In fact OS/lang-impl maintainers half
expect stuff to rot on new hardware, and hopefully remember what they
depended on. Application maintainers generally don't (if they ever
understood it in the first place). This is what I've been saying
consistently.

Anyway, this issue doesn't have anything to do with what we were talking
about, which is whether the current architected x86 model allows a
particular behaviour.

> That one? And what do people think the memory model that only

> "I/O, locking, and/or serializing instructions" can synchronize is?

You're overanalysing a fairly loosely worded recommendation.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

Joe Seigh

unread,

Sep 5, 2005, 6:32:20 PM9/5/05

to

David Hopwood wrote:
> Joe Seigh wrote:
>
>> "Despite the fact that Pentium 4, Intel Xeon, and P6 family
>> processors support processor ordering, Intel does not guarantee that
>> future processors will support this model. To make software portable
>> to future processors, it is recommended that operating systems provide
>> critical region and resource control constructs and API’s (application
>> program interfaces) based on I/O, locking, and/or serializing
>> instructions be used to synchronize access to shared areas of
>> memory in multiple-processor systems."
>
>
> This is all perfectly sensible. "Future processors" from Intel are not
> necessarily ISA-compatible with x86 anyway. For example, you need to
> recompile to use long mode in EM64T. Also note that it doesn't say
> "future x86 processors". Maybe they were talking about Itanic.
>
> Even if they weren't talking about IA-64 or a different mode, it's
> still a good idea to avoid dependencies on the memory model in
> *applications*, since it is more difficult to change all apps that
> have such dependencies than it is to change threading libraries in OS
> and language implementations. In fact OS/lang-impl maintainers half
> expect stuff to rot on new hardware, and hopefully remember what they
> depended on. Application maintainers generally don't (if they ever
> understood it in the first place). This is what I've been saying
> consistently.

Yes, your adversion to anarchist application programmers doing their
own thing is well known. :)

>
> Anyway, this issue doesn't have anything to do with what we were talking
> about, which is whether the current architected x86 model allows a
> particular behaviour.
>
>> That one? And what do people think the memory model that only
>> "I/O, locking, and/or serializing instructions" can synchronize is?
>
>
> You're overanalysing a fairly loosely worded recommendation.
>

I'm not sure what you're saying here. That all future processors
from Intel that don't have processor ordering won't be x86? And
that the synchronization intructions in these future processors
won't be similar to the one's in x86? That Intel is telling people
in an x86 manual to start writing portable code not now but when
they get to the future processor? That's a little strange even for
Intel.

David Hopwood

unread,

Sep 5, 2005, 8:26:40 PM9/5/05

to

Right, I am absolutely convinced that the roles of application
programmer and infrastructure programmer should be clearly separated
(even if there are a few people with the ability and expertise needed
to successfully do both).

>> Anyway, this issue doesn't have anything to do with what we were talking
>> about, which is whether the current architected x86 model allows a
>> particular behaviour.
>>
>>> That one? And what do people think the memory model that only
>>> "I/O, locking, and/or serializing instructions" can synchronize is?
>>
>> You're overanalysing a fairly loosely worded recommendation.
>
> I'm not sure what you're saying here. That all future processors
> from Intel that don't have processor ordering won't be x86?

Well, they won't be x86-as-we-know-it. OSes, compilers, etc. will
have to be changed to run on or generate code for this new x86-like
thing, and changes in the memory model will probably be only one issue
they need to deal with.

> And that the synchronization intructions in these future processors
> won't be similar to the one's in x86? That Intel is telling people
> in an x86 manual to start writing portable code not now but when
> they get to the future processor?

Of course not. Read what they actually wrote.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

Andy Glew

unread,

Sep 5, 2005, 8:22:22 PM9/5/05

to

Alexander Terekhov <tere...@web.de> writes:

> So just do cmpxchg(&X, 42, 42) which will perform locked read-write
> (with its read part store-load fenced from prior writes, I infer).
> You'll get classic SC if you replace all loads with cmpxchg(&X, 42,
> 42). That's my understanding, and I'm eagerly awaiting confirmation
> from Andy Glew and/or someone from Intel hanging at C++ memory model
> mailing list.

42, eh? Sounds like a joke: Goodbye, and thanks for all the thrash...

I think that the overall intention is that placing MFENCE before and
after every memory reference is supposed to get you SC semantics.
However, MFENCE, LFENCE, and SFENCE were defined after my time, and I
suspect that their definitions are not quite complete enough for what
you want. In particular, *FENCE really only work wrt WC cacheable
memory, and do not drain external buffers such as may occur in bus
bridges. In general, the P6 and Wmt families' mechanism for ensuring
ordering, waiting for global observability, only works for perfectly
vanilla WC cacheable memory, and is frequently violated wrt other
memory types. So I do not want to guarantee that it will work for
things like WC cached memory that is private to a graphics
accelerator.

You may be right that using the cmpxchg as you describe achieves SC on
x86. However, I need to think about it a bit more, since the
reasoning you provide is implementation specific, not architectural.

(Note that an atomic RMW like cmpxchg could well be implemented
without any fencing semantics. I.e. atomic RMWs and memory
ordering/fencing are independent concepts. I argued for this in
Itanium; I am trying to remember if x86 required that the two be mixed
up together. I can't see why it should have... I.e. I am sure that
using cmpxchg as you describe need not provide SC on a reasonable
computer architecture. I just need to find out if x86 mixed the two up
for some legacy reasons. In the meantime: use the fences would be my
recommendation.)

> > 4) The only way to guarantee that a processor has the most recent
> > value of a location is to take ownership of the variable,
> > and that requires a write. Since we actually want to read X,
> ^^^^^^^^^^^^^^^^^^^^^^^^^
>
> That's the key.
>
> > we use CAS (x86 LOCK CMPXCHG) to read the most recent value.

Flawed argument.

It is entirely possible to imagine implementations of CAS that do not
write the variable if the value is unchanged.

> That will work too, but you don't really need to LD X and loop on
> CAS compare failure given that x86's cmpxchg always makes a write.
> "The destination operand is written back if the comparison fails;
> otherwise, the source operand is written into the destination. (The
> processor never produces a locked read without also producing a
> locked write.)"

You are confusing implementation with semantics.

Joe Seigh

unread,

Sep 5, 2005, 9:13:46 PM9/5/05

to

David Hopwood wrote:
> Joe Seigh wrote:
>

>> David Hopwood wrote:
>>>
>>>> That one? And what do people think the memory model that only
>>>> "I/O, locking, and/or serializing instructions" can synchronize is?
>>>
>>>
>>> You're overanalysing a fairly loosely worded recommendation.
>>
>>
>> I'm not sure what you're saying here. That all future processors
>> from Intel that don't have processor ordering won't be x86?
>
>
> Well, they won't be x86-as-we-know-it. OSes, compilers, etc. will
> have to be changed to run on or generate code for this new x86-like
> thing, and changes in the memory model will probably be only one issue
> they need to deal with.
>
>> And that the synchronization intructions in these future processors
>> won't be similar to the one's in x86? That Intel is telling people
>> in an x86 manual to start writing portable code not now but when
>> they get to the future processor?
>
>
> Of course not. Read what they actually wrote.
>

I did. It sounded to me like they said if you want to write
portable code, don't assume processor ordering but use the
locking and serializing instructions instead on the current
processors.

Alexander Terekhov

unread,

Sep 6, 2005, 5:01:26 AM9/6/05

to

Andy Glew wrote:
[...]

> I think that the overall intention is that placing MFENCE before and
> after every memory reference is supposed to get you SC semantics.

But without remote write atomicity, I suppose. And, BTW, that's what
revised Java volatiles do. I mean JSR-133 memory model.

> However, MFENCE, LFENCE, and SFENCE were defined after my time, and I
> suspect that their definitions are not quite complete enough for what
> you want. In particular, *FENCE really only work wrt WC cacheable
> memory, and do not drain external buffers such as may occur in bus
> bridges.

My reading of the specs is that MFENCE is guaranteed to provide
store-load barrier.

P1: X = 1; R1 = Y;
P2: Y = 1; R2 = X;

(R1, R2) = (0, 0) is allowed under pure PC, but

P1: X = 1; MFENCE; R1 = Y;
P2: Y = 1; MFENCE; R2 = X;

(R1, R2) = (0, 0) is NOT allowed.

> In general, the P6 and Wmt families' mechanism for ensuring
> ordering, waiting for global observability, only works for perfectly
> vanilla WC cacheable memory, and is frequently violated wrt other
> memory types. So I do not want to guarantee that it will work for
> things like WC cached memory that is private to a graphics
> accelerator.

I want to know whether MFENCE provides store-load barrier for WB
memory.

>
> You may be right that using the cmpxchg as you describe achieves SC on
> x86. However, I need to think about it a bit more, since the
> reasoning you provide is implementation specific, not architectural.

I'm just reading the specs.

CMPXCHG on x86 always performs a (hopefully StoreLoad+LoadLoad fenced)
load followed by a (LoadStore+StoreStore fenced) store (plus trailing
MFENCE, so to speak). Locked CMPXCHG is supposed to be "fully fenced".

Regarding safety net for remote write atomicity, I rely on the
following CMPXCHG wording:

"The destination operand is written back if the comparison fails;
otherwise, the source operand is written into the destination.
(The processor never produces a locked read without also
producing a locked write.)"

I suspect that (locked) XADD(addr, 0) will also work... but I'm
somewhat missing strong language about mandatory write as in CMPXCHG.

[... cmpxchg could well be implemented without any fencing ...]

"Locked operations are atomic with respect to all other memory
operations and all externally visible events. Only instruction
fetch and page table accesses can pass locked instructions. Locked
instructions can be used to synchronize data written by one
processor and read by another processor.

For the P6 family processors, locked operations serialize all
outstanding load and store operations (that is, wait for them to
complete). This rule is also true for the Pentium 4 and Intel Xeon
processors, with one exception: load operations that reference
weakly ordered memory types (such as the WC memory type) may not
be serialized."

> You are confusing implementation with semantics.

Fix the specs, then.

And explain how can one achieve classic SC semantics for WB memory.

regards,
alexander.

David Hopwood

unread,

Sep 6, 2005, 7:26:04 AM9/6/05

to

Joe Seigh wrote:
> David Hopwood wrote:

>> Joe Seigh wrote:
>>
>>> I'm not sure what you're saying here. That all future processors
>>> from Intel that don't have processor ordering won't be x86?
>>
>> Well, they won't be x86-as-we-know-it. OSes, compilers, etc. will
>> have to be changed to run on or generate code for this new x86-like
>> thing, and changes in the memory model will probably be only one issue
>> they need to deal with.
>>
>>> And that the synchronization intructions in these future processors
>>> won't be similar to the one's in x86? That Intel is telling people
>>> in an x86 manual to start writing portable code not now but when
>>> they get to the future processor?
>>
>> Of course not. Read what they actually wrote.
>
> I did. It sounded to me like they said if you want to write
> portable code, don't assume processor ordering but use the
> locking and serializing instructions instead on the current
> processors.

But OSes, thread libraries and language implementations *aren't* portable
code.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

Joe Seigh

unread,

Sep 6, 2005, 7:53:19 AM9/6/05

to

David Hopwood wrote:
> Joe Seigh wrote:
>

>> David Hopwood wrote:
>>>
>>> Of course not. Read what they actually wrote.
>>
>>
>> I did. It sounded to me like they said if you want to write
>> portable code, don't assume processor ordering but use the
>> locking and serializing instructions instead on the current
>> processors.
>
>
> But OSes, thread libraries and language implementations *aren't* portable
> code.
>

I do not think that word means what you think it means.

Note that I am an ex-kernel developer and have created enough sychronization
api's that run on totally different platforms. I've created an atomically
threadsafe reference counted smart pointer that has two totally different
implmentations on two different architectures. Given that Sun Microsystems'
research division couldn't manage to do this and could only do it is on a
obsolete architecture, I'd say I have a pretty good idea what portability is
and what its issues are.

Joe Seigh

unread,

Sep 6, 2005, 8:02:06 AM9/6/05

to

Alexander Terekhov wrote:

> Andy Glew wrote:
>
>
>>You are confusing implementation with semantics.
>
>
> Fix the specs, then.

I think you can assume that the serializing stuff does the right thing.
If not and you have strong reason to believe otherwise, then you should
short Intel stock as you'd stand a pretty good chance of making a fortune.
Basically, no OS would work correctly on an Intel based multi-processor
server and Intel would be out of that business. Also Intel would be
screwed in the multi-core workstation and desktop market as it would be
too late to fix the current processors going into production.

David Hopwood

unread,

Sep 6, 2005, 8:54:52 AM9/6/05

to

Joe Seigh wrote:
> David Hopwood wrote:
>> Joe Seigh wrote:
>>> David Hopwood wrote:
>>>
>>>> Of course not. Read what they actually wrote.
>>>
>>> I did. It sounded to me like they said if you want to write
>>> portable code, don't assume processor ordering but use the
>>> locking and serializing instructions instead on the current
>>> processors.
>>
>> But OSes, thread libraries and language implementations *aren't* portable
>> code.
>
> I do not think that word means what you think it means.
>
> Note that I am an ex-kernel developer and have created enough
> sychronization api's that run on totally different platforms.

You are totally missing the point. OSes, thread libraries and language
implementations have some code that needs to be adapted to each hardware
architecture. If the memory model were to change in future processors
that are otherwise x86-like, this code would have to change. It's not a
big deal, because this platform-specific code is maintained by people who
know how to change it, and because there are few enough OSes, thread
libraries, and language implementations for the total effort involved
not to be very great. It would, however, be a big deal if existing x86
*applications* stopped working on an otherwise x86-compatible processor.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

Joe Seigh

unread,

Sep 6, 2005, 9:49:18 AM9/6/05

to

I am talking about that. You insist on maintaining that I advocate
applications hardcode platform specific assembly code into their
source. I never have advocated that.

But when you design these api's you have to have a pretty good idea
what kinds of things an be ported and what assumptions you are making
about the memory model. Since I've actually done this kind of stuff
I probably have a much better idea than you have what the actual issues
are.

And yes, there isn't any assumption about the memory model that can't
be broken by a hardware designer. The only thing that keeps hardware
companies from breaking widely used api's like Posix pthreads is they
might go out of business if they did. Hence, shorting Intel stock
might be a good idea if you believe they did do that. But saying
that we should only use widespread api's and not ever create any
new ones is ridiculous.

Eric P.

unread,

Sep 6, 2005, 10:26:49 AM9/6/05

to

Alexander Terekhov wrote:
>
> My reading of the specs is that MFENCE is guaranteed to provide
> store-load barrier.
>
> P1: X = 1; R1 = Y;
> P2: Y = 1; R2 = X;
>
> (R1, R2) = (0, 0) is allowed under pure PC, but
>
> P1: X = 1; MFENCE; R1 = Y;
> P2: Y = 1; MFENCE; R2 = X;
>
> (R1, R2) = (0, 0) is NOT allowed.

Are you sure you are not being inconsistent in example 2 here?
(wrt what you answered yesterday about S/LFENCE).

If MFENCE is just an SFENCE+LFENCE, and neither of those guarantee
delivery or receipt of invalidates, then P1 can have a stale Y
and P2 a stale X. The MFENCE does nothing but prevent bypassing.

Eric

Eric P.

unread,

Sep 6, 2005, 10:58:29 AM9/6/05

to

Forget it, I see. With two processors Y can be stale on P1,
or X stale on P2, but not both.

Eric

Alexander Terekhov

unread,

Sep 6, 2005, 11:29:51 AM9/6/05

to

"Eric P." wrote:
>
> Alexander Terekhov wrote:
> >
> > My reading of the specs is that MFENCE is guaranteed to provide
> > store-load barrier.
> >
> > P1: X = 1; R1 = Y;
> > P2: Y = 1; R2 = X;
> >
> > (R1, R2) = (0, 0) is allowed under pure PC, but
> >
> > P1: X = 1; MFENCE; R1 = Y;
> > P2: Y = 1; MFENCE; R2 = X;
> >
> > (R1, R2) = (0, 0) is NOT allowed.
>
> Are you sure you are not being inconsistent in example 2 here?
> (wrt what you answered yesterday about S/LFENCE).

PC implies both LFENCE and SFENCE ordering constraints. I don't
think that you've got invalidations stuff entirely accurate, but
the basic logic is correct.

>
> If MFENCE is just an SFENCE+LFENCE,

No.

SFENCE is store-store barrier and LFENCE is load-load barrier.

store-store + load-load != store-load.

MFENCE ensures that preceding writes are made globally visible
before subsequent reads are performed (store-load barrier)...
plus it imposes all other PC ordering constraints (load-load +
load-store + store-store).

regards,
alexander.

Alexander Terekhov

unread,

Sep 14, 2005, 4:07:29 AM9/14/05

to

Hey Mr. andy...@intel.com,

you better fix the specs, really. It's not funny anymore.

http://msdn.microsoft.com/msdnmag/issues/05/10/MemoryModels/default.aspx

"When multiprocessor systems based on the x86 architecture were being
designed, the designers needed a memory model that would make most
programs just work, while still allowing the hardware to be reasonably
efficient. The resulting specification requires writes from a
single processor to remain in order with respect to other writes, but
does not constrain reads at all.

Unfortunately, a guarantee about write order means nothing if reads
are unconstrained. After all, it does not matter that A is written
before B if every reader reading B followed by A has reads reordered
so that the pre-update value of B and the post-update value of A is
seen. The end result is the same: write order seems reversed. Thus,
as specified, the x86 model does not provide any stronger guarantees
than the ECMA model.

It is my belief, however, that the x86 processor actually implements
a slightly different memory model than is documented. While this model
has never failed to correctly predict behavior in my experiments, and
it is consistent with what is publicly known about how the hardware
works, it is not in the official specification. New processors might
break it."

regards,
alexander.

Joe Seigh

unread,

Sep 14, 2005, 8:09:42 AM9/14/05

to

Alexander Terekhov wrote:
> Hey Mr. andy...@intel.com,
>
> you better fix the specs, really. It's not funny anymore.
>
> http://msdn.microsoft.com/msdnmag/issues/05/10/MemoryModels/default.aspx
>

It's pretty clear from Andy's comments and from the technical documentation
that Intel's technical writers aren't entirely sure who their audience
actually is and mix up the specification, which is of interest to programmers,
and the implementation, which is of interest to engineers. Andy's last
comment, which appeared to me to be about implementation, certainly didn't
help.

It also doesn't help that Intel has a tradition of not architecting multi-processing
support and do it on the fly as Intel adds in multi-processing support, in clear
contrast to how other companies have documented multi-processing support in their
architectures. You had companies building Intel based multi-processors before Intel
even supported multi-processing, which meant the memory model they implemented may
or may not have matched what Intel later documented as the official memory model.
This is apparently now a tradition and there's a comment to this effect in the Intel
documentation.

"Also, software should not depend on processor ordering in situations where
the system hardware does not support this memory-ordering model."

Alexander Terekhov

unread,

Oct 22, 2005, 10:56:00 AM10/22/05

to

Joe Seigh wrote: ...

http://www.decadentplace.org.uk/pipermail/cpp-threads/2005-October/000728.html

<quote>

Enough people from Intel who can speak authoritatively about this
for me to confidently believe it have said (a) "locked" instructions
and mfence DO have global ordering properties on current and
near-future x86s (b) Intel now realizes that this should have been
documented and will try to ensure that it is (c) Intel does not want
to promise that this will hold forever, and might be interested in
engaging with different language-level standards groups to see if
there is a way to weaken total SC-ness of lock/volatile/atomic specs
to avoid multiple observer ordering agreement requirements that do not
impact practical programs.

</quote>

regards,
alexander.

Joe Seigh

unread,

Oct 22, 2005, 12:25:38 PM10/22/05

to

I'm not in cpp-threads anymore, the moderation time lag was insane, so I can't
reply there. However I would think the standards groups would want to avoid
a meta memory model in their definitions to give the hardware as much
flexibility as possible in its memory model definitions.

David Hopwood

unread,

Oct 22, 2005, 9:16:30 PM10/22/05

to

Alexander Terekhov wrote:
> Joe Seigh wrote: ...
>
> http://www.decadentplace.org.uk/pipermail/cpp-threads/2005-October/000728.html
>
> <quote>
>
> Enough people from Intel who can speak authoritatively about this
> for me to confidently believe it have said (a) "locked" instructions
> and mfence DO have global ordering properties on current and
> near-future x86s

This is rather vague. I assume it must mean at least that locked instructions
and mfence perform in a total order that is the same for all processors. But
no program is going to use *just* locked accesses and mfences. So how does
this interact with other accesses?

> (b) Intel now realizes that this should have been
> documented and will try to ensure that it is (c) Intel does not want
> to promise that this will hold forever, and might be interested in
> engaging with different language-level standards groups to see if
> there is a way to weaken total SC-ness of lock/volatile/atomic specs
> to avoid multiple observer ordering agreement requirements that do not
> impact practical programs.
>
> </quote>

Are these instructions typically executed frequently enough for any relaxation
of their ordering semantics to be worthwhile?

Intel (and AMD) should IMHO wait until the software and standards situation has
settled down a bit before attempting to weaken their hardware. The x86 variant
of processor consistency, with total ordering for some instructions, is fine
for the time being. It just needs to be better documented (e.g. something like
<http://www.intel.com/design/itanium/downloads/25142901.pdf> for x86[-64]).

--
David Hopwood <david.nosp...@blueyonder.co.uk>