Longtime readers of comp.arch may know
a) that I was involved in defining Intel's implementation of memory
ordering (the P6 MOB, etc.), and that I am actually a little bit proud
of that work.
b) that I have long been embarassed by Intel's documentation of
this memory ordering model.
I am happy to report that a much improved document describing Intel's
memory ordering model has been posted to
>Intel® 64 Architecture Memory Ordering White Paper
>This document provides information for Intel 64 architecture memory ordering
>at a level that is architecturally visible to software. The principles and examples
>provide software writers with a clear understanding of the results that different
>sequences of memory access instructions may produce.
The guys who got this pushed through Intel deserve much credit. I wish
that I could congratulate them here by name, but that seems not to be
The document is not perfect (e.g. it is not allowed to use the term
"atomic"), but it is greatly better than what has been available in
I expect that people who need to write parallel programs on Intel
processors will find this clarifies many heretofore muddy issues.
One particular issue: every so often somebody talks about a parallel
algorithm on Intel processors, and I have felt obliged to point out
that the "official" Intel memory ordering model, as it was documented
in the SDM (Software Developers' Manual), and as I defined it ever so
long ago, did not guarantee causality.
Initially x = y = 0
P1: x = 1
P2: if( x ) y = 1
ry = y
rx = x
if( ry )
if( !rx )
assert( ! "it should not be possible to have ry==1 and
was NOT guaranteed to work, heretofore.
The new white paper says that causality is guaranteed: "In a
multiprocessor system, memory ordering obeys causality (memory
ordering respects transitive visibility)."
Indeed, all implementations have provided causality; now it is
This is good, because many programmers implicitly assume causality.
This is a personal post. It is not approved by (or disapproved by) my
employer, Intel. I tell you my employer so that you can account for
any bias I may have.
> I am happy to report that a much improved document describing Intel's
> memory ordering model has been posted to
(Drop the colon on the end, otherwise you'll get an error page!)
>> Intel® 64 Architecture Memory Ordering White Paper
>> This document provides information for Intel 64 architecture memory ordering
>> at a level that is architecturally visible to software. The principles and examples
>> provide software writers with a clear understanding of the results that different
>> sequences of memory access instructions may produce.
> The guys who got this pushed through Intel deserve much credit. I wish
> that I could congratulate them here by name, but that seems not to be
Please tell them from me that I really appreciate their effort, the
white paper is very clear, and documents a very nice and imho intuitive
set of rules.
The fact that they more or less exactly mirror the way I many years ago
assumed it had to work is just a coincidence, of course. :-)
> The new white paper says that causality is guaranteed: "In a
> multiprocessor system, memory ordering obeys causality (memory
> ordering respects transitive visibility)."
> Indeed, all implementations have provided causality; now it is
> This is good, because many programmers implicitly assume causality.
Exactly right. In fact, I'd say this goes for _every_ programmer, except
those who've been bitten by something like the Alpha (lack of) rules.
"almost all programming can be viewed as an exercise in caching"
Finally! The folks over on comp.programming.threads are going to enjoy this
Thank you Intel!
Is causality guarenteed even when x and/or y are I/O addresses (e.g.
I have to admit that I am puzzled why Intel would not regard it as to
its advantage to explain, as clearly and simply as possible, how to
effectively use its products.
Of course, they may well feel that writing manuals is their
obligation, and writing textbooks is someone else's, too.
I see from the page you referenced that Intel is now using the term
"Intel 64 Architecture" instead of EM64T to refer to the 64-bit
extensions to the x86 architecture originally developed at AMD - it
really *isn't* another way of saying IA-64 (Itanium).
No, the paper states explicitly that the memory ordering guarantees only
hold for "normal" memory, i.e. Write Back or Write Combining.
Afair, IO is excluded.
On Sep 8, 12:21 am, "Chris Thomasson" <cris...@comcast.net> wrote:
I'm happy that this document has been made available, and well
done to the guys at Intel who made it happen.
The presentation leaves something to be desired. Specifically,
section 2.4 (intra-processors forwarding of stores is allowed)
contradicts the rules that precede it. So the document reads
like "here are some rules, and here is a case where those rules
don't apply, and here are the rest of the rules"!
I understand why the authors wrote it that way. They are trying
to stick to observable effects for programs, and to avoid talking
about implementations. So they leave the model, about which they
are asserting constraints, implicit. Until 2.4, that works ok,
becuase the model is quite simple. But in 2.4, they reveal the
store queues. So we learn that the model is actually more
complicated, and that the preceding rules don't quite fit it.
IMHO, they would be better off defining their model up-front.
With the model defined, 2.1 - 2.3 could explain precisely how
that model is constrained. Yes, it might smell like a overview
of an implementation, but then, 2.4 smells like an implementation
(And an appendix with the rules restated in an appropriate
formalism would be icing on the cake!)
Just so people are aware, AMD's most recent edition of the AMD64
Architecture Programmer's Manual (rev 3.13, Vol 2, section 7.2,
available at http://developer.amd.com/devguides.jsp) has similar
clarifications on memory ordering for Opteron systems. (We realized
that our documentation was somewhat lacking as well, and that people
really do need to know this stuff... :) )
not officially speaking on behalf of AMD
[The moderator is still out flying and has only briefly stopped
in Aurora OR for a test flight. Expect delays until Tuesday.]
I haven't found anything surprising in this white-paper - though the
existence of the white-paper is certainly a welcome thing (thanks for
posting it here). 2.4 looks like a special case of 2.3 to me...
We have a concrete explanation of Intel's memory model. No more debates!
Intel's, perhaps. But there's AMD to consider as well. The AMD64
docs suggest that independent loads can be reordered. So although the
Intel whitepaper says that the Intel64 will not do this, it's not a
terribly useful guarantee unless AMD follows suit. As it is, I'm not
terribly keen on using synchronization methods simply to guarantee
acquire behavior for loads, but that's what seems to be necessary at
I take this back. Looking at the latest docs, it appears that the
fence instructions are now usable for ordering all accesses rather
than just those from streaming ops. I don't really mind an occasional
fence, it was the "lock cas" for a load-acquire that bothered me.
[C.p. Moderator: you take that back? You take that BACK?
Does this mean that I now have to issue an Cancel signal?]
s/I take this back. L/Actually, l
[Mod: 8^) ]
[C.p. Mod.: OK guys, humor is OK, but keep the attribution down and trim
section 7.2 of AMD64 Architecture Programmer's Manual Volume 2: System
Programming Rev 3.13
Suggests that loads can't be reordered on AMD either:
"Loads do not pass previous loads (loads are not re-ordered). Stores do not
pass previous stores (stores are not re-ordered)"
It looks to me like the AMD memory ordering is pretty much the same as the
Just Software Solutions Ltd - http://www.justsoftwaresolutions.co.uk
Registered in England, Company Number 5478976.
Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL
> Suggests that loads can't be reordered on AMD either:
> "Loads do not pass previous loads (loads are not re-ordered). Stores do not
> pass previous stores (stores are not re-ordered)"
> It looks to me like the AMD memory ordering is pretty much the same as the
> Intel one.
According to Paul E. McKenney:
(see Table 1)
AMD x64 and Intel x86-64 is the same.
That's good to hear. For what it's worth, the clauses regarding out-
of-order reads are what confused me. Specifically 7.1.1: "Out-of-
order reads are allowed to the extent that they can be performed
transparently to software, such that the appearance of in-order
execution is maintained." Since this clause is in the single-
processor section, I assumed it implied that stores may be observed
out of order in a multiprocessor environment. Upon re-reading
however, I suppose this is the relevant clause (in section 7.2):
"Stores from a processor appear to be committed to the memory system
in program order?" That's as close as I can come to a statement that
loads may not be reordered.
I don't think this particular clause applies. It is preceded by the
statement: "All loads, stores and I/O operations from a single
processor appear to occur in program order /to the code running on
that processor... in this context/." (emphasis mine). However, a
later clause in a different section states that "Stores from a
processor appear to be committed to the memory system in program
order," which I believe does effectively cover load reordering.
Something a bit more explicit would have been preferable, but it'll
Upon further reflection, I'm not sure if the "Stores from a processor
appear to be committed to the memory system in program order" clause
is sufficient to imply that load reordering does not occur. It
certainly suggests that stores are not reordered, but an alternate
interpretation could be that the "memory system" observes the stores
in order, but a CPU may still execute loads out-of-order. I apologize
for quibbling over semantics, but the meaning of "appear" is not
entirely clear in the statement quoted above. I will trust that Intel
and AMD are of a mind to provide a common memory model, but it would
be nice if the AMD spec were a bit clearer on this point.
> According to Paul E. McKenney:http://www.rdrop.com/users/paulmck/scalability/paper/ordering.2006.08...
> (see Table 1)
> AMD x64 and Intel x86-64 is the same.
I don't see x86-64 mentioned anywhere in that article. Also, Table 1
has a 'Y' in the AMD64 box under "loads reordered after loads." And
to support this interpretation, Paul has stated in the C++ memory
model discussions that loads may be reordered on AMD64. Therefore, I
am hoping that his article and comments are out of date and that this
is no longer possible. I don't suppose someone can clarify?
The only case I can think of where load reordering would matter if a later
load could see an earlier write. I think the first example in 7.2 has this
covered for a single processor writing --- the load of A MUST see the store to
A if the preceding load of B has seen the store to B. For dependent stores
across multiple processors, this is covered by the example on page 164:
"Dependent stores between different processors appear to occur in program
Any other load reordering is not detectable, since there is no guarantee of
the visibility of the stores anyway.
>> According to Paul E. McKenney:http://www.rdrop.com/users/paulmck/scalability/paper/ordering.2006.08...
>> (see Table 1)
>> AMD x64 and Intel x86-64 is the same.
> I don't see x86-64 mentioned anywhere in that article. Also, Table 1
> has a 'Y' in the AMD64 box under "loads reordered after loads." And
> to support this interpretation, Paul has stated in the C++ memory
> model discussions that loads may be reordered on AMD64. Therefore, I
> am hoping that his article and comments are out of date and that this
> is no longer possible. I don't suppose someone can clarify?
Well, Paul's article is dated August 2006, which is almost a year before the
new specs were published.
Good point. I guess load reordering doesn't really matter so long as
stores are properly ordered.
> For dependent stores
> across multiple processors, this is covered by the example on page 164:
> "Dependent stores between different processors appear to occur in program
This was encouraging to see. It eliminates the situation which has
been discussed here in the past:
x = y = 0
// Thread A
x = 1
// Thread B
if( x == 1 ) y = 1
// Thread C
if( y == 1 ) assert( x == 1 ); // previously could fail
What are we to take away from the following quote:
"This document contains information which Intel may change at any time
without notice. Do
not finalize a design with this information."
CYA in action.
In this case, it simply means that you should use CPUID to verify that
you're running on a cpu which existed at the time this document was published.
On anything newer, a warning (at installation time?) _might_ be in
> Oh shi%t!
[Moderator: please try to write more line than you attribute.]