Does anyone have a readable document, or alternatively source which
demonstrates when and where sfence, lfence and mfence instructions are
required for programming atomic operations on P4 and 686?
I have heard conflicting reports that the fence instructions are not
required on SMP P4, but I doubt this. Information on this seems to be
pretty scarce and any help would be very welcome.
Thanks.
Chris
FWIW, here is my "current" take on the x86:
http://groups.google.com/group/comp.programming.threads/msg/68ba70e66d6b6ee9?hl=en
This brief description seems to cover x86 and UltraSPARC T1 TSO. That is
every explicit memory barrier operation is a nop, except #StoreLoad... Store
followed by load to different location can be reordered on x86 or sparcV9...
My experimental implementation of Petersons Algorithm demonstrates the need
for a #StoreLoad barrier on x86:
Notice how there is no explicit barrier for the "unlock" functions... Again,
this is because "current" x86 stores automatically take care of #LoadStore
dependences...
http://groups.google.com/group/comp.programming.threads/msg/ca2f1af4552233df
That was a trick to exploit the fact that in TSO model, stores are
"basically" equivalent to:
1. #LoadStore|#StoreStore > Release barrier
2. Peform The Actuall Store
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/0e07adf138f0091d/ca2f1af4552233df?#ca2f1af4552233df
(read all of this)
!!> Please note that Intel explicitly states that these rules may not hold
true for "future" x86 memory models... So always have a "backup" plan that
uses the lfence, sfence, and mfence instructions in the "correct" places...
http://appcore.home.comcast.net/
http://appcore.home.comcast.net/appcore/src/cpu/i686/ac_i686_gcc_asm.html
Here is my implementation of a "simple" x86 assembly based atomic operations
abstraction... This code uses mfence, so you may have to change this if your
processor doesn't support the SSE 2, IIRC...
I found this article pretty good:
<http://www.linuxjournal.com/article/8211>
Relaxed memory modes aside for a moment, IA32's alleged memory model is
processor consistency (see below). But official docu sucks miserably.
>
> I found this article pretty good:
> <http://www.linuxjournal.com/article/8211>
I find it pretty bad. A much better 'article' is this:
http://citeseer.ist.psu.edu/gharachorloo95memory.html
regards,
alexander.
# acquire & release semantics ( st.rel / ld.acq ) are held for loads and
# stores to the same locations; you sometimes have to use full membars (
# mfence, LOCK prefix ) to enforce the ordering of a "critical-sequence" of
# loads/stores to different locations on IA-32.
Where did you get this from (specifically the part after the semicolon)?
AFAICS, with the usual caveats (ordinary loads/stores, cached memory), the Intel
and AMD arch manuals don't say anything that would make the memory ordering of
loads/stores to different locations less strict than the ordering for a single
location.
--
David Hopwood <david.nosp...@blueyonder.co.uk>
# 2. An operation is reordered with a store only if the operation accesses
# a different location than does the store.
This is plainly wrong. P4 arch manual volume 3 section 7.2.2:
| [...] enhancements in the Pentium 4, Intel Xeon, and P6 family processors
| are:
[...]
| * Store-buffer forwarding, when a read passes a write to the same memory
| location.
Unfortunately, when an article gets something as basic as this wrong, you
have to treat the whole article as unreliable, no matter how plausible it
may otherwise seem.
--
David Hopwood <david.nosp...@blueyonder.co.uk>
x=1;
if(x!=1)
throw memory_error;
could fire (yes, even single-threaded).
Regards,
Chris Noonan
http://groups.google.com/group/comp.programming.threads/msg/2f2ec4c60b6a5d7c
http://groups.google.com/group/comp.programming.threads/msg/d4f9b41bbadbff95
Critical sections can "overlap", so the acquire portion of "one" mutex can
rise above the release portion of "another"....
In addition to this... IIRC, I think there is some reference(s) to this
specific aspect of x86 mem-model... It should be related to Linux, or
something, and something called "plan 9", IIRC..Humm .. We should ask Alex:
Any reference in Linux kernel about x86 and the ordering of
"load-after-store"?
IMHO, x86 does NOT automatically take care of #StoreLoad dependences;
therefore, you need mfence or LOCK for thing like mutexs, ect...
I think we need to ask Andy about this "specific" question... I "think" I
may be "correct" on this one. The "fact" that you have to use a barrier for
the "lock" portion of a mutex on x86 seems to support my opinion:
> AFAICS, with the usual caveats (ordinary loads/stores, cached memory), the
> Intel
> and AMD arch manuals don't say anything that would make the memory
> ordering of
> loads/stores to different locations less strict than the ordering for a
> single
> location.
Humm... To me more precise:
we are discussing the ordering of stores-to-loads; ie #StoreLoad
dependencies, stores "visible" before loads to other locations "applied"...
In my interpretation of the documents, Yikes!, x86 is perfectly free to
optimize this any which way it wants to...
With respect to the ordering of loads-to-"stores"; #LoadStore dependencies,
loads "applied" before stores to other locations made "visible"... The x86
and UltraSPARC T1, seem to honor this...
Therefore IMO for x86 and for UltraSPARC T1:
- the "lock" portion of a mutex, which has an inherent #StoreLoad dependency
has to use an explicit #StoreLoad barrier...
- the "unlock" portion of a mutex, which has an inherent #LoadStore
dependency does "not" have to use an "explicit" #LoadStore barrier...
I don't see the relevance of these posts. Let me rephrase the question to be
more specific:
Yes, on x86 for write-back cached memory, ordinary loads have acquire semantics
and ordinary stores have release semantics. But this does not depend on whether
the loads and stores are to the same location. AFAIU, each acquire is a *global*
acquire barrier, and each release is a *global* release barrier.
[...]
> In addition to this... IIRC, I think there is some reference(s) to this
> specific aspect of x86 mem-model... It should be related to Linux, or
> something, and something called "plan 9", IIRC..Humm .. We should ask Alex:
> Any reference in Linux kernel about x86 and the ordering of
> "load-after-store"?
>
> IMHO, x86 does NOT automatically take care of #StoreLoad dependences;
> therefore, you need mfence or LOCK for thing like mutexs, etc...
Yes, but that's not what I was asking. Sorry if I was unclear.
--
David Hopwood <david.nosp...@blueyonder.co.uk>
Oh, now that you've got us started, expect another 20 posts or more... :-)
(The Intel x86 docs on memory ordering are really bad, which is why no-one
here can agree on what they say.)
--
David Hopwood <david.nosp...@blueyonder.co.uk>
Humm... I really don't believe that to be completely true because on x86...
Well...
Take the following sequence:
1: X.Store.Release
2: Y.Load.Acquire
IMO, "Does not prevent" x86 from reordering it into:
2: Y.Load.Acquire
1: X.Store.Release
So the release barrier associated with the store to location X does not
effect the load from location Y... In order to force the processor to honor
the sequence you need to add a barrier before the load to location Y.
1: X.Store.Release
2: membar #StoreLoad|#StoreStore
3: Y.Load.Acquire
A "concrete example" of this comes in the implementation of SMR on x86...
Particular the code which allow you to acquire a hazard pointer... Here is
my implementation of such code:
.align 16
.globl ac_i686_lfgc_smr_activate
ac_i686_lfgc_smr_activate:
movl 4(%esp), %edx
movl 8(%esp), %ecx
ac_i686_lfgc_smr_activate_reload:
movl (%ecx), %eax
movl %eax, (%edx)
mfence
cmpl (%ecx), %eax
jne ac_i686_lfgc_smr_activate_reload
ret
There is a nasty #StoreLoad dependency inherent in hazard pointers
therefore, the mfence instruction is simply "required" here on x86, unless
you know all of the tricks (rcu-smr, hint, hint) but that is another topic
altogether... I am sorry if this response is off-topic wrt your questions...
I think I will ask Andy about this over in comp.arch...
Right, but that's what you would expect.
A "store with release semantics" is
#LoadStore | #StoreStore;
store
A "load with acquire semantics" is
load;
#LoadLoad | #StoreStore
So the sequence above is equivalent to
#LoadStore | #StoreStore;
store X;
load Y;
#LoadLoad | #StoreStore
which obviously does not prevent the reordering. It has nothing to do with
the load and store being to different locations.
--
David Hopwood <david.nosp...@blueyonder.co.uk>
I see...
Okay... Yes. I now believe you are correct... I was looking at it from a
different point of view... However, it still seems like it helps me "sketch
out" where to place all of the memory barriers in my algorithms because
every time one of them uses logic that includes a 'store followed by a load
to another location' it immediately raises a "red flag" in my mind. It
forces me to find out if "any" part of the flagged algorithm can cope with
this "specific" reordering; some parts can, some cannot... SMR hazard
pointer is an example of an algorithm that cannot cope with the reordering,
and needs barrier on x86 and ultraT1...
Humm, I guess I may be unknowingly "suffering" from a personal problem that
seems to consist of frequently applying my own "personal lexicon" in
"public" forums... Sorry for any confusion....
;)
Humm, I guess that means its time for me to write a little Dictionary, or
something along those lines... lol...
Thanks David.
:)
Sorry, cut-and-paste error. That should be #LoadLoad | #LoadStore.
> So the sequence above is equivalent to
>
> #LoadStore | #StoreStore;
> store X;
> load Y;
> #LoadLoad | #StoreStore
^^^^^^^^^^^
Same here.
[...]
http://groups.google.com/group/comp.programming.threads/msg/8ae09f9e9bea21b9
This shows where to put the barriers in relationship to both loads and
stores...
What does the SFENCE instruction embedded in that code do?
Here is the specification of SFENCE from the Intel manual:
"Performs a serializing operation on all store-to-memory instructions
that
were issued prior the SFENCE instruction. This serializing operation
guarantees that evey store instruction that precedes in program order
the SFENCE instruction is globally visible before any store instruction
that follows the SFENCE instruction is globally visible. The SFENCE
instruction is ordered with respect store instructions, other SFENCE
instructions, any MFENCE instructions, and any serializing instructions
(such as the CPUID instruction). It is not ordered with respect to load
instructions or the LFENCE instruction."
But on Pentium processor, writes become visible anyway in strict
program order (at least with the normal type of memory). What effect
does SFENCE have?
Chris
Yes:
http://groups.google.com/group/comp.programming.threads/msg/f2c59ced973e75dd
I guess I should of clarified that it shows exactly where to place the
lfence and sfence barriers' "if" they were indeed required on a "future"
x86... Or, I guess I should of posted more links to the discussion which
covers all of this...
;)
Did you mean #LoadLoad | #LoadStore?
Herb
---
Herb Sutter (www.gotw.ca) (www.pluralsight.com/blogs/hsutter)
Convener, ISO WG21 (C++ standards committee) (www.gotw.ca/iso)
Architect, Developer Division, Microsoft (www.gotw.ca/microsoft)
Chris.
He posted a correction here:
http://groups.google.com/group/comp.programming.threads/msg/f3719d477431a942?hl=en
Yes. (I posted a followup correcting this, but you may not have seen it.)
--
David Hopwood <david.nosp...@blueyonder.co.uk>
It devastates the pipeline and therefore, destroys performance.