Memory fence instructions on x86

PillMonsta

unread,

Jul 24, 2006, 5:40:40 PM7/24/06

to

Dear all,

Does anyone have a readable document, or alternatively source which
demonstrates when and where sfence, lfence and mfence instructions are
required for programming atomic operations on P4 and 686?
I have heard conflicting reports that the fence instructions are not
required on SMP P4, but I doubt this. Information on this seems to be
pretty scarce and any help would be very welcome.

Thanks.

Chris

Chris Thomasson

unread,

Jul 24, 2006, 9:13:49 PM7/24/06

to

"PillMonsta" <ch...@chrisbird.com> wrote in message
news:1153777240....@h48g2000cwc.googlegroups.com...

FWIW, here is my "current" take on the x86:

http://groups.google.com/group/comp.programming.threads/msg/68ba70e66d6b6ee9?hl=en

This brief description seems to cover x86 and UltraSPARC T1 TSO. That is
every explicit memory barrier operation is a nop, except #StoreLoad... Store
followed by load to different location can be reordered on x86 or sparcV9...

My experimental implementation of Petersons Algorithm demonstrates the need
for a #StoreLoad barrier on x86:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/c49c0658e2607317/1e45b4b16bad9784?lnk=gst&q=chris+thomasson+peterson&rnum=1#1e45b4b16bad9784

Notice how there is no explicit barrier for the "unlock" functions... Again,
this is because "current" x86 stores automatically take care of #LoadStore
dependences...

http://groups.google.com/group/comp.programming.threads/msg/ca2f1af4552233df

That was a trick to exploit the fact that in TSO model, stores are
"basically" equivalent to:

1. #LoadStore|#StoreStore > Release barrier
2. Peform The Actuall Store

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/0e07adf138f0091d/ca2f1af4552233df?#ca2f1af4552233df
(read all of this)

!!> Please note that Intel explicitly states that these rules may not hold
true for "future" x86 memory models... So always have a "backup" plan that
uses the lfence, sfence, and mfence instructions in the "correct" places...

http://appcore.home.comcast.net/
http://appcore.home.comcast.net/appcore/src/cpu/i686/ac_i686_gcc_asm.html

Here is my implementation of a "simple" x86 assembly based atomic operations
abstraction... This code uses mfence, so you may have to change this if your
processor doesn't support the SSE 2, IIRC...

Clifford Heath

unread,

Jul 25, 2006, 4:06:32 AM7/25/06

to

PillMonsta wrote:
> Does anyone have a readable document, or alternatively source which
> demonstrates when and where sfence, lfence and mfence instructions are
> required for programming atomic operations on P4 and 686?

I found this article pretty good:
<http://www.linuxjournal.com/article/8211>

Alexander Terekhov

unread,

Jul 25, 2006, 9:24:14 AM7/25/06

to

Clifford Heath wrote:
>
> PillMonsta wrote:
> > Does anyone have a readable document, or alternatively source which
> > demonstrates when and where sfence, lfence and mfence instructions are
> > required for programming atomic operations on P4 and 686?

Relaxed memory modes aside for a moment, IA32's alleged memory model is
processor consistency (see below). But official docu sucks miserably.

>
> I found this article pretty good:
> <http://www.linuxjournal.com/article/8211>

I find it pretty bad. A much better 'article' is this:

http://citeseer.ist.psu.edu/gharachorloo95memory.html

regards,
alexander.

David Hopwood

unread,

Jul 25, 2006, 9:52:42 AM7/25/06

to

Chris Thomasson wrote:

> "PillMonsta" <ch...@chrisbird.com> wrote:
>
>>Dear all,
>>
>>Does anyone have a readable document, or alternatively source which
>>demonstrates when and where sfence, lfence and mfence instructions are
>>required for programming atomic operations on P4 and 686?
>>I have heard conflicting reports that the fence instructions are not
>>required on SMP P4, but I doubt this. Information on this seems to be
>>pretty scarce and any help would be very welcome.
>

> FWIW, here is my "current" take on the x86:
>
> http://groups.google.com/group/comp.programming.threads/msg/68ba70e66d6b6ee9?hl=en

# acquire & release semantics ( st.rel / ld.acq ) are held for loads and
# stores to the same locations; you sometimes have to use full membars (
# mfence, LOCK prefix ) to enforce the ordering of a "critical-sequence" of
# loads/stores to different locations on IA-32.

Where did you get this from (specifically the part after the semicolon)?

AFAICS, with the usual caveats (ordinary loads/stores, cached memory), the Intel
and AMD arch manuals don't say anything that would make the memory ordering of
loads/stores to different locations less strict than the ordering for a single
location.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

David Hopwood

unread,

Jul 25, 2006, 9:52:45 AM7/25/06

to

# 2. An operation is reordered with a store only if the operation accesses
# a different location than does the store.

This is plainly wrong. P4 arch manual volume 3 section 7.2.2:

| [...] enhancements in the Pentium 4, Intel Xeon, and P6 family processors
| are:
[...]
| * Store-buffer forwarding, when a read passes a write to the same memory
| location.

Unfortunately, when an article gets something as basic as this wrong, you
have to treat the whole article as unreliable, no matter how plausible it
may otherwise seem.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

chris noonan

unread,

Jul 25, 2006, 2:12:46 PM7/25/06

to

David Hopwood wrote:

> Clifford Heath wrote:
> # 2. An operation is reordered with a store only if the operation accesses
> # a different location than does the store.
>
> This is plainly wrong. P4 arch manual volume 3 section 7.2.2:
>
> | [...] enhancements in the Pentium 4, Intel Xeon, and P6 family processors
> | are:
> [...]
> | * Store-buffer forwarding, when a read passes a write to the same memory
> | location.
>

Perhaps that asterisked sentence would be better written:
"*Store-buffer forwarding, when a read would otherwise pass a write

to the same memory location."

The Intel documents explain Pentium memory ordering in a certain
stylised way; a good way IMO, but the words used can be misleading.
A read cannot be allowed to *effectively* pass a write to the same
location, otherwise the fragment:

x=1;
if(x!=1)
throw memory_error;

could fire (yes, even single-threaded).

Regards,
Chris Noonan

PillMonsta

unread,

Jul 25, 2006, 4:49:06 PM7/25/06

to

Thanks guys, that gives me enough reading for the moment...

Chris Thomasson

unread,

Jul 25, 2006, 5:15:23 PM7/25/06

to

"David Hopwood" <david.nosp...@blueyonder.co.uk> wrote in message
news:Kmpxg.17634$b9....@fe1.news.blueyonder.co.uk...

> Chris Thomasson wrote:
>> "PillMonsta" <ch...@chrisbird.com> wrote:
>>
>>>Dear all,
>>>
>>>Does anyone have a readable document, or alternatively source which
>>>demonstrates when and where sfence, lfence and mfence instructions are
>>>required for programming atomic operations on P4 and 686?
>>>I have heard conflicting reports that the fence instructions are not
>>>required on SMP P4, but I doubt this. Information on this seems to be
>>>pretty scarce and any help would be very welcome.
>>
>> FWIW, here is my "current" take on the x86:
>>
>> http://groups.google.com/group/comp.programming.threads/msg/68ba70e66d6b6ee9?hl=en
>
> # acquire & release semantics ( st.rel / ld.acq ) are held for loads and
> # stores to the same locations; you sometimes have to use full membars (
> # mfence, LOCK prefix ) to enforce the ordering of a "critical-sequence"
> of
> # loads/stores to different locations on IA-32.
>
> Where did you get this from (specifically the part after the semicolon)?

http://groups.google.com/group/comp.programming.threads/msg/2f2ec4c60b6a5d7c

http://groups.google.com/group/comp.programming.threads/msg/d4f9b41bbadbff95

Critical sections can "overlap", so the acquire portion of "one" mutex can
rise above the release portion of "another"....

In addition to this... IIRC, I think there is some reference(s) to this
specific aspect of x86 mem-model... It should be related to Linux, or
something, and something called "plan 9", IIRC..Humm .. We should ask Alex:
Any reference in Linux kernel about x86 and the ordering of
"load-after-store"?

IMHO, x86 does NOT automatically take care of #StoreLoad dependences;
therefore, you need mfence or LOCK for thing like mutexs, ect...

I think we need to ask Andy about this "specific" question... I "think" I
may be "correct" on this one. The "fact" that you have to use a barrier for
the "lock" portion of a mutex on x86 seems to support my opinion:

> AFAICS, with the usual caveats (ordinary loads/stores, cached memory), the
> Intel
> and AMD arch manuals don't say anything that would make the memory
> ordering of
> loads/stores to different locations less strict than the ordering for a
> single
> location.

Humm... To me more precise:

we are discussing the ordering of stores-to-loads; ie #StoreLoad
dependencies, stores "visible" before loads to other locations "applied"...
In my interpretation of the documents, Yikes!, x86 is perfectly free to
optimize this any which way it wants to...

With respect to the ordering of loads-to-"stores"; #LoadStore dependencies,
loads "applied" before stores to other locations made "visible"... The x86
and UltraSPARC T1, seem to honor this...

Therefore IMO for x86 and for UltraSPARC T1:

- the "lock" portion of a mutex, which has an inherent #StoreLoad dependency
has to use an explicit #StoreLoad barrier...

- the "unlock" portion of a mutex, which has an inherent #LoadStore
dependency does "not" have to use an "explicit" #LoadStore barrier...

David Hopwood

unread,

Jul 25, 2006, 7:41:34 PM7/25/06

to

Chris Thomasson wrote:
> "David Hopwood" <david.nosp...@blueyonder.co.uk> wrote:

>>Chris Thomasson wrote:
>>
>>>FWIW, here is my "current" take on the x86:
>>>
>>>http://groups.google.com/group/comp.programming.threads/msg/68ba70e66d6b6ee9?hl=en
>>
>># acquire & release semantics ( st.rel / ld.acq ) are held for loads and
>># stores to the same locations; you sometimes have to use full membars (
>># mfence, LOCK prefix ) to enforce the ordering of a "critical-sequence"

>># of loads/stores to different locations on IA-32.

>>
>>Where did you get this from (specifically the part after the semicolon)?
>
> http://groups.google.com/group/comp.programming.threads/msg/2f2ec4c60b6a5d7c
>
> http://groups.google.com/group/comp.programming.threads/msg/d4f9b41bbadbff95

I don't see the relevance of these posts. Let me rephrase the question to be
more specific:

Yes, on x86 for write-back cached memory, ordinary loads have acquire semantics
and ordinary stores have release semantics. But this does not depend on whether
the loads and stores are to the same location. AFAIU, each acquire is a *global*
acquire barrier, and each release is a *global* release barrier.

[...]

> In addition to this... IIRC, I think there is some reference(s) to this
> specific aspect of x86 mem-model... It should be related to Linux, or
> something, and something called "plan 9", IIRC..Humm .. We should ask Alex:
> Any reference in Linux kernel about x86 and the ordering of
> "load-after-store"?
>
> IMHO, x86 does NOT automatically take care of #StoreLoad dependences;

> therefore, you need mfence or LOCK for thing like mutexs, etc...

Yes, but that's not what I was asking. Sorry if I was unclear.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

David Hopwood

unread,

Jul 25, 2006, 7:44:22 PM7/25/06

to

PillMonsta wrote:
> Thanks guys, that gives me enough reading for the moment...

Oh, now that you've got us started, expect another 20 posts or more... :-)

(The Intel x86 docs on memory ordering are really bad, which is why no-one
here can agree on what they say.)

--
David Hopwood <david.nosp...@blueyonder.co.uk>

Chris Thomasson

unread,

Jul 27, 2006, 12:09:42 PM7/27/06

to

"David Hopwood" <david.nosp...@blueyonder.co.uk> wrote in message

news:O_xxg.18853$b9.1...@fe1.news.blueyonder.co.uk...

> Chris Thomasson wrote:
>> "David Hopwood" <david.nosp...@blueyonder.co.uk> wrote:
>>>Chris Thomasson wrote:
>>>
>>>>FWIW, here is my "current" take on the x86:
>>>>
>>>>http://groups.google.com/group/comp.programming.threads/msg/68ba70e66d6b6ee9?hl=en
>>>
>>># acquire & release semantics ( st.rel / ld.acq ) are held for loads and
>>># stores to the same locations; you sometimes have to use full membars (
>>># mfence, LOCK prefix ) to enforce the ordering of a "critical-sequence"
>>># of loads/stores to different locations on IA-32.
>>>
>>>Where did you get this from (specifically the part after the semicolon)?
>>
>> http://groups.google.com/group/comp.programming.threads/msg/2f2ec4c60b6a5d7c
>>
>> http://groups.google.com/group/comp.programming.threads/msg/d4f9b41bbadbff95
>
> I don't see the relevance of these posts. Let me rephrase the question to
> be
> more specific:
>
> Yes, on x86 for write-back cached memory, ordinary loads have acquire
> semantics
> and ordinary stores have release semantics. But this does not depend on
> whether
> the loads and stores are to the same location. AFAIU, each acquire is a
> *global*
> acquire barrier, and each release is a *global* release barrier.

Humm... I really don't believe that to be completely true because on x86...
Well...

Take the following sequence:

1: X.Store.Release
2: Y.Load.Acquire

IMO, "Does not prevent" x86 from reordering it into:

2: Y.Load.Acquire
1: X.Store.Release

So the release barrier associated with the store to location X does not
effect the load from location Y... In order to force the processor to honor
the sequence you need to add a barrier before the load to location Y.

1: X.Store.Release
2: membar #StoreLoad|#StoreStore
3: Y.Load.Acquire

A "concrete example" of this comes in the implementation of SMR on x86...
Particular the code which allow you to acquire a hazard pointer... Here is
my implementation of such code:

.align 16
.globl ac_i686_lfgc_smr_activate
ac_i686_lfgc_smr_activate:
movl 4(%esp), %edx
movl 8(%esp), %ecx

ac_i686_lfgc_smr_activate_reload:
movl (%ecx), %eax
movl %eax, (%edx)
mfence
cmpl (%ecx), %eax
jne ac_i686_lfgc_smr_activate_reload
ret

There is a nasty #StoreLoad dependency inherent in hazard pointers
therefore, the mfence instruction is simply "required" here on x86, unless
you know all of the tricks (rcu-smr, hint, hint) but that is another topic
altogether... I am sorry if this response is off-topic wrt your questions...
I think I will ask Andy about this over in comp.arch...

David Hopwood

unread,

Jul 27, 2006, 7:51:51 PM7/27/06

to

Chris Thomasson wrote:
> "David Hopwood" <david.nosp...@blueyonder.co.uk> wrote:
>
>> Yes, on x86 for write-back cached memory, ordinary loads have acquire
>> semantics and ordinary stores have release semantics. But this does not
>> depend on whether the loads and stores are to the same location. AFAIU,
>> each acquire is a *global* acquire barrier, and each release is a *global*
>> release barrier.
>
> Humm... I really don't believe that to be completely true because on x86...
> Well...
>
> Take the following sequence:
>
> 1: X.Store.Release
> 2: Y.Load.Acquire
>
> IMO, "Does not prevent" x86 from reordering it into:
>
> 2: Y.Load.Acquire
> 1: X.Store.Release

Right, but that's what you would expect.
A "store with release semantics" is

#LoadStore | #StoreStore;
store

A "load with acquire semantics" is

load;
#LoadLoad | #StoreStore

So the sequence above is equivalent to

#LoadStore | #StoreStore;
store X;
load Y;
#LoadLoad | #StoreStore

which obviously does not prevent the reordering. It has nothing to do with
the load and store being to different locations.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

Chris Thomasson

unread,

Jul 27, 2006, 9:36:55 PM7/27/06

to

"David Hopwood" <david.nosp...@blueyonder.co.uk> wrote in message

news:rkcyg.24883$9d4....@fe2.news.blueyonder.co.uk...

> Chris Thomasson wrote:
>> "David Hopwood" <david.nosp...@blueyonder.co.uk> wrote:
>>
>>> Yes, on x86 for write-back cached memory, ordinary loads have acquire
>>> semantics and ordinary stores have release semantics. But this does not
>>> depend on whether the loads and stores are to the same location. AFAIU,
>>> each acquire is a *global* acquire barrier, and each release is a
>>> *global*
>>> release barrier.
>>
>> Humm... I really don't believe that to be completely true because on
>> x86...
>> Well...
>>

[...]

>
> Right, but that's what you would expect.

[...]

> which obviously does not prevent the reordering. It has nothing to do with
> the load and store being to different locations.

I see...

Okay... Yes. I now believe you are correct... I was looking at it from a
different point of view... However, it still seems like it helps me "sketch
out" where to place all of the memory barriers in my algorithms because
every time one of them uses logic that includes a 'store followed by a load
to another location' it immediately raises a "red flag" in my mind. It
forces me to find out if "any" part of the flagged algorithm can cope with
this "specific" reordering; some parts can, some cannot... SMR hazard
pointer is an example of an algorithm that cannot cope with the reordering,
and needs barrier on x86 and ultraT1...

Humm, I guess I may be unknowingly "suffering" from a personal problem that
seems to consist of frequently applying my own "personal lexicon" in
"public" forums... Sorry for any confusion....

;)

Humm, I guess that means its time for me to write a little Dictionary, or
something along those lines... lol...

Thanks David.

:)

David Hopwood

unread,

Jul 27, 2006, 11:14:59 PM7/27/06

to

David Hopwood wrote:
> Chris Thomasson wrote:
[...]

>>Take the following sequence:
>>
>>1: X.Store.Release
>>2: Y.Load.Acquire
>>
>>IMO, "Does not prevent" x86 from reordering it into:
>>
>>2: Y.Load.Acquire
>>1: X.Store.Release
>
> Right, but that's what you would expect.
> A "store with release semantics" is
>
> #LoadStore | #StoreStore;
> store
>
> A "load with acquire semantics" is
>
> load;
> #LoadLoad | #StoreStore

^^^^^^^^^^^

Sorry, cut-and-paste error. That should be #LoadLoad | #LoadStore.

> So the sequence above is equivalent to
>
> #LoadStore | #StoreStore;
> store X;
> load Y;
> #LoadLoad | #StoreStore

^^^^^^^^^^^

Same here.

Chris Thomasson

unread,

Jul 31, 2006, 9:01:50 PM7/31/06

to

"PillMonsta" <ch...@chrisbird.com> wrote in message
news:1153777240....@h48g2000cwc.googlegroups.com...

> Dear all,
>
> Does anyone have a readable document, or alternatively source which
> demonstrates when and where sfence, lfence and mfence instructions are
> required for programming atomic operations on P4 and 686?

[...]

http://groups.google.com/group/comp.programming.threads/msg/8ae09f9e9bea21b9

This shows where to put the barriers in relationship to both loads and
stores...

chris noonan

unread,

Aug 1, 2006, 4:16:22 PM8/1/06

to

Chris Thomasson wrote:

> "PillMonsta" wrote:
> > Does anyone have a readable document, or alternatively source which
> > demonstrates when and where sfence, lfence and mfence instructions are
> > required for programming atomic operations on P4 and 686?
>
> [...]
>
> http://groups.google.com/group/comp.programming.threads/msg/8ae09f9e9bea21b9
>
> This shows where to put the barriers in relationship to both loads and
> stores...

What does the SFENCE instruction embedded in that code do?

Here is the specification of SFENCE from the Intel manual:

"Performs a serializing operation on all store-to-memory instructions
that
were issued prior the SFENCE instruction. This serializing operation
guarantees that evey store instruction that precedes in program order
the SFENCE instruction is globally visible before any store instruction
that follows the SFENCE instruction is globally visible. The SFENCE
instruction is ordered with respect store instructions, other SFENCE
instructions, any MFENCE instructions, and any serializing instructions
(such as the CPUID instruction). It is not ordered with respect to load
instructions or the LFENCE instruction."

But on Pentium processor, writes become visible anyway in strict
program order (at least with the normal type of memory). What effect
does SFENCE have?

Chris

Chris Thomasson

unread,

Aug 1, 2006, 9:29:23 PM8/1/06

to

"chris noonan" <use...@leapheap.co.uk> wrote in message
news:1154463382.3...@i42g2000cwa.googlegroups.com...

Yes:
http://groups.google.com/group/comp.programming.threads/msg/f2c59ced973e75dd

I guess I should of clarified that it shows exactly where to place the
lfence and sfence barriers' "if" they were indeed required on a "future"
x86... Or, I guess I should of posted more links to the discussion which
covers all of this...

;)

Herb Sutter

unread,

Aug 3, 2006, 2:03:49 PM8/3/06

to

David Hopwood <david.nosp...@blueyonder.co.uk> wrote:
>A "store with release semantics" is
>
> #LoadStore | #StoreStore;
> store
>
>A "load with acquire semantics" is
>
> load;
> #LoadLoad | #StoreStore

Did you mean #LoadLoad | #LoadStore?

Herb

---
Herb Sutter (www.gotw.ca) (www.pluralsight.com/blogs/hsutter)

Convener, ISO WG21 (C++ standards committee) (www.gotw.ca/iso)
Architect, Developer Division, Microsoft (www.gotw.ca/microsoft)

Chris

unread,

Aug 3, 2006, 3:41:32 PM8/3/06

to

What is the relative hit of putting in lfence/sfence, should you want
to future proof your code?

Chris.

Chris Thomasson

unread,

Aug 3, 2006, 4:09:20 PM8/3/06

to

"Herb Sutter" <hsu...@gotw.ca> wrote in message
news:hge4d2dkj5b1u3k21...@4ax.com...

> David Hopwood <david.nosp...@blueyonder.co.uk> wrote:
>>A "store with release semantics" is
>>
>> #LoadStore | #StoreStore;
>> store
>>
>>A "load with acquire semantics" is
>>
>> load;
>> #LoadLoad | #StoreStore
>
> Did you mean #LoadLoad | #LoadStore?
>

He posted a correction here:

http://groups.google.com/group/comp.programming.threads/msg/f3719d477431a942?hl=en

David Hopwood

unread,

Aug 5, 2006, 10:09:03 AM8/5/06

to

Herb Sutter wrote:
> David Hopwood <david.nosp...@blueyonder.co.uk> wrote:
>
>>A "store with release semantics" is
>>
>> #LoadStore | #StoreStore;
>> store
>>
>>A "load with acquire semantics" is
>>
>> load;
>> #LoadLoad | #StoreStore
>
> Did you mean #LoadLoad | #LoadStore?

Yes. (I posted a followup correcting this, but you may not have seen it.)

--
David Hopwood <david.nosp...@blueyonder.co.uk>

Chris Thomasson

unread,

Aug 10, 2006, 12:22:13 AM8/10/06

to

"Chris" <ch...@chrisbird.com> wrote in message
news:1154634092.4...@i42g2000cwa.googlegroups.com...

It devastates the pipeline and therefore, destroys performance.