Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

x86 memory barrier: why does Linux prefer MFENCE to Locked ADD?

630 views
Skip to first unread message

Dexuan Cui

unread,
Mar 3, 2016, 10:05:38 AM3/3/16
to linux-...@vger.kernel.org, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, David Howells, Paul E. McKenney, linux-...@vger.kernel.org
Hi,
My understanding about arch/x86/include/asm/barrier.h is: obviously Linux
more likes {L,S,M}FENCE -- Locked ADD is only used in x86_32 platforms that
don't support XMM2.

However, it looks people say Locked Add is much faster than the FENCE
instructions, even on modern Intel CPUs like Haswell, e.g., please see
the three sources:

" 11.5.1 Locked Instructions as Memory Barriers
Optimization
Use locked instructions to implement Store/Store and Store/Load barriers.
"
http://support.amd.com/TechDocs/47414_15h_sw_opt_guide.pdf

"lock addl %(rsp), 0 is a better solution for StoreLoad barrier ":
http://shipilev.net/blog/2014/on-the-fence-with-dependencies/

"...locked instruction are more efficient barriers...":
http://www.pvk.ca/Blog/2014/10/19/performance-optimisation-~-writing-an-essay/

I also found that FreeBSD prefers Locked Add.

So, I'm curious why Linux prefers MFENCE.
I guess I may be missing something.

I tried to google the question, but didn't find an answer.

Thanks,
-- Dexuan


Ingo Molnar

unread,
Mar 3, 2016, 10:27:51 AM3/3/16
to Dexuan Cui, linux-...@vger.kernel.org, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, David Howells, Paul E. McKenney, linux-...@vger.kernel.org, Michael S. Tsirkin, Peter Zijlstra
It's being worked on, see this thread on lkml from a few weeks ago:

C Jan 13 Michael S. Tsir | [PATCH v3 0/4] x86: faster mb()+documentation tweaks
C Jan 13 Michael S. Tsir | ├─>[PATCH v3 1/4] x86: add cc clobber for addl
C Jan 13 Michael S. Tsir | ├─>[PATCH v3 2/4] x86: drop a comment left over from X86_OOSTORE
C Jan 13 Michael S. Tsir | ├─>[PATCH v3 3/4] x86: tweak the comment about use of wmb for IO
C Jan 13 Michael S. Tsir | ├─>[PATCH v3 4/4] x86: drop mfence in favor of lock+addl

The 4th patch changes MFENCE to a LOCK ADDL locked instruction.

Thanks,

Ingo

Peter Zijlstra

unread,
Mar 3, 2016, 10:35:06 AM3/3/16
to Ingo Molnar, Dexuan Cui, linux-...@vger.kernel.org, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, David Howells, Paul E. McKenney, linux-...@vger.kernel.org, Michael S. Tsirkin

Michael S. Tsirkin

unread,
Mar 3, 2016, 1:36:04 PM3/3/16
to Peter Zijlstra, Ingo Molnar, Dexuan Cui, linux-...@vger.kernel.org, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, David Howells, Paul E. McKenney, linux-...@vger.kernel.org
> lkml.kernel.org/r/1453921746-16178-1...@redhat.comZ

It's ready as far as I am concerned.
Basically we are just waiting for ack from hpa.

--
MST

H. Peter Anvin

unread,
Mar 3, 2016, 2:06:34 PM3/3/16
to Michael S. Tsirkin, Peter Zijlstra, Ingo Molnar, Dexuan Cui, linux-...@vger.kernel.org, Thomas Gleixner, Ingo Molnar, David Howells, Paul E. McKenney, linux-...@vger.kernel.org
And I'm still discussing this with the hardware people. It seems we can do this for *most* things, but not all; the question is where exactly we need to do something different.
--
Sent from my Android device with K-9 Mail. Please excuse brevity and formatting.

Peter Zijlstra

unread,
Jun 3, 2016, 9:39:37 AM6/3/16
to H. Peter Anvin, Michael S. Tsirkin, Ingo Molnar, Dexuan Cui, linux-...@vger.kernel.org, Thomas Gleixner, Ingo Molnar, David Howells, Paul E. McKenney, linux-...@vger.kernel.org
On Thu, Mar 03, 2016 at 11:05:43AM -0800, H. Peter Anvin wrote:
> >> latest version here:
> >>
> >> lkml.kernel.org/r/1453921746-16178-1...@redhat.comZ
> >
> >It's ready as far as I am concerned.
> >Basically we are just waiting for ack from hpa.
>
> And I'm still discussing this with the hardware people. It seems we
> can do this for *most* things, but not all; the question is where
> exactly we need to do something different.

Anything on this?

Michael S. Tsirkin

unread,
Aug 3, 2016, 12:36:54 AM8/3/16
to H. Peter Anvin, Peter Zijlstra, Ingo Molnar, Dexuan Cui, linux-...@vger.kernel.org, Thomas Gleixner, Ingo Molnar, David Howells, Paul E. McKenney, linux-...@vger.kernel.org
On Thu, Mar 03, 2016 at 11:05:43AM -0800, H. Peter Anvin wrote:
> >It's ready as far as I am concerned.
> >Basically we are just waiting for ack from hpa.
>
> And I'm still discussing this with the hardware people. It seems we
> can do this for *most* things, but not all; the question is where
> exactly we need to do something different.

I'm guessing there's still no update?

There's a decent chance that without documentation a bunch of current
uses are actually broken. See for example
http://marc.info/?l=linux-kernel&m=145400059304553&w=2
which going by the manual is fixing smp_mb misuse for clflush - or maybe not?

Henrique de Moraes Holschuh

unread,
Aug 3, 2016, 9:01:45 AM8/3/16
to Michael S. Tsirkin, H. Peter Anvin, Peter Zijlstra, Ingo Molnar, Dexuan Cui, linux-...@vger.kernel.org, Thomas Gleixner, Ingo Molnar, David Howells, Paul E. McKenney, linux-...@vger.kernel.org
On Wed, 03 Aug 2016, Michael S. Tsirkin wrote:
> > And I'm still discussing this with the hardware people. It seems we
> > can do this for *most* things, but not all; the question is where
> > exactly we need to do something different.

Let's hope the "hardware guys" get back to you soon :(


HSD162/BDM116 MOVNTDQA From WC Memory May Pass Earlier Locked
Instructions

Problem: An execution of (V)MOVNTDQA (streaming load instruction)
that loads from WC (write combining) memory may appear to pass an
earlier locked instruction that accesses a different cache line.

Implication: Software that expects a lock to fence subsequent
(V)MOVNTDQA instructions may not operate properly.

Workaround: None identified. Software that relies on a locked
instruction to fence subsequent executions of (V)MOVNTDQA should
insert an MFENCE instruction between the locked instruction and
subsequent (V)MOVNTDQA instruction.



SKL079 MOVNTDQA From WC Memory May Pass Earlier MFENCE Instructions

Problem: An execution of MOVNTDQA or VMOVNTDQA that loads from WC
(write combining) memory may appear to pass an earlier execution of
the MFENCE instruction.

Implication: When this erratum occurs, an execution of MOVNTDQA or
VMOVNTDQA may appear to execute before memory operations that
precede the earlier MFENCE instruction. Software that uses MFENCE
to order subsequent executions of the MOVNTDQA instructions may not
operate properly.

Workaround: It is possible for the BIOS to contain a workaround for
this erratum. For the steppings affected, see the Summary Table of
Changes.


These are just examples. Intel might have other errata related to
*FENCE or LOCK, and AMD might have its share of model-specific LOCK or
*FENCE oddities as well (I didn't check).

Note that Skylake is broken in exactly the opposite way that Haswell and
Broadwell are. Fortunately, Skylake could be fixed through a microcode
update, but still...

The point is that we indeed need to be careful if we want to switch away
from *FENCE.

--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh

Michael S. Tsirkin

unread,
Aug 3, 2016, 9:12:16 AM8/3/16
to Henrique de Moraes Holschuh, H. Peter Anvin, Peter Zijlstra, Ingo Molnar, Dexuan Cui, linux-...@vger.kernel.org, Thomas Gleixner, Ingo Molnar, David Howells, Paul E. McKenney, linux-...@vger.kernel.org
Are any of these used in kernel though?

Henrique de Moraes Holschuh

unread,
Aug 3, 2016, 7:20:10 PM8/3/16
to Michael S. Tsirkin, H. Peter Anvin, Peter Zijlstra, Ingo Molnar, Dexuan Cui, linux-...@vger.kernel.org, Thomas Gleixner, Ingo Molnar, David Howells, Paul E. McKenney, linux-...@vger.kernel.org
On Wed, 03 Aug 2016, Michael S. Tsirkin wrote:
> Are any of these used in kernel though?

These specific errata were not the point of my post, rather, it was the
fact that errata related to *FENCE and LOCKed instructions exists.

I didn't verify whether something attempts to use non-temporal loads or
stores from WC memory in the kernel.
0 new messages