[llvm-dev] RFC: non-temporal fencing in LLVM IR

203 views
Skip to first unread message

JF Bastien via llvm-dev

unread,
Jan 13, 2016, 2:16:34 AM1/13/16
to llvm-dev, Hans Boehm
Hello, fencing enthusiasts!

TL;DR: We'd like to propose an addition to the LLVM memory model requiring non-temporal accesses be surrounded by non-temporal load barriers and non-temporal store barriers, and we'd like to add such orderings to the fence IR opcode.

We are open to different approaches, hence this email instead of a patch.


Who's "we"?

Philip Reames brought this to my attention, and we've had numerous discussions with Hans Boehm on the topic. Any mistakes below are my own, all the clever bits are theirs.


Why?

Ignore non-temporals for a moment, on most x86 targets LLVM generates an mfence for seq_cst atomic fencing. One could instead use a locked idempotent atomic accesses to top-of-stack such as lock or4i [RSP-8] 0. Philip has measured this as equivalent on micro-benchmarks, but as ~25% faster in macro-benchmarks (other codebases confirm this). There's one problem with this approach: non-temporal accesses on x86 are only ordered by fence instructions! This means that code using non-temporal accesses can't rely on LLVM's fence opcode to do the right thing, they instead have to rely on architecture-specific _mm*fence intrinsics.


But wait! Who said developers need to issue any type of fence when using non-temporals?

Well, the LLVM memory model sure didn't. The x86 memory model does (volume 3 section 8.2.2 Memory Ordering) but LLVM targets more than x86 and the backends are free to ignore the !nontemporal metadata, and AFAICT the x86 backend doesn't add those fences.

Therefore even without the above optimization the LLVM language reference is incorrect: non-temporals should be bracketed by barriers. This applies even without threading! Non-temporal accesses aren't guaranteed to interact well with regular accesses, which means that regular loads cannot move "down" a non-temporal barrier, and regular stores cannot move "up" a non-temporal barrier.


Why not just have the compiler add the fences?

LLVM could do this, either as a per-backend thing or a hookable pass such as AtomicExpandPass. It seems more natural to ask the programmer to express intent, just as is done with atomics. In fact, a backend is current free to ignore !nontemporal on load and store and could therefore generate only half of what's requested, leading to incorrect code. That would of course be silly, backends should either honor all !nontemporal or none of them but who knows what the middle-end does.

Put another way: some optimized C library use non-temporal accesses (when string instructions aren't du jour) and they terminate their copying with an sfence. It's a de-facto convention, the ABI doesn't say anything, but let's avoid divergence.

Aside: one day we may live in the fence elimination promised land where fences are exactly where they need to be, no more, no less.


Isn't x86's lfence just a no-op?

Yes, but we're proposing the addition of a target-independent non-temporal load barrier. It'll be up to the x86 backend to make it an X86ISD::MEMBARRIER and other backends to get it right (hint: it's not always a no-op).


Won't this optimization cause coherency misses? C++ access the thread stack concurrently all the time!

Maybe, but then it isn't much of an optimization if it's slowing code down. LLVM doesn't just target C++, and it's really up to the backend to decide whether one fence type is better than another (on x86, whether a locked top-of-stack idempotent operation is better than mfence). Other languages have private stacks where this isn't an issue, and where the stack top can reasonably be assumed to be in cache.


How will this affect non-user-mode code (i.e. kernel code)?

Kernel code still has to ask for _mm_mfence if it wants mfence: C11 and C++11 barriers aren't specified as a specific instruction.


Is it safe to access top-of-stack?

AFAIK yes, and the ABI-specified red zone has our back (or front if the stack grows up ☻).


What about non-x86 architectures?

Architectures such as ARMv8 support non-temporal instructions and require barriers such as DMB nshld to order loads and DMB nshst to order stores.

Even ARM's address-dependency rule (a.k.a. the ill-fated std::memory_order_consume) fails to hold with non-temporals:
LDR X0, [X3]
LDNP X2, X1, [X0] // X0 may not be loaded when the instruction executes!


Who uses non-temporals anyways?

That's an awfully personal question!

Philip Reames via llvm-dev

unread,
Jan 13, 2016, 12:45:42 PM1/13/16
to JF Bastien, llvm-dev, Hans Boehm


On 01/12/2016 11:16 PM, JF Bastien wrote:
Hello, fencing enthusiasts!

TL;DR: We'd like to propose an addition to the LLVM memory model requiring non-temporal accesses be surrounded by non-temporal load barriers and non-temporal store barriers, and we'd like to add such orderings to the fence IR opcode.

We are open to different approaches, hence this email instead of a patch.


Who's "we"?

Philip Reames brought this to my attention, and we've had numerous discussions with Hans Boehm on the topic. Any mistakes below are my own, all the clever bits are theirs.


Why?

Ignore non-temporals for a moment, on most x86 targets LLVM generates an mfence for seq_cst atomic fencing. One could instead use a locked idempotent atomic accesses to top-of-stack such as lock or4i [RSP-8] 0. Philip has measured this as equivalent on micro-benchmarks, but as ~25% faster in macro-benchmarks (other codebases confirm this). There's one problem with this approach: non-temporal accesses on x86 are only ordered by fence instructions! This means that code using non-temporal accesses can't rely on LLVM's fence opcode to do the right thing, they instead have to rely on architecture-specific _mm*fence intrinsics.
Just for clarify: the proposal to change the implementation of ceq_cst is arguable separate from this proposal.  It will go through normal patch review once the semantics are addressed.  Whatever we end up doing with ceq_cst, we currently have a semantic hole in our specification around non-temporals that needs addressed. 

Another approach would be to define the current fences as fencing non-temporals and introducing new ones that don't.  Either approach is workable.  I believe that new fences for non-temporals are the appropriate choice given that would more closely match existing practice. 

We could also consider forward serialize bitcode to the stronger form whichever choice we made.  That would be conservatively correct thing to do for older bitcode which might be assuming strong semantics than our barriers explicitly provided.

John Brawn via llvm-dev

unread,
Jan 13, 2016, 1:32:48 PM1/13/16
to JF Bastien, llvm...@lists.llvm.org, nd, Hans Boehm

What about non-x86 architectures?

 

Architectures such as ARMv8 support non-temporal instructions and require barriers such as DMB nshld to order loads and DMB nshst to order stores.

 

Even ARM's address-dependency rule (a.k.a. the ill-fated std::memory_order_consume) fails to hold with non-temporals:

LDR X0, [X3]

LDNP X2, X1, [X0] // X0 may not be loaded when the instruction executes!

 

What exactly do you mean by ‘X0 may not be loaded’ in your example here? If you mean that the LDNP

could start executing with the value of X0 from before the LDR,  e.g. initially X0=0x100, the LDR loads

X0=0x200 but the LDNP uses the old value of X0=0x100, then I don’t think that’s true. According to

section C3.2.4 of the ARMv8 ARMARM other observers may observe the LDR and the LDNP in the wrong

order, but the CPU executing the instructions will observe them in program order.

 

I have no idea if that affects anything in this RFC though.

 

John

JF Bastien via llvm-dev

unread,
Jan 13, 2016, 1:44:54 PM1/13/16
to John Brawn, llvm...@lists.llvm.org, nd, Hans Boehm
On Wed, Jan 13, 2016 at 10:32 AM, John Brawn <John....@arm.com> wrote:

What about non-x86 architectures?

 

Architectures such as ARMv8 support non-temporal instructions and require barriers such as DMB nshld to order loads and DMB nshst to order stores.

 

Even ARM's address-dependency rule (a.k.a. the ill-fated std::memory_order_consume) fails to hold with non-temporals:

LDR X0, [X3]

LDNP X2, X1, [X0] // X0 may not be loaded when the instruction executes!

 

What exactly do you mean by ‘X0 may not be loaded’ in your example here? If you mean that the LDNP

could start executing with the value of X0 from before the LDR,  e.g. initially X0=0x100, the LDR loads

X0=0x200 but the LDNP uses the old value of X0=0x100, then I don’t think that’s true. According to

section C3.2.4 of the ARMv8 ARMARM other observers may observe the LDR and the LDNP in the wrong

order, but the CPU executing the instructions will observe them in program order.


I haven't touched ARMv8 in a few years so I'm rusty on the non-temporal details for that ISA. I lifted this example from here:

Which is correct?


 I have no idea if that affects anything in this RFC though.


Agreed, but I don't want to be misleading! The current example serves as a good justification for non-temporal read barriers, it would be a shame to justify myself on incorrect data :-)

Tim Northover via llvm-dev

unread,
Jan 13, 2016, 1:59:31 PM1/13/16
to JF Bastien, llvm...@lists.llvm.org, nd, Hans Boehm
> I haven't touched ARMv8 in a few years so I'm rusty on the non-temporal
> details for that ISA. I lifted this example from here:
>
> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0024a/CJACGJJF.html
>
> Which is correct?

FWIW, I agree with John here. The example I'd give for the unexpected
behaviour allowed in the spec is:

.Lwait_for_data:
ldr x0, [x3]
cbz x0, .Lwait_for_data
ldnp x2, x1, [x0]

where another thread first writes to a buffer then tells us where that
buffer is. For a normal ldp, the address dependency rule means we
don't need a barrier or acquiring load to ensure we see the real data
in the buffer. For ldnp, we would need a barrier to prevent stale
data.

I suspect this is actually even closer to the x86 situation than what
the guide implies (which looks like a straight-up exposed pipeline to
me, beyond even what Alpha would have done).

Cheers.

Tim.
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Hans Boehm via llvm-dev

unread,
Jan 14, 2016, 12:56:11 AM1/14/16
to Tim Northover, llvm...@lists.llvm.org, nd
I agree with Tim's assessment for ARM.  That's interesting; I wasn't previously aware of that instruction.

My understanding is that Alpha would have the same problem for normal loads.

I'm all in favor of more systematic handling of the fences associated with x86 non-temporal accesses.

AFAICT, nontemporal loads and stores seem to have different fencing rules on x86, none of them very clear.  Nontemporal stores should probably ideally use an SFENCE.  Locked instructions seem to be documented to work with MOVNTDQA.  In both cases, there seems to be only empirical evidence as to which side(s) of the nontemporal operations they should go on?

I finally decided that I was OK with using a LOCKed top-of-stack update as a fence in Java on x86.  I'm significantly less enthusiastic for C++.  I also think that risks unexpected coherence miss problems, though they would probably be very rare.  But they would be very surprising if they did occur.


Hal Finkel via llvm-dev

unread,
Jan 14, 2016, 3:51:31 PM1/14/16
to Philip Reames, llvm-dev, Hans Boehm
Hi JF, Philip,

Clang currently has __builtin_nontemporal_store and __builtin_nontemporal_load. How will the usage model for those change?

Thanks again,
Hal

----- Original Message -----

> > Hello, fencing enthusiasts!
>

> > Who's "we"?
>

> > Why?
>

> > 0 . Philip has measured this as equivalent on micro-benchmarks, but

> > such as AtomicExpandPass . It seems more natural to ask the


> > programmer to express intent, just as is done with atomics. In
> > fact,
> > a backend is current free to ignore !nontemporal on load and store
> > and could therefore generate only half of what's requested, leading
> > to incorrect code. That would of course be silly, backends should
> > either honor all !nontemporal or none of them but who knows what
> > the
> > middle-end does.
>

> > Put another way: some optimized C library use non-temporal accesses
> > (when string instructions aren't du jour) and they terminate their

> > copying with an sfence . It's a de-facto convention, the ABI


> > doesn't
> > say anything, but let's avoid divergence.
>

> > Aside: one day we may live in the fence elimination promised land
> > where fences are exactly where they need to be, no more, no less.
>

> > Isn't x86's lfence just a no-op?
>

> > Yes, but we're proposing the addition of a target-independent
> > non-temporal load barrier. It'll be up to the x86 backend to make
> > it
> > an X86ISD::MEMBARRIER and other backends to get it right (hint:
> > it's
> > not always a no-op).
>

> > Won't this optimization cause coherency misses? C++ access the
> > thread
> > stack concurrently all the time!
>

> > Maybe, but then it isn't much of an optimization if it's slowing
> > code
> > down. LLVM doesn't just target C++, and it's really up to the
> > backend to decide whether one fence type is better than another (on
> > x86, whether a locked top-of-stack idempotent operation is better

> > than mfence ). Other languages have private stacks where this isn't


> > an issue, and where the stack top can reasonably be assumed to be
> > in
> > cache.
>

> > How will this affect non-user-mode code (i.e. kernel code)?
>

> > Kernel code still has to ask for _mm_ mfence if it wants mfence :


> > C11
> > and C++11 barriers aren't specified as a specific instruction.
>

> > Is it safe to access top-of-stack?
>

> > AFAIK yes, and the ABI-specified red zone has our back (or front if
> > the stack grows up ☻).
>

> > What about non-x86 architectures?
>

> > Architectures such as ARMv8 support non-temporal instructions and
> > require barriers such as DMB nshld to order loads and DMB nshst to
> > order stores.
>

> > Even ARM's address-dependency rule (a.k.a. the ill-fated

> > std::memory_order_consume ) fails to hold with non-temporals:
>

> > > LDR X0, [X3]
> >
>

> > > LDNP X2, X1, [X0] // X0 may not be loaded when the instruction
> > > executes!
> >
>
> > Who uses non-temporals anyways?
>

> > That's an awfully personal question!
>

> _______________________________________________
> LLVM Developers mailing list
> llvm...@lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

--

--
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

JF Bastien via llvm-dev

unread,
Jan 14, 2016, 4:02:33 PM1/14/16
to Hal Finkel, llvm-dev, Hans Boehm
On Thu, Jan 14, 2016 at 12:51 PM, Hal Finkel <hfi...@anl.gov> wrote:
Hi JF, Philip,

Clang currently has __builtin_nontemporal_store and __builtin_nontemporal_load. How will the usage model for those change?

I think you would use them in the same way, but you'd have to also use __builtin_nontemporal_store_fence and __builtin_nontemporal_load_fence. 

Unless we have LLVM automagically figure out where non-temporal fences should go, which I think isn't as good of an approach.

Hal Finkel via llvm-dev

unread,
Jan 14, 2016, 4:06:04 PM1/14/16
to JF Bastien, llvm-dev, Hans Boehm
----- Original Message -----
> From: "JF Bastien" <j...@google.com>
> To: "Hal Finkel" <hfi...@anl.gov>
> Cc: "Philip Reames" <list...@philipreames.com>, "Hans Boehm" <hbo...@google.com>, "llvm-dev"
> <llvm...@lists.llvm.org>
> Sent: Thursday, January 14, 2016 3:02:20 PM
> Subject: Re: [llvm-dev] RFC: non-temporal fencing in LLVM IR
>
>
>
>
> On Thu, Jan 14, 2016 at 12:51 PM, Hal Finkel < hfi...@anl.gov >
> wrote:
>
>
> Hi JF, Philip,
>
> Clang currently has __builtin_nontemporal_store and
> __builtin_nontemporal_load. How will the usage model for those
> change?
>
>
>
> I think you would use them in the same way, but you'd have to also
> use __builtin_nontemporal_store_fence and
> __builtin_nontemporal_load_fence.

So we'll add new fence intrinsics. That makes sense.

>
>
> Unless we have LLVM automagically figure out where non-temporal
> fences should go, which I think isn't as good of an approach.
>

I agree. Such a determination is likely to be too conservative in practice.

-Hal

David Majnemer via llvm-dev

unread,
Jan 14, 2016, 4:10:54 PM1/14/16
to Hans Boehm, llvm...@lists.llvm.org, nd
On Wed, Jan 13, 2016 at 7:00 PM, Hans Boehm via llvm-dev <llvm...@lists.llvm.org> wrote:
I agree with Tim's assessment for ARM.  That's interesting; I wasn't previously aware of that instruction.

My understanding is that Alpha would have the same problem for normal loads.

I'm all in favor of more systematic handling of the fences associated with x86 non-temporal accesses.

AFAICT, nontemporal loads and stores seem to have different fencing rules on x86, none of them very clear.  Nontemporal stores should probably ideally use an SFENCE.  Locked instructions seem to be documented to work with MOVNTDQA.  In both cases, there seems to be only empirical evidence as to which side(s) of the nontemporal operations they should go on?

I finally decided that I was OK with using a LOCKed top-of-stack update as a fence in Java on x86.  I'm significantly less enthusiastic for C++.  I also think that risks unexpected coherence miss problems, though they would probably be very rare.  But they would be very surprising if they did occur.

Today's LLVM already emits 'lock or %eax, (%esp)' for 'fence seq_cst'/__sync_synchronize/__atomic_thread_fence(__ATOMIC_SEQ_CST) when targeting 32-bit x86 machines which do not support mfence.  What instruction sequence should we be using instead?
 



On Wed, Jan 13, 2016 at 10:59 AM, Tim Northover <t.p.no...@gmail.com> wrote:
> I haven't touched ARMv8 in a few years so I'm rusty on the non-temporal
> details for that ISA. I lifted this example from here:
>
> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0024a/CJACGJJF.html
>
> Which is correct?

FWIW, I agree with John here. The example I'd give for the unexpected
behaviour allowed in the spec is:

.Lwait_for_data:
    ldr x0, [x3]
    cbz x0, .Lwait_for_data
    ldnp x2, x1, [x0]

where another thread first writes to a buffer then tells us where that
buffer is. For a normal ldp, the address dependency rule means we
don't need a barrier or acquiring load to ensure we see the real data
in the buffer. For ldnp, we would need a barrier to prevent stale
data.

I suspect this is actually even closer to the x86 situation than what
the guide implies (which looks like a straight-up exposed pipeline to
me, beyond even what Alpha would have done).

Cheers.

Tim.


JF Bastien via llvm-dev

unread,
Jan 14, 2016, 4:11:35 PM1/14/16
to Hal Finkel, llvm-dev, Hans Boehm
On Thu, Jan 14, 2016 at 1:05 PM, Hal Finkel <hfi...@anl.gov> wrote:
----- Original Message -----
> From: "JF Bastien" <j...@google.com>
> To: "Hal Finkel" <hfi...@anl.gov>
> Cc: "Philip Reames" <list...@philipreames.com>, "Hans Boehm" <hbo...@google.com>, "llvm-dev"
> <llvm...@lists.llvm.org>
> Sent: Thursday, January 14, 2016 3:02:20 PM
> Subject: Re: [llvm-dev] RFC: non-temporal fencing in LLVM IR
>
>
>
>
> On Thu, Jan 14, 2016 at 12:51 PM, Hal Finkel < hfi...@anl.gov >
> wrote:
>
>
> Hi JF, Philip,
>
> Clang currently has __builtin_nontemporal_store and
> __builtin_nontemporal_load. How will the usage model for those
> change?
>
>
>
> I think you would use them in the same way, but you'd have to also
> use __builtin_nontemporal_store_fence and
> __builtin_nontemporal_load_fence.

So we'll add new fence intrinsics. That makes sense.

Correct, and I propose that this translate to an LLVM IR barrier, with a new type of memory ordering (non-temporal load, and non-temporal store). It can't be metadata, but it could be an attribute instead (akin to how load/store have atomic and volatile attributes).

We could then add the same concept to C++ but I won't tip my hand too much ;-)


> Unless we have LLVM automagically figure out where non-temporal
> fences should go, which I think isn't as good of an approach.
>

I agree. Such a determination is likely to be too conservative in practice.

Indeed, user control seems better here especially when it comes to knowing which memory aliases to know where the fence matters.

JF Bastien via llvm-dev

unread,
Jan 14, 2016, 4:13:47 PM1/14/16
to David Majnemer, llvm...@lists.llvm.org, nd, Hans Boehm
On Thu, Jan 14, 2016 at 1:10 PM, David Majnemer via llvm-dev <llvm...@lists.llvm.org> wrote:


On Wed, Jan 13, 2016 at 7:00 PM, Hans Boehm via llvm-dev <llvm...@lists.llvm.org> wrote:
I agree with Tim's assessment for ARM.  That's interesting; I wasn't previously aware of that instruction.

My understanding is that Alpha would have the same problem for normal loads.

I'm all in favor of more systematic handling of the fences associated with x86 non-temporal accesses.

AFAICT, nontemporal loads and stores seem to have different fencing rules on x86, none of them very clear.  Nontemporal stores should probably ideally use an SFENCE.  Locked instructions seem to be documented to work with MOVNTDQA.  In both cases, there seems to be only empirical evidence as to which side(s) of the nontemporal operations they should go on?

I finally decided that I was OK with using a LOCKed top-of-stack update as a fence in Java on x86.  I'm significantly less enthusiastic for C++.  I also think that risks unexpected coherence miss problems, though they would probably be very rare.  But they would be very surprising if they did occur.

Today's LLVM already emits 'lock or %eax, (%esp)' for 'fence seq_cst'/__sync_synchronize/__atomic_thread_fence(__ATOMIC_SEQ_CST) when targeting 32-bit x86 machines which do not support mfence.  What instruction sequence should we be using instead?

Do they have non-temporal accesses in the ISA?

David Majnemer via llvm-dev

unread,
Jan 14, 2016, 4:35:56 PM1/14/16
to JF Bastien, llvm...@lists.llvm.org, nd, Hans Boehm
On Thu, Jan 14, 2016 at 1:13 PM, JF Bastien <j...@google.com> wrote:
On Thu, Jan 14, 2016 at 1:10 PM, David Majnemer via llvm-dev <llvm...@lists.llvm.org> wrote:


On Wed, Jan 13, 2016 at 7:00 PM, Hans Boehm via llvm-dev <llvm...@lists.llvm.org> wrote:
I agree with Tim's assessment for ARM.  That's interesting; I wasn't previously aware of that instruction.

My understanding is that Alpha would have the same problem for normal loads.

I'm all in favor of more systematic handling of the fences associated with x86 non-temporal accesses.

AFAICT, nontemporal loads and stores seem to have different fencing rules on x86, none of them very clear.  Nontemporal stores should probably ideally use an SFENCE.  Locked instructions seem to be documented to work with MOVNTDQA.  In both cases, there seems to be only empirical evidence as to which side(s) of the nontemporal operations they should go on?

I finally decided that I was OK with using a LOCKed top-of-stack update as a fence in Java on x86.  I'm significantly less enthusiastic for C++.  I also think that risks unexpected coherence miss problems, though they would probably be very rare.  But they would be very surprising if they did occur.

Today's LLVM already emits 'lock or %eax, (%esp)' for 'fence seq_cst'/__sync_synchronize/__atomic_thread_fence(__ATOMIC_SEQ_CST) when targeting 32-bit x86 machines which do not support mfence.  What instruction sequence should we be using instead?

Do they have non-temporal accesses in the ISA?

I thought not but there appear to be instructions like movntps.  mfence was introduced in SSE2 while movntps and sfence were introduced in SSE.

JF Bastien via llvm-dev

unread,
Jan 14, 2016, 4:37:47 PM1/14/16
to David Majnemer, llvm...@lists.llvm.org, nd, Hans Boehm
On Thu, Jan 14, 2016 at 1:35 PM, David Majnemer <david.m...@gmail.com> wrote:


On Thu, Jan 14, 2016 at 1:13 PM, JF Bastien <j...@google.com> wrote:
On Thu, Jan 14, 2016 at 1:10 PM, David Majnemer via llvm-dev <llvm...@lists.llvm.org> wrote:


On Wed, Jan 13, 2016 at 7:00 PM, Hans Boehm via llvm-dev <llvm...@lists.llvm.org> wrote:
I agree with Tim's assessment for ARM.  That's interesting; I wasn't previously aware of that instruction.

My understanding is that Alpha would have the same problem for normal loads.

I'm all in favor of more systematic handling of the fences associated with x86 non-temporal accesses.

AFAICT, nontemporal loads and stores seem to have different fencing rules on x86, none of them very clear.  Nontemporal stores should probably ideally use an SFENCE.  Locked instructions seem to be documented to work with MOVNTDQA.  In both cases, there seems to be only empirical evidence as to which side(s) of the nontemporal operations they should go on?

I finally decided that I was OK with using a LOCKed top-of-stack update as a fence in Java on x86.  I'm significantly less enthusiastic for C++.  I also think that risks unexpected coherence miss problems, though they would probably be very rare.  But they would be very surprising if they did occur.

Today's LLVM already emits 'lock or %eax, (%esp)' for 'fence seq_cst'/__sync_synchronize/__atomic_thread_fence(__ATOMIC_SEQ_CST) when targeting 32-bit x86 machines which do not support mfence.  What instruction sequence should we be using instead?

Do they have non-temporal accesses in the ISA?

I thought not but there appear to be instructions like movntps.  mfence was introduced in SSE2 while movntps and sfence were introduced in SSE.

So the new builtin could be sfence? I think the codegen you point out for SEQ_CST is fine if we fix the memory model as suggested.

Hans Boehm via llvm-dev

unread,
Jan 14, 2016, 7:08:38 PM1/14/16
to JF Bastien, llvm...@lists.llvm.org, nd
On Thu, Jan 14, 2016 at 1:37 PM, JF Bastien <j...@google.com> wrote:
On Thu, Jan 14, 2016 at 1:35 PM, David Majnemer <david.m...@gmail.com> wrote:


On Thu, Jan 14, 2016 at 1:13 PM, JF Bastien <j...@google.com> wrote:
On Thu, Jan 14, 2016 at 1:10 PM, David Majnemer via llvm-dev <llvm...@lists.llvm.org> wrote:


On Wed, Jan 13, 2016 at 7:00 PM, Hans Boehm via llvm-dev <llvm...@lists.llvm.org> wrote:
I agree with Tim's assessment for ARM.  That's interesting; I wasn't previously aware of that instruction.

My understanding is that Alpha would have the same problem for normal loads.

I'm all in favor of more systematic handling of the fences associated with x86 non-temporal accesses.

AFAICT, nontemporal loads and stores seem to have different fencing rules on x86, none of them very clear.  Nontemporal stores should probably ideally use an SFENCE.  Locked instructions seem to be documented to work with MOVNTDQA.  In both cases, there seems to be only empirical evidence as to which side(s) of the nontemporal operations they should go on?

I finally decided that I was OK with using a LOCKed top-of-stack update as a fence in Java on x86.  I'm significantly less enthusiastic for C++.  I also think that risks unexpected coherence miss problems, though they would probably be very rare.  But they would be very surprising if they did occur.

Today's LLVM already emits 'lock or %eax, (%esp)' for 'fence seq_cst'/__sync_synchronize/__atomic_thread_fence(__ATOMIC_SEQ_CST) when targeting 32-bit x86 machines which do not support mfence.  What instruction sequence should we be using instead?

Do they have non-temporal accesses in the ISA?

I thought not but there appear to be instructions like movntps.  mfence was introduced in SSE2 while movntps and sfence were introduced in SSE.

So the new builtin could be sfence? I think the codegen you point out for SEQ_CST is fine if we fix the memory model as suggested.

I agree that it's fine to use a locked instruction as a seq_cst fence if MFENCE is not available.  If you have to dirty a cache line, (%esp) seems like relatively safe one.  (I'm assuming that CPUID is appreciably slower and out of the running?  I haven't tried.  But it also probably clobbers too many registers.)  It's only the idea of writing to a memory location when MFENCE is available, and could be used instead, that seems questionable.

What exactly would the non-temporal fences be?  It seems that on x86, the load and store case may differ.  In theory, there's also a before vs. after question.  In practice code using MOVNTA seems to assume that you only need an SFENCE afterwards.  I can't back that up with spec verbiage.  I don't know about MOVNTDQA.  What about ARM?

Philip Reames via llvm-dev

unread,
Jan 14, 2016, 7:27:20 PM1/14/16
to Hans Boehm, JF Bastien, llvm...@lists.llvm.org, nd


On 01/14/2016 04:05 PM, Hans Boehm via llvm-dev wrote:


On Thu, Jan 14, 2016 at 1:37 PM, JF Bastien <j...@google.com> wrote:
On Thu, Jan 14, 2016 at 1:35 PM, David Majnemer <david.m...@gmail.com> wrote:


On Thu, Jan 14, 2016 at 1:13 PM, JF Bastien <j...@google.com> wrote:
On Thu, Jan 14, 2016 at 1:10 PM, David Majnemer via llvm-dev <llvm...@lists.llvm.org> wrote:


On Wed, Jan 13, 2016 at 7:00 PM, Hans Boehm via llvm-dev <llvm...@lists.llvm.org> wrote:
I agree with Tim's assessment for ARM.  That's interesting; I wasn't previously aware of that instruction.

My understanding is that Alpha would have the same problem for normal loads.

I'm all in favor of more systematic handling of the fences associated with x86 non-temporal accesses.

AFAICT, nontemporal loads and stores seem to have different fencing rules on x86, none of them very clear.  Nontemporal stores should probably ideally use an SFENCE.  Locked instructions seem to be documented to work with MOVNTDQA.  In both cases, there seems to be only empirical evidence as to which side(s) of the nontemporal operations they should go on?

I finally decided that I was OK with using a LOCKed top-of-stack update as a fence in Java on x86.  I'm significantly less enthusiastic for C++.  I also think that risks unexpected coherence miss problems, though they would probably be very rare.  But they would be very surprising if they did occur.

Today's LLVM already emits 'lock or %eax, (%esp)' for 'fence seq_cst'/__sync_synchronize/__atomic_thread_fence(__ATOMIC_SEQ_CST) when targeting 32-bit x86 machines which do not support mfence.  What instruction sequence should we be using instead?

Do they have non-temporal accesses in the ISA?

I thought not but there appear to be instructions like movntps.  mfence was introduced in SSE2 while movntps and sfence were introduced in SSE.

So the new builtin could be sfence? I think the codegen you point out for SEQ_CST is fine if we fix the memory model as suggested.

I agree that it's fine to use a locked instruction as a seq_cst fence if MFENCE is not available. 
It's not clear to me this is true if the seq_cst fence is expected to fence non-temporal stores.  I think in practice, you'd be very unlikely to notice a difference, but I can't point to anything in the Intel docs which justifies a lock prefixed instruction as sufficient to fence any non-temporal access. 

If you have to dirty a cache line, (%esp) seems like relatively safe one. 
Agreed.  As we discussed previously, it is possible to false sharing in C++, but this would require one thread to be accessing information stored in the last frame of another running thread's stack.  That seems sufficiently unlikely to be ignored. 

(I'm assuming that CPUID is appreciably slower and out of the running?  I haven't tried.  But it also probably clobbers too many registers.) 
This is my belief.  I haven't actually tried this experiment, but I've seen no reports that CPUID is a good choice here.


It's only the idea of writing to a memory location when MFENCE is available, and could be used instead, that seems questionable.
While in principal I agree, it appears in practice that this tradeoff is worthwhile.  The hardware doesn't seem to optimize for the MFENCE case whereas lock prefix instructions appear to be handled much better. 

What exactly would the non-temporal fences be?  It seems that on x86, the load and store case may differ.  In theory, there's also a before vs. after question.  In practice code using MOVNTA seems to assume that you only need an SFENCE afterwards.  I can't back that up with spec verbiage.  I don't know about MOVNTDQA.  What about ARM?
I'll leave this to JF to answer.  I'm not knowledgeable enough about non-temporals to answer without substantial research first. 

JF Bastien via llvm-dev

unread,
Jan 15, 2016, 3:15:35 AM1/15/16
to Philip Reames, llvm...@lists.llvm.org, nd, Hans Boehm
I agree that it's fine to use a locked instruction as a seq_cst fence if MFENCE is not available. 
It's not clear to me this is true if the seq_cst fence is expected to fence non-temporal stores.  I think in practice, you'd be very unlikely to notice a difference, but I can't point to anything in the Intel docs which justifies a lock prefixed instruction as sufficient to fence any non-temporal access. 

Correct, that's why changing the memory model is critical: seq_cst fence wouldn't have any guarantee w.r.t. non-temporal.


What exactly would the non-temporal fences be?  It seems that on x86, the load and store case may differ.  In theory, there's also a before vs. after question.  In practice code using MOVNTA seems to assume that you only need an SFENCE afterwards.  I can't back that up with spec verbiage.  I don't know about MOVNTDQA.  What about ARM?
I'll leave this to JF to answer.  I'm not knowledgeable enough about non-temporals to answer without substantial research first.

I'm proposing two builtins:
  • __builtin_nontemporal_load_fence
  • __builtin_nontemporal_store_fence

I've I've got this right, on x86 they would respectively be a nop, and sfence.

They otherwise act as memory code motion barriers unless accesses are proven to not alias. I think it may be possible to loosen the rule so they act closer to acquire/release (allowing accesses to move into the pair) but I'm not convinced that this works for every ISA so I'd err on the side of caution (since this can be loosened later).

John Brawn via llvm-dev

unread,
Jan 15, 2016, 1:22:53 PM1/15/16
to Hans Boehm, Tim Northover, llvm...@lists.llvm.org, nd

> I haven't touched ARMv8 in a few years so I'm rusty on the non-temporal
> details for that ISA. I lifted this example from here:
>
> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0024a/CJACGJJF.html
>
> Which is correct?

 

I’ve confirmed that this example in the Cortex-A programmers guide is wrong, and it should

hopefully be corrected in a future version.

 

John

Hans Boehm via llvm-dev

unread,
Jan 15, 2016, 1:56:41 PM1/15/16
to John Brawn, llvm...@lists.llvm.org, nd
It seems to me the intent of that section is intelligible to those of us who have been spending too much time dealing with these issues, but seems wrong to everyone else:  If another thread updates [X0] and then [X3] (with an intervening fence), this thread may see the new value of [X3], but the old value of [X0], violating the data dependence.  This makes it incorrect to use such a load for e.g. Java final fields without a fence.  I agree that the text is at best unclear, but presumably that was indeed the intent?

Hans Boehm via llvm-dev

unread,
Jan 15, 2016, 2:21:32 PM1/15/16
to Philip Reames, llvm...@lists.llvm.org, nd
On Thu, Jan 14, 2016 at 4:27 PM, Philip Reames <list...@philipreames.com> wrote:
It's not clear to me this is true if the seq_cst fence is expected to fence non-temporal stores.  I think in practice, you'd be very unlikely to notice a difference, but I can't point to anything in the Intel docs which justifies a lock prefixed instruction as sufficient to fence any non-temporal access. 

Agreed.  I think it's not guaranteed.  And the most rational explanation for the fact that LOCK; X is faster than MFENCE seems to be that LOCK only deals with normal write-back cacheable accesses, and hence may not work for cases like this.


If you have to dirty a cache line, (%esp) seems like relatively safe one. 
Agreed.  As we discussed previously, it is possible to false sharing in C++, but this would require one thread to be accessing information stored in the last frame of another running thread's stack.  That seems sufficiently unlikely to be ignored. 

I disagree with the reasoning, but not really with the conclusion.  Starting a thread with a lambda that captures locals by reference is likely to do this, and is a common C++ idiom, especially in textbook examples.  This is aggravated by the fact that I don't understand the hardware prefetcher, and that it sometimes seems to fetch an adjacent line.  (Note that C, unlike C++, allows implementations to make thread stacks inaccessible to other threads.  Some of us consider that a bug and would refuse to use a general purpose implementation that actually did this.  I suspect there are enough of us that it doesn't matter.)

I think a stronger argument is that the compiler is always allowed to push temporaries on the stack.  So this looks exactly as though a sequentially consistent fence required a stack temporary.


It's only the idea of writing to a memory location when MFENCE is available, and could be used instead, that seems questionable.
While in principal I agree, it appears in practice that this tradeoff is worthwhile.  The hardware doesn't seem to optimize for the MFENCE case whereas lock prefix instructions appear to be handled much better.
The concern is that it is actually fairly easy to get contention as a result in C++.  And programmers might think they know that certain fences shouldn't use temporaries and the rest of their code should run in registers.  But I agree this is not a completely clear call.  I wish x86 provided a plain fence instruction that handled the common case efficiently, so we could avoid these trade-offs.  (A "sequentially consistent store" instruction might be even better, in that it should largely eliminate fences and allows other optimizations.)

Hans

Hans Boehm via llvm-dev

unread,
Jan 15, 2016, 3:04:18 PM1/15/16
to JF Bastien, llvm...@lists.llvm.org, nd
On Fri, Jan 15, 2016 at 12:15 AM, JF Bastien <j...@google.com> wrote:
What exactly would the non-temporal fences be?  It seems that on x86, the load and store case may differ.  In theory, there's also a before vs. after question.  In practice code using MOVNTA seems to assume that you only need an SFENCE afterwards.  I can't back that up with spec verbiage.  I don't know about MOVNTDQA.  What about ARM?
I'll leave this to JF to answer.  I'm not knowledgeable enough about non-temporals to answer without substantial research first.

I'm proposing two builtins:
  • __builtin_nontemporal_load_fence
  • __builtin_nontemporal_store_fence

I've I've got this right, on x86 they would respectively be a nop, and sfence.

They otherwise act as memory code motion barriers unless accesses are proven to not alias. I think it may be possible to loosen the rule so they act closer to acquire/release (allowing accesses to move into the pair) but I'm not convinced that this works for every ISA so I'd err on the side of caution (since this can be loosened later).

What would the semantics be?  They restore the normal architectural ordering guarantees relied upon by the synchronization primitives, so that non-temporal accesses don't need to be considered when  implementing synchronization?

Then I think an SFENCE following x86 non-temporal stores would be correct. And empirically we don't need anything to before a non-temporal store to order it with respect to earlier normal stores.  But I don't the latter conclusion follows from the spec.

I looked at the MOVNTDQA non-temporal load documentation again, and I'm confused.  It sounds like so long as the memory is WB-cacheable, we may be OK without any fences.  But I can't tell that for sure.  In the WC case, a LOCKed instruction seems to be documented to work as a fence.

In the ARM LDNP case, things seem to be messy.  I don't think we currently need fences for C++, since we don't normally use the dependency-based ordering guarantees.  (Except to prevent out-of-thin-air results, which don't seem to be precluded by the ARM spec.  Intentional or bug?)  But the difference does matter when implementing Java final fields or memory_order_consume.

I'm actually getting a little worried that these things are just too idiosynchratic to reflect in portable intrinsics.
Reply all
Reply to author
Forward
0 new messages