LDR X0, [X3]LDNP X2, X1, [X0] // X0 may not be loaded when the instruction executes!
Hello, fencing enthusiasts!TL;DR: We'd like to propose an addition to the LLVM memory model requiring non-temporal accesses be surrounded by non-temporal load barriers and non-temporal store barriers, and we'd like to add such orderings to the fence IR opcode.
We are open to different approaches, hence this email instead of a patch.
Who's "we"?
Philip Reames brought this to my attention, and we've had numerous discussions with Hans Boehm on the topic. Any mistakes below are my own, all the clever bits are theirs.
Why?
Ignore non-temporals for a moment, on most x86 targets LLVM generates an mfence for seq_cst atomic fencing. One could instead use a locked idempotent atomic accesses to top-of-stack such as lock or4i [RSP-8] 0. Philip has measured this as equivalent on micro-benchmarks, but as ~25% faster in macro-benchmarks (other codebases confirm this). There's one problem with this approach: non-temporal accesses on x86 are only ordered by fence instructions! This means that code using non-temporal accesses can't rely on LLVM's fence opcode to do the right thing, they instead have to rely on architecture-specific _mm*fence intrinsics.
What about non-x86 architectures?
Architectures such as ARMv8 support non-temporal instructions and require barriers such as DMB nshld to order loads and DMB nshst to order stores.
Even ARM's address-dependency rule (a.k.a. the ill-fated std::memory_order_consume) fails to hold with non-temporals:
LDR X0, [X3]
LDNP X2, X1, [X0] // X0 may not be loaded when the instruction executes!
What exactly do you mean by ‘X0 may not be loaded’ in your example here? If you mean that the LDNP
could start executing with the value of X0 from before the LDR, e.g. initially X0=0x100, the LDR loads
X0=0x200 but the LDNP uses the old value of X0=0x100, then I don’t think that’s true. According to
section C3.2.4 of the ARMv8 ARMARM other observers may observe the LDR and the LDNP in the wrong
order, but the CPU executing the instructions will observe them in program order.
I have no idea if that affects anything in this RFC though.
John
What about non-x86 architectures?
Architectures such as ARMv8 support non-temporal instructions and require barriers such as DMB nshld to order loads and DMB nshst to order stores.
Even ARM's address-dependency rule (a.k.a. the ill-fated std::memory_order_consume) fails to hold with non-temporals:
LDR X0, [X3]
LDNP X2, X1, [X0] // X0 may not be loaded when the instruction executes!
What exactly do you mean by ‘X0 may not be loaded’ in your example here? If you mean that the LDNP
could start executing with the value of X0 from before the LDR, e.g. initially X0=0x100, the LDR loads
X0=0x200 but the LDNP uses the old value of X0=0x100, then I don’t think that’s true. According to
section C3.2.4 of the ARMv8 ARMARM other observers may observe the LDR and the LDNP in the wrong
order, but the CPU executing the instructions will observe them in program order.
I have no idea if that affects anything in this RFC though.
FWIW, I agree with John here. The example I'd give for the unexpected
behaviour allowed in the spec is:
.Lwait_for_data:
    ldr x0, [x3]
    cbz x0, .Lwait_for_data
    ldnp x2, x1, [x0]
where another thread first writes to a buffer then tells us where that
buffer is. For a normal ldp, the address dependency rule means we
don't need a barrier or acquiring load to ensure we see the real data
in the buffer. For ldnp, we would need a barrier to prevent stale
data.
I suspect this is actually even closer to the x86 situation than what
the guide implies (which looks like a straight-up exposed pipeline to
me, beyond even what Alpha would have done).
Cheers.
Tim.
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Clang currently has __builtin_nontemporal_store and __builtin_nontemporal_load. How will the usage model for those change?
Thanks again,
Hal
----- Original Message -----
> > Hello, fencing enthusiasts!
> 
> > Who's "we"?
> 
> > Why?
> 
> > 0 . Philip has measured this as equivalent on micro-benchmarks, but
> > such as AtomicExpandPass . It seems more natural to ask the
> > programmer to express intent, just as is done with atomics. In
> > fact,
> > a backend is current free to ignore !nontemporal on load and store
> > and could therefore generate only half of what's requested, leading
> > to incorrect code. That would of course be silly, backends should
> > either honor all !nontemporal or none of them but who knows what
> > the
> > middle-end does.
> 
> > Put another way: some optimized C library use non-temporal accesses
> > (when string instructions aren't du jour) and they terminate their
> > copying with an sfence . It's a de-facto convention, the ABI
> > doesn't
> > say anything, but let's avoid divergence.
> 
> > Aside: one day we may live in the fence elimination promised land
> > where fences are exactly where they need to be, no more, no less.
> 
> > Isn't x86's lfence just a no-op?
> 
> > Yes, but we're proposing the addition of a target-independent
> > non-temporal load barrier. It'll be up to the x86 backend to make
> > it
> > an X86ISD::MEMBARRIER and other backends to get it right (hint:
> > it's
> > not always a no-op).
> 
> > Won't this optimization cause coherency misses? C++ access the
> > thread
> > stack concurrently all the time!
> 
> > Maybe, but then it isn't much of an optimization if it's slowing
> > code
> > down. LLVM doesn't just target C++, and it's really up to the
> > backend to decide whether one fence type is better than another (on
> > x86, whether a locked top-of-stack idempotent operation is better
> > than mfence ). Other languages have private stacks where this isn't
> > an issue, and where the stack top can reasonably be assumed to be
> > in
> > cache.
> 
> > How will this affect non-user-mode code (i.e. kernel code)?
> 
> > Kernel code still has to ask for _mm_ mfence if it wants mfence :
> > C11
> > and C++11 barriers aren't specified as a specific instruction.
> 
> > Is it safe to access top-of-stack?
> 
> > AFAIK yes, and the ABI-specified red zone has our back (or front if
> > the stack grows up ☻).
> 
> > What about non-x86 architectures?
> 
> > Architectures such as ARMv8 support non-temporal instructions and
> > require barriers such as DMB nshld to order loads and DMB nshst to
> > order stores.
> 
> > Even ARM's address-dependency rule (a.k.a. the ill-fated
> > std::memory_order_consume ) fails to hold with non-temporals:
> 
> > > LDR X0, [X3]
> > 
> 
> > > LDNP X2, X1, [X0] // X0 may not be loaded when the instruction
> > > executes!
> > 
> 
> > Who uses non-temporals anyways?
> 
> > That's an awfully personal question!
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm...@lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
--
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory
Hi JF, Philip,
Clang currently has __builtin_nontemporal_store and __builtin_nontemporal_load. How will the usage model for those change?
So we'll add new fence intrinsics. That makes sense.
> 
> 
> Unless we have LLVM automagically figure out where non-temporal
> fences should go, which I think isn't as good of an approach.
> 
I agree. Such a determination is likely to be too conservative in practice.
-Hal
I agree with Tim's assessment for ARM. That's interesting; I wasn't previously aware of that instruction.My understanding is that Alpha would have the same problem for normal loads.I'm all in favor of more systematic handling of the fences associated with x86 non-temporal accesses.AFAICT, nontemporal loads and stores seem to have different fencing rules on x86, none of them very clear. Nontemporal stores should probably ideally use an SFENCE. Locked instructions seem to be documented to work with MOVNTDQA. In both cases, there seems to be only empirical evidence as to which side(s) of the nontemporal operations they should go on?I finally decided that I was OK with using a LOCKed top-of-stack update as a fence in Java on x86. I'm significantly less enthusiastic for C++. I also think that risks unexpected coherence miss problems, though they would probably be very rare. But they would be very surprising if they did occur.
On Wed, Jan 13, 2016 at 10:59 AM, Tim Northover <t.p.no...@gmail.com> wrote:> I haven't touched ARMv8 in a few years so I'm rusty on the non-temporal
> details for that ISA. I lifted this example from here:
>
> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0024a/CJACGJJF.html
>
> Which is correct?
FWIW, I agree with John here. The example I'd give for the unexpected
behaviour allowed in the spec is:
.Lwait_for_data:
ldr x0, [x3]
cbz x0, .Lwait_for_data
ldnp x2, x1, [x0]
where another thread first writes to a buffer then tells us where that
buffer is. For a normal ldp, the address dependency rule means we
don't need a barrier or acquiring load to ensure we see the real data
in the buffer. For ldnp, we would need a barrier to prevent stale
data.
I suspect this is actually even closer to the x86 situation than what
the guide implies (which looks like a straight-up exposed pipeline to
me, beyond even what Alpha would have done).
Cheers.
Tim.
----- Original Message -----
> From: "JF Bastien" <j...@google.com>
> To: "Hal Finkel" <hfi...@anl.gov>
> Cc: "Philip Reames" <list...@philipreames.com>, "Hans Boehm" <hbo...@google.com>, "llvm-dev"
> <llvm...@lists.llvm.org>
> Sent: Thursday, January 14, 2016 3:02:20 PM
> Subject: Re: [llvm-dev] RFC: non-temporal fencing in LLVM IR
>
>
>
>
> On Thu, Jan 14, 2016 at 12:51 PM, Hal Finkel < hfi...@anl.gov >
> wrote:
>
>
> Hi JF, Philip,
>
> Clang currently has __builtin_nontemporal_store and
> __builtin_nontemporal_load. How will the usage model for those
> change?
>
>
>
> I think you would use them in the same way, but you'd have to also
> use __builtin_nontemporal_store_fence and
> __builtin_nontemporal_load_fence.
So we'll add new fence intrinsics. That makes sense.
> Unless we have LLVM automagically figure out where non-temporal
> fences should go, which I think isn't as good of an approach.
>
I agree. Such a determination is likely to be too conservative in practice.
On Wed, Jan 13, 2016 at 7:00 PM, Hans Boehm via llvm-dev <llvm...@lists.llvm.org> wrote:I agree with Tim's assessment for ARM. That's interesting; I wasn't previously aware of that instruction.My understanding is that Alpha would have the same problem for normal loads.I'm all in favor of more systematic handling of the fences associated with x86 non-temporal accesses.AFAICT, nontemporal loads and stores seem to have different fencing rules on x86, none of them very clear. Nontemporal stores should probably ideally use an SFENCE. Locked instructions seem to be documented to work with MOVNTDQA. In both cases, there seems to be only empirical evidence as to which side(s) of the nontemporal operations they should go on?I finally decided that I was OK with using a LOCKed top-of-stack update as a fence in Java on x86. I'm significantly less enthusiastic for C++. I also think that risks unexpected coherence miss problems, though they would probably be very rare. But they would be very surprising if they did occur.Today's LLVM already emits 'lock or %eax, (%esp)' for 'fence seq_cst'/__sync_synchronize/__atomic_thread_fence(__ATOMIC_SEQ_CST) when targeting 32-bit x86 machines which do not support mfence. What instruction sequence should we be using instead?
On Thu, Jan 14, 2016 at 1:10 PM, David Majnemer via llvm-dev <llvm...@lists.llvm.org> wrote:On Wed, Jan 13, 2016 at 7:00 PM, Hans Boehm via llvm-dev <llvm...@lists.llvm.org> wrote:I agree with Tim's assessment for ARM. That's interesting; I wasn't previously aware of that instruction.My understanding is that Alpha would have the same problem for normal loads.I'm all in favor of more systematic handling of the fences associated with x86 non-temporal accesses.AFAICT, nontemporal loads and stores seem to have different fencing rules on x86, none of them very clear. Nontemporal stores should probably ideally use an SFENCE. Locked instructions seem to be documented to work with MOVNTDQA. In both cases, there seems to be only empirical evidence as to which side(s) of the nontemporal operations they should go on?I finally decided that I was OK with using a LOCKed top-of-stack update as a fence in Java on x86. I'm significantly less enthusiastic for C++. I also think that risks unexpected coherence miss problems, though they would probably be very rare. But they would be very surprising if they did occur.Today's LLVM already emits 'lock or %eax, (%esp)' for 'fence seq_cst'/__sync_synchronize/__atomic_thread_fence(__ATOMIC_SEQ_CST) when targeting 32-bit x86 machines which do not support mfence. What instruction sequence should we be using instead?Do they have non-temporal accesses in the ISA?
On Thu, Jan 14, 2016 at 1:13 PM, JF Bastien <j...@google.com> wrote:On Thu, Jan 14, 2016 at 1:10 PM, David Majnemer via llvm-dev <llvm...@lists.llvm.org> wrote:On Wed, Jan 13, 2016 at 7:00 PM, Hans Boehm via llvm-dev <llvm...@lists.llvm.org> wrote:I agree with Tim's assessment for ARM. That's interesting; I wasn't previously aware of that instruction.My understanding is that Alpha would have the same problem for normal loads.I'm all in favor of more systematic handling of the fences associated with x86 non-temporal accesses.AFAICT, nontemporal loads and stores seem to have different fencing rules on x86, none of them very clear. Nontemporal stores should probably ideally use an SFENCE. Locked instructions seem to be documented to work with MOVNTDQA. In both cases, there seems to be only empirical evidence as to which side(s) of the nontemporal operations they should go on?I finally decided that I was OK with using a LOCKed top-of-stack update as a fence in Java on x86. I'm significantly less enthusiastic for C++. I also think that risks unexpected coherence miss problems, though they would probably be very rare. But they would be very surprising if they did occur.Today's LLVM already emits 'lock or %eax, (%esp)' for 'fence seq_cst'/__sync_synchronize/__atomic_thread_fence(__ATOMIC_SEQ_CST) when targeting 32-bit x86 machines which do not support mfence. What instruction sequence should we be using instead?Do they have non-temporal accesses in the ISA?I thought not but there appear to be instructions like movntps. mfence was introduced in SSE2 while movntps and sfence were introduced in SSE.
On Thu, Jan 14, 2016 at 1:35 PM, David Majnemer <david.m...@gmail.com> wrote:On Thu, Jan 14, 2016 at 1:13 PM, JF Bastien <j...@google.com> wrote:On Thu, Jan 14, 2016 at 1:10 PM, David Majnemer via llvm-dev <llvm...@lists.llvm.org> wrote:On Wed, Jan 13, 2016 at 7:00 PM, Hans Boehm via llvm-dev <llvm...@lists.llvm.org> wrote:I agree with Tim's assessment for ARM. That's interesting; I wasn't previously aware of that instruction.My understanding is that Alpha would have the same problem for normal loads.I'm all in favor of more systematic handling of the fences associated with x86 non-temporal accesses.AFAICT, nontemporal loads and stores seem to have different fencing rules on x86, none of them very clear. Nontemporal stores should probably ideally use an SFENCE. Locked instructions seem to be documented to work with MOVNTDQA. In both cases, there seems to be only empirical evidence as to which side(s) of the nontemporal operations they should go on?I finally decided that I was OK with using a LOCKed top-of-stack update as a fence in Java on x86. I'm significantly less enthusiastic for C++. I also think that risks unexpected coherence miss problems, though they would probably be very rare. But they would be very surprising if they did occur.Today's LLVM already emits 'lock or %eax, (%esp)' for 'fence seq_cst'/__sync_synchronize/__atomic_thread_fence(__ATOMIC_SEQ_CST) when targeting 32-bit x86 machines which do not support mfence. What instruction sequence should we be using instead?Do they have non-temporal accesses in the ISA?I thought not but there appear to be instructions like movntps. mfence was introduced in SSE2 while movntps and sfence were introduced in SSE.So the new builtin could be sfence? I think the codegen you point out for SEQ_CST is fine if we fix the memory model as suggested.
On Thu, Jan 14, 2016 at 1:37 PM, JF Bastien <j...@google.com> wrote:
On Thu, Jan 14, 2016 at 1:35 PM, David Majnemer <david.m...@gmail.com> wrote:
On Thu, Jan 14, 2016 at 1:13 PM, JF Bastien <j...@google.com> wrote:
On Thu, Jan 14, 2016 at 1:10 PM, David Majnemer via llvm-dev <llvm...@lists.llvm.org> wrote:
On Wed, Jan 13, 2016 at 7:00 PM, Hans Boehm via llvm-dev <llvm...@lists.llvm.org> wrote:
I agree with Tim's assessment for ARM. That's interesting; I wasn't previously aware of that instruction.
My understanding is that Alpha would have the same problem for normal loads.
I'm all in favor of more systematic handling of the fences associated with x86 non-temporal accesses.
AFAICT, nontemporal loads and stores seem to have different fencing rules on x86, none of them very clear. Nontemporal stores should probably ideally use an SFENCE. Locked instructions seem to be documented to work with MOVNTDQA. In both cases, there seems to be only empirical evidence as to which side(s) of the nontemporal operations they should go on?
I finally decided that I was OK with using a LOCKed top-of-stack update as a fence in Java on x86. I'm significantly less enthusiastic for C++. I also think that risks unexpected coherence miss problems, though they would probably be very rare. But they would be very surprising if they did occur.
Today's LLVM already emits 'lock or %eax, (%esp)' for 'fence seq_cst'/__sync_synchronize/__atomic_thread_fence(__ATOMIC_SEQ_CST) when targeting 32-bit x86 machines which do not support mfence. What instruction sequence should we be using instead?
Do they have non-temporal accesses in the ISA?
I thought not but there appear to be instructions like movntps. mfence was introduced in SSE2 while movntps and sfence were introduced in SSE.
So the new builtin could be sfence? I think the codegen you point out for SEQ_CST is fine if we fix the memory model as suggested.
I agree that it's fine to use a locked instruction as a seq_cst fence if MFENCE is not available.
If you have to dirty a cache line, (%esp) seems like relatively safe one.
(I'm assuming that CPUID is appreciably slower and out of the running? I haven't tried. But it also probably clobbers too many registers.)
It's only the idea of writing to a memory location when MFENCE is available, and could be used instead, that seems questionable.
What exactly would the non-temporal fences be? It seems that on x86, the load and store case may differ. In theory, there's also a before vs. after question. In practice code using MOVNTA seems to assume that you only need an SFENCE afterwards. I can't back that up with spec verbiage. I don't know about MOVNTDQA. What about ARM?
It's not clear to me this is true if the seq_cst fence is expected to fence non-temporal stores. I think in practice, you'd be very unlikely to notice a difference, but I can't point to anything in the Intel docs which justifies a lock prefixed instruction as sufficient to fence any non-temporal access.I agree that it's fine to use a locked instruction as a seq_cst fence if MFENCE is not available.
I'll leave this to JF to answer. I'm not knowledgeable enough about non-temporals to answer without substantial research first.What exactly would the non-temporal fences be? It seems that on x86, the load and store case may differ. In theory, there's also a before vs. after question. In practice code using MOVNTA seems to assume that you only need an SFENCE afterwards. I can't back that up with spec verbiage. I don't know about MOVNTDQA. What about ARM?
> I haven't touched ARMv8 in a few years so I'm rusty on the non-temporal
> details for that ISA. I lifted this example from here:
>
> 
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0024a/CJACGJJF.html
>
> Which is correct?
I’ve confirmed that this example in the Cortex-A programmers guide is wrong, and it should
hopefully be corrected in a future version.
John
It's not clear to me this is true if the seq_cst fence is expected to fence non-temporal stores. I think in practice, you'd be very unlikely to notice a difference, but I can't point to anything in the Intel docs which justifies a lock prefixed instruction as sufficient to fence any non-temporal access.
Agreed. As we discussed previously, it is possible to false sharing in C++, but this would require one thread to be accessing information stored in the last frame of another running thread's stack. That seems sufficiently unlikely to be ignored.If you have to dirty a cache line, (%esp) seems like relatively safe one.
While in principal I agree, it appears in practice that this tradeoff is worthwhile. The hardware doesn't seem to optimize for the MFENCE case whereas lock prefix instructions appear to be handled much better.It's only the idea of writing to a memory location when MFENCE is available, and could be used instead, that seems questionable.
I'll leave this to JF to answer. I'm not knowledgeable enough about non-temporals to answer without substantial research first.What exactly would the non-temporal fences be? It seems that on x86, the load and store case may differ. In theory, there's also a before vs. after question. In practice code using MOVNTA seems to assume that you only need an SFENCE afterwards. I can't back that up with spec verbiage. I don't know about MOVNTDQA. What about ARM?
I'm proposing two builtins:
- __builtin_nontemporal_load_fence
- __builtin_nontemporal_store_fenceI've I've got this right, on x86 they would respectively be a nop, and sfence.They otherwise act as memory code motion barriers unless accesses are proven to not alias. I think it may be possible to loosen the rule so they act closer to acquire/release (allowing accesses to move into the pair) but I'm not convinced that this works for every ISA so I'd err on the side of caution (since this can be loosened later).