Spatial Prefetcher: This prefetcher strives to complete every cache line fetched to the L2 cache with the pair line that completes it to a 128-byte aligned chunk.
On Pentium M, Intel Core Solo, Intel Core Duo processors, and processors based on Intel Core microarchitecture; a synchronization variable should be placed alone and in separate cache line to avoid false-sharing. Software must not allow a synchronization variable to span across page boundary.
I don't think your code does proper alignment. You malloc the array of padded_long structs. Malloc does not respect the aligned attribute on structs as far as I remember. The alignment only works for stack allocated structs AFAIR. Maybe put in an assert on the address to verify alignment. I would use the non-portable posix_memalign or the portable but sometimes not supported aligned_alloc or just allocate extra memory and find the right aligned boundary myself.
You should test with multiple NUMA nodes, or false sharing
becomes true sharing at LLC.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
You should test with multiple NUMA nodes, or false sharing becomes true sharing at LLC.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
This.Also, I think the (Intel) adjacent sector prefetch is a feature enabled through BIOS. I think that will pull the adjacent line to L1, whereas the spatial prefetcher is probably for streaming accesses that are loading L2.
Also, I'd run the bench without atomic ops - just relaxed (atomic) or volatile writes.
On Sun, Jan 29, 2017 at 1:27 PM Avi Kivity <a...@scylladb.com> wrote:
You should test with multiple NUMA nodes, or false sharing becomes true sharing at LLC.
On 01/29/2017 07:04 PM, Duarte Nunes wrote:
--Hi all,
In the latest Intel optimization manual, we can read in section "2.3.5.4 Data Prefetching":
Spatial Prefetcher: This prefetcher strives to complete every cache line fetched to the L2 cache with the pair line that completes it to a 128-byte aligned chunk.
I take this to mean that adjacent cache lines are brought together from memory (as often as possible). Indeed, there is code (e.g. JCTools, Folly) that assumes the false sharing range is 128 bytes and pads accordingly.
However, doing some more exegesis on the manual reveals there is no mention of prefetching in the context of false sharing, and save for the NetBurst microarchitectures, all advice seems to be to place variables in different cache lines:
On Pentium M, Intel Core Solo, Intel Core Duo processors, and processors based on Intel Core microarchitecture; a synchronization variable should be placed alone and in separate cache line to avoid false-sharing. Software must not allow a synchronization variable to span across page boundary.
Similarly, in the Linux kernel the false sharing range seems to be just a cache line (64 bytes).
I myself saw no difference whether values are 1 or 2 cache lines apart, when running tests to demonstraste false sharing (https://gist.github.com/duarten/b7ee60b4412596440a97498d87bf402e), but that's only relevant for the microarchitecture I'm in (Haswell-E).
Am I missing something?
Cheers,Duarte
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
On Sunday, January 29, 2017 at 8:16:53 PM UTC+1, Vitaly Davidovich wrote:This.Also, I think the (Intel) adjacent sector prefetch is a feature enabled through BIOS. I think that will pull the adjacent line to L1, whereas the spatial prefetcher is probably for streaming accesses that are loading L2.I think that was for Pentium 4, for the NetBurst microarchitecture. I can't find any reference to it in the new manuals. The spatial prefetcher is different than the streamer (section 2.3.5.4).Also, I'd run the bench without atomic ops - just relaxed (atomic) or volatile writes.It's not clear to me why would that matter?
On Sun, Jan 29, 2017 at 1:27 PM Avi Kivity <a...@scylladb.com> wrote:
You should test with multiple NUMA nodes, or false sharing becomes true sharing at LLC.
On 01/29/2017 07:04 PM, Duarte Nunes wrote:
--Hi all,
In the latest Intel optimization manual, we can read in section "2.3.5.4 Data Prefetching":
Spatial Prefetcher: This prefetcher strives to complete every cache line fetched to the L2 cache with the pair line that completes it to a 128-byte aligned chunk.
I take this to mean that adjacent cache lines are brought together from memory (as often as possible). Indeed, there is code (e.g. JCTools, Folly) that assumes the false sharing range is 128 bytes and pads accordingly.
However, doing some more exegesis on the manual reveals there is no mention of prefetching in the context of false sharing, and save for the NetBurst microarchitectures, all advice seems to be to place variables in different cache lines:
On Pentium M, Intel Core Solo, Intel Core Duo processors, and processors based on Intel Core microarchitecture; a synchronization variable should be placed alone and in separate cache line to avoid false-sharing. Software must not allow a synchronization variable to span across page boundary.
Similarly, in the Linux kernel the false sharing range seems to be just a cache line (64 bytes).
I myself saw no difference whether values are 1 or 2 cache lines apart, when running tests to demonstraste false sharing (https://gist.github.com/duarten/b7ee60b4412596440a97498d87bf402e), but that's only relevant for the microarchitecture I'm in (Haswell-E).
Am I missing something?
Cheers,Duarte
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--Sent from my phone
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
I am not sure how much I would trust an AWS machine in general for benchmarks
>>> email to mechanical-sympathy+unsub...@googlegroups.com.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>>
> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to mechanical-sympathy+unsub...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
--
Studying for the Turing test
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.