Prefetching and false sharing

550 views
Skip to first unread message

Duarte Nunes

unread,
Jan 29, 2017, 12:04:52 PM1/29/17
to mechanical-sympathy
Hi all,

In the latest Intel optimization manual, we can read in section "2.3.5.4 Data Prefetching":

Spatial Prefetcher: This prefetcher strives to complete every cache line fetched to the L2 cache with the pair line that completes it to a 128-byte aligned chunk.

I take this to mean that adjacent cache lines are brought together from memory (as often as possible). Indeed, there is code (e.g. JCTools, Folly) that assumes the false sharing range is 128 bytes and pads accordingly.

However, doing some more exegesis on the manual reveals there is no mention of prefetching in the context of false sharing, and save for the NetBurst microarchitectures, all advice seems to be to place variables in different cache lines:

On Pentium M, Intel Core Solo, Intel Core Duo processors, and processors based on Intel Core microarchitecture; a synchronization variable should be placed alone and in separate cache line to avoid false-sharing. Software must not allow a synchronization variable to span across page boundary.

Similarly, in the Linux kernel the false sharing range seems to be just a cache line (64 bytes).

I myself saw no difference whether values are 1 or 2 cache lines apart, when running tests to demonstraste false sharing (https://gist.github.com/duarten/b7ee60b4412596440a97498d87bf402e), but that's only relevant for the microarchitecture I'm in (Haswell-E).

Am I missing something?

Cheers,
Duarte

Rajiv Kurian

unread,
Jan 29, 2017, 12:26:14 PM1/29/17
to mechanical-sympathy
I don't think your code does proper alignment. You malloc the array of padded_long structs. Malloc does not respect the aligned attribute on structs as far as I remember. The alignment only works for stack allocated structs AFAIR. Maybe put in an assert on the address to verify alignment. I would use the non-portable posix_memalign or the portable but sometimes not supported aligned_alloc or just allocate extra memory and find the right aligned boundary myself.

Duarte Nunes

unread,
Jan 29, 2017, 12:54:10 PM1/29/17
to mechanical-sympathy


On Sunday, January 29, 2017 at 6:26:14 PM UTC+1, Rajiv Kurian wrote:
I don't think your code does proper alignment. You malloc the array of padded_long structs. Malloc does not respect the aligned attribute on structs as far as I remember. The alignment only works for stack allocated structs AFAIR. Maybe put in an assert on the address to verify alignment. I would use the non-portable posix_memalign or the portable but sometimes not supported aligned_alloc or just allocate extra memory and find the right aligned boundary myself.

It works for me. I can put in the assert or just add a uint64_t pad[0/64/128] field to the struct.

Avi Kivity

unread,
Jan 29, 2017, 1:27:03 PM1/29/17
to mechanica...@googlegroups.com

You should test with multiple NUMA nodes, or false sharing becomes true sharing at LLC.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Vitaly Davidovich

unread,
Jan 29, 2017, 2:16:53 PM1/29/17
to mechanica...@googlegroups.com
This.

Also, I think the (Intel) adjacent sector prefetch is a feature enabled through BIOS.  I think that will pull the adjacent line to L1, whereas the spatial prefetcher is probably for streaming accesses that are loading L2.

Also, I'd run the bench without atomic ops - just relaxed (atomic) or volatile writes.
--
Sent from my phone

Duarte Nunes

unread,
Jan 29, 2017, 2:33:46 PM1/29/17
to mechanical-sympathy


On Sunday, January 29, 2017 at 7:27:03 PM UTC+1, Avi Kivity wrote:

You should test with multiple NUMA nodes, or false sharing becomes true sharing at LLC.



You're right, of course. With multiple NUMA nodes I clearly see a difference. For 128 byte padding:

1339 mticks, node 1, cpu 13
1875 mticks, node 1, cpu 13
2208 mticks, node 1, cpu 31
2259 mticks, node 0, cpu 4
2267 mticks, node 1, cpu 24
2278 mticks, node 1, cpu 28
2286 mticks, node 1, cpu 26
2290 mticks, node 1, cpu 10
2289 mticks, node 1, cpu 13
2330 mticks, node 1, cpu 25
2332 mticks, node 1, cpu 27
2307 mticks, node 1, cpu 12
2361 mticks, node 1, cpu 15
2361 mticks, node 1, cpu 8
2337 mticks, node 0, cpu 20
2395 mticks, node 0, cpu 7
2402 mticks, node 0, cpu 6
2403 mticks, node 1, cpu 30
2385 mticks, node 1, cpu 14
2416 mticks, node 0, cpu 22
2419 mticks, node 0, cpu 23
2430 mticks, node 0, cpu 5
2438 mticks, node 0, cpu 17
2435 mticks, node 0, cpu 21
2451 mticks, node 0, cpu 1
2454 mticks, node 0, cpu 2
2461 mticks, node 0, cpu 3
2473 mticks, node 0, cpu 18
2474 mticks, node 0, cpu 19
3149 mticks, node 0, cpu 4
4182 mticks, node 1, cpu 29
4647 mticks, node 0, cpu 16
main 4653 mticks, node 1, cpu 15

For 64:

8690 mticks, node 1, cpu 28
9003 mticks, node 1, cpu 13
9430 mticks, node 0, cpu 5
9789 mticks, node 1, cpu 28
9869 mticks, node 1, cpu 14
9873 mticks, node 1, cpu 26
9890 mticks, node 0, cpu 17
9898 mticks, node 1, cpu 27
9904 mticks, node 1, cpu 31
9911 mticks, node 1, cpu 30
9998 mticks, node 1, cpu 15
10042 mticks, node 0, cpu 23
10068 mticks, node 1, cpu 25
10076 mticks, node 0, cpu 18
10077 mticks, node 1, cpu 10
10079 mticks, node 1, cpu 14
10093 mticks, node 1, cpu 12
10139 mticks, node 0, cpu 5
10148 mticks, node 0, cpu 19
10159 mticks, node 1, cpu 29
10207 mticks, node 0, cpu 7
10214 mticks, node 0, cpu 2
10240 mticks, node 1, cpu 24
10259 mticks, node 0, cpu 16
10261 mticks, node 0, cpu 1
10292 mticks, node 0, cpu 3
10319 mticks, node 0, cpu 0
10326 mticks, node 0, cpu 22
10353 mticks, node 0, cpu 21
10663 mticks, node 1, cpu 11
10903 mticks, node 0, cpu 23
11318 mticks, node 0, cpu 20
main 11319 mticks, node 1, cpu 15

And finally, no padding:

16174 mticks, node 0, cpu 6
16441 mticks, node 0, cpu 22
27226 mticks, node 1, cpu 15
28862 mticks, node 0, cpu 22
29535 mticks, node 0, cpu 17
31532 mticks, node 1, cpu 29
32141 mticks, node 0, cpu 5
32601 mticks, node 0, cpu 4
32917 mticks, node 1, cpu 28
33464 mticks, node 1, cpu 10
33677 mticks, node 0, cpu 2
34329 mticks, node 0, cpu 19
34673 mticks, node 1, cpu 30
34840 mticks, node 0, cpu 21
35238 mticks, node 0, cpu 1
35386 mticks, node 0, cpu 23
35435 mticks, node 0, cpu 6
35638 mticks, node 1, cpu 15
35673 mticks, node 0, cpu 16
35940 mticks, node 0, cpu 2
36180 mticks, node 1, cpu 15
36747 mticks, node 0, cpu 4
36902 mticks, node 0, cpu 3
39147 mticks, node 1, cpu 29
39424 mticks, node 1, cpu 13
39815 mticks, node 1, cpu 10
40235 mticks, node 1, cpu 11
40313 mticks, node 1, cpu 9
41448 mticks, node 1, cpu 12
42617 mticks, node 1, cpu 24
43273 mticks, node 1, cpu 14
43288 mticks, node 1, cpu 15
main 43324 mticks, node 1, cpu 11

 
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Duarte Nunes

unread,
Jan 29, 2017, 2:34:39 PM1/29/17
to mechanical-sympathy
Forgot to mention that I had to use an AWS machine and CPU counters are not available, so not posting those.

Duarte Nunes

unread,
Jan 29, 2017, 2:37:15 PM1/29/17
to mechanical-sympathy


On Sunday, January 29, 2017 at 8:16:53 PM UTC+1, Vitaly Davidovich wrote:
This.

Also, I think the (Intel) adjacent sector prefetch is a feature enabled through BIOS.  I think that will pull the adjacent line to L1, whereas the spatial prefetcher is probably for streaming accesses that are loading L2.


I think that was for Pentium 4, for the NetBurst microarchitecture. I can't find any reference to it in the new manuals. The spatial prefetcher is different than the streamer (section 2.3.5.4).

 

Also, I'd run the bench without atomic ops - just relaxed (atomic) or volatile writes.


It's not clear to me why would that matter?

 

On Sun, Jan 29, 2017 at 1:27 PM Avi Kivity <a...@scylladb.com> wrote:

You should test with multiple NUMA nodes, or false sharing becomes true sharing at LLC.


On 01/29/2017 07:04 PM, Duarte Nunes wrote:
Hi all,

In the latest Intel optimization manual, we can read in section "2.3.5.4 Data Prefetching":

Spatial Prefetcher: This prefetcher strives to complete every cache line fetched to the L2 cache with the pair line that completes it to a 128-byte aligned chunk.

I take this to mean that adjacent cache lines are brought together from memory (as often as possible). Indeed, there is code (e.g. JCTools, Folly) that assumes the false sharing range is 128 bytes and pads accordingly.

However, doing some more exegesis on the manual reveals there is no mention of prefetching in the context of false sharing, and save for the NetBurst microarchitectures, all advice seems to be to place variables in different cache lines:

On Pentium M, Intel Core Solo, Intel Core Duo processors, and processors based on Intel Core microarchitecture; a synchronization variable should be placed alone and in separate cache line to avoid false-sharing. Software must not allow a synchronization variable to span across page boundary.

Similarly, in the Linux kernel the false sharing range seems to be just a cache line (64 bytes).

I myself saw no difference whether values are 1 or 2 cache lines apart, when running tests to demonstraste false sharing (https://gist.github.com/duarten/b7ee60b4412596440a97498d87bf402e), but that's only relevant for the microarchitecture I'm in (Haswell-E).

Am I missing something?

Cheers,
Duarte
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Vitaly Davidovich

unread,
Jan 29, 2017, 3:07:08 PM1/29/17
to mechanica...@googlegroups.com
Ok, it's possible that it no longer exists.

As for the (not using) atomic ops, it's just to minimize the impact of atomics on the absolute numbers, and isolate the test to just the false sharing aspect.  The relative performance would still show the effects, but I'd think the absolute numbers would be different.
On Sun, Jan 29, 2017 at 2:37 PM Duarte Nunes <duarte....@gmail.com> wrote:


On Sunday, January 29, 2017 at 8:16:53 PM UTC+1, Vitaly Davidovich wrote:
This.

Also, I think the (Intel) adjacent sector prefetch is a feature enabled through BIOS.  I think that will pull the adjacent line to L1, whereas the spatial prefetcher is probably for streaming accesses that are loading L2.


I think that was for Pentium 4, for the NetBurst microarchitecture. I can't find any reference to it in the new manuals. The spatial prefetcher is different than the streamer (section 2.3.5.4).

 

Also, I'd run the bench without atomic ops - just relaxed (atomic) or volatile writes.


It's not clear to me why would that matter?
On Sun, Jan 29, 2017 at 1:27 PM Avi Kivity <a...@scylladb.com> wrote:

You should test with multiple NUMA nodes, or false sharing becomes true sharing at LLC.


On 01/29/2017 07:04 PM, Duarte Nunes wrote:
Hi all,

In the latest Intel optimization manual, we can read in section "2.3.5.4 Data Prefetching":

Spatial Prefetcher: This prefetcher strives to complete every cache line fetched to the L2 cache with the pair line that completes it to a 128-byte aligned chunk.

I take this to mean that adjacent cache lines are brought together from memory (as often as possible). Indeed, there is code (e.g. JCTools, Folly) that assumes the false sharing range is 128 bytes and pads accordingly.

However, doing some more exegesis on the manual reveals there is no mention of prefetching in the context of false sharing, and save for the NetBurst microarchitectures, all advice seems to be to place variables in different cache lines:

On Pentium M, Intel Core Solo, Intel Core Duo processors, and processors based on Intel Core microarchitecture; a synchronization variable should be placed alone and in separate cache line to avoid false-sharing. Software must not allow a synchronization variable to span across page boundary.

Similarly, in the Linux kernel the false sharing range seems to be just a cache line (64 bytes).

I myself saw no difference whether values are 1 or 2 cache lines apart, when running tests to demonstraste false sharing (https://gist.github.com/duarten/b7ee60b4412596440a97498d87bf402e), but that's only relevant for the microarchitecture I'm in (Haswell-E).

Am I missing something?

Cheers,
Duarte
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
--
Sent from my phone

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Avi Kivity

unread,
Jan 29, 2017, 3:59:12 PM1/29/17
to mechanica...@googlegroups.com
Surely there's a small lab with a few NUMA machines you can access somewhere?
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Greg Young

unread,
Jan 29, 2017, 6:00:26 PM1/29/17
to mechanica...@googlegroups.com
I am not sure how much I would trust an AWS machine in general for benchmarks
>>> email to mechanical-symp...@googlegroups.com.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>>
> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to mechanical-symp...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



--
Studying for the Turing test

Duarte Nunes

unread,
Jan 29, 2017, 6:13:57 PM1/29/17
to mechanica...@googlegroups.com
On Mon, Jan 30, 2017 at 12:00 AM Greg Young <gregor...@gmail.com> wrote:
I am not sure how much I would trust an AWS machine in general for benchmarks

I'll repeat the tests on real hardware, but since I was mainly interested in relative numbers, I think it's okay.

Duarte Nunes

unread,
Jan 31, 2017, 5:13:00 PM1/31/17
to mechanical-sympathy
I found a small lab with NUMA machines. The results match what I observed on AWS.

For 64 byte alignment:

$ perf stat -e instructions,cycles,cache-misses,L1-dcache-load-misses,LLC-load-misses,node-load-misses ./a.out 12

    24,833,266,980      instructions              #    0.44  insn per cycle           (83.36%)
    56,995,776,488      cycles                                                        (83.33%)
        49,379,695      cache-misses                                                  (83.31%)
        84,920,864      L1-dcache-load-misses                                         (83.33%)
        25,296,056      LLC-load-misses                                               (83.35%)
        25,023,297      node-load-misses                                              (66.72%)

       1.577588839 seconds time elapsed


For 128 byte alignment:

    23,692,465,156      instructions              #    0.92  insn per cycle           (83.41%)
    25,846,004,438      cycles                                                        (83.31%)
         5,961,883      cache-misses                                                  (83.31%)
         7,139,079      L1-dcache-load-misses                                         (83.34%)
         3,008,220      LLC-load-misses                                               (83.35%)
         2,991,075      node-load-misses                                              (66.77%)

       1.084108707 seconds time elapsed


With false sharing:

    24,115,447,790      instructions              #    0.07  insn per cycle           (83.33%)
   343,250,966,435      cycles                                                        (83.33%)
       118,898,229      cache-misses                                                  (83.33%)
       192,488,985      L1-dcache-load-misses                                         (83.34%)
        25,084,968      LLC-load-misses                                               (83.34%)
        24,735,178      node-load-misses                                              (66.66%)

       6.573246235 seconds time elapsed

The question remains why the Intel manuals don't suggest 128 byte alignment and why it seemingly hasn't been adopted in the Linux kernel. Maybe space, since the difference is less significant?

-Duarte

PS: This is the hardware.

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Model name:            Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
Stepping:              2
CPU MHz:               1200.292
CPU max MHz:           3200.0000
CPU min MHz:           1200.0000
BogoMIPS:              4794.50
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              20480K
NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31

>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>>
> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an

> For more options, visit https://groups.google.com/d/optout.



--
Studying for the Turing test

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Nitsan Wakart

unread,
Feb 6, 2017, 9:18:21 AM2/6/17
to mechanica...@googlegroups.com
I'm a bit late to the party, yes JCTools pads 128, this was based on measurement and was visible on non-NUMA setups. See notes here:
Look for the comparison between Y8 and Y83 which compare 2 identical queues except for the padding size(64 vs 128). Tests were run on Ubuntu13.04/JDK7u40/i7-4...@2.40GHz(so a 3 generations ago intel laptop Haswell CPU).
Reply all
Reply to author
Forward
0 new messages