Prefetching and false sharing

Duarte Nunes

unread,

Jan 29, 2017, 12:04:52 PM1/29/17

to mechanical-sympathy

Hi all,

In the latest Intel optimization manual, we can read in section "2.3.5.4 Data Prefetching":

Spatial Prefetcher: This prefetcher strives to complete every cache line fetched to the L2 cache with the pair line that completes it to a 128-byte aligned chunk.

I take this to mean that adjacent cache lines are brought together from memory (as often as possible). Indeed, there is code (e.g. JCTools, Folly) that assumes the false sharing range is 128 bytes and pads accordingly.

However, doing some more exegesis on the manual reveals there is no mention of prefetching in the context of false sharing, and save for the NetBurst microarchitectures, all advice seems to be to place variables in different cache lines:

On Pentium M, Intel Core Solo, Intel Core Duo processors, and processors based on Intel Core microarchitecture; a synchronization variable should be placed alone and in separate cache line to avoid false-sharing. Software must not allow a synchronization variable to span across page boundary.

Similarly, in the Linux kernel the false sharing range seems to be just a cache line (64 bytes).

I myself saw no difference whether values are 1 or 2 cache lines apart, when running tests to demonstraste false sharing (https://gist.github.com/duarten/b7ee60b4412596440a97498d87bf402e), but that's only relevant for the microarchitecture I'm in (Haswell-E).

Am I missing something?

Cheers,

Duarte

Rajiv Kurian

unread,

Jan 29, 2017, 12:26:14 PM1/29/17

to mechanical-sympathy

I don't think your code does proper alignment. You malloc the array of padded_long structs. Malloc does not respect the aligned attribute on structs as far as I remember. The alignment only works for stack allocated structs AFAIR. Maybe put in an assert on the address to verify alignment. I would use the non-portable posix_memalign or the portable but sometimes not supported aligned_alloc or just allocate extra memory and find the right aligned boundary myself.

Duarte Nunes

unread,

Jan 29, 2017, 12:54:10 PM1/29/17

to mechanical-sympathy

On Sunday, January 29, 2017 at 6:26:14 PM UTC+1, Rajiv Kurian wrote:

I don't think your code does proper alignment. You malloc the array of padded_long structs. Malloc does not respect the aligned attribute on structs as far as I remember. The alignment only works for stack allocated structs AFAIR. Maybe put in an assert on the address to verify alignment. I would use the non-portable posix_memalign or the portable but sometimes not supported aligned_alloc or just allocate extra memory and find the right aligned boundary myself.

It works for me. I can put in the assert or just add a uint64_t pad[0/64/128] field to the struct.

Avi Kivity

unread,

Jan 29, 2017, 1:27:03 PM1/29/17

to mechanica...@googlegroups.com

You should test with multiple NUMA nodes, or false sharing becomes true sharing at LLC.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Vitaly Davidovich

unread,

Jan 29, 2017, 2:16:53 PM1/29/17

to mechanica...@googlegroups.com

This.

Also, I think the (Intel) adjacent sector prefetch is a feature enabled through BIOS. I think that will pull the adjacent line to L1, whereas the spatial prefetcher is probably for streaming accesses that are loading L2.

Also, I'd run the bench without atomic ops - just relaxed (atomic) or volatile writes.

--

Sent from my phone

Duarte Nunes

unread,

Jan 29, 2017, 2:33:46 PM1/29/17

to mechanical-sympathy

On Sunday, January 29, 2017 at 7:27:03 PM UTC+1, Avi Kivity wrote:

You should test with multiple NUMA nodes, or false sharing becomes true sharing at LLC.

You're right, of course. With multiple NUMA nodes I clearly see a difference. For 128 byte padding:

1339 mticks, node 1, cpu 13

1875 mticks, node 1, cpu 13

2208 mticks, node 1, cpu 31

2259 mticks, node 0, cpu 4

2267 mticks, node 1, cpu 24

2278 mticks, node 1, cpu 28

2286 mticks, node 1, cpu 26

2290 mticks, node 1, cpu 10

2289 mticks, node 1, cpu 13

2330 mticks, node 1, cpu 25

2332 mticks, node 1, cpu 27

2307 mticks, node 1, cpu 12

2361 mticks, node 1, cpu 15

2361 mticks, node 1, cpu 8

2337 mticks, node 0, cpu 20

2395 mticks, node 0, cpu 7

2402 mticks, node 0, cpu 6

2403 mticks, node 1, cpu 30

2385 mticks, node 1, cpu 14

2416 mticks, node 0, cpu 22

2419 mticks, node 0, cpu 23

2430 mticks, node 0, cpu 5

2438 mticks, node 0, cpu 17

2435 mticks, node 0, cpu 21

2451 mticks, node 0, cpu 1

2454 mticks, node 0, cpu 2

2461 mticks, node 0, cpu 3

2473 mticks, node 0, cpu 18

2474 mticks, node 0, cpu 19

3149 mticks, node 0, cpu 4

4182 mticks, node 1, cpu 29

4647 mticks, node 0, cpu 16

main 4653 mticks, node 1, cpu 15

For 64:

8690 mticks, node 1, cpu 28

9003 mticks, node 1, cpu 13

9430 mticks, node 0, cpu 5

9789 mticks, node 1, cpu 28

9869 mticks, node 1, cpu 14

9873 mticks, node 1, cpu 26

9890 mticks, node 0, cpu 17

9898 mticks, node 1, cpu 27

9904 mticks, node 1, cpu 31

9911 mticks, node 1, cpu 30

9998 mticks, node 1, cpu 15

10042 mticks, node 0, cpu 23

10068 mticks, node 1, cpu 25

10076 mticks, node 0, cpu 18

10077 mticks, node 1, cpu 10

10079 mticks, node 1, cpu 14

10093 mticks, node 1, cpu 12

10139 mticks, node 0, cpu 5

10148 mticks, node 0, cpu 19

10159 mticks, node 1, cpu 29

10207 mticks, node 0, cpu 7

10214 mticks, node 0, cpu 2

10240 mticks, node 1, cpu 24

10259 mticks, node 0, cpu 16

10261 mticks, node 0, cpu 1

10292 mticks, node 0, cpu 3

10319 mticks, node 0, cpu 0

10326 mticks, node 0, cpu 22

10353 mticks, node 0, cpu 21

10663 mticks, node 1, cpu 11

10903 mticks, node 0, cpu 23

11318 mticks, node 0, cpu 20

main 11319 mticks, node 1, cpu 15

And finally, no padding:

16174 mticks, node 0, cpu 6

16441 mticks, node 0, cpu 22

27226 mticks, node 1, cpu 15

28862 mticks, node 0, cpu 22

29535 mticks, node 0, cpu 17

31532 mticks, node 1, cpu 29

32141 mticks, node 0, cpu 5

32601 mticks, node 0, cpu 4

32917 mticks, node 1, cpu 28

33464 mticks, node 1, cpu 10

33677 mticks, node 0, cpu 2

34329 mticks, node 0, cpu 19

34673 mticks, node 1, cpu 30

34840 mticks, node 0, cpu 21

35238 mticks, node 0, cpu 1

35386 mticks, node 0, cpu 23

35435 mticks, node 0, cpu 6

35638 mticks, node 1, cpu 15

35673 mticks, node 0, cpu 16

35940 mticks, node 0, cpu 2

36180 mticks, node 1, cpu 15

36747 mticks, node 0, cpu 4

36902 mticks, node 0, cpu 3

39147 mticks, node 1, cpu 29

39424 mticks, node 1, cpu 13

39815 mticks, node 1, cpu 10

40235 mticks, node 1, cpu 11

40313 mticks, node 1, cpu 9

41448 mticks, node 1, cpu 12

42617 mticks, node 1, cpu 24

43273 mticks, node 1, cpu 14

43288 mticks, node 1, cpu 15

main 43324 mticks, node 1, cpu 11

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Duarte Nunes

unread,

Jan 29, 2017, 2:34:39 PM1/29/17

to mechanical-sympathy

Forgot to mention that I had to use an AWS machine and CPU counters are not available, so not posting those.

Duarte Nunes

unread,

Jan 29, 2017, 2:37:15 PM1/29/17

to mechanical-sympathy

On Sunday, January 29, 2017 at 8:16:53 PM UTC+1, Vitaly Davidovich wrote:

This.

Also, I think the (Intel) adjacent sector prefetch is a feature enabled through BIOS. I think that will pull the adjacent line to L1, whereas the spatial prefetcher is probably for streaming accesses that are loading L2.

I think that was for Pentium 4, for the NetBurst microarchitecture. I can't find any reference to it in the new manuals. The spatial prefetcher is different than the streamer (section 2.3.5.4).

Also, I'd run the bench without atomic ops - just relaxed (atomic) or volatile writes.

It's not clear to me why would that matter?

On Sun, Jan 29, 2017 at 1:27 PM Avi Kivity <a...@scylladb.com> wrote:

You should test with multiple NUMA nodes, or false sharing becomes true sharing at LLC.

On 01/29/2017 07:04 PM, Duarte Nunes wrote:

Hi all,

In the latest Intel optimization manual, we can read in section "2.3.5.4 Data Prefetching":

Spatial Prefetcher: This prefetcher strives to complete every cache line fetched to the L2 cache with the pair line that completes it to a 128-byte aligned chunk.

I take this to mean that adjacent cache lines are brought together from memory (as often as possible). Indeed, there is code (e.g. JCTools, Folly) that assumes the false sharing range is 128 bytes and pads accordingly.

However, doing some more exegesis on the manual reveals there is no mention of prefetching in the context of false sharing, and save for the NetBurst microarchitectures, all advice seems to be to place variables in different cache lines:

On Pentium M, Intel Core Solo, Intel Core Duo processors, and processors based on Intel Core microarchitecture; a synchronization variable should be placed alone and in separate cache line to avoid false-sharing. Software must not allow a synchronization variable to span across page boundary.

Similarly, in the Linux kernel the false sharing range seems to be just a cache line (64 bytes).

I myself saw no difference whether values are 1 or 2 cache lines apart, when running tests to demonstraste false sharing (https://gist.github.com/duarten/b7ee60b4412596440a97498d87bf402e), but that's only relevant for the microarchitecture I'm in (Haswell-E).

Am I missing something?

Cheers,

Duarte

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Vitaly Davidovich

unread,

Jan 29, 2017, 3:07:08 PM1/29/17

to mechanica...@googlegroups.com

Ok, it's possible that it no longer exists.

As for the (not using) atomic ops, it's just to minimize the impact of atomics on the absolute numbers, and isolate the test to just the false sharing aspect. The relative performance would still show the effects, but I'd think the absolute numbers would be different.

On Sun, Jan 29, 2017 at 2:37 PM Duarte Nunes <duarte....@gmail.com> wrote:

On Sunday, January 29, 2017 at 8:16:53 PM UTC+1, Vitaly Davidovich wrote:
This.

Also, I think the (Intel) adjacent sector prefetch is a feature enabled through BIOS. I think that will pull the adjacent line to L1, whereas the spatial prefetcher is probably for streaming accesses that are loading L2.

I think that was for Pentium 4, for the NetBurst microarchitecture. I can't find any reference to it in the new manuals. The spatial prefetcher is different than the streamer (section 2.3.5.4).

Also, I'd run the bench without atomic ops - just relaxed (atomic) or volatile writes.

It's not clear to me why would that matter?

On Sun, Jan 29, 2017 at 1:27 PM Avi Kivity <a...@scylladb.com> wrote:

You should test with multiple NUMA nodes, or false sharing becomes true sharing at LLC.

On 01/29/2017 07:04 PM, Duarte Nunes wrote:

Hi all,

In the latest Intel optimization manual, we can read in section "2.3.5.4 Data Prefetching":

Spatial Prefetcher: This prefetcher strives to complete every cache line fetched to the L2 cache with the pair line that completes it to a 128-byte aligned chunk.

I take this to mean that adjacent cache lines are brought together from memory (as often as possible). Indeed, there is code (e.g. JCTools, Folly) that assumes the false sharing range is 128 bytes and pads accordingly.

However, doing some more exegesis on the manual reveals there is no mention of prefetching in the context of false sharing, and save for the NetBurst microarchitectures, all advice seems to be to place variables in different cache lines:

On Pentium M, Intel Core Solo, Intel Core Duo processors, and processors based on Intel Core microarchitecture; a synchronization variable should be placed alone and in separate cache line to avoid false-sharing. Software must not allow a synchronization variable to span across page boundary.

Similarly, in the Linux kernel the false sharing range seems to be just a cache line (64 bytes).

I myself saw no difference whether values are 1 or 2 cache lines apart, when running tests to demonstraste false sharing (https://gist.github.com/duarten/b7ee60b4412596440a97498d87bf402e), but that's only relevant for the microarchitecture I'm in (Haswell-E).

Am I missing something?

Cheers,

Duarte

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
Sent from my phone

--

You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Avi Kivity

unread,

Jan 29, 2017, 3:59:12 PM1/29/17

to mechanica...@googlegroups.com

Surely there's a small lab with a few NUMA machines you can access somewhere?

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Greg Young

unread,

Jan 29, 2017, 6:00:26 PM1/29/17

to mechanica...@googlegroups.com

I am not sure how much I would trust an AWS machine in general for benchmarks

>>> email to mechanical-symp...@googlegroups.com.

>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>>

> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an

> email to mechanical-symp...@googlegroups.com.

> For more options, visit https://groups.google.com/d/optout.

--
Studying for the Turing test

Duarte Nunes

unread,

Jan 29, 2017, 6:13:57 PM1/29/17

to mechanica...@googlegroups.com

On Mon, Jan 30, 2017 at 12:00 AM Greg Young <gregor...@gmail.com> wrote:

I am not sure how much I would trust an AWS machine in general for benchmarks

I'll repeat the tests on real hardware, but since I was mainly interested in relative numbers, I think it's okay.

Duarte Nunes

unread,

Jan 31, 2017, 5:13:00 PM1/31/17

to mechanical-sympathy

I found a small lab with NUMA machines. The results match what I observed on AWS.

For 64 byte alignment:

$ perf stat -e instructions,cycles,cache-misses,L1-dcache-load-misses,LLC-load-misses,node-load-misses ./a.out 12

    24,833,266,980      instructions              #    0.44 insn per cycle           (83.36%)
    56,995,776,488      cycles                                                        (83.33%)
        49,379,695      cache-misses                                                  (83.31%)
        84,920,864      L1-dcache-load-misses                                         (83.33%)
        25,296,056      LLC-load-misses                                               (83.35%)
        25,023,297      node-load-misses                                              (66.72%)

       1.577588839 seconds time elapsed

For 128 byte alignment:

    23,692,465,156      instructions              #    0.92 insn per cycle           (83.41%)
    25,846,004,438      cycles                                                        (83.31%)
         5,961,883      cache-misses                                                  (83.31%)
         7,139,079      L1-dcache-load-misses                                         (83.34%)
         3,008,220      LLC-load-misses                                               (83.35%)
         2,991,075      node-load-misses                                              (66.77%)

       1.084108707 seconds time elapsed

With false sharing:

24,115,447,790 instructions # 0.07 insn per cycle (83.33%)

343,250,966,435 cycles (83.33%)

118,898,229 cache-misses (83.33%)

192,488,985 L1-dcache-load-misses (83.34%)

25,084,968 LLC-load-misses (83.34%)

24,735,178 node-load-misses (66.66%)

6.573246235 seconds time elapsed

The question remains why the Intel manuals don't suggest 128 byte alignment and why it seemingly hasn't been adopted in the Linux kernel. Maybe space, since the difference is less significant?

-Duarte

PS: This is the hardware.

Architecture: x86_64

CPU op-mode(s): 32-bit, 64-bit

Byte Order: Little Endian

CPU(s): 32

On-line CPU(s) list: 0-31

Thread(s) per core: 2

Core(s) per socket: 8

Socket(s): 2

NUMA node(s): 2

Vendor ID: GenuineIntel

CPU family: 6

Model: 63

Model name: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz

Stepping: 2

CPU MHz: 1200.292

CPU max MHz: 3200.0000

CPU min MHz: 1200.0000

BogoMIPS: 4794.50

Virtualization: VT-x

L1d cache: 32K

L1i cache: 32K

L2 cache: 256K

L3 cache: 20480K

NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30

NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31

>>> email to mechanical-sympathy+unsub...@googlegroups.com.

>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>>
> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an

> email to mechanical-sympathy+unsub...@googlegroups.com.

> For more options, visit https://groups.google.com/d/optout.

--
Studying for the Turing test

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Nitsan Wakart

unread,

Feb 6, 2017, 9:18:21 AM2/6/17

to mechanica...@googlegroups.com

I'm a bit late to the party, yes JCTools pads 128, this was based on measurement and was visible on non-NUMA setups. See notes here:

http://psy-lob-saw.blogspot.co.za/2013/10/spsc-revisited-part-iii-fastflow-sparse.html

Look for the comparison between Y8 and Y83 which compare 2 identical queues except for the padding size(64 vs 128). Tests were run on Ubuntu13.04/JDK7u40/i7-4...@2.40GHz(so a 3 generations ago intel laptop Haswell CPU).

Reply all

Reply to author

Forward