False sharing/memory alignment performance with memory mapped files

Kyle Kavanagh

unread,

Sep 11, 2016, 12:37:29 PM9/11/16

to mechanical-sympathy

Looking into the effects of false sharing when writing to memory mapped files. Intuitively, I would think that to avoid false sharing between multiple threads writing (no reader threads) to the mapped file at different offsets at the same time, writes to the file should be aligned and cache-line-padded.

I wrote a simple JMH benchmark to test this, where in one test, 5 threads (-t 5) are writing single longs to the file with no cache padding, using an AtomicLong to track the current index into the file. The other test still only writes singular longs, but adds 64 to the index counter on every iteration so that no two writes occur on the same cache line. The results of the test across multiple forks were the opposite of what I had expected, with the unpadded implementation performing a non-trivial amount better:

Benchmark                    (blackhole)  Mode  Cnt    Score    Error  Units
AlignmentTest.testPadded               0  avgt   25  455.139 ± 59.933  ns/op
AlignmentTest.testUnpadded             0  avgt   25  374.866 ± 46.613  ns/op
AlignmentTest.testBlackHole            0  avgt   25  158.849 ± 20.971  ns/op

Whats also interesting is that this trend holds true even when running with a single thread (-t 1), though I could reason that this is due to less spatial locality.

Benchmark                    (blackhole)  Mode  Cnt   Score   Error  Units
AlignmentTest.testPadded               0  avgt   25  14.696 ± 0.008  ns/op
AlignmentTest.testUnpadded             0  avgt   25  11.145 ± 0.278  ns/op
AlignmentTest.testBlackHole            0  avgt   25   9.699 ± 0.008  ns/op

(testBlackhole measure the time it takes to increment the index counter).

Looking at the perf stats for both tests shows that the padded implementation performs better with regard to cache performance, as expected - In fact the only metric that is worse in the padded implementation is the CPI, which is nearly 2x that of the unpadded implementation. To the best that I can tell, the assembly between the two implementation is the same as well. Any ideas of other things that would cause padded writes to perform worse than writes which should cause tons of false sharing?

This was tested on an Intel E5-2667 Haswell with Hotspot jdk 1.8_92. Benchmark source attached.

AlignmentTest.java

Greg Young

unread,

Sep 11, 2016, 12:55:28 PM9/11/16

to mechanica...@googlegroups.com

Just guessing but based on the test, the cache aligned version touches
pages with different access pattern than the test without alignment.

> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to mechanical-symp...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

--
Studying for the Turing test

Alen Vrečko

unread,

Sep 12, 2016, 5:00:38 AM9/12/16

to mechanica...@googlegroups.com

I'd say the padded version is less mechanically sympathetic to the DRAM writing.

Sooner or later the data will need to be flushed to main memory. It is
1 GiB file. I am thinking DRAM writing works better when you have less
cache lines to write to it. The effect should be more visible on NUMA
machine (it makes sense /dev/shm/* to be interleaved).

I ran the benchmark on my i5-4460. The padded version took somewhat
longer than non-padded (~10%). When I changed the file size to 2 MiB
(nothing needs to leave LLC) the timings of both padded and non-padded
were almost the same (non-padded was marginally better, sometimes they
were the same).

Gil Tene

unread,

Sep 12, 2016, 12:01:02 PM9/12/16

to mechanica...@googlegroups.com

Kyle,

False sharing (when discussed in the context of cache lines) is an issue/problem only when the same memory field (and containing cache line) is repeatedly and very hotly accessed by multiple threads, and hotly written to by at least one thread. Since a writing core must exclusively own the cache line it is writing to, such access patterns cause the cache lines to "fly around" between cores on many/accesses, dramatically reducing the effectiveness on the processor's nearest cache level. It is important to note that false sharing (of this type) only results in the reduction of hot field access performance from L1-cache hitting to cache missing (may be L2 missing or L3 missing, depending on CPU architecture and cache topologies). When enough hyper-threads are contending for the same cache line (many more than 5. Think 100), the serialization on cache line access could result in access speeds that are even slower than un-contended DRAM access, but those will still be fast when compared to disk or SSD access speeds.

I wouldn't call the pattern you are testing here "false sharing". You have "actually sharing" of the atomic long index, and sometimes sharing of the actual word written's cache line, but the same written-to mapped word is not repeated, and the same mapped cache line does not repeatedly fly around between cores. [The actually-shared atomic counter does fly around between cache line with the same limitations that false sharing would].

There are two main factors that can make the non-padded version perform better than the padded version:

1. You have "constructive sharing" going on: Since the threads are each continuously bringing in new mapped values from DRAM to the CPU cache, in a pattern that is not going to be perfect for a hardware prefetchers (the stride as seen from each thread's point of view will not be constant, since the atomic counter interleaving is not perfect, and while 8 longs fit in a single cache line, there are 5 threads racing so unlikely to see constant stride) threads running in the non-padded version and likely amortize at least some of their L3 cache misses: when the counter gets tightly interleaved, only one thread brings the cache line in form DRAM to L3, while the others miss in their L1/L2 to L3, and do not need to perform a DRAM access. In the padded version, each access will involve an L3 cache miss, resulting in up to 5x the miss rate and consumed memory bandwidth. Note that since you are interleaving your update operations with an atomic increment (which is strongly ordered), your cache miss latency is likely to dominate (as bandwidth cannot even come close to being saturated with only one miss in flight)

2. If I/O speed is a limitation in the execution (e.g. if the mapped pages need to be read from and eventually written to disk or SSD, either may be the bottleneck), 1/5 as many pages per second are needed to keep up with the workload. [The benchmark is likely small enough that it is hitting in the file cache, and dirty pages are not under pressure to be written back].

The fact that your padded single thread results show a sustained rate of >4.5GB/sec (11.145ns per 64 bytes) suggest that there is no I/O bottleneck at play, and that we are looking at the differences in DRAM-to-L3 cache missing behavior as a key factor.

You may want to experiment with a variant that will pre-interleave the lines in the padded version. E.g. still share an atomic counter but not for the index [sharing purely to make sure the AtomicLong contention remains similar], but and have each thread write in a fixed interleave pattern (e.g. thread number + (64 bytes * thread-local-incrementing-index) ). My *guess* would be that the resulting perfect stride pattern will cause dramatically improved performance compared to the current version, as the hardware prefetches will be able to stream in the L3 misses and hide their latencies.

Reply all

Reply to author

Forward