Kyle,
False sharing (when discussed in the context of cache lines) is an issue/problem only when the same memory field (and containing cache line) is repeatedly and very hotly accessed by multiple threads, and hotly written to by at least one thread. Since a writing core must exclusively own the cache line it is writing to, such access patterns cause the cache lines to "fly around" between cores on many/accesses, dramatically reducing the effectiveness on the processor's nearest cache level. It is important to note that false sharing (of this type) only results in the reduction of hot field access performance from L1-cache hitting to cache missing (may be L2 missing or L3 missing, depending on CPU architecture and cache topologies). When enough hyper-threads are contending for the same cache line (many more than 5. Think 100), the serialization on cache line access could result in access speeds that are even slower than un-contended DRAM access, but those will still be fast when compared to disk or SSD access speeds.
I wouldn't call the pattern you are testing here "false sharing". You have "actually sharing" of the atomic long index, and sometimes sharing of the actual word written's cache line, but the same written-to mapped word is not repeated, and the same mapped cache line does not repeatedly fly around between cores. [The actually-shared atomic counter does fly around between cache line with the same limitations that false sharing would].
There are two main factors that can make the non-padded version perform better than the padded version:
1. You have "constructive sharing" going on: Since the threads are each continuously bringing in new mapped values from DRAM to the CPU cache, in a pattern that is not going to be perfect for a hardware prefetchers (the stride as seen from each thread's point of view will not be constant, since the atomic counter interleaving is not perfect, and while 8 longs fit in a single cache line, there are 5 threads racing so unlikely to see constant stride) threads running in the non-padded version and likely amortize at least some of their L3 cache misses: when the counter gets tightly interleaved, only one thread brings the cache line in form DRAM to L3, while the others miss in their L1/L2 to L3, and do not need to perform a DRAM access. In the padded version, each access will involve an L3 cache miss, resulting in up to 5x the miss rate and consumed memory bandwidth. Note that since you are interleaving your update operations with an atomic increment (which is strongly ordered), your cache miss latency is likely to dominate (as bandwidth cannot even come close to being saturated with only one miss in flight)
2. If I/O speed is a limitation in the execution (e.g. if the mapped pages need to be read from and eventually written to disk or SSD, either may be the bottleneck), 1/5 as many pages per second are needed to keep up with the workload. [The benchmark is likely small enough that it is hitting in the file cache, and dirty pages are not under pressure to be written back].
The fact that your padded single thread results show a sustained rate of >4.5GB/sec (11.145ns per 64 bytes) suggest that there is no I/O bottleneck at play, and that we are looking at the differences in DRAM-to-L3 cache missing behavior as a key factor.
You may want to experiment with a variant that will pre-interleave the lines in the padded version. E.g. still share an atomic counter but not for the index [sharing purely to make sure the AtomicLong contention remains similar], but and have each thread write in a fixed interleave pattern (e.g. thread number + (64 bytes * thread-local-incrementing-index) ). My *guess* would be that the resulting perfect stride pattern will cause dramatically improved performance compared to the current version, as the hardware prefetches will be able to stream in the L3 misses and hide their latencies.