I was playing with memory traversal speed using Unsafe vs native primitive arrays, and found that for a simple sum, Unsafe can only get to about 80% of the throughput of a long array. I'm wondering if there are any ways to improve the throughput to get it closer to what native array provides.
Unsafe code:
final int s = size;
final long addr = address;
for (int i = 0; i < s; i++) {
sum += PlatformDependent.UNSAFE.getLong(addr + i * 8);
}
return sum;
Array code:
long sum = 0;
for (int i = 0; i < size; i++) {
sum += longArray[i];
}
return sum;
JMH result:
Benchmark (size) Mode Cnt Score Error Units
UnsafeBenchmark.arrayTraversal 1048576 thrpt 2 2616.060 ± NaN ops/s
UnsafeBenchmark.unsafeTraversalIncrement1 1048576 thrpt 2 2198.652 ± NaN ops/s
JIT was able to unroll both loops, but for arrayTraversal, JIT was able to figure out (no surprise since array access is heavily optimized) it is just pointer increment, and thus compiled into the following to use a single instruction to fetch the values at fixed increments:
0x000000010a8c9b98: add 0x18(%r10,%r11,8),%rbx
0x000000010a8c9b9d: add 0x20(%r10,%r11,8),%rbx
0x000000010a8c9ba2: add 0x28(%r10,%r11,8),%rbx
......
0x000000010a8c9bd4: add 0x78(%r10,%r11,8),%rbx
0x000000010a8c9bd9: add 0x80(%r10,%r11,8),%rbx
In the case of Unsafe, JIT was not able to optimize out the pointer arithmetics, and had extra instructions to run those. As a result, each of the add above becomes:
0x000000010a8c9080: mov %r10d,%r9d
0x000000010a8c9083: add $0x60,%r9d
0x000000010a8c9087: mov %r10d,%r8d
0x000000010a8c908a: add $0x58,%r8d
0x000000010a8c908e: movslq %r9d,%r11
0x000000010a8c9091: mov %r11,0x18(%rsp)
0x000000010a8c9096: movslq %r8d,%r11
0x000000010a8c9099: mov %r11,0x20(%rsp)
...
0x000000010a8c9114: mov %rdi,%rbx
0x000000010a8c9117: add (%rbx,%r11,1),%rcx
Even though the eventual memory access was one instruction, it had to go through 9 instructions to get the right address. I was hoping Unsafe would provide LLVM-like speed for this particular case (minus the SIMD support), but a bit disappointed to find the generated instructions not as efficient as I wanted.
I tried playing with some other things (such as incrementing i 8 byte at a time -- which actually ended up worse because JIT wasn't able to unroll that loop) but haven't found a way yet. Are there other ways to improve this throughput?
Thanks.