unsafe vs array memory traversal throughput

733 views
Skip to first unread message

Reynold Xin

unread,
Mar 13, 2015, 3:49:30 AM3/13/15
to mechanica...@googlegroups.com
I was playing with memory traversal speed using Unsafe vs native primitive arrays, and found that for a simple sum, Unsafe can only get to about 80% of the throughput of a long array. I'm wondering if there are any ways to improve the throughput to get it closer to what native array provides.

Unsafe code:

    final int s = size;
    final long addr = address;
    for (int i = 0; i < s; i++) {
      sum += PlatformDependent.UNSAFE.getLong(addr + i * 8);
    }
    return sum;

Array code:

    long sum = 0;
    for (int i = 0; i < size; i++) {
      sum += longArray[i];
    }
    return sum;


JMH result: 

Benchmark                                   (size)   Mode  Cnt     Score   Error  Units
UnsafeBenchmark.arrayTraversal             1048576  thrpt    2  2616.060 ±   NaN  ops/s
UnsafeBenchmark.unsafeTraversalIncrement1  1048576  thrpt    2  2198.652 ±   NaN  ops/s


JIT was able to unroll both loops, but for arrayTraversal, JIT was able to figure out (no surprise since array access is heavily optimized) it is just pointer increment, and thus compiled into the following to use a single instruction to fetch the values at fixed increments:

  0x000000010a8c9b98: add    0x18(%r10,%r11,8),%rbx
  0x000000010a8c9b9d: add    0x20(%r10,%r11,8),%rbx
  0x000000010a8c9ba2: add    0x28(%r10,%r11,8),%rbx
  ......
  0x000000010a8c9bd4: add    0x78(%r10,%r11,8),%rbx
  0x000000010a8c9bd9: add    0x80(%r10,%r11,8),%rbx


In the case of Unsafe, JIT was not able to optimize out the pointer arithmetics, and had extra instructions to run those. As a result, each of the add above becomes:
 
  0x000000010a8c9080: mov    %r10d,%r9d
  0x000000010a8c9083: add    $0x60,%r9d
  0x000000010a8c9087: mov    %r10d,%r8d
  0x000000010a8c908a: add    $0x58,%r8d
  0x000000010a8c908e: movslq %r9d,%r11
  0x000000010a8c9091: mov    %r11,0x18(%rsp)
  0x000000010a8c9096: movslq %r8d,%r11
  0x000000010a8c9099: mov    %r11,0x20(%rsp)
  ...
  0x000000010a8c9114: mov    %rdi,%rbx
  0x000000010a8c9117: add    (%rbx,%r11,1),%rcx

  
Even though the eventual memory access was one instruction, it had to go through 9 instructions to get the right address. I was hoping Unsafe would provide LLVM-like speed for this particular case (minus the SIMD support), but a bit disappointed to find the generated instructions not as efficient as I wanted.

I tried playing with some other things (such as incrementing i 8 byte at a time -- which actually ended up worse because JIT wasn't able to unroll that loop) but haven't found a way yet. Are there other ways to improve this throughput?

I've put my whole benchmark code, result, and assembly instructions here: https://gist.github.com/rxin/ae6d7692e58c03a92861

Thanks.

Nitsan Wakart

unread,
Mar 13, 2015, 5:00:26 AM3/13/15
to mechanica...@googlegroups.com
@Benchmark
public long unsafeTraversalAddressIncrement() {
    long sum = 0;
    int s = size;
    long addr = address;
    for (int i = 0; i < s; i++) {
      sum += UNSAFE.getLong(addr);
      addr+=8;
    }
    return sum;
}
This performs closer to the mark in my rudimentary benchmarking.
The for loop in your Increment8 will have a safepoint poll in because the counter is a long.



--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Reynold Xin

unread,
Mar 13, 2015, 5:25:04 AM3/13/15
to mechanica...@googlegroups.com
Thanks, Nitsan.

I was just reading your blog post that touched the same topic http://psy-lob-saw.blogspot.com/2014/11/the-mythical-modulo-mask.html?m=1

This makes sense. For a project I was using I was going to use long indexes, but given loops over longs could introduce safepoint polls that prevent loop unrolling, I'm reevaluating this design. 


Vitaly Davidovich

unread,
Mar 13, 2015, 7:15:35 AM3/13/15
to mechanica...@googlegroups.com

You should forward this to hotspot-compiler openjdk mailing list; enhancement to JIT to do strength reduction of the address calculation.

sent from my phone

Aleksey Shipilev

unread,
Mar 13, 2015, 7:55:44 AM3/13/15
to mechanica...@googlegroups.com
On 03/13/2015 10:49 AM, Reynold Xin wrote:
> Even though the eventual memory access was one instruction, it had to go
> through 9 instructions to get the right address. I was hoping Unsafe
> would provide LLVM-like speed for this particular case (minus the SIMD
> support), but a bit disappointed to find the generated instructions not
> as efficient as I wanted.

Lesson: sun.misc.Unsafe != sun.misc.SuperFast.

Even though in some cases it can be used to overcome the Java language
semantics, e.g. bounds checks, accessing raw memory, etc., I think the
belief that once you start using Unsafe any code that touches it would
run awesomely fast is wishful thinking.

Seriously though, there are some known issues:
https://bugs.openjdk.java.net/browse/JDK-8003585
https://bugs.openjdk.java.net/browse/JDK-8074124

...and the same unrolling thing also popped out in some of my test a
week ago, but I wait for JDK-8003585 to be resolved, because there is an
anecdotal evidence the patch for JDK-8003585 helps to solve it.

Thanks,
-Aleksey.

signature.asc

Reynold Xin

unread,
Mar 13, 2015, 4:30:51 PM3/13/15
to mechanica...@googlegroups.com
Thanks for commenting. I forwarded the thread to hotspot-compiler-dev also, although it is awaiting moderator approval. 

Roman Leventov

unread,
Mar 14, 2015, 9:55:23 AM3/14/15
to mechanica...@googlegroups.com
Reply all
Reply to author
Forward
0 new messages