Unsafe.setMemory() and copyMemory() vs. Unsafe.putByte()

829 views
Skip to first unread message

Kyle Downey

unread,
Aug 25, 2015, 5:58:27 PM8/25/15
to mechanical-sympathy
This is my first time posting to mechanical-sympathy.

I am seeing a consistent difference in a microbenchmark between calling Unsafe.putByte() for each byte in a byte array vs. either (a) setMemory() to zero out all bytes, or (b) copyMemory() to copy the data in the byte[] array into the native memory. I would have expected the bulk memory writes to be faster than making multiple calls to update memory byte-by-byte, but am seeing the opposite, at least for the small byte[] arrays I'm testing.

This is on MacOS X 10.5, JDK 1.8.0_51 on a MacBook Pro with a 2.8 GHz Intel Core i7, benchmarked with JMH settings:

# Warmup: 5 iterations, 1 s each
# Measurement: 20 iterations, 1 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time

The difference seen from essentially a one-line change to replace my putBytes() call with copyMemory():

OffHeapFastStringPerfTest.appendFastStringNoPrealloc                 thrpt   20  1895.290 ± 15.210  ops/s
OffHeapFastStringPerfTest.appendFastStringNoPreallocUsingCopyMemory  thrpt   20  1415.249 ± 16.747  ops/s

Is this a well-known difference, and is there something about the way these operations that have been implemented in Java 8 that slows them down?


Michael Barker

unread,
Aug 25, 2015, 6:10:25 PM8/25/15
to mechanica...@googlegroups.com
Any chance that you could post the benchmark?

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Vitaly Davidovich

unread,
Aug 25, 2015, 6:20:38 PM8/25/15
to mechanical-sympathy

setMemory isn't backed by intrinsic so you get JNI penalty.  copyMemory is but what size are you copying? You mention small arrays which, depending on what we mean, could be quicker to blast through with no memcpy setup.  Try larger sizes.  Finally, look at the generated assembly to see the difference.

sent from my phone

Kyle Downey

unread,
Aug 25, 2015, 7:08:47 PM8/25/15
to mechanical-sympathy
Vitaly,

Good call -- here are the results with appending an 89 byte array instead of a 3 byte array -- copyMemory is now faster than both StringBuffer and the putByte()-based version of the string append operation.

OffHeapFastStringPerfTest.appendFastStringNoPrealloc                 thrpt   20  1062.834 ±  9.683  ops/s
OffHeapFastStringPerfTest.appendFastStringNoPreallocUsingCopyMemory  thrpt   20  2625.247 ± 26.639  ops/s
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Kyle Downey

unread,
Aug 25, 2015, 8:48:12 PM8/25/15
to mechanical-sympathy
I ran it with 4 bytes and 32 byte inputs, this time in AverageTime mode and with some simplifications to the benchmark so it does just a single append. With a few rounds of tests I determined that that cut-off (for this hardware) is around 4 bytes; this length or smaller it seems you are better off doing a loop with putByte() instead of copyMemory(), and modified the code to switch mode for small arrays.

Benchmark                                                  Mode  Cnt   Score   Error  Units
OffHeapFastStringPerfTest.appendBaselineStringBuilder      avgt   20   8.913 ± 0.509  ns/op
OffHeapFastStringPerfTest.appendBaselineStringBuilderLong  avgt   20  29.393 ± 0.643  ns/op
OffHeapFastStringPerfTest.appendFastString                 avgt   20   6.247 ± 0.127  ns/op
OffHeapFastStringPerfTest.appendFastStringLong             avgt   20  10.869 ± 0.094  ns/op

Vitaly Davidovich

unread,
Aug 25, 2015, 9:43:59 PM8/25/15
to mechanical-sympathy

I'm somewhat surprised the cutoff is 4 bytes - I'd have expected larger.  Have you looked at the assembly by chance for both versions?

sent from my phone

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

On Tuesday, August 25, 2015 at 6:20:38 PM UTC-4, Vitaly Davidovich wrote:

setMemory isn't backed by intrinsic so you get JNI penalty.  copyMemory is but what size are you copying? You mention small arrays which, depending on what we mean, could be quicker to blast through with no memcpy setup.  Try larger sizes.  Finally, look at the generated assembly to see the difference.

sent from my phone

On Aug 25, 2015 5:58 PM, "Kyle Downey" <kyle....@gmail.com> wrote:
This is my first time posting to mechanical-sympathy.

I am seeing a consistent difference in a microbenchmark between calling Unsafe.putByte() for each byte in a byte array vs. either (a) setMemory() to zero out all bytes, or (b) copyMemory() to copy the data in the byte[] array into the native memory. I would have expected the bulk memory writes to be faster than making multiple calls to update memory byte-by-byte, but am seeing the opposite, at least for the small byte[] arrays I'm testing.

This is on MacOS X 10.5, JDK 1.8.0_51 on a MacBook Pro with a 2.8 GHz Intel Core i7, benchmarked with JMH settings:

# Warmup: 5 iterations, 1 s each
# Measurement: 20 iterations, 1 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time

The difference seen from essentially a one-line change to replace my putBytes() call with copyMemory():

OffHeapFastStringPerfTest.appendFastStringNoPrealloc                 thrpt   20  1895.290 ± 15.210  ops/s
OffHeapFastStringPerfTest.appendFastStringNoPreallocUsingCopyMemory  thrpt   20  1415.249 ± 16.747  ops/s

Is this a well-known difference, and is there something about the way these operations that have been implemented in Java 8 that slows them down?


--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Kyle Downey

unread,
Aug 26, 2015, 10:13:18 PM8/26/15
to mechanica...@googlegroups.com
I posted a complete example to GitHub that demonstrates the behavior, including a subset of the original code:

https://github.com/kyle-downey/cloudwall-lab

The JMH test on my machine shows that it's always preferable to use copyMemory() down to 4 bytes -- which is the point where the run times are about equal. I set up what I need on this machine to disassemble tomorrow but if you want to play with it the code is all there.

Benchmark                                                 Mode  Cnt   Score   Error  Units
FastAppenderBenchmark.appendBaselineStringBuilder12Bytes  avgt   20   9.617 ± 0.640  ns/op
FastAppenderBenchmark.appendBaselineStringBuilder32Bytes  avgt   20  29.584 ± 2.168  ns/op
FastAppenderBenchmark.appendBaselineStringBuilder4Bytes   avgt   20   9.516 ± 0.404  ns/op
FastAppenderBenchmark.appenderAlwaysCopyMemory12Bytes     avgt   20   5.720 ± 0.407  ns/op
FastAppenderBenchmark.appenderAlwaysCopyMemory32Bytes     avgt   20   4.827 ± 0.253  ns/op
FastAppenderBenchmark.appenderAlwaysCopyMemory4Bytes      avgt   20   5.629 ± 0.293  ns/op
FastAppenderBenchmark.appenderAlwaysPutBytes12Bytes       avgt   20   8.615 ± 0.520  ns/op
FastAppenderBenchmark.appenderAlwaysPutBytes32Bytes       avgt   20  17.414 ± 1.120  ns/op
FastAppenderBenchmark.appenderAlwaysPutBytes4Bytes        avgt   20   5.540 ± 0.479  ns/op

You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/8_Pn597tmFE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.

Vitaly Davidovich

unread,
Aug 26, 2015, 10:35:03 PM8/26/15
to mechanical-sympathy

Kyle,

A few quick suggestions:

1) since you appear to be using StringBuilder as baseline, I'd size those instances appropriately upfront.  In particular, 32 case will cause a resize.

2) Remove the asserts.  it's just unnecessary code noise (shouldn't impact perf in this case since methods are still within frequent code inline threshold).

3) Don't branch based on input array length.  Again, it'll get predicted well by the cpu but it's noise and may cause compiler to do something odd (unlikely, but without assembly cannot tell).  Create an abstract class with 2 concrete impls instead.

4) Manually common out loop invariant calculations in the putByte loop (i.e. address + startIndex).  Compiler *should* pick that up but without assembly hard to say (plus you're not trying to test that aspect).

5) My hunch is compiler is not unrolling the putByte loop as it doesn't know if the stores alias with the loads.  Try manually unrolling, say, a 8 byte loop and see if anything changes.

6) immaterial to the perf, but I'd make unsafe field final or just remove it entirely (assuming NativeBytes.UNSAFE is static final it'll become JIT constant).

sent from my phone

Vitaly Davidovich

unread,
Aug 26, 2015, 10:38:25 PM8/26/15
to mechanical-sympathy

Argh - #6 should say make unsafe static final, not just final.

sent from my phone

Kyle Downey

unread,
Aug 27, 2015, 4:04:37 PM8/27/15
to mechanica...@googlegroups.com
GitHub updated with the changes above. The significant impact was switching to the 8-byte loop unrolling when using putByte() and pre-sizing StringBuilder for the 32-byte case so it's not an overly flattering baseline. The optimized putByte() case matched copyMemory, so I think you are right, Vitaly: the compiler missed this optimization.

Benchmark                                                 Mode  Cnt   Score   Error  Units
FastAppenderBenchmark.appendBaselineStringBuilder12Bytes  avgt   20  21.576 ± 1.477  ns/op
FastAppenderBenchmark.appendBaselineStringBuilder32Bytes  avgt   20  28.352 ± 0.861  ns/op
FastAppenderBenchmark.appendBaselineStringBuilder4Bytes   avgt   20  23.098 ± 0.652  ns/op
FastAppenderBenchmark.appenderAlwaysCopyMemory12Bytes     avgt   20   5.479 ± 0.068  ns/op
FastAppenderBenchmark.appenderAlwaysCopyMemory32Bytes     avgt   20   4.581 ± 0.092  ns/op
FastAppenderBenchmark.appenderAlwaysCopyMemory4Bytes      avgt   20   5.349 ± 0.109  ns/op
FastAppenderBenchmark.appenderAlwaysPutBytes12Bytes       avgt   20   6.947 ± 0.139  ns/op
FastAppenderBenchmark.appenderAlwaysPutBytes32Bytes       avgt   20   4.419 ± 0.041  ns/op
FastAppenderBenchmark.appenderAlwaysPutBytes4Bytes        avgt   20   5.594 ± 0.058  ns/op

Reply all
Reply to author
Forward
0 new messages