Unsafe.setMemory() and copyMemory() vs. Unsafe.putByte()

Kyle Downey

unread,

Aug 25, 2015, 5:58:27 PM8/25/15

to mechanical-sympathy

This is my first time posting to mechanical-sympathy.

I am seeing a consistent difference in a microbenchmark between calling Unsafe.putByte() for each byte in a byte array vs. either (a) setMemory() to zero out all bytes, or (b) copyMemory() to copy the data in the byte[] array into the native memory. I would have expected the bulk memory writes to be faster than making multiple calls to update memory byte-by-byte, but am seeing the opposite, at least for the small byte[] arrays I'm testing.

This is on MacOS X 10.5, JDK 1.8.0_51 on a MacBook Pro with a 2.8 GHz Intel Core i7, benchmarked with JMH settings:

# Warmup: 5 iterations, 1 s each

# Measurement: 20 iterations, 1 s each

# Timeout: 10 min per iteration

# Threads: 1 thread, will synchronize iterations

# Benchmark mode: Throughput, ops/time

The difference seen from essentially a one-line change to replace my putBytes() call with copyMemory():

OffHeapFastStringPerfTest.appendFastStringNoPrealloc thrpt 20 1895.290 ± 15.210 ops/s

OffHeapFastStringPerfTest.appendFastStringNoPreallocUsingCopyMemory thrpt 20 1415.249 ± 16.747 ops/s

Is this a well-known difference, and is there something about the way these operations that have been implemented in Java 8 that slows them down?

Michael Barker

unread,

Aug 25, 2015, 6:10:25 PM8/25/15

to mechanica...@googlegroups.com

Any chance that you could post the benchmark?

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Vitaly Davidovich

unread,

Aug 25, 2015, 6:20:38 PM8/25/15

to mechanical-sympathy

setMemory isn't backed by intrinsic so you get JNI penalty. copyMemory is but what size are you copying? You mention small arrays which, depending on what we mean, could be quicker to blast through with no memcpy setup. Try larger sizes. Finally, look at the generated assembly to see the difference.

sent from my phone

Kyle Downey

unread,

Aug 25, 2015, 7:08:47 PM8/25/15

to mechanical-sympathy

Vitaly,

Good call -- here are the results with appending an 89 byte array instead of a 3 byte array -- copyMemory is now faster than both StringBuffer and the putByte()-based version of the string append operation.

OffHeapFastStringPerfTest.appendFastStringNoPrealloc thrpt 20 1062.834 ± 9.683 ops/s

OffHeapFastStringPerfTest.appendFastStringNoPreallocUsingCopyMemory thrpt 20 2625.247 ± 26.639 ops/s

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Kyle Downey

unread,

Aug 25, 2015, 8:48:12 PM8/25/15

to mechanical-sympathy

I ran it with 4 bytes and 32 byte inputs, this time in AverageTime mode and with some simplifications to the benchmark so it does just a single append. With a few rounds of tests I determined that that cut-off (for this hardware) is around 4 bytes; this length or smaller it seems you are better off doing a loop with putByte() instead of copyMemory(), and modified the code to switch mode for small arrays.

Benchmark Mode Cnt Score Error Units

OffHeapFastStringPerfTest.appendBaselineStringBuilder avgt 20 8.913 ± 0.509 ns/op

OffHeapFastStringPerfTest.appendBaselineStringBuilderLong avgt 20 29.393 ± 0.643 ns/op

OffHeapFastStringPerfTest.appendFastString avgt 20 6.247 ± 0.127 ns/op

OffHeapFastStringPerfTest.appendFastStringLong avgt 20 10.869 ± 0.094 ns/op

Vitaly Davidovich

unread,

Aug 25, 2015, 9:43:59 PM8/25/15

to mechanical-sympathy

I'm somewhat surprised the cutoff is 4 bytes - I'd have expected larger. Have you looked at the assembly by chance for both versions?

sent from my phone

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

On Tuesday, August 25, 2015 at 6:20:38 PM UTC-4, Vitaly Davidovich wrote:

setMemory isn't backed by intrinsic so you get JNI penalty. copyMemory is but what size are you copying? You mention small arrays which, depending on what we mean, could be quicker to blast through with no memcpy setup. Try larger sizes. Finally, look at the generated assembly to see the difference.

sent from my phone

On Aug 25, 2015 5:58 PM, "Kyle Downey" <kyle....@gmail.com> wrote:

This is my first time posting to mechanical-sympathy.

I am seeing a consistent difference in a microbenchmark between calling Unsafe.putByte() for each byte in a byte array vs. either (a) setMemory() to zero out all bytes, or (b) copyMemory() to copy the data in the byte[] array into the native memory. I would have expected the bulk memory writes to be faster than making multiple calls to update memory byte-by-byte, but am seeing the opposite, at least for the small byte[] arrays I'm testing.

This is on MacOS X 10.5, JDK 1.8.0_51 on a MacBook Pro with a 2.8 GHz Intel Core i7, benchmarked with JMH settings:

# Warmup: 5 iterations, 1 s each
# Measurement: 20 iterations, 1 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time

The difference seen from essentially a one-line change to replace my putBytes() call with copyMemory():

OffHeapFastStringPerfTest.appendFastStringNoPrealloc thrpt 20 1895.290 ± 15.210 ops/s
OffHeapFastStringPerfTest.appendFastStringNoPreallocUsingCopyMemory thrpt 20 1415.249 ± 16.747 ops/s

Is this a well-known difference, and is there something about the way these operations that have been implemented in Java 8 that slows them down?

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Kyle Downey

unread,

Aug 26, 2015, 10:13:18 PM8/26/15

to mechanica...@googlegroups.com

I posted a complete example to GitHub that demonstrates the behavior, including a subset of the original code:

https://github.com/kyle-downey/cloudwall-lab

The JMH test on my machine shows that it's always preferable to use copyMemory() down to 4 bytes -- which is the point where the run times are about equal. I set up what I need on this machine to disassemble tomorrow but if you want to play with it the code is all there.

Benchmark Mode Cnt Score Error Units

FastAppenderBenchmark.appendBaselineStringBuilder12Bytes avgt 20 9.617 ± 0.640 ns/op

FastAppenderBenchmark.appendBaselineStringBuilder32Bytes avgt 20 29.584 ± 2.168 ns/op

FastAppenderBenchmark.appendBaselineStringBuilder4Bytes avgt 20 9.516 ± 0.404 ns/op

FastAppenderBenchmark.appenderAlwaysCopyMemory12Bytes avgt 20 5.720 ± 0.407 ns/op

FastAppenderBenchmark.appenderAlwaysCopyMemory32Bytes avgt 20 4.827 ± 0.253 ns/op

FastAppenderBenchmark.appenderAlwaysCopyMemory4Bytes avgt 20 5.629 ± 0.293 ns/op

FastAppenderBenchmark.appenderAlwaysPutBytes12Bytes avgt 20 8.615 ± 0.520 ns/op

FastAppenderBenchmark.appenderAlwaysPutBytes32Bytes avgt 20 17.414 ± 1.120 ns/op

FastAppenderBenchmark.appenderAlwaysPutBytes4Bytes avgt 20 5.540 ± 0.479 ns/op

You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/8_Pn597tmFE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.

Vitaly Davidovich

unread,

Aug 26, 2015, 10:35:03 PM8/26/15

to mechanical-sympathy

Kyle,

A few quick suggestions:

1) since you appear to be using StringBuilder as baseline, I'd size those instances appropriately upfront. In particular, 32 case will cause a resize.

2) Remove the asserts. it's just unnecessary code noise (shouldn't impact perf in this case since methods are still within frequent code inline threshold).

3) Don't branch based on input array length. Again, it'll get predicted well by the cpu but it's noise and may cause compiler to do something odd (unlikely, but without assembly cannot tell). Create an abstract class with 2 concrete impls instead.

4) Manually common out loop invariant calculations in the putByte loop (i.e. address + startIndex). Compiler *should* pick that up but without assembly hard to say (plus you're not trying to test that aspect).

5) My hunch is compiler is not unrolling the putByte loop as it doesn't know if the stores alias with the loads. Try manually unrolling, say, a 8 byte loop and see if anything changes.

6) immaterial to the perf, but I'd make unsafe field final or just remove it entirely (assuming NativeBytes.UNSAFE is static final it'll become JIT constant).

sent from my phone

Vitaly Davidovich

unread,

Aug 26, 2015, 10:38:25 PM8/26/15

to mechanical-sympathy

Argh - #6 should say make unsafe static final, not just final.

sent from my phone

Kyle Downey

unread,

Aug 27, 2015, 4:04:37 PM8/27/15

to mechanica...@googlegroups.com

GitHub updated with the changes above. The significant impact was switching to the 8-byte loop unrolling when using putByte() and pre-sizing StringBuilder for the 32-byte case so it's not an overly flattering baseline. The optimized putByte() case matched copyMemory, so I think you are right, Vitaly: the compiler missed this optimization.

Benchmark Mode Cnt Score Error Units

FastAppenderBenchmark.appendBaselineStringBuilder12Bytes avgt 20 21.576 ± 1.477 ns/op

FastAppenderBenchmark.appendBaselineStringBuilder32Bytes avgt 20 28.352 ± 0.861 ns/op

FastAppenderBenchmark.appendBaselineStringBuilder4Bytes avgt 20 23.098 ± 0.652 ns/op

FastAppenderBenchmark.appenderAlwaysCopyMemory12Bytes avgt 20 5.479 ± 0.068 ns/op

FastAppenderBenchmark.appenderAlwaysCopyMemory32Bytes avgt 20 4.581 ± 0.092 ns/op

FastAppenderBenchmark.appenderAlwaysCopyMemory4Bytes avgt 20 5.349 ± 0.109 ns/op

FastAppenderBenchmark.appenderAlwaysPutBytes12Bytes avgt 20 6.947 ± 0.139 ns/op

FastAppenderBenchmark.appenderAlwaysPutBytes32Bytes avgt 20 4.419 ± 0.041 ns/op

FastAppenderBenchmark.appenderAlwaysPutBytes4Bytes avgt 20 5.594 ± 0.058 ns/op

Reply all

Reply to author

Forward