I'm trying to copy an array of bytes and return the copy. I noticed that there are two ways of doing this: System.arraycopy and Arrays.copyOf, both of which have intrinsic versions in OpenJDK. I was hoping that the Arrays.copyOf version would skip the initial zeroing of the destination array, and thus be slightly faster. However, benchmarking them seems to show that neither one is consistently faster than the other. I'm not that familiar with assembly, so I may be reading it wrong. That said, reading the "Hottest Region 2" below, it looks like copyOf is still zeroing? Can that be right?
# JMH 1.17.3 (released 56 days ago)
# VM version: JDK 1.8.0_92, VM 25.92-b14
# VM invoker: ~/Downloads/jdk1.8.0_92/jre/bin/java
# VM options: -server -Xms2g -Xmx2g -XX:+UnlockDiagnosticVMOptions -XX:LogFile=/tmp/thelogs -XX:+LogCompilation -XX:+PrintNMethods -XX:+PrintNativeNMethods -XX:+PrintAssembly -XX:+PrintInlining -XX:PrintAssemblyOptions=syntax -XX:+PrintCompilation
# Warmup: 10 iterations, 1 s each
# Measurement: 1 iterations, 1 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Sampling time
# Benchmark: Benchmark.arrcopy
# Parameters: (size = 1048576)
# Run progress: 0.00% complete, ETA 00:00:22
# Fork: 1 of 1
# Preparing profilers: LinuxPerfAsmProfiler
# Profilers consume stdout and stderr from target VM, use -v EXTRA to copy to console
# Warmup Iteration 1: 3064247.755 ±(99.9%) 1101881.397 ns/op
# Warmup Iteration 2: 2104675.705 ±(99.9%) 116863.885 ns/op
# Warmup Iteration 3: 1933284.271 ±(99.9%) 73166.251 ns/op
# Warmup Iteration 4: 2022234.170 ±(99.9%) 31234.827 ns/op
# Warmup Iteration 5: 2054921.988 ±(99.9%) 171828.509 ns/op
# Warmup Iteration 6: 1984621.714 ±(99.9%) 70876.189 ns/op
# Warmup Iteration 7: 2029098.580 ±(99.9%) 32632.162 ns/op
# Warmup Iteration 8: 1888186.445 ±(99.9%) 27772.059 ns/op
# Warmup Iteration 9: 1956954.176 ±(99.9%) 110294.977 ns/op
# Warmup Iteration 10: 1961879.592 ±(99.9%) 75665.475 ns/op
Iteration 1: 2082120.533 ±(99.9%) 31688.951 ns/op
arrcopy·p0.00: 1851392.000 ns/op
arrcopy·p0.50: 2059264.000 ns/op
arrcopy·p0.90: 2231910.400 ns/op
arrcopy·p0.95: 2273280.000 ns/op
arrcopy·p0.999: 4194304.000 ns/op
arrcopy·p0.9999: 4194304.000 ns/op
arrcopy·p1.00: 4194304.000 ns/op
# Processing profiler results: LinuxPerfAsmProfiler
Result "arrcopy":
N = 480
mean = 2082120.533 ±(99.9%) 31688.951 ns/op
Histogram, ns/op:
[1000000.000, 1250000.000) = 0
[1250000.000, 1500000.000) = 0
[1500000.000, 1750000.000) = 0
[1750000.000, 2000000.000) = 166
[2000000.000, 2250000.000) = 276
[2250000.000, 2500000.000) = 26
[2500000.000, 2750000.000) = 6
[2750000.000, 3000000.000) = 1
[3000000.000, 3250000.000) = 1
[3250000.000, 3500000.000) = 1
[3500000.000, 3750000.000) = 1
[3750000.000, 4000000.000) = 1
[4000000.000, 4250000.000) = 1
[4250000.000, 4500000.000) = 0
[4500000.000, 4750000.000) = 0
Percentiles, ns/op:
p(0.0000) = 1851392.000 ns/op
p(50.0000) = 2059264.000 ns/op
p(90.0000) = 2231910.400 ns/op
p(95.0000) = 2273280.000 ns/op
p(99.9000) = 4194304.000 ns/op
p(99.9900) = 4194304.000 ns/op
p(99.9990) = 4194304.000 ns/op
p(99.9999) = 4194304.000 ns/op
p(100.0000) = 4194304.000 ns/op
Secondary result "·asm":
PrintAssembly processed: 156748 total address lines.
Perf output processed (skipped 10.568 seconds):
Column 1: cycles (1148 events)
Column 2: instructions (1051 events)
Hottest code regions (>10.00% "cycles" events):
....[Hottest Region 1]..............................................................................
runtime stub, StubRoutines::jlong_disjoint_arraycopy (24 bytes)
0x00007fd8a505278e: neg %rdx
╭ 0x00007fd8a5052791: jmpq Stub::jlong_disjoint_arraycopy+72 0x0x7fd8a50527c8
│↗ 0x00007fd8a5052796: mov 0x8(%rdi,%rdx,8),%rax
││ 0x00007fd8a505279b: mov %rax,0x8(%rcx,%rdx,8)
││ 0x00007fd8a50527a0: inc %rdx
│╰ 0x00007fd8a50527a3: jne Stub::jlong_disjoint_arraycopy+22 0x0x7fd8a5052796
│ 0x00007fd8a50527a5: xor %rax,%rax
│ 0x00007fd8a50527a8: leaveq
│ 0x00007fd8a50527a9: retq
│ 0x00007fd8a50527aa: nopw 0x0(%rax,%rax,1)
0.17% 0.19% │ ↗ 0x00007fd8a50527b0: vmovdqu -0x38(%rdi,%rdx,8),%ymm0
37.11% 12.94% │ │ 0x00007fd8a50527b6: vmovdqu %ymm0,-0x38(%rcx,%rdx,8)
5.66% 1.52% │ │ 0x00007fd8a50527bc: vmovdqu -0x18(%rdi,%rdx,8),%ymm1
0.87% 0.38% │ │ 0x00007fd8a50527c2: vmovdqu %ymm1,-0x18(%rcx,%rdx,8)
5.14% 0.48% ↘ │ 0x00007fd8a50527c8: add $0x8,%rdx
╰ 0x00007fd8a50527cc: jle Stub::jlong_disjoint_arraycopy+48 0x0x7fd8a50527b0
0x00007fd8a50527ce: sub $0x4,%rdx
╭ 0x00007fd8a50527d2: jg Stub::jlong_disjoint_arraycopy+100 0x0x7fd8a50527e4
│ 0x00007fd8a50527d4: vmovdqu -0x18(%rdi,%rdx,8),%ymm0
│ 0x00007fd8a50527da: vmovdqu %ymm0,-0x18(%rcx,%rdx,8)
│ 0x00007fd8a50527e0: add $0x4,%rdx
↘ 0x00007fd8a50527e4: (bad)
0x00007fd8a50527e7: rol $0xf5,%ch
0x00007fd8a50527ea: out %eax,(%dx)
0x00007fd8a50527eb: leaveq
....................................................................................................
48.95% 15.51% <total for region 1>
....[Hottest Region 2]..............................................................................
C1, level 3, java.lang.String::hashCode, version 1 (8 bytes)
0x00007fd8a5106814: mov %ecx,0x8(%rax)
0x00007fd8a5106817: mov %ebx,0xc(%rax)
0x00007fd8a510681a: mov 0xa(%rdx),%cl
0x00007fd8a510681d: and $0xff,%rcx
0x00007fd8a5106824: sub %rcx,%rsi
0x00007fd8a5106827: add %rax,%rcx
0x00007fd8a510682a: sub $0x0,%rsi
╭ 0x00007fd8a510682e: je 0x00007fd8a5106845
│ 0x00007fd8a5106834: xor %rdi,%rdi
│ 0x00007fd8a5106837: shr $0x3,%rsi
4.36% 19.70% │↗ 0x00007fd8a510683b: mov %rdi,-0x8(%rcx,%rsi,8)
32.49% 49.57% ││ 0x00007fd8a5106840: dec %rsi
0.17% │╰ 0x00007fd8a5106843: jne 0x00007fd8a510683b
↘ 0x00007fd8a5106845: retq
0x00007fd8a5106846: push %rbp
0x00007fd8a5106847: mov %rsp,%rbp
0x00007fd8a510684a: mov %rsp,-0x28(%rsp)
0x00007fd8a510684f: sub $0x80,%rsp
0x00007fd8a5106856: mov %rax,0x78(%rsp)
0x00007fd8a510685b: mov %rcx,0x70(%rsp)
0x00007fd8a5106860: mov %rdx,0x68(%rsp)
0x00007fd8a5106865: mov %rbx,0x60(%rsp)
0x00007fd8a510686a: mov %rbp,0x50(%rsp)
....................................................................................................
37.02% 69.27% <total for region 2>
....[Hottest Regions]...............................................................................
48.95% 15.51% runtime stub StubRoutines::jlong_disjoint_arraycopy (24 bytes)
37.02% 69.27% C1, level 3 java.lang.String::hashCode, version 1 (8 bytes)
8.62% 9.32% [kernel.kallsyms] [unknown] (0 bytes)
1.31% 1.05% C1, level 3 java.util.zip.ZipCoder::getBytes, version 582 (9 bytes)
0.61% 0.76% C1, level 3 java.util.zip.ZipCoder::getBytes, version 582 (17 bytes)
0.52% 0.95% C1, level 3 java.lang.String::hashCode, version 1 (5 bytes)
0.44% 0.95% C1, level 3 java.util.zip.ZipCoder::getBytes, version 582 (24 bytes)
0.17% 0.29% C1, level 3 java.util.zip.ZipCoder::getBytes, version 582 (10 bytes)
0.09% [kernel.kallsyms] [unknown] (0 bytes)
0.09% 0.10% [kernel.kallsyms] [unknown] (0 bytes)
0.09% [kernel.kallsyms] [unknown] (0 bytes)
0.09% [kernel.kallsyms] [unknown] (0 bytes)
0.09% [kernel.kallsyms] [unknown] (0 bytes)
0.09% [kernel.kallsyms] [unknown] (0 bytes)
0.09% [kernel.kallsyms] [unknown] (0 bytes)
0.09% [kernel.kallsyms] [unknown] (0 bytes)
0.09% [kernel.kallsyms] [unknown] (0 bytes)
0.09% [kernel.kallsyms] [unknown] (0 bytes)
0.09% [kernel.kallsyms] [unknown] (0 bytes)
0.09% 0.10% [kernel.kallsyms] [unknown] (0 bytes)
1.13% 1.71% <...other 27 warm regions...>
....................................................................................................
99.83% 100.00% <totals>
....[Hottest Methods (after inlining)]..............................................................
48.95% 15.51% runtime stub StubRoutines::jlong_disjoint_arraycopy
37.54% 70.22% C1, level 3 java.lang.String::hashCode, version 1
9.76% 10.18% [kernel.kallsyms] [unknown]
3.22% 3.71% C1, level 3 java.util.zip.ZipCoder::getBytes, version 582
0.09% interpreter iconst_0 3 iconst_0
0.09% interpreter if_icmpge 162 if_icmpge
0.09% interpreter fast_aputfield 211 fast_aputfield
0.09% 0.19% C1, level 3 Benchmark::arrcopy, version 455
....................................................................................................
99.83% 99.81% <totals>
....[Distribution by Source]........................................................................
48.95% 15.51% runtime stub
40.85% 74.22% C1, level 3
9.76% 10.18% [kernel.kallsyms]
0.26% interpreter
....................................................................................................
99.83% 100.00% <totals>