System.arraycopy vs Arrays.copyOf

648 views
Skip to first unread message

Carl Mastrangelo

unread,
Feb 9, 2017, 11:33:12 AM2/9/17
to mechanical-sympathy
Hi,

I'm trying to copy an array of bytes and return the copy.  I noticed that there are two ways of doing this: System.arraycopy and Arrays.copyOf, both of which have intrinsic versions in OpenJDK.  I was hoping that the Arrays.copyOf version would skip the initial zeroing of the destination array, and thus be slightly faster.  However, benchmarking them seems to show that neither one is consistently faster than the other.  I'm not that familiar with assembly, so I may be reading it wrong.  That said, reading the "Hottest Region 2" below, it looks like copyOf is still zeroing?  Can that be right?

Benchmark:

@State(Scope.Benchmark)
public class Benchmark {

  @Param({"1048576"})
  public int size;

  public long[] src;
  
  @Setup
  public void setUp() throws Exception {
    src = new long[size];
    for (int i = 0; i < src.length; i++) {
      src[i] = i;
    }
  }

  /**
   * Javadoc comment.
   */
  @Benchmark
  @BenchmarkMode(Mode.SampleTime)
  @OutputTimeUnit(TimeUnit.NANOSECONDS)
  public long[] syscopy() {
    long[] dest = new long[src.length];
    System.arraycopy(src, 0, dest, 0, src.length);
    return dest;
  }

  /**
   * Javadoc comment.
   */
  @Benchmark
  @BenchmarkMode(Mode.SampleTime)
  @OutputTimeUnit(TimeUnit.NANOSECONDS)
  public long[] arrcopy() {
    return Arrays.copyOf(src, src.length);
  }
}

And the Output for arrcopy:

# JMH 1.17.3 (released 56 days ago)
# VM version: JDK 1.8.0_92, VM 25.92-b14
# VM invoker: ~/Downloads/jdk1.8.0_92/jre/bin/java
# VM options: -server -Xms2g -Xmx2g -XX:+UnlockDiagnosticVMOptions -XX:LogFile=/tmp/thelogs -XX:+LogCompilation -XX:+PrintNMethods -XX:+PrintNativeNMethods -XX:+PrintAssembly -XX:+PrintInlining -XX:PrintAssemblyOptions=syntax -XX:+PrintCompilation
# Warmup: 10 iterations, 1 s each
# Measurement: 1 iterations, 1 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Sampling time
# Benchmark: Benchmark.arrcopy
# Parameters: (size = 1048576)

# Run progress: 0.00% complete, ETA 00:00:22
# Fork: 1 of 1
# Preparing profilers: LinuxPerfAsmProfiler 
# Profilers consume stdout and stderr from target VM, use -v EXTRA to copy to console
# Warmup Iteration   1: 3064247.755 ±(99.9%) 1101881.397 ns/op
# Warmup Iteration   2: 2104675.705 ±(99.9%) 116863.885 ns/op
# Warmup Iteration   3: 1933284.271 ±(99.9%) 73166.251 ns/op
# Warmup Iteration   4: 2022234.170 ±(99.9%) 31234.827 ns/op
# Warmup Iteration   5: 2054921.988 ±(99.9%) 171828.509 ns/op
# Warmup Iteration   6: 1984621.714 ±(99.9%) 70876.189 ns/op
# Warmup Iteration   7: 2029098.580 ±(99.9%) 32632.162 ns/op
# Warmup Iteration   8: 1888186.445 ±(99.9%) 27772.059 ns/op
# Warmup Iteration   9: 1956954.176 ±(99.9%) 110294.977 ns/op
# Warmup Iteration  10: 1961879.592 ±(99.9%) 75665.475 ns/op
Iteration   1: 2082120.533 ±(99.9%) 31688.951 ns/op
                 arrcopy·p0.00:   1851392.000 ns/op
                 arrcopy·p0.50:   2059264.000 ns/op
                 arrcopy·p0.90:   2231910.400 ns/op
                 arrcopy·p0.95:   2273280.000 ns/op
                 arrcopy·p0.99:   3109150.720 ns/op
                 arrcopy·p0.999:  4194304.000 ns/op
                 arrcopy·p0.9999: 4194304.000 ns/op
                 arrcopy·p1.00:   4194304.000 ns/op

# Processing profiler results: LinuxPerfAsmProfiler 


Result "arrcopy":
  N = 480
  mean = 2082120.533 ±(99.9%) 31688.951 ns/op

  Histogram, ns/op:
    [1000000.000, 1250000.000) = 0 
    [1250000.000, 1500000.000) = 0 
    [1500000.000, 1750000.000) = 0 
    [1750000.000, 2000000.000) = 166 
    [2000000.000, 2250000.000) = 276 
    [2250000.000, 2500000.000) = 26 
    [2500000.000, 2750000.000) = 6 
    [2750000.000, 3000000.000) = 1 
    [3000000.000, 3250000.000) = 1 
    [3250000.000, 3500000.000) = 1 
    [3500000.000, 3750000.000) = 1 
    [3750000.000, 4000000.000) = 1 
    [4000000.000, 4250000.000) = 1 
    [4250000.000, 4500000.000) = 0 
    [4500000.000, 4750000.000) = 0 

  Percentiles, ns/op:
      p(0.0000) = 1851392.000 ns/op
     p(50.0000) = 2059264.000 ns/op
     p(90.0000) = 2231910.400 ns/op
     p(95.0000) = 2273280.000 ns/op
     p(99.0000) = 3109150.720 ns/op
     p(99.9000) = 4194304.000 ns/op
     p(99.9900) = 4194304.000 ns/op
     p(99.9990) = 4194304.000 ns/op
     p(99.9999) = 4194304.000 ns/op
    p(100.0000) = 4194304.000 ns/op

Secondary result "·asm":
PrintAssembly processed: 156748 total address lines.
Perf output processed (skipped 10.568 seconds):
 Column 1: cycles (1148 events)
 Column 2: instructions (1051 events)

Hottest code regions (>10.00% "cycles" events):

....[Hottest Region 1]..............................................................................
runtime stub, StubRoutines::jlong_disjoint_arraycopy (24 bytes) 

                        0x00007fd8a505278e: neg    %rdx
                  ╭     0x00007fd8a5052791: jmpq   Stub::jlong_disjoint_arraycopy+72 0x0x7fd8a50527c8
                  │↗    0x00007fd8a5052796: mov    0x8(%rdi,%rdx,8),%rax
                  ││    0x00007fd8a505279b: mov    %rax,0x8(%rcx,%rdx,8)
                  ││    0x00007fd8a50527a0: inc    %rdx
                  │╰    0x00007fd8a50527a3: jne    Stub::jlong_disjoint_arraycopy+22 0x0x7fd8a5052796
                  │     0x00007fd8a50527a5: xor    %rax,%rax
                  │     0x00007fd8a50527a8: leaveq 
                  │     0x00007fd8a50527a9: retq   
                  │     0x00007fd8a50527aa: nopw   0x0(%rax,%rax,1)
  0.17%    0.19%  │ ↗   0x00007fd8a50527b0: vmovdqu -0x38(%rdi,%rdx,8),%ymm0
 37.11%   12.94%  │ │   0x00007fd8a50527b6: vmovdqu %ymm0,-0x38(%rcx,%rdx,8)
  5.66%    1.52%  │ │   0x00007fd8a50527bc: vmovdqu -0x18(%rdi,%rdx,8),%ymm1
  0.87%    0.38%  │ │   0x00007fd8a50527c2: vmovdqu %ymm1,-0x18(%rcx,%rdx,8)
  5.14%    0.48%  ↘ │   0x00007fd8a50527c8: add    $0x8,%rdx
                    ╰   0x00007fd8a50527cc: jle    Stub::jlong_disjoint_arraycopy+48 0x0x7fd8a50527b0
                        0x00007fd8a50527ce: sub    $0x4,%rdx
                     ╭  0x00007fd8a50527d2: jg     Stub::jlong_disjoint_arraycopy+100 0x0x7fd8a50527e4
                     │  0x00007fd8a50527d4: vmovdqu -0x18(%rdi,%rdx,8),%ymm0
                     │  0x00007fd8a50527da: vmovdqu %ymm0,-0x18(%rcx,%rdx,8)
                     │  0x00007fd8a50527e0: add    $0x4,%rdx
                     ↘  0x00007fd8a50527e4: (bad)  
                        0x00007fd8a50527e7: rol    $0xf5,%ch
                        0x00007fd8a50527ea: out    %eax,(%dx)
                        0x00007fd8a50527eb: leaveq 
....................................................................................................
 48.95%   15.51%  <total for region 1>

....[Hottest Region 2]..............................................................................
C1, level 3, java.lang.String::hashCode, version 1 (8 bytes) 

                      0x00007fd8a5106814: mov    %ecx,0x8(%rax)
                      0x00007fd8a5106817: mov    %ebx,0xc(%rax)
                      0x00007fd8a510681a: mov    0xa(%rdx),%cl
                      0x00007fd8a510681d: and    $0xff,%rcx
                      0x00007fd8a5106824: sub    %rcx,%rsi
                      0x00007fd8a5106827: add    %rax,%rcx
                      0x00007fd8a510682a: sub    $0x0,%rsi
                  ╭   0x00007fd8a510682e: je     0x00007fd8a5106845
                  │   0x00007fd8a5106834: xor    %rdi,%rdi
                  │   0x00007fd8a5106837: shr    $0x3,%rsi
  4.36%   19.70%  │↗  0x00007fd8a510683b: mov    %rdi,-0x8(%rcx,%rsi,8)
 32.49%   49.57%  ││  0x00007fd8a5106840: dec    %rsi
  0.17%           │╰  0x00007fd8a5106843: jne    0x00007fd8a510683b
                  ↘   0x00007fd8a5106845: retq   
                      0x00007fd8a5106846: push   %rbp
                      0x00007fd8a5106847: mov    %rsp,%rbp
                      0x00007fd8a510684a: mov    %rsp,-0x28(%rsp)
                      0x00007fd8a510684f: sub    $0x80,%rsp
                      0x00007fd8a5106856: mov    %rax,0x78(%rsp)
                      0x00007fd8a510685b: mov    %rcx,0x70(%rsp)
                      0x00007fd8a5106860: mov    %rdx,0x68(%rsp)
                      0x00007fd8a5106865: mov    %rbx,0x60(%rsp)
                      0x00007fd8a510686a: mov    %rbp,0x50(%rsp)
....................................................................................................
 37.02%   69.27%  <total for region 2>

....[Hottest Regions]...............................................................................
 48.95%   15.51%       runtime stub  StubRoutines::jlong_disjoint_arraycopy (24 bytes) 
 37.02%   69.27%        C1, level 3  java.lang.String::hashCode, version 1 (8 bytes) 
  8.62%    9.32%  [kernel.kallsyms]  [unknown] (0 bytes) 
  1.31%    1.05%        C1, level 3  java.util.zip.ZipCoder::getBytes, version 582 (9 bytes) 
  0.61%    0.76%        C1, level 3  java.util.zip.ZipCoder::getBytes, version 582 (17 bytes) 
  0.52%    0.95%        C1, level 3  java.lang.String::hashCode, version 1 (5 bytes) 
  0.44%    0.95%        C1, level 3  java.util.zip.ZipCoder::getBytes, version 582 (24 bytes) 
  0.17%    0.29%        C1, level 3  java.util.zip.ZipCoder::getBytes, version 582 (10 bytes) 
  0.09%           [kernel.kallsyms]  [unknown] (0 bytes) 
  0.09%    0.10%  [kernel.kallsyms]  [unknown] (0 bytes) 
  0.09%           [kernel.kallsyms]  [unknown] (0 bytes) 
  0.09%           [kernel.kallsyms]  [unknown] (0 bytes) 
  0.09%           [kernel.kallsyms]  [unknown] (0 bytes) 
  0.09%           [kernel.kallsyms]  [unknown] (0 bytes) 
  0.09%           [kernel.kallsyms]  [unknown] (0 bytes) 
  0.09%           [kernel.kallsyms]  [unknown] (0 bytes) 
  0.09%           [kernel.kallsyms]  [unknown] (0 bytes) 
  0.09%           [kernel.kallsyms]  [unknown] (0 bytes) 
  0.09%           [kernel.kallsyms]  [unknown] (0 bytes) 
  0.09%    0.10%  [kernel.kallsyms]  [unknown] (0 bytes) 
  1.13%    1.71%  <...other 27 warm regions...>
....................................................................................................
 99.83%  100.00%  <totals>

....[Hottest Methods (after inlining)]..............................................................
 48.95%   15.51%       runtime stub  StubRoutines::jlong_disjoint_arraycopy 
 37.54%   70.22%        C1, level 3  java.lang.String::hashCode, version 1 
  9.76%   10.18%  [kernel.kallsyms]  [unknown] 
  3.22%    3.71%        C1, level 3  java.util.zip.ZipCoder::getBytes, version 582 
  0.09%                 interpreter  iconst_0  3 iconst_0  
  0.09%                 interpreter  if_icmpge  162 if_icmpge  
  0.09%                 interpreter  fast_aputfield  211 fast_aputfield  
  0.09%    0.19%        C1, level 3  Benchmark::arrcopy, version 455 
....................................................................................................
 99.83%   99.81%  <totals>

....[Distribution by Source]........................................................................
 48.95%   15.51%       runtime stub
 40.85%   74.22%        C1, level 3
  9.76%   10.18%  [kernel.kallsyms]
  0.26%                 interpreter
....................................................................................................
 99.83%  100.00%  <totals>



Norman Maurer

unread,
Feb 9, 2017, 11:41:28 AM2/9/17
to mechanica...@googlegroups.com
As far as I know both do zero out. 
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jean-Philippe BEMPEL

unread,
Feb 10, 2017, 5:33:41 AM2/10/17
to mechanical-sympathy
Hi Carl,


Cheers

Nikolay Tsankov

unread,
Feb 15, 2017, 4:27:09 AM2/15/17
to mechanica...@googlegroups.com
I strongly suspect that you are not giving it enough time to warm up.

The zero-ing happens when you allocate the destination arrays and hotspot can often optimize it away if you copy into it right after allocation. As Mr. Shipilev pointed there are cases when it doesn't yet, e.g. when you use a field for the length parameter. 
To test this assumption, I've added another benchmark to your test and run it with a longer warmup:
@Benchmark
public long[] arraycopy_field() {
long[] dst = new long[size];
System.arraycopy(src, 0, dst, 0, size);
return dst;
}
Results:
Benchmark                    (size)  Mode  Cnt        Score       Error  Units
ArrayBench.arraycopy_field  1048576  avgt   30  1907102.105 ± 30966.659  ns/op       <- zeroing slows it down
ArrayBench.arrcopy          1048576  avgt   30  1402851.015 ±  4220.862  ns/op       <- no zeroing
ArrayBench.syscopy          1048576  avgt   30  1422812.870 ± 26030.386  ns/op       <- no zeroing

I have not looked at the assembly though, so I might be wrong

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages