I was really rooting for System.arraycopy since it is an intrinsic. I'm also surprised that Java code can beat arraycopy / unsafe for longs which seems to indicate that the later operate on 4 byte chuncks instead of 8 byte chunks.
Is this (the fact that they don't operate on 8 byte - or perhaps even larger - chunks) a bug / a known shortcoming?
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Duude, perfasm is awesome! Now I just need to stare at the AT&T syntax until my head stops hurting :)
What I think I found (I posted the entire output here [1]):
Bytes:
- manual copy (both increasing / decreasing) uses vmovq w/o loop unrolling [2] (so moving 8 bytes at once)
- System.arraycopy uses tight and unrolled loop of vmovdqu [3] (moving 32 bytes / 256 bits every iteration)
Unsafe.copyMemory does the same
Shorts / Ints / Longs:
- manual copy: vmovdqu
- System.arraycopy / Unsafe.copyMemory: unrolled vmovdqu
Object references (using compressed Oops - just added these):
- manual copy: some complicated combinations of movs / shrs [4]
- System.arraycopy: unrolled vmovdqu (couldn't get Unsafe.copyMemory to work :-))
[1] https://github.com/gpanther/benchmark-arraycopy/blob/master/perfasm.txt
[2] https://github.com/gpanther/benchmark-arraycopy/blob/master/perfasm.txt#L135
https://github.com/gpanther/benchmark-arraycopy/blob/master/perfasm.txt#L483
[3] https://github.com/gpanther/benchmark-arraycopy/blob/master/perfasm.txt#L919
[4] https://github.com/gpanther/benchmark-arraycopy/blob/master/perfasm.txt#L4613
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/sug91A1ynF4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.
--
The reason first he random pad is that array copying (and other streaming operations) can be highly sensitive to vector-sized alignment both of the individual arrays (hence the initial random pad) and of the relative alignment between the source and destination. Depending on platform, alignments to 8, 16 (e.g. SSE/AVX), 32 (e.g. AVX2), or 64 (e.g. AVX512/AVX3) bytes can have a significant effect on performance.
Now getting back to my original question: do the following two pieces of code differ substantially regarding the alignment of the resulting objects? (assuming that pad, source and destination are fields on the object)
pad = new byte[new Random().nextInt(1024)];
source = new long[size];
destination = new long[size];
System.gc();
And:
source = new long[size];
destination = new long[size];
System.gc();
My reasoning as to why these are the same (with regards to alignment):
If we want to test the influence of alignment on copying, wouldn't it be better to allocate a byte[] array and arraycopy from/to it using random starting offsets? (given that arraycopy would move 32 bytes anyway per iteration).
Cheers,
Attila
Only on my phone now but I think the version I looked at had the destination, rather than the source, array filled with random data.
--
In my mind a better way measure alignment effects would be the following:
I did exactly that here [1] (although I didn't try all the combinations). You can see the results drawn on candlestick charts here [2] and here [3] (I also attached it to the email for convenience). The first chart is for an i7 and the second one is for a Xeon.
Both of them show strong bi-modality, so alignment definitely matters - and I'm still unsure how the benchmark attached to the Jira issue variens the alignment - other than the pure randomness of memory allocation.
Also, perhaps the arraycopy code could account for alignment by copying the
first/last few elements differently and only copy the aligned data in the main loop.
Cheers,
Attila
[2] https://github.com/gpanther/benchmark-arraycopy/blob/master/alignment-i7.png
[3] https://github.com/gpanther/benchmark-arraycopy/blob/master/alignment-xeon.png
Now it tests three scenarios:
Performance counter stats for 'jdk-9/bin/java -Djol.tryWithSudo=true -jar target/benchmarks.jar BenchmarkLongArrayCopyStandalone2.arraycopy':
41886,813709 task-clock (msec) # 0,881 CPUs utilized
4.845 context-switches # 0,116 K/sec
509 cpu-migrations # 0,012 K/sec
22.246 page-faults # 0,531 K/sec
133.105.085.141 cycles # 3,178 GHz
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
193.885.831.009 instructions # 1,46 insns per cycle
33.125.314.936 branches # 790,829 M/sec
265.929.932 branch-misses # 0,80% of all branches
47,536583457 seconds time elapsed
Benchmark (size) Mode Cnt Score Error Units
BenchmarkLongArrayCopyStandalone2.arraycopy 1024 avgt 20 194.047 ± 0.005 ns/op
BenchmarkLongArrayCopyStandalone2.manualCopy 1024 avgt 20 263.778 ± 0.036 ns/op
BenchmarkLongArrayCopyStandalone2.manualCopy_Dec 1024 avgt 20 196.387 ± 0.040 ns/op
2.49% 1.28% │ ↗ │││ 0x00007f79d5773a70: vmovdqu -0x38(%rdi,%rdx,8),%ymm0
2.61% 1.60% │ │ │││ 0x00007f79d5773a76: vmovdqu %ymm0,-0x38(%rcx,%rdx,8)
3.39% 0.84% │ │ │││ 0x00007f79d5773a7c: vmovdqu -0x18(%rdi,%rdx,8),%ymm1
28.53% 16.22% │ │ │││ 0x00007f79d5773a82: vmovdqu %ymm1,-0x18(%rcx,%rdx,8)
56.50% 73.29% ↘ │ │││ 0x00007f79d5773a88: add $0x8,%rdx
╰ │││ 0x00007f79d5773a8c: jle Stub::jlong_disjoint_arraycopy+48 0x00007f79d5773a70
2.68% 1.67% │││││ │↗ 0x00007f233c9a98b0: vmovdqu -0x8(%r11,%rbx,8),%ymm0
0.24% 0.21% │││││ ││ 0x00007f233c9a98b7: vmovdqu %ymm0,-0x8(%r10,%rbx,8)
10.44% 11.96% │││││ ││ 0x00007f233c9a98be: vmovdqu -0x28(%r11,%rbx,8),%ymm0
0.03% 0.05% │││││ ││ 0x00007f233c9a98c5: vmovdqu %ymm0,-0x28(%r10,%rbx,8)
8.94% 10.19% │││││ ││ 0x00007f233c9a98cc: vmovdqu -0x48(%r11,%rbx,8),%ymm0
0.11% 0.05% │││││ ││ 0x00007f233c9a98d3: vmovdqu %ymm0,-0x48(%r10,%rbx,8)
9.62% 10.50% │││││ ││ 0x00007f233c9a98da: vmovdqu -0x68(%r11,%rbx,8),%ymm0
0.08% 0.11% │││││ ││ 0x00007f233c9a98e1: vmovdqu %ymm0,-0x68(%r10,%rbx,8)
9.28% 7.94% │││││ ││ 0x00007f233c9a98e8: vmovdqu -0x88(%r11,%rbx,8),%ymm0
0.20% 0.03% │││││ ││ 0x00007f233c9a98f2: vmovdqu %ymm0,-0x88(%r10,%rbx,8)
10.83% 8.90% │││││ ││ 0x00007f233c9a98fc: vmovdqu -0xa8(%r11,%rbx,8),%ymm0
0.17% 0.08% │││││ ││ 0x00007f233c9a9906: vmovdqu %ymm0,-0xa8(%r10,%rbx,8)
10.49% 8.92% │││││ ││ 0x00007f233c9a9910: vmovdqu -0xc8(%r11,%rbx,8),%ymm0
0.08% 0.06% │││││ ││ 0x00007f233c9a991a: vmovdqu %ymm0,-0xc8(%r10,%rbx,8)
8.87% 6.60% │││││ ││ 0x00007f233c9a9924: vmovdqu -0xe8(%r11,%rbx,8),%ymm0
0.12% 0.05% │││││ ││ 0x00007f233c9a992e: vmovdqu %ymm0,-0xe8(%r10,%rbx,8)
│││││ ││ ;*lastore {reexecute=0 rethrow=0 return_oop=0}
│││││ ││ ; - net.greypanther.BenchmarkLongArrayCopyStandalone2::manualCopy_Dec@20 (line 105)
8.55% 7.71% │││││ ││ 0x00007f233c9a9938: add $0xffffffffffffffe0,%ebx ;*iinc {reexecute=0 rethrow=0 return_oop=0}
│││││ ││ ; - net.greypanther.BenchmarkLongArrayCopyStandalone2::manualCopy_Dec@21 (line 104)
0.03% │││││ ││ 0x00007f233c9a993b: cmp $0x1e,%ebx
│││││ │╰ 0x00007f233c9a993e: jg 0x00007f233c9a98b0 ;*iflt {reexecute=0 rethrow=0 return_oop=0}
│││││ │ ; - net.greypanther.BenchmarkLongArrayCopyStandalone2::manualCopy_Dec@8 (line 104)
--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/sug91A1ynF4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.
<alignment-i7.png>
<alignment-xeon.png>
You may want to list the specific processor model in each case. "i7" and "Xeon" are not specific enough to know what core model is used. Vector sizes and processor optimization of steaming copy loops (and for increasing vs. decreasing directions) can vary a bunch across generations of cores.
You should also throw 64 byte alignments into the mix. Starting with sky lake cores ("v5" in Xeon models) the AVX vectors go that wide.
Sent from Gil's iPhone
> On Feb 27, 2016, at 7:34 AM, Attila-Mihaly Balazs <dify...@gmail.com> wrote:
>
> ------=_Part_1756_305911719.1456587267079
> Content-Type: text/plain; charset="UTF-8"
>
> Long story short, I think that the JVM does much more aggressive loop
> unrolling for manualCopy_Dec which seems to be a good match for the
> particular model of i7.
>
> Cheers,
> Attila
>
> --
> You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/sug91A1ynF4/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
> ------=_Part_1756_305911719.1456587267079
> Content-Type: text/html; charset="UTF-8"
>
> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"><div dir="ltr">Long story short, I think that the JVM does much more aggressive loop unrolling for manualCopy_Dec which seems to be a good match for the particular model of i7.<br><br>Cheers,<br>Attila<br></div>
>
> <p></p>
>
> -- <br>
> You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.<br>
> To unsubscribe from this topic, visit <a href="https://groups.google.com/d/topic/mechanical-sympathy/sug91A1ynF4/unsubscribe">https://groups.google.com/d/topic/mechanical-sympathy/sug91A1ynF4/unsubscribe</a>.<br>
> To unsubscribe from this group and all its topics, send an email to <a href="mailto:mechanical-sympathy+unsu...@googlegroups.com">mechanical-sympathy+unsu...@googlegroups.com</a>.<br>
Well, you seem to be where we (Sergey and me) were three weeks ago, but
now we know significantly more. You might want to read the comments and
see the data in the original bug report:
https://bugs.openjdk.java.net/browse/JDK-8150730
Hello all,
Sorry for the delay - I now found some more free time to follow up on this thread.
First: what are the exact processor models used?
I test on my laptop (Intel(R) Core(TM) i7-4600U) and a server (Xeon(R) CPU E5-2665).
Second, Gil / Aleksey: I agree that alignment plays a big role in the behaviour of array copy. I even produced some charts ([1] and [2]) trying to quantify their impact. You can clearly see a bi-modal distribution (ie. some combination of source / destination alignment work well, other work poorly). The numbers were produced by thin benchmark [3] which does the following:
- allocates the arrays at startup statically
- disables JMH forking so that all tests reuse the exact same arrays
- find the "zero point" in the arrays (the first byte the address of which is 64 byte aligned)
- benchmarks from there
BTW, Gil, I see no reason why the JVM would need to access the array's length since arrays can't be resized in java (ie. their length can be cached).
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
On Thursday, March 31, 2016, Attila-Mihaly Balazs <dify...@gmail.com> wrote:Hello all,
Sorry for the delay - I now found some more free time to follow up on this thread.
First: what are the exact processor models used?
I test on my laptop (Intel(R) Core(TM) i7-4600U) and a server (Xeon(R) CPU E5-2665).
Second, Gil / Aleksey: I agree that alignment plays a big role in the behaviour of array copy. I even produced some charts ([1] and [2]) trying to quantify their impact. You can clearly see a bi-modal distribution (ie. some combination of source / destination alignment work well, other work poorly). The numbers were produced by thin benchmark [3] which does the following:
- allocates the arrays at startup statically
- disables JMH forking so that all tests reuse the exact same arrays
- find the "zero point" in the arrays (the first byte the address of which is 64 byte aligned)
- benchmarks from there
BTW, Gil, I see no reason why the JVM would need to access the array's length since arrays can't be resized in java (ie. their length can be cached).In short, range checks.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
Sent from my phone
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
> - disables JMH forking so that all tests reuse the exact same arrays
Disabling forking is bad for JMH, this should be used only for
debugging, not for actual performance runs.
BTW, Gil, I see no reason why the JVM would need to access the array's length since arrays can't be resized in java (ie. their length can be cached).In short, range checks.
Cheers,
Attila
--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/sug91A1ynF4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.