CMOVcc v. Jcc (probably one for Gil)

Michael Barker

unread,

Jan 1, 2013, 12:17:26 AM1/1/13

to mechanica...@googlegroups.com

Hi,

During a course on HPC that I took a couple of months ago, we talked
about the cost of branch mis-prediction and ways to avoid it. One
simple case for functions like max(a, b) or abs(a) was instead of
executing a CMP and JE instructions was to use CMP and CMOV
(conditional move). Often functions like max/abs can be difficult to
predict correctly as it can be common for both branches to be executed
with equal probability. There is an additional cost as CMOV is a 3
argument instruction (implicitly carries the flags register value).
It was also recommended that in C using the ternary expression for abs
(a < 0 ? -a : a) was better than a pure arithmetic based solution
(e.g. see Hacker's Delight section 2-4) as the CMOV solution contains
fewer instructions.

Playing with a little bit of C code:

int max(int a, int b)
{
return a < b ? b : a;
}

If I do the standard compile (CLang compiler) then the following
assembler is generated:

_max:
0000000100000e30 pushq %rbp
0000000100000e31 movq %rsp,%rbp
0000000100000e34 movl %edi,0xfc(%rbp)
0000000100000e37 movl %esi,0xf8(%rbp)
0000000100000e3a movl 0xfc(%rbp),%esi
0000000100000e3d cmpl 0xf8(%rbp),%esi
0000000100000e40 jge 0x100000e51
0000000100000e46 movl 0xf8(%rbp),%eax
0000000100000e49 movl %eax,0xf4(%rbp)
0000000100000e4c jmp 0x100000e57
0000000100000e51 movl 0xfc(%rbp),%eax
0000000100000e54 movl %eax,0xf4(%rbp)
0000000100000e57 movl 0xf4(%rbp),%eax
0000000100000e5a popq %rbp
0000000100000e5b ret
0000000100000e5c nopl 0x00(%rax)

With the most basic optimisation level (-O1), the ternary pattern is
spotted and the CMOV instruction is used:

_max:
0000000100000e70 pushq %rbp
0000000100000e71 movq %rsp,%rbp
0000000100000e74 cmpl %esi,%edi
0000000100000e76 cmovll %esi,%edi
0000000100000e79 movl %edi,%eax
0000000100000e7b popq %rbp
0000000100000e7c ret
0000000100000e7d nopl (%rax)

If I benchmark them, then the -O1 version is twice as fast (however
that could be due to the other optimisations that are included, I was
unable to just apply the CMOV optimisation).

However with Hotspot (1.7.0u4) the generated code uses the JGE instruction:

# {method} 'max' '(II)I' in 'java/lang/Math'
# parm0: rsi = int
# parm1: rdx = int
# [sp+0x20] (sp of caller)
0x000000010243be80: sub $0x18,%rsp
0x000000010243be87: mov %rbp,0x10(%rsp) ;*synchronization entry
; -
java.lang.Math::max@-1 (line 802)
0x000000010243be8c: cmp %edx,%esi
0x000000010243be8e: jge 0x000000010243be9e ;*if_icmplt
; -
java.lang.Math::max@2 (line 802)
0x000000010243be90: mov %edx,%eax ;*ireturn
; -
java.lang.Math::max@10 (line 802)
0x000000010243be92: add $0x10,%rsp
0x000000010243be96: pop %rbp
0x000000010243be97: test %eax,-0xd4ce9d(%rip) # 0x00000001016ef000
; {poll_return}
0x000000010243be9d: retq
0x000000010243be9e: mov %esi,%eax
0x000000010243bea0: jmp 0x000000010243be92

Is this a missing optimisation from Hotspot or is there some other
reason why sticking with a conditional jump is a better idea? I think
it would be slightly tricker to spot the ternary pattern in the java
byte code as it does not have a explicit instruction, it could even be
done as an intrinsic directly on the Math.min/abs/max methods.

Mike.

Gil Tene

unread,

Jan 3, 2013, 4:01:14 AM1/3/13

to mechanica...@googlegroups.com

Strange, I see CMOVcc used by the -server compiler in both Oracle JDK 1.7.0_10 and 1.7.0_03 (hard to believe it temporarily went away in 1.7.0_04). Maybe it's something about how your micro-benchmark is exercising the code?

My crude micro-benchmark loops the following:

long MathMaxSpeedLoop(long loopCount) {

long sum = 0;

for (long i = 0; i < loopCount; i++) {

sum += Math.max(loopCount & 0x80, i & 0x80);

}

return sum;

}

(I call it once for warmup with a loop count of 50000, and once for timing for with a loop count of 4000000000L)

and with -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly I see roughly the same thing for both Oracle JDK 1.7.0_10 and 1.7.0_03

...

# {method} 'max' '(JJ)J' in 'java/lang/Math'

# parm0: rsi:rsi = long

# parm1: rdx:rdx = long

# [sp+0x20] (sp of caller)

0x00007f19850612c0: push %rbp

0x00007f19850612c1: sub $0x10,%rsp

0x00007f19850612c5: nop ;*synchronization entry

; - java.lang.Math::max@-1 (line 816)

0x00007f19850612c6: cmp %rdx,%rsi

0x00007f19850612c9: mov %rsi,%rax

0x00007f19850612cc: cmovl %rdx,%rax ;*lreturn

; - java.lang.Math::max@11 (line 816)

0x00007f19850612d0: add $0x10,%rsp

0x00007f19850612d4: pop %rbp

0x00007f19850612d5: test %eax,0x980fd25(%rip) # 0x00007f198e871000

; {poll_return}

0x00007f19850612db: retq

0x00007f19850612dc: hlt

...

And the hot loop looks like this:

...

0x00007f0a0505fa40: mov %r11,%r8

0x00007f0a0505fa43: and $0x80,%r8 ;*land

; - perf.org.HdrHistogram.GenericPerfTest::MathMaxSpeedLoop@23 (line 50)

0x00007f0a0505fa4a: add $0x1,%r11 ;*ladd

; - perf.org.HdrHistogram.GenericPerfTest::MathMaxSpeedLoop@32 (line 49)

0x00007f0a0505fa4e: cmp %r8,%r10

0x00007f0a0505fa51: mov %r10,%r9

0x00007f0a0505fa54: cmovl %r8,%r9

0x00007f0a0505fa58: add %r9,%rax ; OopMap{off=91}

;*goto

; - perf.org.HdrHistogram.GenericPerfTest::MathMaxSpeedLoop@35 (line 49)

0x00007f0a0505fa5b: test %eax,0xaa3759f(%rip) # 0x00007f0a0fa97000

;*goto

; - perf.org.HdrHistogram.GenericPerfTest::MathMaxSpeedLoop@35 (line 49)

; {poll}

0x00007f0a0505fa61: cmp %rdx,%r11

0x00007f0a0505fa64: jl 0x00007f0a0505fa40 ;*ifge

....

Zing similarly uses cmov, but it seems to take this a step farther, unrolling the loop to do these 4-at-a-time, and scheduling those 4 andi/cmp/mov/cmov combos a bit to allow them to interleave. For this specific tight-loop-means-nothing-in-the-real-world-mciro-bemnchmark, the Zing generated code completes about 1.5x faster. (The relevant code snippet taken right from ZVision's CPU/code-blob profiling screen:

...

	`0x5002bac6`	nop [rax*1+rax+0] // 10 byte nop	`0x66660f1f840000000000`
9.48%	489	`0x5002bad0`	lea8 rdi,[rdx*1+0x7]	`0x488d3c1507000000`
		`0x5002bad8`	lea8 rsi,[rdx*1+0x5]	`0x488d341505000000`
1.82%	94	`0x5002bae0`	lea8 rcx,[rdx*1+0x6]	`0x488d0c1506000000`
4.71%	243	`0x5002bae8`	mov8 rdx,r10	`0x4c89d2`
4.94%	255	`0x5002baeb`	mov8 rbp,rdx	`0x4889d5`
		`0x5002baee`	and4i ebp,0x80	`0x81e580000000`
0.06%	3	`0x5002baf4`	and4i edi,0x80	`0x81e780000000`
5.16%	266	`0x5002bafa`	and4i ecx,0x80	`0x81e180000000`
4.67%	241	`0x5002bb00`	and4i esi,0x80	`0x81e680000000`
0.04%	2	`0x5002bb06`	lea8 r10,[rdx*1+0x4]	`0x4c8d141504000000`
0.02%	1	`0x5002bb0e`	cmp8 r09,rsi	`0x4c3bce`
6.26%	323	`0x5002bb11`	mov8 rbx,r09	`0x4c89cb`
3.59%	185	`0x5002bb14`	cmov8l rbx,rsi	`0x480f4cde`
7.48%	386	`0x5002bb18`	cmp8 r09,rbp	`0x4c3bcd`
1.34%	69	`0x5002bb1b`	mov8 rsi,r09	`0x4c89ce`
3.22%	166	`0x5002bb1e`	cmov8l rsi,rbp	`0x480f4cf5`
8.05%	415	`0x5002bb22`	cmp8 r09,rcx	`0x4c3bc9`
0.08%	4	`0x5002bb25`	mov8 rbp,r09	`0x4c89cd`
4.94%	255	`0x5002bb28`	cmov8l rbp,rcx	`0x480f4ce9`
6.57%	339	`0x5002bb2c`	cmp8 r09,rdi	`0x4c3bcf`
0.06%	3	`0x5002bb2f`	mov8 rcx,r09	`0x4c89c9`
3.68%	190	`0x5002bb32`	cmov8l rcx,rdi	`0x480f4ccf`
7.17%	370	`0x5002bb36`	add8 rsi,rax	`0x4803f0`
0.10%	5	`0x5002bb39`	add8 rbx,rsi	`0x4803de`
3.51%	181	`0x5002bb3c`	add8 rbx,rbp	`0x4803dd`
4.94%	255	`0x5002bb3f`	lea8 rax,[rcx*1+rbx]	`0x488d040b`
8.09%	417	`0x5002bb43`	cmp8 r10,r11	`0x4d3bd3`
		`0x5002bb46`	jl 0x5002bad0 // perf.org.HdrHistogram.GenericPerfTest.MathMaxSpeedLoop(J)J+0x98	`0x7c88`

...

Michael Barker

unread,

Jan 3, 2013, 3:17:46 PM1/3/13

to Gil Tene, mechanica...@googlegroups.com

If I use Gil's version I see the CMOV every time. If I change it slightly:

public static void main(String[] args)

{

Random random = new Random();

long max = 0;

max = run(random.nextLong(), 50000);

if (max == 0)

{

throw new RuntimeException();

}

max = run(random.nextLong(), 4000000000L);

if (max == 0)

{

throw new RuntimeException();

}

private static long run(long a, long loopCount)

{

long result = -1;

for (long i = 0; i < loopCount; i++)

{

result += Math.max(a & 0x80, i & 0x80);

}

return result;

}

Run with: java -XX:+UnlockDiagnosticVMOptions -XX:CompileCommand=print,*Math.max -XX:-Inline -cp bin MaxCalculation

I've seen it compile to both a CMOVL and a JGE after running multiple times. So definitely not a missing optimisation, I suspect it is another case of Hotspot being smarter than me.

Mike.

On Thu, Jan 3, 2013 at 10:01 PM, Gil Tene <g...@azulsystems.com> wrote:

4000000000L

tcn

unread,

Oct 12, 2013, 6:14:17 AM10/12/13

to mechanica...@googlegroups.com, Gil Tene

On Thursday, January 3, 2013 9:17:46 PM UTC+1, mikeb01 wrote:

I've seen it compile to both a CMOVL and a JGE after running multiple times. So definitely not a missing optimisation, I suspect it is another case of Hotspot being smarter than me.

I just had a shot at http://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-an-unsorted-array.

GCC is easy: -O3 emits cmovl, -O2 does not and is significantly slower.

Since I had doubt about SIMD vectorization of the JVM in the past (well, still have...):

"The latest JDK-7 release of HotSpot emits integer cmoves for x86/x64 architectures but not

floating-point cmoves."

http://developer.amd.com/wordpress/media/2012/10/AddingNewInstructionsToHotspot1.pdf

Turned about that integer appears to actually mean integer and not long; but that's still only part of the answer.

So, changing "sum" from long to int plus running the JVM [1] with -Xcomp [2] plus moving the if-then to its own method eventually resulted in cmovl being emitted.

Well, not sure what lesson to learn from that :-\

[1] Oracle 7u40, x64

[2] java -XX:+UnlockDiagnosticVMOptions '-XX:CompileCommand=print,*.main' -XX:PrintAssemblyOptions=intel,hdis-help -Xcomp sort | grep -i cmov

Michael Barker

unread,

Oct 12, 2013, 6:25:58 AM10/12/13

to mechanica...@googlegroups.com, Gil Tene

I suspect (but need to do further analysis) that Hotspot is actually looking at the predictability of the branch condition explaining why I saw both patterns being generated. If a condition can be correctly predicted CMP+Jcc will be faster than a CMP+CMOVcc.

Mike.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply all

Reply to author

Forward

Message has been deleted