CMOVcc v. Jcc (probably one for Gil)

630 views
Skip to first unread message

Michael Barker

unread,
Jan 1, 2013, 12:17:26 AM1/1/13
to mechanica...@googlegroups.com
Hi,

During a course on HPC that I took a couple of months ago, we talked
about the cost of branch mis-prediction and ways to avoid it. One
simple case for functions like max(a, b) or abs(a) was instead of
executing a CMP and JE instructions was to use CMP and CMOV
(conditional move). Often functions like max/abs can be difficult to
predict correctly as it can be common for both branches to be executed
with equal probability. There is an additional cost as CMOV is a 3
argument instruction (implicitly carries the flags register value).
It was also recommended that in C using the ternary expression for abs
(a < 0 ? -a : a) was better than a pure arithmetic based solution
(e.g. see Hacker's Delight section 2-4) as the CMOV solution contains
fewer instructions.

Playing with a little bit of C code:

int max(int a, int b)
{
return a < b ? b : a;
}

If I do the standard compile (CLang compiler) then the following
assembler is generated:

_max:
0000000100000e30 pushq %rbp
0000000100000e31 movq %rsp,%rbp
0000000100000e34 movl %edi,0xfc(%rbp)
0000000100000e37 movl %esi,0xf8(%rbp)
0000000100000e3a movl 0xfc(%rbp),%esi
0000000100000e3d cmpl 0xf8(%rbp),%esi
0000000100000e40 jge 0x100000e51
0000000100000e46 movl 0xf8(%rbp),%eax
0000000100000e49 movl %eax,0xf4(%rbp)
0000000100000e4c jmp 0x100000e57
0000000100000e51 movl 0xfc(%rbp),%eax
0000000100000e54 movl %eax,0xf4(%rbp)
0000000100000e57 movl 0xf4(%rbp),%eax
0000000100000e5a popq %rbp
0000000100000e5b ret
0000000100000e5c nopl 0x00(%rax)

With the most basic optimisation level (-O1), the ternary pattern is
spotted and the CMOV instruction is used:

_max:
0000000100000e70 pushq %rbp
0000000100000e71 movq %rsp,%rbp
0000000100000e74 cmpl %esi,%edi
0000000100000e76 cmovll %esi,%edi
0000000100000e79 movl %edi,%eax
0000000100000e7b popq %rbp
0000000100000e7c ret
0000000100000e7d nopl (%rax)

If I benchmark them, then the -O1 version is twice as fast (however
that could be due to the other optimisations that are included, I was
unable to just apply the CMOV optimisation).

However with Hotspot (1.7.0u4) the generated code uses the JGE instruction:

# {method} 'max' '(II)I' in 'java/lang/Math'
# parm0: rsi = int
# parm1: rdx = int
# [sp+0x20] (sp of caller)
0x000000010243be80: sub $0x18,%rsp
0x000000010243be87: mov %rbp,0x10(%rsp) ;*synchronization entry
; -
java.lang.Math::max@-1 (line 802)
0x000000010243be8c: cmp %edx,%esi
0x000000010243be8e: jge 0x000000010243be9e ;*if_icmplt
; -
java.lang.Math::max@2 (line 802)
0x000000010243be90: mov %edx,%eax ;*ireturn
; -
java.lang.Math::max@10 (line 802)
0x000000010243be92: add $0x10,%rsp
0x000000010243be96: pop %rbp
0x000000010243be97: test %eax,-0xd4ce9d(%rip) # 0x00000001016ef000
; {poll_return}
0x000000010243be9d: retq
0x000000010243be9e: mov %esi,%eax
0x000000010243bea0: jmp 0x000000010243be92

Is this a missing optimisation from Hotspot or is there some other
reason why sticking with a conditional jump is a better idea? I think
it would be slightly tricker to spot the ternary pattern in the java
byte code as it does not have a explicit instruction, it could even be
done as an intrinsic directly on the Math.min/abs/max methods.

Mike.

Gil Tene

unread,
Jan 3, 2013, 4:01:14 AM1/3/13
to mechanica...@googlegroups.com
Strange, I see CMOVcc used by the -server compiler in both Oracle JDK 1.7.0_10 and 1.7.0_03 (hard to believe it temporarily went away in 1.7.0_04). Maybe it's something about how your micro-benchmark is exercising the code?

My crude micro-benchmark loops the following:

    long MathMaxSpeedLoop(long loopCount) {
        long sum = 0;
        for (long i = 0; i < loopCount; i++) {
            sum += Math.max(loopCount & 0x80, i & 0x80);
        }
        return sum;
    }

(I call it once for warmup with a loop count of 50000, and once for timing for with a loop count of 4000000000L)

and with -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly I see roughly the same thing for both Oracle JDK 1.7.0_10 and 1.7.0_03

...
  # {method} 'max' '(JJ)J' in 'java/lang/Math'
  # parm0:    rsi:rsi   = long
  # parm1:    rdx:rdx   = long
  #           [sp+0x20]  (sp of caller)
  0x00007f19850612c0: push   %rbp
  0x00007f19850612c1: sub    $0x10,%rsp
  0x00007f19850612c5: nop                       ;*synchronization entry
                                                ; - java.lang.Math::max@-1 (line 816)
  0x00007f19850612c6: cmp    %rdx,%rsi
  0x00007f19850612c9: mov    %rsi,%rax
  0x00007f19850612cc: cmovl  %rdx,%rax          ;*lreturn
                                                ; - java.lang.Math::max@11 (line 816)
  0x00007f19850612d0: add    $0x10,%rsp
  0x00007f19850612d4: pop    %rbp
  0x00007f19850612d5: test   %eax,0x980fd25(%rip)        # 0x00007f198e871000
                                                ;   {poll_return}
  0x00007f19850612db: retq   
  0x00007f19850612dc: hlt    
... 

And the hot loop looks like this:

...
  0x00007f0a0505fa40: mov    %r11,%r8
  0x00007f0a0505fa43: and    $0x80,%r8          ;*land
                                                ; - perf.org.HdrHistogram.GenericPerfTest::MathMaxSpeedLoop@23 (line 50)
  0x00007f0a0505fa4a: add    $0x1,%r11          ;*ladd
                                                ; - perf.org.HdrHistogram.GenericPerfTest::MathMaxSpeedLoop@32 (line 49)
  0x00007f0a0505fa4e: cmp    %r8,%r10
  0x00007f0a0505fa51: mov    %r10,%r9
  0x00007f0a0505fa54: cmovl  %r8,%r9
  0x00007f0a0505fa58: add    %r9,%rax           ; OopMap{off=91}
                                                ;*goto
                                                ; - perf.org.HdrHistogram.GenericPerfTest::MathMaxSpeedLoop@35 (line 49)
  0x00007f0a0505fa5b: test   %eax,0xaa3759f(%rip)        # 0x00007f0a0fa97000
                                                ;*goto
                                                ; - perf.org.HdrHistogram.GenericPerfTest::MathMaxSpeedLoop@35 (line 49)
                                                ;   {poll}
  0x00007f0a0505fa61: cmp    %rdx,%r11
  0x00007f0a0505fa64: jl     0x00007f0a0505fa40  ;*ifge
....


Zing similarly uses cmov, but it seems to take this a step farther, unrolling the loop to do these 4-at-a-time, and scheduling those 4 andi/cmp/mov/cmov combos a bit to allow them to interleave. For this specific tight-loop-means-nothing-in-the-real-world-mciro-bemnchmark, the Zing generated code completes about 1.5x faster. (The relevant code snippet taken right from ZVision's CPU/code-blob profiling screen:

...
0x5002bac6nop [rax*1+rax+0] // 10 byte nop0x66660f1f840000000000
9.48%4890x5002bad0lea8 rdi,[rdx*1+0x7]0x488d3c1507000000
0x5002bad8lea8 rsi,[rdx*1+0x5]0x488d341505000000
1.82%940x5002bae0lea8 rcx,[rdx*1+0x6]0x488d0c1506000000
4.71%2430x5002bae8mov8 rdx,r100x4c89d2
4.94%2550x5002baebmov8 rbp,rdx0x4889d5
0x5002baeeand4i ebp,0x800x81e580000000
0.06%30x5002baf4and4i edi,0x800x81e780000000
5.16%2660x5002bafaand4i ecx,0x800x81e180000000
4.67%2410x5002bb00and4i esi,0x800x81e680000000
0.04%20x5002bb06lea8 r10,[rdx*1+0x4]0x4c8d141504000000
0.02%10x5002bb0ecmp8 r09,rsi0x4c3bce
6.26%3230x5002bb11mov8 rbx,r090x4c89cb
3.59%1850x5002bb14cmov8l rbx,rsi0x480f4cde
7.48%3860x5002bb18cmp8 r09,rbp0x4c3bcd
1.34%690x5002bb1bmov8 rsi,r090x4c89ce
3.22%1660x5002bb1ecmov8l rsi,rbp0x480f4cf5
8.05%4150x5002bb22cmp8 r09,rcx0x4c3bc9
0.08%40x5002bb25mov8 rbp,r090x4c89cd
4.94%2550x5002bb28cmov8l rbp,rcx0x480f4ce9
6.57%3390x5002bb2ccmp8 r09,rdi0x4c3bcf
0.06%30x5002bb2fmov8 rcx,r090x4c89c9
3.68%1900x5002bb32cmov8l rcx,rdi0x480f4ccf
7.17%3700x5002bb36add8 rsi,rax0x4803f0
0.10%50x5002bb39add8 rbx,rsi0x4803de
3.51%1810x5002bb3cadd8 rbx,rbp0x4803dd
4.94%2550x5002bb3flea8 rax,[rcx*1+rbx]0x488d040b
8.09%4170x5002bb43cmp8 r10,r110x4d3bd3
0x5002bb46jl 0x5002bad0 // perf.org.HdrHistogram.GenericPerfTest.MathMaxSpeedLoop(J)J+0x980x7c88
...

Michael Barker

unread,
Jan 3, 2013, 3:17:46 PM1/3/13
to Gil Tene, mechanica...@googlegroups.com
If I use Gil's version I see the CMOV every time.  If I change it slightly:

public static void main(String[] args)
{
    Random random = new Random();
    long max = 0;
    
    max = run(random.nextLong(), 50000);
    if (max == 0)
    {
        throw new RuntimeException();
    }
    
    max = run(random.nextLong(), 4000000000L);
    if (max == 0)
    {
        throw new RuntimeException();
    }
}

private static long run(long a, long loopCount)
{
    long result = -1;
    for (long i = 0; i < loopCount; i++)
    {
        result += Math.max(a & 0x80, i & 0x80);
    }
    return result;
}

Run with: java -XX:+UnlockDiagnosticVMOptions -XX:CompileCommand=print,*Math.max -XX:-Inline -cp bin MaxCalculation

I've seen it compile to both a CMOVL and a JGE after running multiple times.  So definitely not a missing optimisation, I suspect it is another case of Hotspot being smarter than me.

Mike.

On Thu, Jan 3, 2013 at 10:01 PM, Gil Tene <g...@azulsystems.com> wrote:
4000000000L

tcn

unread,
Oct 12, 2013, 6:14:17 AM10/12/13
to mechanica...@googlegroups.com, Gil Tene


On Thursday, January 3, 2013 9:17:46 PM UTC+1, mikeb01 wrote:

I've seen it compile to both a CMOVL and a JGE after running multiple times.  So definitely not a missing optimisation, I suspect it is another case of Hotspot being smarter than me.

 

GCC is easy: -O3 emits cmovl, -O2 does not and is significantly slower.

Since I had doubt about SIMD vectorization of the JVM in the past (well, still have...):

"The latest JDK-7 release of HotSpot emits integer cmoves for x86/x64 architectures but not 
floating-point cmoves."

Turned about that integer appears to actually mean integer and not long; but that's still only part of the answer.

So, changing "sum" from long to int plus running the JVM [1] with -Xcomp [2] plus moving the if-then to its own method eventually resulted in cmovl being emitted.

Well, not sure what lesson to learn from that :-\

[1] Oracle 7u40, x64
[2] java -XX:+UnlockDiagnosticVMOptions '-XX:CompileCommand=print,*.main' -XX:PrintAssemblyOptions=intel,hdis-help -Xcomp sort | grep -i cmov

Michael Barker

unread,
Oct 12, 2013, 6:25:58 AM10/12/13
to mechanica...@googlegroups.com, Gil Tene
I suspect (but need to do further analysis) that Hotspot is actually looking at the predictability of the branch condition explaining why I saw both patterns being generated.  If a condition can be correctly predicted CMP+Jcc will be faster than a CMP+CMOVcc.

Mike.


--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply all
Reply to author
Forward
Message has been deleted
0 new messages