Strange, I see CMOVcc used by the -server compiler in both Oracle JDK 1.7.0_10 and 1.7.0_03 (hard to believe it temporarily went away in 1.7.0_04). Maybe it's something about how your micro-benchmark is exercising the code?
My crude micro-benchmark loops the following:
long MathMaxSpeedLoop(long loopCount) {
long sum = 0;
for (long i = 0; i < loopCount; i++) {
sum += Math.max(loopCount & 0x80, i & 0x80);
}
return sum;
}
(I call it once for warmup with a loop count of 50000, and once for timing for with a loop count of 4000000000L)
and with -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly I see roughly the same thing for both Oracle JDK 1.7.0_10 and 1.7.0_03
...
# {method} 'max' '(JJ)J' in 'java/lang/Math'
# parm0: rsi:rsi = long
# parm1: rdx:rdx = long
# [sp+0x20] (sp of caller)
0x00007f19850612c0: push %rbp
0x00007f19850612c1: sub $0x10,%rsp
0x00007f19850612c5: nop ;*synchronization entry
; - java.lang.Math::max@-1 (line 816)
0x00007f19850612c6: cmp %rdx,%rsi
0x00007f19850612c9: mov %rsi,%rax
0x00007f19850612cc: cmovl %rdx,%rax ;*lreturn
; - java.lang.Math::max@11 (line 816)
0x00007f19850612d0: add $0x10,%rsp
0x00007f19850612d4: pop %rbp
0x00007f19850612d5: test %eax,0x980fd25(%rip) # 0x00007f198e871000
; {poll_return}
0x00007f19850612db: retq
0x00007f19850612dc: hlt
...
And the hot loop looks like this:
...
0x00007f0a0505fa40: mov %r11,%r8
0x00007f0a0505fa43: and $0x80,%r8 ;*land
; - perf.org.HdrHistogram.GenericPerfTest::MathMaxSpeedLoop@23 (line 50)
0x00007f0a0505fa4a: add $0x1,%r11 ;*ladd
; - perf.org.HdrHistogram.GenericPerfTest::MathMaxSpeedLoop@32 (line 49)
0x00007f0a0505fa4e: cmp %r8,%r10
0x00007f0a0505fa51: mov %r10,%r9
0x00007f0a0505fa54: cmovl %r8,%r9
0x00007f0a0505fa58: add %r9,%rax ; OopMap{off=91}
;*goto
; - perf.org.HdrHistogram.GenericPerfTest::MathMaxSpeedLoop@35 (line 49)
0x00007f0a0505fa5b: test %eax,0xaa3759f(%rip) # 0x00007f0a0fa97000
;*goto
; - perf.org.HdrHistogram.GenericPerfTest::MathMaxSpeedLoop@35 (line 49)
; {poll}
0x00007f0a0505fa61: cmp %rdx,%r11
0x00007f0a0505fa64: jl 0x00007f0a0505fa40 ;*ifge
....
Zing similarly uses cmov, but it seems to take this a step farther, unrolling the loop to do these 4-at-a-time, and scheduling those 4 andi/cmp/mov/cmov combos a bit to allow them to interleave. For this specific tight-loop-means-nothing-in-the-real-world-mciro-bemnchmark, the Zing generated code completes about 1.5x faster. (The relevant code snippet taken right from ZVision's CPU/code-blob profiling screen:
...
| 0x5002bac6 | nop [rax*1+rax+0] // 10 byte nop | 0x66660f1f840000000000 |
9.48% | 489 | 0x5002bad0 | lea8 rdi,[rdx*1+0x7] | 0x488d3c1507000000 |
| | 0x5002bad8 | lea8 rsi,[rdx*1+0x5] | 0x488d341505000000 |
1.82% | 94 | 0x5002bae0 | lea8 rcx,[rdx*1+0x6] | 0x488d0c1506000000 |
4.71% | 243 | 0x5002bae8 | mov8 rdx,r10 | 0x4c89d2 |
4.94% | 255 | 0x5002baeb | mov8 rbp,rdx | 0x4889d5 |
| | 0x5002baee | and4i ebp,0x80 | 0x81e580000000 |
0.06% | 3 | 0x5002baf4 | and4i edi,0x80 | 0x81e780000000 |
5.16% | 266 | 0x5002bafa | and4i ecx,0x80 | 0x81e180000000 |
4.67% | 241 | 0x5002bb00 | and4i esi,0x80 | 0x81e680000000 |
0.04% | 2 | 0x5002bb06 | lea8 r10,[rdx*1+0x4] | 0x4c8d141504000000 |
0.02% | 1 | 0x5002bb0e | cmp8 r09,rsi | 0x4c3bce |
6.26% | 323 | 0x5002bb11 | mov8 rbx,r09 | 0x4c89cb |
3.59% | 185 | 0x5002bb14 | cmov8l rbx,rsi | 0x480f4cde |
7.48% | 386 | 0x5002bb18 | cmp8 r09,rbp | 0x4c3bcd |
1.34% | 69 | 0x5002bb1b | mov8 rsi,r09 | 0x4c89ce |
3.22% | 166 | 0x5002bb1e | cmov8l rsi,rbp | 0x480f4cf5 |
8.05% | 415 | 0x5002bb22 | cmp8 r09,rcx | 0x4c3bc9 |
0.08% | 4 | 0x5002bb25 | mov8 rbp,r09 | 0x4c89cd |
4.94% | 255 | 0x5002bb28 | cmov8l rbp,rcx | 0x480f4ce9 |
6.57% | 339 | 0x5002bb2c | cmp8 r09,rdi | 0x4c3bcf |
0.06% | 3 | 0x5002bb2f | mov8 rcx,r09 | 0x4c89c9 |
3.68% | 190 | 0x5002bb32 | cmov8l rcx,rdi | 0x480f4ccf |
7.17% | 370 | 0x5002bb36 | add8 rsi,rax | 0x4803f0 |
0.10% | 5 | 0x5002bb39 | add8 rbx,rsi | 0x4803de |
3.51% | 181 | 0x5002bb3c | add8 rbx,rbp | 0x4803dd |
4.94% | 255 | 0x5002bb3f | lea8 rax,[rcx*1+rbx] | 0x488d040b |
8.09% | 417 | 0x5002bb43 | cmp8 r10,r11 | 0x4d3bd3 |
| | 0x5002bb46 | jl 0x5002bad0 // perf.org.HdrHistogram.GenericPerfTest.MathMaxSpeedLoop(J)J+0x98 | 0x7c88 |
...