Wierd Performance Difference between int and long ...

Rüdiger Möller

unread,

Nov 7, 2013, 3:18:24 PM11/7/13

to mechanica...@googlegroups.com

http://stackoverflow.com/questions/19844048/why-is-long-slower-than-int-in-x64-java#19845277

I could reproduce that even with correct warm up etc. and some adjustements, however the difference stays.

package de.ruedigermoeller.reallive.play;

public class XYBench {


    long l;
    int i;


    public static void main(String[] args) {
        new XYBench().main();
    }


    public void main() {
        long time;
        System.out.println("Starting the warm-up phase");
        runIntTest();
        runLongTest();
        System.out.println("Warm-up phase done");


        System.out.println("Starting the timing phase (long)");
        time = System.nanoTime();
        runLongTest();
        time = (System.nanoTime() - time) / 1000 / 1000;
        System.out.println("Finished the long loop in " + time + "ms");


        System.out.println("Starting the timing phase (int)");
        time = System.nanoTime();
        runIntTest();
        time = (System.nanoTime() - time) / 1000 / 1000;
        System.out.println("Finished the int loop in " + time + "ms");


    }


    void runIntTest() {
        i = Integer.MAX_VALUE;
        while (!decrementAndCheckInt()) { }
    }


    boolean decrementAndCheckInt() {
        return i-- < 0;
    }


    void runLongTest() {
        l = Integer.MAX_VALUE;
        while (!decrementAndCheckLong()) { }
    }


    boolean decrementAndCheckLong() {
        return l-- < 0l;
    }
}

result in

Starting the timing phase (long)

Finished the long loop in 1126ms

Starting the timing phase (int)

Finished the int loop in 34ms

Rüdiger Möller

unread,

Nov 7, 2013, 4:09:21 PM11/7/13

to mechanica...@googlegroups.com

Apparently its because the JIT performs loop unrolling for the int version only ...

Norman Maurer

unread,

Nov 8, 2013, 1:07:39 AM11/8/13

to mechanica...@googlegroups.com

I wonder what the reason is here… Is it a known „limitation“ ?

I always thought it will do this also for long.

Am 07.11.2013 um 22:09 schrieb Rüdiger Möller <moru...@gmail.com>:

Apparently its because the JIT performs loop unrolling for the int version only ...

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Martin Thompson

unread,

Nov 8, 2013, 2:25:18 AM11/8/13

to mechanica...@googlegroups.com, norman...@googlemail.com

During my fun writing microbenchmarks Cliff Click has a few times pointed out to me that there are a less optimisations for long ops than int ops.

On Friday, 8 November 2013 06:07:39 UTC, Norman Maurer wrote:

I wonder what the reason is here… Is it a known „limitation“ ?

I always thought it will do this also for long.

Am 07.11.2013 um 22:09 schrieb Rüdiger Möller <moru...@gmail.com>:

Apparently its because the JIT performs loop unrolling for the int version only ...

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Kirk Pepperdine

unread,

Nov 8, 2013, 2:26:12 AM11/8/13

to mechanica...@googlegroups.com, norman...@googlemail.com

I've also seen the lack of optimizations or longs... for some unknown reason....

----
Kirk Pepperdine
Principal Consultant
http://www.kodewerk.com
Tel: +36 60 213 6543
skype: kcpeppe
twitter: @kcpeppe

Java Champion

NetBeans Dream Team

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Norman Maurer

unread,

Nov 8, 2013, 2:55:25 AM11/8/13

to Kirk Pepperdine, mechanica...@googlegroups.com

Quite interesting... Will keep this in mind.

Thanks

Georges Gomes

unread,

Nov 8, 2013, 3:20:06 AM11/8/13

to mechanica...@googlegroups.com, norman...@googlemail.com

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Jean-Philippe BEMPEL

unread,

Nov 8, 2013, 3:43:40 AM11/8/13

to mechanica...@googlegroups.com

Hello Rudiger,

For me micro-benchmark is flawed (once again).

If you enable PrintCompilation option you will see:

Starting the warm-up phase

53 1 benchmark.IntLongBench::decrementAndCheckInt (18 bytes)

54 1% benchmark.IntLongBench::runIntTest @ 6 (14 bytes)

155 2 benchmark.IntLongBench::decrementAndCheckLong (20 bytes)

156 2% benchmark.IntLongBench::runLongTest @ 7 (15 bytes)

Warm-up phase done

Starting the timing phase (long)

2481 3 benchmark.IntLongBench::runLongTest (15 bytes)

Finished the long loop in 2320ms

Starting the timing phase (int)

4802 4 benchmark.IntLongBench::runIntTest (14 bytes)

Finished the int loop in 96ms

So first compilation wil give OSR optimizations. maybe OSR optimizations are effectively different for int & long, I have not checked yet.

But if you call 2 times run methods you have totally different story here:

Starting the warm-up phase

54 1 benchmark.IntLongBench::decrementAndCheckInt (18 bytes)

54 1% benchmark.IntLongBench::runIntTest @ 6 (14 bytes)

157 2 benchmark.IntLongBench::decrementAndCheckLong (20 bytes)

157 2% benchmark.IntLongBench::runLongTest @ 7 (15 bytes)

2477 3 benchmark.IntLongBench::runIntTest (14 bytes)

2573 4 benchmark.IntLongBench::runLongTest (15 bytes)

Warm-up phase done

Starting the timing phase (long)

Finished the long loop in 3079ms

Starting the timing phase (int)

Finished the int loop in 3081ms

So I will not jump into the conclusion of long being less optmized than int generally speaking (beside the Cliff Click remark), it depends on the context, and maybe if we have only OSR it may be true, to be confirmed.

I will check the asm generated.

Rüdiger Möller

unread,

Nov 8, 2013, 8:46:16 AM11/8/13

to

Hello Jean,

I cannot reproduce that. I ran the bench in an endless loop, results stay same, so where do you see the failure ?

...

Starting the timing phase (long)

Finished the long loop in 1121ms

Starting the timing phase (int)

Finished the int loop in 34ms

Starting the timing phase (long)

Finished the long loop in 1117ms

Starting the timing phase (int)

Finished the int loop in 34ms

Starting the timing phase (long)

...

while( true ) {


            System.out.println("Starting the timing phase (long)");
            time = System.nanoTime();
            runLongTest();
            time = (System.nanoTime() - time) / 1000 / 1000;
            System.out.println("Finished the long loop in " + time + "ms");


            System.out.println("Starting the timing phase (int)");
            time = System.nanoTime();
            runIntTest();
            time = (System.nanoTime() - time) / 1000 / 1000;
            System.out.println("Finished the int loop in " + time + "ms");
        }

funny thing is, that this:

    boolean decrementAndCheckLong() {
        lo = lo - 1l;
        return lo < -1l;
    }

improves the long test from ~1150ms to ~750ms (i7)

Jean-Philippe BEMPEL

unread,

Nov 9, 2013, 8:09:09 AM11/9/13

to mechanica...@googlegroups.com

Hello Rüdiger,

I have modified the warmup phase like this:

System.out.println("Starting the warm-up phase");

runIntTest();

runLongTest();

runIntTest();

runLongTest();

System.out.println("Warm-up phase done");

I have run on jdk7 update 40 x64 on Windows:

Starting the warm-up phase

69 1 benchmark.IntLongBench::decrementAndCheckInt (18 bytes)

70 2 % benchmark.IntLongBench::runIntTest @ 6 (14 bytes)

325 3 benchmark.IntLongBench::decrementAndCheckLong (20 bytes)

325 4 % benchmark.IntLongBench::runLongTest @ 7 (15 bytes)

2656 5 benchmark.IntLongBench::runIntTest (14 bytes)

2752 6 benchmark.IntLongBench::runLongTest (15 bytes)

Warm-up phase done

Starting the timing phase (long)

Finished the long loop in 3079ms

Starting the timing phase (int)

Finished the int loop in 3080ms

On Linux RHEL 6.3:

[root@archi-srv benchs]# /root/jdk/jdk1.7.0_40/bin/java -showversion -XX:+PrintCompilation -cp . benchmark.IntLongBench

java version "1.7.0_40"

Java(TM) SE Runtime Environment (build 1.7.0_40-b43)

Java HotSpot(TM) 64-Bit Server VM (build 24.0-b56, mixed mode)

Starting the warm-up phase

614 1 benchmark.IntLongBench::decrementAndCheckInt (18 bytes)

615 2 % benchmark.IntLongBench::runIntTest @ 6 (14 bytes)

741 3 benchmark.IntLongBench::decrementAndCheckLong (20 bytes)

741 4 % benchmark.IntLongBench::runLongTest @ 7 (15 bytes)

2685 5 benchmark.IntLongBench::runIntTest (14 bytes)

2766 6 benchmark.IntLongBench::runLongTest (15 bytes)

Warm-up phase done

Starting the timing phase (long)

Finished the long loop in 2625ms

Starting the timing phase (int)

Finished the int loop in 2625ms

On Friday, November 8, 2013 2:45:18 PM UTC+1, Rüdiger Möller wrote:

Hello Jean,

I cannot reproduce that. I ran the bench in an endless loop, results stay same, so where do you see the failure ?
...

Starting the timing phase (long)

Finished the long loop in 1121ms

Starting the timing phase (int)
Finished the int loop in 34ms

Starting the timing phase (long)

Finished the long loop in 1117ms

Starting the timing phase (int)
Finished the int loop in 34ms

Starting the timing phase (long)

...

while( true ) {

System.out.println("Starting the timing phase (long)"); time = System.nanoTime(); runLongTest(); time = (System.nanoTime() - time) / 1000 / 1000; System.out.println("Finished the long loop in " + time + "ms"); System.out.println("Starting the timing phase (int)"); time = System.nanoTime(); runIntTest(); time = (System.nanoTime() - time) / 1000 / 1000; System.out.println("Finished the int loop in " + time + "ms"); }

Jean-Philippe BEMPEL

unread,

Nov 9, 2013, 8:27:50 AM11/9/13

to mechanica...@googlegroups.com

Looking at the generated code, OSR version of runXXXTest looks different:

runIntTest OSR:

[Verified Entry Point]

[Constants]

# {method} 'runIntTest' '()V' in 'benchmark/IntLongBench'

0x00000000026319c0: int3

0x00000000026319c1: data32 data32 nop WORD PTR [rax+rax*1+0x0]

0x00000000026319cc: data32 data32 xchg ax,ax

0x00000000026319d0: mov DWORD PTR [rsp-0x6000],eax

0x00000000026319d7: push rbp

0x00000000026319d8: sub rsp,0x30

0x00000000026319dc: mov rbx,QWORD PTR [rdx]

0x00000000026319df: mov rcx,rdx

0x00000000026319e2: movabs r10,0x6753fff0

0x00000000026319ec: call r10

0x00000000026319ef: mov r11d,DWORD PTR [rbx+0x8] ; implicit exception: dispatches to 0x0000000002631a5d

0x00000000026319f3: cmp r11d,0xef84b2c1 ; {oop('benchmark/IntLongBench')}

0x00000000026319fa: jne 0x0000000002631a4d ;*aload_0

; - benchmark.IntLongBench::runIntTest@6 (line 44)

0x00000000026319fc: mov r11d,DWORD PTR [rbx+0xc] ;*getfield i

; - benchmark.IntLongBench::decrementAndCheckInt@2 (line 49)

; - benchmark.IntLongBench::runIntTest@7 (line 44)

0x0000000002631a00: mov r10d,r11d

0x0000000002631a03: dec r10d ;*aload_0

; - benchmark.IntLongBench::runIntTest@6 (line 44)

0x0000000002631a06: dec r11d ;*isub

; - benchmark.IntLongBench::decrementAndCheckInt@7 (line 49)

; - benchmark.IntLongBench::runIntTest@7 (line 44)

0x0000000002631a09: mov DWORD PTR [rbx+0xc],r11d ;*putfield i

; - benchmark.IntLongBench::decrementAndCheckInt@8 (line 49)

; - benchmark.IntLongBench::runIntTest@7 (line 44)

0x0000000002631a0d: cmp r11d,r10d

0x0000000002631a10: jg 0x0000000002631a06 ;*ifeq

; - benchmark.IntLongBench::runIntTest@10 (line 44)

0x0000000002631a12: cmp r11d,0xd

0x0000000002631a16: jle 0x0000000002631a2e

0x0000000002631a18: nop DWORD PTR [rax+rax*1+0x0]

;*aload_0

; - benchmark.IntLongBench::runIntTest@6 (line 44)

0x0000000002631a20: add r11d,0xfffffff0 ;*isub

; - benchmark.IntLongBench::decrementAndCheckInt@7 (line 49)

; - benchmark.IntLongBench::runIntTest@7 (line 44)

0x0000000002631a24: mov DWORD PTR [rbx+0xc],r11d ;*putfield i

; - benchmark.IntLongBench::decrementAndCheckInt@8 (line 49)

; - benchmark.IntLongBench::runIntTest@7 (line 44)

0x0000000002631a28: cmp r11d,0xd

0x0000000002631a2c: jg 0x0000000002631a20 ;*ifeq

; - benchmark.IntLongBench::runIntTest@10 (line 44)

0x0000000002631a2e: cmp r11d,0xfffffffe

0x0000000002631a32: jle 0x0000000002631a41 ;*aload_0

; - benchmark.IntLongBench::runIntTest@6 (line 44)

0x0000000002631a34: dec r11d ;*isub

; - benchmark.IntLongBench::decrementAndCheckInt@7 (line 49)

; - benchmark.IntLongBench::runIntTest@7 (line 44)

0x0000000002631a37: mov DWORD PTR [rbx+0xc],r11d ;*putfield i

; - benchmark.IntLongBench::decrementAndCheckInt@8 (line 49)

; - benchmark.IntLongBench::runIntTest@7 (line 44)

0x0000000002631a3b: cmp r11d,0xfffffffe

0x0000000002631a3f: jg 0x0000000002631a34

0x0000000002631a41: add rsp,0x30

0x0000000002631a45: pop rbp

0x0000000002631a46: test DWORD PTR [rip+0xfffffffffde8e5b4],eax # 0x00000000004c0000

; {poll_return}

0x0000000002631a4c: ret

runLongTest OSR:

[Verified Entry Point]

[Constants]

# {method} 'runLongTest' '()V' in 'benchmark/IntLongBench'

0x000000000262fea0: int3

0x000000000262fea1: data32 data32 nop WORD PTR [rax+rax*1+0x0]

0x000000000262feac: data32 data32 xchg ax,ax

0x000000000262feb0: mov DWORD PTR [rsp-0x6000],eax

0x000000000262feb7: push rbp

0x000000000262feb8: sub rsp,0x30

0x000000000262febc: mov rbx,QWORD PTR [rdx]

0x000000000262febf: mov rcx,rdx

0x000000000262fec2: movabs r10,0x6753fff0

0x000000000262fecc: call r10

0x000000000262fecf: mov r11d,DWORD PTR [rbx+0x8] ; implicit exception: dispatches to 0x000000000262ff25

0x000000000262fed3: cmp r11d,0xef84b2c1 ; {oop('benchmark/IntLongBench')}

0x000000000262feda: jne 0x000000000262ff16 ;*aload_0

; - benchmark.IntLongBench::runLongTest@7 (line 55)

0x000000000262fedc: mov r10,QWORD PTR [rbx+0x10] ;*getfield l

; - benchmark.IntLongBench::decrementAndCheckLong@2 (line 60)

; - benchmark.IntLongBench::runLongTest@8 (line 55)

0x000000000262fee0: jmp 0x000000000262fee5

0x000000000262fee2: mov r10,r8 ;*aload_0

; - benchmark.IntLongBench::runLongTest@7 (line 55)

0x000000000262fee5: test r10,r10

0x000000000262fee8: jl 0x000000000262ff0e ;*ifge

; - benchmark.IntLongBench::decrementAndCheckLong@13 (line 60)

; - benchmark.IntLongBench::runLongTest@8 (line 55)

0x000000000262feea: xor r11d,r11d ;*invokevirtual decrementAndCheckLong

; - benchmark.IntLongBench::runLongTest@8 (line 55)

0x000000000262feed: mov r8,r10

0x000000000262fef0: dec r8 ;*lsub

; - benchmark.IntLongBench::decrementAndCheckLong@7 (line 60)

; - benchmark.IntLongBench::runLongTest@8 (line 55)

0x000000000262fef3: mov QWORD PTR [rbx+0x10],r8 ; OopMap{rbx=Oop off=87}

;*ifeq

; - benchmark.IntLongBench::runLongTest@11 (line 55)

0x000000000262fef7: test DWORD PTR [rip+0xfffffffffde90103],eax # 0x00000000004c0000

; {poll}

0x000000000262fefd: test r10,r10

0x000000000262ff00: jge 0x000000000262fee2

0x000000000262ff02: add rsp,0x30

0x000000000262ff06: pop rbp

0x000000000262ff07: test DWORD PTR [rip+0xfffffffffde900f3],eax # 0x00000000004c0000

; {poll_return}

0x000000000262ff0d: ret

But non-OSR version are very very similar (to DWORD/QWORD mov).

runIntTest non-OSR:

[Constants]

# {method} 'runIntTest' '()V' in 'benchmark/IntLongBench'

# [sp+0x20] (sp of caller)

0x000000000262fb80: mov r10d,DWORD PTR [rdx+0x8]

0x000000000262fb84: shl r10,0x3

0x000000000262fb88: cmp rax,r10

0x000000000262fb8b: jne 0x0000000002607a60 ; {runtime_call}

0x000000000262fb91: data32 xchg ax,ax

0x000000000262fb94: nop DWORD PTR [rax+rax*1+0x0]

0x000000000262fb9c: data32 data32 xchg ax,ax

[Verified Entry Point]

0x000000000262fba0: sub rsp,0x18

0x000000000262fba7: mov QWORD PTR [rsp+0x10],rbp ;*synchronization entry

; - benchmark.IntLongBench::runIntTest@-1 (line 43)

0x000000000262fbac: mov DWORD PTR [rdx+0xc],0x7ffffffe

;*putfield i

; - benchmark.IntLongBench::decrementAndCheckInt@8 (line 49)

; - benchmark.IntLongBench::runIntTest@7 (line 44)

0x000000000262fbb3: mov r11d,0x7ffffffe

0x000000000262fbb9: mov r10d,0x7ffffffd

0x000000000262fbbf: jmp 0x000000000262fbcd

0x000000000262fbc1: mov r9d,r10d

0x000000000262fbc4: dec r9d ;*isub

; - benchmark.IntLongBench::decrementAndCheckInt@7 (line 49)

; - benchmark.IntLongBench::runIntTest@7 (line 44)

0x000000000262fbc7: mov r11d,r10d

0x000000000262fbca: mov r10d,r9d ;*aload_0

; - benchmark.IntLongBench::runIntTest@6 (line 44)

0x000000000262fbcd: test r11d,r11d

0x000000000262fbd0: jl 0x000000000262fbf0 ;*ifge

; - benchmark.IntLongBench::decrementAndCheckInt@11 (line 49)

; - benchmark.IntLongBench::runIntTest@7 (line 44)

0x000000000262fbd2: xor r8d,r8d ;*invokevirtual decrementAndCheckInt

; - benchmark.IntLongBench::runIntTest@7 (line 44)

0x000000000262fbd5: mov DWORD PTR [rdx+0xc],r10d ; OopMap{rdx=Oop off=89}

;*ifeq

; - benchmark.IntLongBench::runIntTest@10 (line 44)

0x000000000262fbd9: test DWORD PTR [rip+0xfffffffffde90421],eax # 0x00000000004c0000

; {poll}

0x000000000262fbdf: test r11d,r11d

0x000000000262fbe2: jge 0x000000000262fbc1 ;*getfield i

; - benchmark.IntLongBench::decrementAndCheckInt@2 (line 49)

; - benchmark.IntLongBench::runIntTest@7 (line 44)

0x000000000262fbe4: add rsp,0x10

0x000000000262fbe8: pop rbp

0x000000000262fbe9: test DWORD PTR [rip+0xfffffffffde90411],eax # 0x00000000004c0000

; {poll_return}

0x000000000262fbef: ret

runLongTest non-OSR:

[Constants]

# {method} 'runLongTest' '()V' in 'benchmark/IntLongBench'

# [sp+0x20] (sp of caller)

0x000000000262f880: mov r10d,DWORD PTR [rdx+0x8]

0x000000000262f884: shl r10,0x3

0x000000000262f888: cmp rax,r10

0x000000000262f88b: jne 0x0000000002607a60 ; {runtime_call}

0x000000000262f891: data32 xchg ax,ax

0x000000000262f894: nop DWORD PTR [rax+rax*1+0x0]

0x000000000262f89c: data32 data32 xchg ax,ax

[Verified Entry Point]

0x000000000262f8a0: sub rsp,0x18

0x000000000262f8a7: mov QWORD PTR [rsp+0x10],rbp ;*synchronization entry

; - benchmark.IntLongBench::runLongTest@-1 (line 54)

0x000000000262f8ac: mov QWORD PTR [rdx+0x10],0x7ffffffe

;*putfield l

; - benchmark.IntLongBench::decrementAndCheckLong@8 (line 60)

; - benchmark.IntLongBench::runLongTest@8 (line 55)

0x000000000262f8b4: mov r10d,0x7ffffffe

0x000000000262f8ba: mov r11d,0x7ffffffd

0x000000000262f8c0: jmp 0x000000000262f8ce

0x000000000262f8c2: mov r8,r11

0x000000000262f8c5: dec r8 ;*lsub

; - benchmark.IntLongBench::decrementAndCheckLong@7 (line 60)

; - benchmark.IntLongBench::runLongTest@8 (line 55)

0x000000000262f8c8: mov r10,r11

0x000000000262f8cb: mov r11,r8 ;*aload_0

; - benchmark.IntLongBench::runLongTest@7 (line 55)

0x000000000262f8ce: test r10,r10

0x000000000262f8d1: jl 0x000000000262f8f1 ;*ifge

; - benchmark.IntLongBench::decrementAndCheckLong@13 (line 60)

; - benchmark.IntLongBench::runLongTest@8 (line 55)

0x000000000262f8d3: xor r8d,r8d ;*invokevirtual decrementAndCheckLong

; - benchmark.IntLongBench::runLongTest@8 (line 55)

0x000000000262f8d6: mov QWORD PTR [rdx+0x10],r11 ; OopMap{rdx=Oop off=90}

;*ifeq

; - benchmark.IntLongBench::runLongTest@11 (line 55)

0x000000000262f8da: test DWORD PTR [rip+0xfffffffffde90720],eax # 0x00000000004c0000

; {poll}

0x000000000262f8e0: test r10,r10

0x000000000262f8e3: jge 0x000000000262f8c2 ;*getfield l

; - benchmark.IntLongBench::decrementAndCheckLong@2 (line 60)

; - benchmark.IntLongBench::runLongTest@8 (line 55)

0x000000000262f8e5: add rsp,0x10

0x000000000262f8e9: pop rbp

0x000000000262f8ea: test DWORD PTR [rip+0xfffffffffde90710],eax # 0x00000000004c0000

; {poll_return}

0x000000000262f8f0: ret

Which confirm my observation from the benchmark (so in fact no need to benchmark ;-))

Rüdiger Möller

unread,

Nov 10, 2013, 1:36:33 PM11/10/13

to mechanica...@googlegroups.com

Thanx for your clarification Jean, I am impressed :-)

Reply all

Reply to author

Forward