Forward vs backward loop over array performance

1,300 views
Skip to first unread message

Eugene Morozov

unread,
Oct 11, 2013, 6:19:43 PM10/11/13
to mechanica...@googlegroups.com
Hello!

Here is a video I've watched recently and I'd like to have a simple check that on java. I've chosen to iterate over array in forward and backward ways. The code is following.

public class ArrReading {

    private static long tmp = 0;
    private static int[] arr = new int[1000000];

    static {
        System.out.println("Initializes array...");
        Random R = new Random(1);
        for (int i = 0; i < arr.length; i++) {
            arr[i] = R.nextInt();
        }
    }

    public static void main(String[] args) {
        ArrReading bb = new ArrReading();
        System.out.println("Doing warm up");
        bb.warmUp();
        System.out.println("Doing experiment");
        bb.runExperiment();
    }

    public void warmUp() {
        final int[] ints = {1, 2, 3, 4, 5};

        for (int i = 0; i < 10000; i++) {
            decrementFor(ints);
            incrementFor(ints);
        }
    }

    public void runExperiment() {
        long time2 = System.nanoTime();
        tmp += decrementFor(arr);
        long tD = System.nanoTime() - time2;

        long time = System.nanoTime();
        tmp += incrementFor(arr);
        long tI = System.nanoTime() - time;

        System.out.println("++ took: " + tI);
        System.out.println("-- took: " + tD);
        System.out.println("ratio [--/++]=" + (tD * 1.0 / tI));
    }

    public int decrementFor(int[] ar) {
        int sum = 0;
        for (int i = ar.length - 1; i >= 0; i--) {
            sum += ar[i];
        }
        return sum;
    }

    public int incrementFor(int[] ar) {
        int sum = 0;
        for (int i = 0; i < ar.length; i++) {
            sum += ar[i];
        }
        return sum;
    }
}

I believe I've got correctly compiled method, no OSR. There is partial output from PrintCompilation:

    278  23       java.util.concurrent.atomic.AtomicLong::compareAndSet (13 bytes)
    282   1%      own.my.incdec.ArrReading::<clinit> @ 30 (54 bytes)
Doing warm up
    299  24       own.my.incdec.ArrReading::decrementFor (25 bytes)
    300  25       own.my.incdec.ArrReading::incrementFor (24 bytes)
Doing experiment
++ took: 354000
-- took: 413000
ratio [--/++]=1.1666666666666667

I have compiled my class ArrReading by using 
$ java -version
java version "1.6.0_29"
Java(TM) SE Runtime Environment (build 1.6.0_29-b11-402-11M3527)
Java HotSpot(TM) 64-Bit Server VM (build 20.4-b02-402, mixed mode)

and run it using java 1.6.0_29-b11-402-11M3527 and 1.7.0_40-b43. Results for different versions of java are different, but the interesting issue is that they are not concrete - they differ from run to run. I've calculated average and standard deviation:
                         jdk 1.6              jdk 1.7
mean         1.0048932375 1.1780869565
stdev 0.1039612282 0.0786770806 

I must say that for both cases I've tried to gather GC log. For both cases GC hasn't been run even once. Eden has been full for about 20% only, there is no reason for it to be run. Another interesting fact is that HotSpot of jdk 1.7 seems to be more performant, cause it compiles these methods on 171st milliseconds as opposed to jdk 1.6 on 300th:
    151    8 %           ArrReading::<clinit> @ 30 (54 bytes)
Doing warm up
    169    9             ArrReading::incrementFor (24 bytes)
    171   10             ArrReading::decrementFor (25 bytes)


Do you have any explanation for such a behavior? 
Why numbers differ from run to run? How can I debug what's going on there?

Thank you!

Rüdiger Möller

unread,
Oct 11, 2013, 7:03:03 PM10/11/13
to
My take (there are more hardware skilled people here ..):

1) Ensure you have turned off dynamic clocking of your CPU.
2) Since the test performance will depend mostly on L1/2 cache hits, the jitter you are observing is expected as some OS thread context switches might distort cache hit rates randomly.
3) hotspot recognizes and optimizes common loop patterns, so iteration order should not make a big difference

Gil Tene

unread,
Oct 11, 2013, 10:36:51 PM10/11/13
to mechanica...@googlegroups.com
Eugene, some notes:

1. You are measuring and comparing two single occurrences of very short operation, comparing a 354usec single loop with a 413usec single loop. That single case of a 59usec difference can be caused by "atmospheric noise". If you want to deduce anything about consistent speed here, you should be comparing the total time to execute something like ~1M occurrences of each of these loops.

2. You should warm up your actual test code, not the methods that it calls. If you run the benchmark looping long enough for numbers to mean anything, the decrementFor and incrementFor methods would probably both be inlined into runExperiment(), causing a compilation part-way thru the test unless you warm up the actual runExperiment() code first. The easiest way to do this is to have runExperiment() take an iteration count, and separately loop decrementFor and incrementFor (reporting total time for each loop). You would then call runExperiment() twice directly from main: once with e.g. 50,000 count for warmup, and once with e.g. 1M count for the actual timing. 

3. 10,000 is a dangerous number to use. Too close to the compile threshold. When the compile threashold is 10,000, I use at least 5x that to make sure I'm not on the edge of something...

4. Your 4MB array is probably sitting comfortably in the L3 cache, and missing in L1 and L2. You may want to play with different sizes to see what the cache levels do to your conclusions (e.g. 10KB, 100KB, 1MB, 100MB for L1-hitting, L2-hitting, L3-hitting, and L3-missing respectively).

5. You should probably detail the hardware and OS config you did this on. Since these loop behaviors will be dominated mostly by prefetch engine behavior and L2<->L3 cache bandwidth, what the actual setup is really matters.

-- Gil.

Kirk Pepperdine

unread,
Oct 12, 2013, 2:19:19 AM10/12/13
to mechanica...@googlegroups.com

On 2013-10-12, at 4:36 AM, Gil Tene <g...@azulsystems.com> wrote:

> Eugene, some notes:
>
> 1. You are measuring and comparing two single occurrences of very short operation, comparing a 354usec single loop with a 413usec single loop. That single case of a 59usec difference can be caused by "atmospheric noise". If you want to deduce anything about consistent speed here, you should be comparing the total time to execute something like ~1M occurrences of each of these loops.
>
> 2. You should warm up your actual test code, not the methods that it calls. If you run the benchmark looping long enough for numbers to mean anything, the decrementFor and incrementFor methods would probably both be inlined into runExperiment(), causing a compilation part-way thru the test unless you warm up the actual runExperiment() code first. The easiest way to do this is to have runExperiment() take an iteration count, and separately loop decrementFor and incrementFor (reporting total time for each loop). You would then call runExperiment() twice directly from main: once with e.g. 50,000 count for warmup, and once with e.g. 1M count for the actual timing.
>
> 3. 10,000 is a dangerous number to use. Too close to the compile threshold. When the compile threashold is 10,000, I use at least 5x that to make sure I'm not on the edge of something...

And throw away the results that sit on the left side of that edge.
>
> 4. Your 4MB array is probably sitting comfortably in the L3 cache, and missing in L1 and L2. You may want to play with different sizes to see what the cache levels do to your conclusions (e.g. 10KB, 100KB, 1MB, 100MB for L1-hitting, L2-hitting, L3-hitting, and L3-missing respectively).
>
> 5. You should probably detail the hardware and OS config you did this on. Since these loop behaviors will be dominated mostly by prefetch engine behavior and L2<->L3 cache bandwidth, what the actual setup is really matters.

I must say this is a classic MBM in that it's difficult to understand what is *really* happening and consequently what you are *really* measuring.

-- Kirk

Martin Thompson

unread,
Oct 12, 2013, 5:49:37 AM10/12/13
to mechanica...@googlegroups.com
From a hardware perspective the Intel CPUs employ 3 types of prefetchers depending of which micro architecture you execute on.  The following is for Sandy Bridge. The L1 cache has a streaming prefetcher that goes forward only and a instruction pointer prefetcher that can spot patterns in either direction. The L2 cache has a streaming prefetcher that works both directions and a spatial prefetcher to fetch adjacent cachelines. 

So you are more likely to get good performance going forward but that is not guaranteed depending on which prefetcher has decided to help your particular code.

In addition to the points on writing microbenchmarks made by Gil. You need to work over a reasonable range to emphasise the effects. The prefetchers kick in when a pattern of cache misses are detected, usually 5.

Also note that prefetching is disabled if a fence is in the pipeline. This is one of the reasons why over use of CAS can hurt performance.

Martin Grajcar

unread,
Oct 12, 2013, 8:07:37 AM10/12/13
to mechanica...@googlegroups.com
On my Core i5-2400 (Sandy Bridge, L2: 4 × 256 KB, L3: 6 MB) I'm getting very similar results for both directions (and also for the foreach loop). I was expecting a performance drop somewhere near the L2 size, i.e., between 1e4 and 1e5 integers, but there's none. There's a small drop near the L3 size, i.e. between 1e6 and 1e7 integers.

The shortest array (10 elements) is the slowest (per element) because of the loop overhead. From 1e7 elements on the time per element increases from 0.3 ns to about 0.4 ns. For these sizes the backward loop is slower by about 0.01 to 0.03 ns per element. No idea where this comes from as failed prefetching would have a much bigger effect.

The parameter is the decimal logarithmus of the array length.

Martin Grajcar

unread,
Oct 12, 2013, 12:56:33 PM10/12/13
to mechanica...@googlegroups.com
It just occurred to me, that the loop was slowed down by the sequential addition, so I unrolled it twice and used two accumulators instead of one (unrolling more doesn't help). I dropped the foreach benchmark as it can't be modified easily to use more accumulators.

Now the forward loop is consistently faster by some 5-10%, but who knows what's going on?


Eugene Morozov

unread,
Oct 12, 2013, 6:04:19 PM10/12/13
to mechanica...@googlegroups.com
Gil, thanks a lot for your answer, I've found it really useful. Please find my comments inlined.

On Sat, Oct 12, 2013 at 6:36 AM, Gil Tene <g...@azulsystems.com> wrote:
Eugene, some notes:

1. You are measuring and comparing two single occurrences of very short operation, comparing a 354usec single loop with a 413usec single loop. That single case of a 59usec difference can be caused by "atmospheric noise". If you want to deduce anything about consistent speed here, you should be comparing the total time to execute something like ~1M occurrences of each of these loops.

2. You should warm up your actual test code, not the methods that it calls. If you run the benchmark looping long enough for numbers to mean anything, the decrementFor and incrementFor methods would probably both be inlined into runExperiment(), causing a compilation part-way thru the test unless you warm up the actual runExperiment() code first. The easiest way to do this is to have runExperiment() take an iteration count, and separately loop decrementFor and incrementFor (reporting total time for each loop). You would then call runExperiment() twice directly from main: once with e.g. 50,000 count for warmup, and once with e.g. 1M count for the actual timing. 
That's indeed correct in case I call runExperiment in a loop for some time. -XX:PrintCompiler gives me firstly that it's OSRed and then compiled. In current code that's not even possible to measure it. So I will probably adjust the code.


3. 10,000 is a dangerous number to use. Too close to the compile threshold. When the compile threashold is 10,000, I use at least 5x that to make sure I'm not on the edge of something...
Fortunately I'm not guessing here, cause -XX:PrintCompiled said the method is compiled. It doesn't give anything to run it more than enough. If compiler would say that after 4500 iteration the method is compiled, then it is enough and I'd stop there =)
 

4. Your 4MB array is probably sitting comfortably in the L3 cache, and missing in L1 and L2. You may want to play with different sizes to see what the cache levels do to your conclusions (e.g. 10KB, 100KB, 1MB, 100MB for L1-hitting, L2-hitting, L3-hitting, and L3-missing respectively).
I'm on MacBook Pro 9.2 with Ivy Bridge 3210M, which has L2=256Kb and L3=3Mb. But interesting thing I didn't notice anything specific in difference of ++ or -- for the test. Of course the longer array I have, the more time it consumes, but the ratio is pretty same.
 

5. You should probably detail the hardware and OS config you did this on. Since these loop behaviors will be dominated mostly by prefetch engine behavior and L2<->L3 cache bandwidth, what the actual setup is really matters.

I believe Rüdiger gave the most simple and thus correct explanation - there is always some noise in OS: many unpredictable thread and context switches. 


 

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Eugene Morozov

unread,
Oct 12, 2013, 6:11:33 PM10/12/13
to mechanica...@googlegroups.com
Martin, could you, please, explain in some more details why you believe that sequential addition might slow down the loop? I didn't get it.

I have also seen you've used while loop, but for loop might be unrolled by compiler itself. What's the point to prefer while over for?
I think that incremental one might probably be changed more significantly than decremental by the HotSpot compiler and thus measurements for forward vs backward won't be correct. I mean that comparison idea itself would be lost. As far as I know (haven't had a chance to proof that) in case of while loop the compiler cannot done the same. Is that correct or you have different point (or experience) of view?


Martin Grajcar

unread,
Oct 12, 2013, 6:27:12 PM10/12/13
to mechanica...@googlegroups.com
On Sun, Oct 13, 2013 at 12:04 AM, Eugene Morozov <evgeny.a...@gmail.com> wrote:
 
3. 10,000 is a dangerous number to use. Too close to the compile threshold. When the compile threashold is 10,000, I use at least 5x that to make sure I'm not on the edge of something...
Fortunately I'm not guessing here, cause -XX:PrintCompiled said the method is compiled. It doesn't give anything to run it more than enough. If compiler would say that after 4500 iteration the method is compiled, then it is enough and I'd stop there =)

What about tiered compilation? For this code it might not apply, but for something more complicated, the speed changes many time. And it also varies between runs and whatever, see [1] and [2].
 
5. You should probably detail the hardware and OS config you did this on. Since these loop behaviors will be dominated mostly by prefetch engine behavior and L2<->L3 cache bandwidth, what the actual setup is really matters.

I believe Rüdiger gave the most simple and thus correct explanation - there is always some noise in OS: many unpredictable thread and context switches. 

Actually for smaller sizes it's dominated by the sequential addition to `sum`.

Martin Grajcar

unread,
Oct 12, 2013, 6:55:28 PM10/12/13
to mechanica...@googlegroups.com
On Sun, Oct 13, 2013 at 12:11 AM, Eugene Morozov <evgeny.a...@gmail.com> wrote:
Martin, could you, please, explain in some more details why you believe that sequential addition might slow down the loop? I didn't get it.

That's simple: My CPU can run up to 4 instructions per second. So a loop like

while (reps-->0) {
    a ^= b; b += c; c ^= d; d++;
}

can execute at 1 cycle per iteration (assuming loop unrolling to eliminate the counter overhead). In a loop like yours

for (int i = ar.length - 1; i >= 0; i--) {
    sum += ar[i];
}

which after some unrolling looks like

for (...) {
    sum += ar[i];
    sum += ar[i+1];
    sum += ar[i+2];
    sum += ar[i+3];
}
handle_remaining_part;

there's not enough parallelism. Actually, each line is a single instruction, which must wait for the previous one to finish (more precisely, the memory operand gets fetched earlier as this is a separate micro-op, but the addition has to wait for the other operand).

It looks like the JIT can do the unrolling, but it's not smart enough to introduce the "per-line" counters and sum them up afterwards (the way I did).
 
I have also seen you've used while loop, but for loop might be unrolled by compiler itself. What's the point to prefer while over for?

Saving chars to type. For the compiler it's exactly the same.
 
I think that incremental one might probably be changed more significantly than decremental by the HotSpot compiler and thus measurements for forward vs backward won't be correct. I mean that comparison idea itself would be lost. As far as I know (haven't had a chance to proof that) in case of while loop the compiler cannot done the same. Is that correct or you have different point (or experience) of view?

I guess you're viewing a for-loop like something special (like we did when learning theoretical computer science), assuming known number of iterations. But Java ignores it and treats the for-loop like a fancy way to write a while-loop. I guess the information about "for vs. while" gets lost already when compiling to bytecode. This would be stupid if there were no following optimizations. But the loop gets analyzed and unrolled and what not...  I'm sure I measured things like this and never saw a speed difference between for and while.

No matter how you write it, the speed is always the same (unless you do something really original to fool the compiler). This makes sense as the optimizer should use the same code representation for equivalent code as much as possible.

Eugene Morozov

unread,
Oct 15, 2013, 6:22:44 PM10/15/13
to mechanica...@googlegroups.com
Some more specifics regarding the original forward vs backward. My colleague bind my loops to the particular CPU core and there are some measurements. He uses Intel's utility to gather the stats.

EXEC  : instructions per nominal CPU cycle
 IPC   : instructions per CPU cycle

backward_1K
Core (SKT) | EXEC | IPC  | FREQ  | AFREQ | L3MISS | L2MISS | L3HIT | L2HIT | L3CLK | L2CLK  | READ  | WRITE | TEMP
   3    0     3.25   2.65   1.23    1.23      16 K     77 K    0.79    0.66    0.00    0.00     N/A     N/A     16
   3    0     3.25   2.66   1.22    1.22      27 K     70 K    0.62    0.60    0.00    0.00     N/A     N/A     12

backward_32K
   3    0     3.23   2.64   1.22    1.23      23 K    162 K    0.85    1.00    0.00    0.00     N/A     N/A      9
   3    0     3.24   2.64   1.23    1.23    9501      145 K    0.93    1.00    0.00    0.00     N/A     N/A     10

backward_4M
   3    0     2.55   2.08   1.22    1.23      24 M     25 M    0.02    0.10    0.13    0.00     N/A     N/A      6
   3    0     2.55   2.08   1.22    1.23      24 M     25 M    0.02    0.10    0.13    0.00     N/A     N/A      6


forward_1K
   3    0     1.60   1.30   1.23    1.23      27 K    103 K    0.73    0.68    0.00    0.00     N/A     N/A     10
   3    0     1.59   1.30   1.22    1.22      27 K     71 K    0.61    0.61    0.00    0.00     N/A     N/A     12

forward_32K
   3    0     1.42   1.16   1.22    1.23      30 K    239 K    0.87    0.99    0.00    0.00     N/A     N/A     13
   3    0     1.43   1.16   1.23    1.23    8026      162 K    0.95    1.00    0.00    0.00     N/A     N/A     13

forward_4M
   3    0     1.07   0.88   1.22    1.22      26 M     27 M    0.02    0.01    0.14    0.00     N/A     N/A     13
   3    0     1.05   0.87   1.21    1.22      26 M     27 M    0.02    0.01    0.14    0.00     N/A     N/A     13

What is interesting that there is a good correlation between duration and EXEC parameters. Forward is faster, but EXEC is less. According to the description for backward there are more instructions per CPU cycle. That didn't make sense to me, because the more instruction per cycle we have, the less time it must consume, cause there are particular number of instructions.

* It's obvious that my intuitive explanation is not correct, but is there a simple explanation of what's happening with this EXEC and IPC numbers? 
* Are there other ways to comprehend what's going on except to look at the assembler code of what has really been run?

I think that incremental one might probably be changed more significantly than decremental by the HotSpot compiler and thus measurements for forward vs backward won't be correct. I mean that comparison idea itself would be lost. As far as I know (haven't had a chance to proof that) in case of while loop the compiler cannot done the same. Is that correct or you have different point (or experience) of view?

I guess you're viewing a for-loop like something special (like we did when learning theoretical computer science), assuming known number of iterations.
Yep, I do that. According to the page... I've just read it again, it actually says nothing about for or while loops specifically. It's just about loops.
 
But Java ignores it and treats the for-loop like a fancy way to write a while-loop. I guess the information about "for vs. while" gets lost already when compiling to bytecode.

I admit that it's lost when compiled, but somehow decompilers do know if those loop are for or while. 
Here is an example: that's an output of javap for compiled by jdk 1.6 class file, which contains two methods main and main2 with for and while loops from 0 up to 10 with sout of invariant.
public static void main(java.lang.String[]);
  Code:
   0: iconst_0
   1: istore_1
   2: iload_1
   3: bipush 10
   5: if_icmpge 21
   8: getstatic #2; //Field java/lang/System.out:Ljava/io/PrintStream;
   11: iload_1
   12: invokevirtual #3; //Method java/io/PrintStream.println:(I)V
   15: iinc 1, 1
   18: goto 2
   21: return

public static void main2(java.lang.String[]);
  Code:
   0: iconst_0
   1: istore_1
   2: iload_1
   3: bipush 10
   5: if_icmpge 21
   8: getstatic #2; //Field java/lang/System.out:Ljava/io/PrintStream;
   11: iload_1
   12: invokevirtual #3; //Method java/io/PrintStream.println:(I)V
   15: iinc 1, 1
   18: goto 2
   21: return
 
Try to guess which one is it, or may be they are both for? =)
But Java Decompiler perfectly understood that and showed it correctly. If that's lost on bytecode, then how it did that?

That's the real code:
    public static void main(String[] args) {
        for (int i = 0; i < 10; i++) {
            System.out.println(i);
        }
    }
    
    public static void main2(String[] args) {
        int i = 0;
        while (i < 10) {
            System.out.println(i);
            i++;
        }
    }
Reply all
Reply to author
Forward
0 new messages