strange performance behavior of a mathematical method: why?

Dimitri Ognibene

unread,

Apr 23, 2006, 4:17:08 PM4/23/06

to

I've found a difference of about 10% in the execution of 2 version of
this method:

public double[] evaluate(double input[]){
double a;
//System.arraycopy(input,0,activation[0],0,input.length);
activation[0]=input;
//for (int i=0;i<input.length;i++)
// activation[0][i]=input[i];
for (int i=1;i<layers;i++){
double activation_col[]=activation[i-1];
double activation_col_res[]=activation[i];
double weight_matr[][]=weight[i-1];
for (int j =0; j< activation_col_res.length;j++ ){
double weight_col[]=weight_matr[j];
double acc=0;
for (int k=0; k<activation_col.length;k++){
*************variant go here*************
}
activation_col_res[j]=g(acc);
}
}
setChanged();
notifyObservers();

return activation[layers-1];

}

variant 1:
a= activation_col[k];
a*=weight_col[k];
acc+=a;
variant 2:
acc+= activation_col[k]*weight_col[k];

variant1 is 10% faster than variant2.
I've a matrix of about 1200x400 elements (weight matrix)
variant1 avarage is 4.82ms
variant2 avarage is 5.24ms

does this performance difference makes any sense?

does someone has any tips to write always the faster code?

thanks
Dimitri

Diomidis Spinellis

unread,

Apr 23, 2006, 5:26:14 PM4/23/06

to

If you are curious look at the generated Java bytecodes (use javap).
But the difference isn't important. Another compiler, jit or processor
architecture could give you different results.

> does someone has any tips to write always the faster code?

- Measure before optimizing
- Focus on algorithmic improvements

--
Diomidis Spinellis
Code Quality: The Open Source Perspective (Addison-Wesley 2006)
http://www.spinellis.gr/codequality?cljp

Dimitri Ognibene

unread,

Apr 23, 2006, 5:34:38 PM4/23/06

to

what do you tjink of this method implementation?
any tip?
must i pass to c++ & asm to get better performance?
note that the g method called inside the loop is final and the compiler
does inline it, or it looks to be so from the profiling data.

Roedy Green

unread,

Apr 23, 2006, 11:32:36 PM4/23/06

to

On 23 Apr 2006 13:17:08 -0700, "Dimitri Ognibene"
<dimitri....@gmail.com> wrote, quoted or indirectly quoted
someone who said :

>variant 1:
> a= activation_col[k];
> a*=weight_col[k];
> acc+=a;
>variant 2:
> acc+= activation_col[k]*weight_col[k];
>
>variant1 is 10% faster than variant2.

that is a puzzle. First have a look at the byte codes generated.
see http://mindprod.com/jgloss/disassembler.html
--
Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Roedy Green

unread,

Apr 23, 2006, 11:36:27 PM4/23/06

to

On 23 Apr 2006 13:17:08 -0700, "Dimitri Ognibene"
<dimitri....@gmail.com> wrote, quoted or indirectly quoted
someone who said :

>

>does this performance difference makes any sense?
>
>does someone has any tips to write always the faster code?

You try variants and measure. Use an AOT compiler or java -server
which optimises harder, but takes longer to come up to speed.

see http://mindprod.com/jgloss/aot.html
http://mindprod.com/jgloss/javaexe.html

Diomidis Spinellis

unread,

Apr 24, 2006, 3:51:54 AM4/24/06

to

From what I can see this particular code manipulates floating point
numbers. This is one of the (not common) cases where a good Java
compiler and JIT-based runtime system should be able to give you the
same performance as the one you would achieve with (say) C.

You are unlikely to gain anything from moving to assembly, unless you
can utilize instructions that compilers don't typically support. For
example, I see that your code calculates some type of dot product: if
your data allows it, you could gain by using Intel's SSE SIMD extensions
or AMD's 3DNow. Another possibility would be to move your code to be
executed by a 3D graphic card's hardware.

Dimitri Ognibene

unread,

Apr 24, 2006, 4:04:05 AM4/24/06

to

thanks diobidis,
do you have any link where i can found such an approach?
thanks

Diomidis Spinellis

unread,

Apr 24, 2006, 4:21:55 AM4/24/06

to

Dimitri Ognibene wrote:
> thanks diobidis,
> do you have any link where i can found such an approach?

http://www.ics.forth.gr/eHealth/publications/papers/2005/PCI2005.pdf

Chris Uppal

unread,

Apr 24, 2006, 4:34:30 AM4/24/06

to

Dimitri Ognibene wrote:

> I've found a difference of about 10% in the execution of 2 version of
> this method:

[...]

> variant 1:
> a= activation_col[k];
> a*=weight_col[k];
> acc+=a;
> variant 2:
> acc+= activation_col[k]*weight_col[k];
>
> variant1 is 10% faster than variant2.

It's very difficult to estimate exactly what will have marginal effects at this
level of optimisation. The JIT is pretty clever, the CPU does lots of
optimisations of its own, cache effects and alignment effects combine to
confound estimation too. I have no idea what the explanation might be here; it
could be an effect of the (JITed form of the) first expression being able to
make better use of multiple arithmetic units within the CPU, but that's nothing
more than a guess.

You don't mention whether you are using the -client or -server JVM, so I'm
guessing that you aren't aware of how that choice affects the kind of
optimisations that the JIT will do. Rule of thumb: -server is a good deal
better at optimising arithmetic code. FWIW, on the only micro-benchmark I've
run recently (and one micro-benchmark means almost nothing on its own), the
server JVM was doing essentially the same optimisations as the best C++
compiler I have access to (the one in MS VS.2003).

Similarly, you don't mention what CPU you are running on, so I'm guessing that
you are not aware of how details of chip architecture affect how fast a
specific expression of an algorithm can run. Agner Fog has a fascinating (but
long) guide to optimisation for Pentium family processors on this page:
http://www.agner.org/assem/
which also has some links. If nothing else, it should put you off the idea of
re-implementing in assembly ;-)

-- chris

Dimitri Ognibene

unread,

Apr 24, 2006, 6:06:53 AM4/24/06

to

thanks very much Chris,
however i don't want to rewrite it in assembly but build an jni
interface to Math kernel libraries...
but i'll look at the links that you and diomidis suggest, and after
some trial i'll decide
thanks agani
dimitri

Robert Klemme

unread,

Apr 24, 2006, 7:33:22 AM4/24/06

to

Dimitri Ognibene wrote:

> variant 1:
> a= activation_col[k];
> a*=weight_col[k];
> acc+=a;
> variant 2:
> acc+= activation_col[k]*weight_col[k];
>
> variant1 is 10% faster than variant2.
> I've a matrix of about 1200x400 elements (weight matrix)
> variant1 avarage is 4.82ms
> variant2 avarage is 5.24ms
>
> does this performance difference makes any sense?

How exactly did you measure performance? Did you give the JVM some warm
up runs before doing the real measurement?

Cheers

robert

Dimitri Ognibene

unread,

Apr 24, 2006, 7:44:43 AM4/24/06

to

yes i do,
i've tried to wait a little and this are the "steady" value.
however i was not using the -server option.. i'll post the result
later
thanks dimtiri

Dimitri Ognibene

unread,

Apr 24, 2006, 8:40:47 AM4/24/06

to

ok you were right with the -server option the two version are about 2
time faster and have the same speed 2.70 ms avarage
thanks

Robert Klemme

unread,

Apr 24, 2006, 10:37:45 AM4/24/06

to

Dimitri Ognibene wrote:
> yes i do,
> i've tried to wait a little and this are the "steady" value.

What exactly do you mean by "wait a little"? In order for the JVM's
optimization to kick in you have to actually execute the code several
times before you start measuring. Also you should execute every variant
several times to get better measurement accuracy.

robert

Dimitri Ognibene

unread,

Apr 24, 2006, 10:51:19 AM4/24/06

to

yes,
i was in steady state...

Rob@Bedford

unread,

Apr 24, 2006, 12:44:26 PM4/24/06

to

When I was working on some code optimization I read somewhere that
copying global variables to local variables does increase performance.
Don't ask me how/why, but its a behind the scenes thing. I will try to
find the article and post link.

Dimitri Ognibene

unread,

May 4, 2006, 12:21:14 PM5/4/06

to

this paper is very interesting, like other resources on gpgpu.org,
but do you, or any one else, know of an interface for java like brook
for c++ to use the gpu as a vector coprocessor?
Thanks
Dimitri