public double[] evaluate(double input[]){
double a;
//System.arraycopy(input,0,activation[0],0,input.length);
activation[0]=input;
//for (int i=0;i<input.length;i++)
// activation[0][i]=input[i];
for (int i=1;i<layers;i++){
double activation_col[]=activation[i-1];
double activation_col_res[]=activation[i];
double weight_matr[][]=weight[i-1];
for (int j =0; j< activation_col_res.length;j++ ){
double weight_col[]=weight_matr[j];
double acc=0;
for (int k=0; k<activation_col.length;k++){
*************variant go here*************
}
activation_col_res[j]=g(acc);
}
}
setChanged();
notifyObservers();
return activation[layers-1];
}
variant 1:
a= activation_col[k];
a*=weight_col[k];
acc+=a;
variant 2:
acc+= activation_col[k]*weight_col[k];
variant1 is 10% faster than variant2.
I've a matrix of about 1200x400 elements (weight matrix)
variant1 avarage is 4.82ms
variant2 avarage is 5.24ms
does this performance difference makes any sense?
does someone has any tips to write always the faster code?
thanks
Dimitri
If you are curious look at the generated Java bytecodes (use javap).
But the difference isn't important. Another compiler, jit or processor
architecture could give you different results.
> does someone has any tips to write always the faster code?
- Measure before optimizing
- Focus on algorithmic improvements
--
Diomidis Spinellis
Code Quality: The Open Source Perspective (Addison-Wesley 2006)
http://www.spinellis.gr/codequality?cljp
>variant 1:
> a= activation_col[k];
> a*=weight_col[k];
> acc+=a;
>variant 2:
> acc+= activation_col[k]*weight_col[k];
>
>variant1 is 10% faster than variant2.
that is a puzzle. First have a look at the byte codes generated.
see http://mindprod.com/jgloss/disassembler.html
--
Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.
>
>does this performance difference makes any sense?
>
>does someone has any tips to write always the faster code?
You try variants and measure. Use an AOT compiler or java -server
which optimises harder, but takes longer to come up to speed.
see http://mindprod.com/jgloss/aot.html
http://mindprod.com/jgloss/javaexe.html
From what I can see this particular code manipulates floating point
numbers. This is one of the (not common) cases where a good Java
compiler and JIT-based runtime system should be able to give you the
same performance as the one you would achieve with (say) C.
You are unlikely to gain anything from moving to assembly, unless you
can utilize instructions that compilers don't typically support. For
example, I see that your code calculates some type of dot product: if
your data allows it, you could gain by using Intel's SSE SIMD extensions
or AMD's 3DNow. Another possibility would be to move your code to be
executed by a 3D graphic card's hardware.
http://www.ics.forth.gr/eHealth/publications/papers/2005/PCI2005.pdf
> I've found a difference of about 10% in the execution of 2 version of
> this method:
[...]
> variant 1:
> a= activation_col[k];
> a*=weight_col[k];
> acc+=a;
> variant 2:
> acc+= activation_col[k]*weight_col[k];
>
> variant1 is 10% faster than variant2.
It's very difficult to estimate exactly what will have marginal effects at this
level of optimisation. The JIT is pretty clever, the CPU does lots of
optimisations of its own, cache effects and alignment effects combine to
confound estimation too. I have no idea what the explanation might be here; it
could be an effect of the (JITed form of the) first expression being able to
make better use of multiple arithmetic units within the CPU, but that's nothing
more than a guess.
You don't mention whether you are using the -client or -server JVM, so I'm
guessing that you aren't aware of how that choice affects the kind of
optimisations that the JIT will do. Rule of thumb: -server is a good deal
better at optimising arithmetic code. FWIW, on the only micro-benchmark I've
run recently (and one micro-benchmark means almost nothing on its own), the
server JVM was doing essentially the same optimisations as the best C++
compiler I have access to (the one in MS VS.2003).
Similarly, you don't mention what CPU you are running on, so I'm guessing that
you are not aware of how details of chip architecture affect how fast a
specific expression of an algorithm can run. Agner Fog has a fascinating (but
long) guide to optimisation for Pentium family processors on this page:
http://www.agner.org/assem/
which also has some links. If nothing else, it should put you off the idea of
re-implementing in assembly ;-)
-- chris
> variant 1:
> a= activation_col[k];
> a*=weight_col[k];
> acc+=a;
> variant 2:
> acc+= activation_col[k]*weight_col[k];
>
> variant1 is 10% faster than variant2.
> I've a matrix of about 1200x400 elements (weight matrix)
> variant1 avarage is 4.82ms
> variant2 avarage is 5.24ms
>
> does this performance difference makes any sense?
How exactly did you measure performance? Did you give the JVM some warm
up runs before doing the real measurement?
Cheers
robert
What exactly do you mean by "wait a little"? In order for the JVM's
optimization to kick in you have to actually execute the code several
times before you start measuring. Also you should execute every variant
several times to get better measurement accuracy.
robert