Breeze slower with netlib than without it, vector array operations are faster than matrix

97 visualizações
Pular para a primeira mensagem não lida

Ivan Nikolaev

não lida,
20 de ago. de 2015, 11:58:3020/08/2015
para Scala Breeze
Hello everyone,

I am learning how to use breeze and have stumbled upon surprising results.
I am doing some simple matrix/vector addition and multiplication operations,
as well as calculating exp and vector sum. When I load netlib-java library
through maven, I get significantly worse performance than without it.
Also, when I use an array of vectors instead of a matrix I get better performance,
which is something that my matlab experience tells me should be the other way 
around.

I am running all this on a MacBook pro with  2.3 GHz Intel Core i7 processor.

Here are the run results:

Without netlib:
Aug 20, 2015 2:58:04 PM com.github.fommil.netlib.BLAS <clinit>
WARNING: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
Aug 20, 2015 2:58:04 PM com.github.fommil.netlib.BLAS <clinit>
WARNING: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
Vector version took 26.927885357 seconds
Matrix version took 87.750030486 seconds

With netlib:
Aug 20, 2015 3:02:04 PM com.github.fommil.jni.JniLoader liberalLoad
INFO: successfully loaded /var/folders/w6/gnnw541s7v548y5yqd9sqfvc0000gn/T/jniloader8864377321849022943netlib-native_system-osx-x86_64.jnilib
Vector version runs: 148.106227193 seconds
Matrix version runs: 240.031840978 seconds

For comparison, a mex function in Matlab that performs the same calculation:
Elapsed time is 15.725586 seconds.


Any help would be greatly appreciated. Attached is the scala object that I
use to measure performance.

I am going to have to do a lot of optimisation in breeze, if you have any tips for
useful resources, please tell.


Best regards,
Ivan

GaussMixtureTransform.scala

David Hall

não lida,
20 de ago. de 2015, 12:09:2520/08/2015
para scala-...@googlegroups.com
There's a moderation queue for new posters, and pretty much I'm the only one who checks it on a regular basis. 

 I recently (last weekend) did some profiling and noticed some pretty severe overhead for blas with small to medium sized vectors (and with insufficient speedups at high dimensions). It was so bad I actually changed the code to stop using blas for level 1 (vector/vector) operations, except for really really big dot products. Dot product on small vectors (n <= 6) in particular is now about two orders of magnitude faster.

A small micro-optimization: line 9 (-gamma * diff dot diff) creates a temporary because of Scala's precedence rules. better to do (-gamma * (diff dot diff))

Would you mind my including your code in the repository for benchmarking purposes?

-- David

--
You received this message because you are subscribed to the Google Groups "Scala Breeze" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scala-breeze...@googlegroups.com.
To post to this group, send email to scala-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/scala-breeze/ff6a5fcd-393d-4c5a-a0ec-e808074d8b9e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ivan Nikolaev

não lida,
20 de ago. de 2015, 12:59:1420/08/2015
para Scala Breeze
Ok, well, in a way I'm glad to hear it, because I just spent the last several hours trying to get netlib to work on linux.
It's a real pain in the ***. Considerning, that I plan to run this in a hadoop cluster, I'm glad it'll be faster without netlib:)

Thank you for the optimisation, it seemed to help by a couple of seconds. If you have any more suggestions, please tell.
Do you think I would gain a performance improvement by using java with primitive double arrays?

Feel free to use and modify my code for benchmarking in your repository.

David Hall

não lida,
20 de ago. de 2015, 13:44:5220/08/2015
para scala-...@googlegroups.com
netlib is still faster for BLAS level 3 (M/M) operations, though I should benchmark that again since it's been a while and I have a much better pure-jvm matrix mul than i used to.

As for switching to pure java stuff, it depends on how much work you put into it. There's going to be a certain amount of overhead in using breeze. It's pretty minimal in the places where I've carefully optimized it (assuming your vectors aren't too short), but that depends on hotspot behaving, etc etc. But I have had to be pretty careful because of the way Breeze uses views of vectors. In particular, operations that I've optimized for dense vectors are typically broken into 2 to 3 separate loops, making fast paths for common versions that the JVM can optimize more effectively (unit strides, 0 offsets, etc). I haven't done that everywhere yet, which means there are probably a bunch of relatively easy wins if I go hunting for them.

One thing that Breeze currently doesn't do a good job of is operation fusion. For example, A += B * C currently is two loops (one for the mmul, one for the addition), when blas has support for doing that in one go. MATLAB I'm pretty sure does that kind of optimization. That's of course easy to do if you roll your own loops. I've started to think about how to do it in Scala, but I haven't gotten than far on it yet.

-- David

Ivan Nikolaev

não lida,
21 de ago. de 2015, 05:01:2521/08/2015
para Scala Breeze
Okay, thank you. I think I'll stick with breeze.

Just out of curiousity, is there any way to switch on/off netlib from within the program? So, let's say you do lots of additions, multiplications on small vectors without netlib, then put them all in a big matrix, turn on netlib and run some eigenvalue decomposition on it, for example?

David Hall

não lida,
21 de ago. de 2015, 13:53:2121/08/2015
para scala-...@googlegroups.com
glad to hear it!

There's no way to turn it off, since a lot of implementations use it and don't have fallbacks. I'm adding intelligent checks as I optimize, but...

Anyway, I took a somewhat deeper look at your code yesterday. A big perf problem in your matrix implementation is that you're doing operations on row vectors rather than column vectors. Breeze matrices are default column major, so it's important to operations on columns at a time rather than rows at a time. (So, prefer matrix(::, *) to matrix(*, ::)) That doesn't fix the problem, but it turns a 3x slowdown into a 1.5x slowdown. The rest is a more complicated performance problem that I'll get to eventually, but it takes some major changes to the operators.

Responder a todos
Responder ao autor
Encaminhar
0 nova mensagem