Native Libraries slower than Java code

Amit Chandak

unread,

Sep 17, 2014, 9:59:13 AM9/17/14

to jblas...@googlegroups.com

Hi,
I am trying to use JBLas with native libraries to get the best performance.
I wanted to verify the performance boost when we use Native libraries (NativeBlas.java ).
On my machine native code(NativeBlas.java) is slower than normal java (DoubleMatrix.java which uses SimpleBlas.java, look at line 184-187).

To verify if all the things are perfect. Ran the benchmark-code o/p is attached in the file.
I tried to benchmark the two approaches for simple matrix-multiplication.

For same array sizes this is the performance on my machine (Note: no other heavy processes were running on my machine)
My machine has Red Hat 4.1.2-46 and upgraded GCC to GCC-4.8 .

No. of ops: 200000000, Time Take: 140182000 NanoSeconds (Matrix multiplication using DoubleMatrix mmul function)
No. of ops:: 200000000,Time Take: 569076000 NanoSeconds (Matrix multiplication using NativeBlas's dgemv function).

Please find the java code (similar to ATLASDoubleMultiplicationBenchmark.java code ) attached and give you inputs.

Thanks,
Amit,.

TestJblasMultiplication.java

JBlasBenchmarkOutput.txt

Mikio Braun

unread,

Sep 17, 2014, 10:50:27 AM9/17/14

to jblas...@googlegroups.com

Hi Amit,

as explained here
https://github.com/mikiobraun/jblas/wiki/Java-Native-Code-Background
calling native code involves copying the data forth and back between
Java memory and native memory, so an operation like matrix vector
multiplication cannot benefit from native code. That's also why
DoubleMatrix mmul uses Java code.

If you want to see some impressive speedups, go for matrix-matrix
multiplication with dgemm

-M

> --
> You received this message because you are subscribed to the Google Groups
> "jblas-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to jblas-users...@googlegroups.com.
> To post to this group, send email to jblas...@googlegroups.com.
> Visit this group at http://groups.google.com/group/jblas-users.
> For more options, visit https://groups.google.com/d/optout.

--
Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun

Amit Chandak

unread,

Sep 18, 2014, 3:50:45 AM9/18/14

to jblas...@googlegroups.com

Hi Mikio,
I think it has got to do with size of matrices rather than the kind of operation. I experimented with different matrices size and compared three operations DoubleMatrix.mmul , NativeBlas.dgemv and NativeBlas.dgemm.
Please find the results and code attached.
I am not sure why the results are not as you mentioned.

Thanks,
Amit

TestJblasMultiplication.java

JBlasMatrixMulBenchmarks.txt

Mikio Braun

unread,

Sep 18, 2014, 8:38:59 AM9/18/14

to jblas...@googlegroups.com

Hi Amit,

ok, took a while to go through your results. When reporting time, it's
often better to divide by 1e9 to get seconds, not nanoseconds (so many
digits to parse...)

The numbers you report are correct, but the reason seems faster is
that dgemv only does matrix-vector multiplication, even if you feed it
two matrices. No check's in there on the LAPACK side. So dgemv only
does a fraction of the work of dgemm.

If you take that into account, you get the following picture:

Case 1&2&4: Matrix-matrix multiplication

mmul uses dgemm internally, so they are equally fast, dgemv does not
compute matrix-matrix multiplication

Case 3: matrix-vector multiplication

mmul recognizes that it's actually matrix-vector multiplication,
switches to Java code, is faster than both dgemv and dgemm which copy
the data to native memory and back

I hope this clears it up. The main mistake you made in your thinking
was that you thought dgemv would also do matrix-matrix multiplication.

Best,

-M

Debasish Das

unread,

Dec 13, 2014, 2:31:15 PM12/13/14

to jblas...@googlegroups.com

Hi Mikio,

Is it possible to ask JVM not to copy the data forth and back between Java memory and native memory ? It copies only the result or every time we call dgemv, the matrix and vector are also copied from JVM to Native ?

Thanks.

Deb

Mikio Braun

unread,

Dec 15, 2014, 9:16:51 AM12/15/14

to jblas...@googlegroups.com

Hi Deb,

no, unfortunately, it's not possible unless you take care of this yourself. The JNI does not provide mechanisms for this.

I've played around with the idea where you're handling the copying yourself, using, for example a DirectBuffer, but that has so far been slower than the JNI version.

Ultimately, one probably needs some higher level optimizer which looks at a whole expression before it is evaluated and tries to push as much computation as possible on the native side, as you probably would do with a GPU side computation... .

Hope this helps,

Mikio

--
You received this message because you are subscribed to the Google Groups "jblas-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jblas-users...@googlegroups.com.
To post to this group, send email to jblas...@googlegroups.com.
Visit this group at http://groups.google.com/group/jblas-users.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward