A native extension for Vectorz?

Matt Revelle

unread,

Jan 7, 2015, 5:38:49 PM1/7/15

to numerica...@googlegroups.com

I'm not getting acceptable performance from existing sparse matrix libraries in Java and am starting to think the only way to improve this is with a native library. There are plenty of native matrix libraries with support for sparse matrices and a few have Java bindings. However, they require copying data between the JVM and the native lib.

I'd like to propose something different, a native extension for Vectorz which can perform costly operations with native code but without copying data back and forth between the JVM and native lib. This would be implemented by subclassing matrix classes and overriding selected methods to call native functions. The native functions would be implemented with JNI and use GetDoubleArrayElements and/or GetDoubleArrayCritical to access double arrays without the overhead of copying the elements. The JNI docs (http://docs.oracle.com/javase/8/docs/technotes/guides/jni/spec/functions.html#array_operations) state that whether an array is copied is dependent on the particular JVM. So more work is necessary to determine if this is possible on Oracle's JVM (HotSpot).

I'm going to do another profiling session with SparseRowMatrix and SparseColumnMatrix from Vectorz in case I missed bottlenecks in my program code or in Vectorz. But any feedback is appreciated in the meantime before this work might begin.

-Matt

P.S. Mike, apologies if this is antithetical to Vectorz. =)

Mike Anderson

unread,

Jan 7, 2015, 8:10:01 PM1/7/15

to numerica...@googlegroups.com

On Thursday, 8 January 2015 06:38:49 UTC+8, Matt Revelle wrote:

I'm not getting acceptable performance from existing sparse matrix libraries in Java and am starting to think the only way to improve this is with a native library. There are plenty of native matrix libraries with support for sparse matrices and a few have Java bindings. However, they require copying data between the JVM and the native lib.

May well be the case - though if you have bottlenecks in vectorz-clj I'm happy to take a look and fix them. Not all the code paths are optimised yet especially around sparse matrices, I've had 10x speedups pretty much every time I've optimised a slow code path that I have been hitting regularly.

In particular, sparse code efficiency often depends on being able to scan two indexes simultaneously and take action based on the corresponding non-sparse values only. This is the only way to get the maximally efficient O(N * density) operations on two vectors to work. I've done this for a few major operations but certainly not everything.

I'd like to propose something different, a native extension for Vectorz which can perform costly operations with native code but without copying data back and forth between the JVM and native lib. This would be implemented by subclassing matrix classes and overriding selected methods to call native functions. The native functions would be implemented with JNI and use GetDoubleArrayElements and/or GetDoubleArrayCritical to access double arrays without the overhead of copying the elements. The JNI docs (http://docs.oracle.com/javase/8/docs/technotes/guides/jni/spec/functions.html#array_operations) state that whether an array is copied is dependent on the particular JVM. So more work is necessary to determine if this is possible on Oracle's JVM (HotSpot).

Sounds like a plausible design. You should be able to implement by subclassing the right Vectorz abstract classes relatively easily.

The only really fiddly bit will probably be ensuring that the in-memory representation exactly matches what the native library requires. If you have to do a memory allocation + copy, you will probably lose any advantage that native code gives you.

I'm going to do another profiling session with SparseRowMatrix and SparseColumnMatrix from Vectorz in case I missed bottlenecks in my program code or in Vectorz. But any feedback is appreciated in the meantime before this work might begin.

-Matt

P.S. Mike, apologies if this is antithetical to Vectorz. =)

Not at all - I have considered doing something similar myself, although I was planning to do it for GPU rather than regular native code (possibly using JCublas). I don't *personally* think the difference between Vectorz and native code (maybe 2x?) is significant for the kind of work that I do, but ~10-100x improvements from utilising the GPU is potentially interesting.

In both cases, the right thing would be to make a new library with both Vectorz and whatever native code is required as a dependency. The new classes should then still be fully compatible with vectorz-clj (as long as the relevant AMatrix / AVector interfaces are fully implemented)

Matt Revelle

unread,

Jan 8, 2015, 1:02:43 AM1/8/15

to numerica...@googlegroups.com

On Wednesday, January 7, 2015 8:10:01 PM UTC-5, Mike Anderson wrote:

On Thursday, 8 January 2015 06:38:49 UTC+8, Matt Revelle wrote:

<snip>

I'm going to do another profiling session with SparseRowMatrix and SparseColumnMatrix from Vectorz in case I missed bottlenecks in my program code or in Vectorz. But any feedback is appreciated in the meantime before this work might begin.

-Matt

P.S. Mike, apologies if this is antithetical to Vectorz. =)

Not at all - I have considered doing something similar myself, although I was planning to do it for GPU rather than regular native code (possibly using JCublas). I don't *personally* think the difference between Vectorz and native code (maybe 2x?) is significant for the kind of work that I do, but ~10-100x improvements from utilising the GPU is potentially interesting.

In both cases, the right thing would be to make a new library with both Vectorz and whatever native code is required as a dependency. The new classes should then still be fully compatible with vectorz-clj (as long as the relevant AMatrix / AVector interfaces are fully implemented)

I like the idea of using GPUs. Not sure if sparse matrices would benefit, but I don't know much about programming GPUs. Will keep that in mind.

Mike Anderson

unread,

Jan 8, 2015, 4:30:44 AM1/8/15

to numerica...@googlegroups.com

I've not looked at sparse GPU implementations, but it certainly seems possible: see e.g.

http://www.nvidia.com/object/nvidia_research_pub_001.html

Mike Anderson

unread,

Jan 13, 2015, 7:33:23 PM1/13/15

to numerica...@googlegroups.com

FWIW, the back-end code for Neanderthal looks pretty suitable as a way of interfacing with BLAS/ATLAS. See no reason why this couldn't be plugged into a Vectorz INDArray or AMatrix wrapper.

https://github.com/uncomplicate/neanderthal-atlas

On Thursday, 8 January 2015 06:38:49 UTC+8, Matt Revelle wrote:

Dragan Djuric

unread,

Jan 14, 2015, 4:41:32 AM1/14/15

to numerica...@googlegroups.com

It is not suitable because it uses Buffer, and I suppose all other implementations use arrays.

Buffers are used for two reasons:

- It is not guaranteerd that arrays are in a continuous block in memory, and JVM may wish to move them from time to time, so GetPrimitiveArrayCritical is not as suitable as working with buffers

- Buffers allow much more efficient subvectors, submatrices and other operations in Java without copying

So, it you want to use neanderthal-atlas for Vectorz, you have two choices:

- Copy arrays to buffers each time you call a native method, to avoid copying arrays to native arrays, which is pointless.

- Implement Vectorz through Buffers, thus reinventing Neanderthal...

Matt Revelle

unread,

Jan 14, 2015, 3:34:50 PM1/14/15

to numerica...@googlegroups.com

Thanks for the info. Will take a look at adding sparse matrix support to Neanderthal. It makes more sense to add core.matrix support to Neanderthal than add Buffer-backed classes to Vectorz. Though I'm still interested in a GPU extension for Vectorz, since copying is necessary there.

Dragan, do you have any ideas for using Buffers to represent sparse matrices/vectors?

Mike Anderson

unread,

Jan 14, 2015, 7:40:53 PM1/14/15

to numerica...@googlegroups.com

Hi Dragan,

Vectorz is not limited to arrays - in fact many of the existing implementations don't use arrays. Vectorz is ultimately defined by abstractions (e.g. INDArray, AMatrix) and can use any underlying storage (including nio buffers).

In fact, it only took me 10 minutes to create a buffer-backed Vectorz matrix class as a proof-of-concept.

https://github.com/mikera/vectorz/blob/develop/src/main/java/mikera/matrixx/impl/BufferMatrix.java

Adding native blas operations using neatherthal-atlas should be relatively trivial.

Advantages of going this route:

- You get full core.matrix compatibility (including optimised protocols) essentially for free via vectorz-clj

- You get higher dimensional (N-dimensional array) support for free via SliceArray and other vectorz implementations

- Easy integration with JVM Vectorz (which is very fast for many other operations, e.g. O(1) slicing, transposes, views etc.)

Because of all this, I genuinely think that a Vectorz extension using ATLAS/BLAS (possibly via neanderthal-atlas bindings) is the most practical way to get powerful native numerical support in idiomatic Clojure. Happy to discuss of course if anyone has a better idea!

Mike Anderson

unread,

Jan 14, 2015, 8:14:09 PM1/14/15

to numerica...@googlegroups.com

Hi Matt - see my other post but I think Buffer-backed classes in Vectorz are trivial to support.

I'd encourage trying a combination of a Vectorz BufferMatrix extension + neanderthal-atlas for the native code so that all the rest of core.matrix comes for free.

Matt Revelle

unread,

Jan 15, 2015, 4:33:15 PM1/15/15

to numerica...@googlegroups.com

Ok, took a look at neanderthal-atlas and definitely would be straightforward to use for native, dense matrices.

I’m interested in how Dragan plans to support sparse matrices in neanderthal since ATLAS (BLAS and LAPACK) doesn't.

--
You received this message because you are subscribed to a topic in the Google Groups "Numerical Clojure" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/numerical-clojure/NUO57KoLx20/unsubscribe.
To unsubscribe from this group and all its topics, send an email to numerical-cloj...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Piotr Moczurad

unread,

Mar 17, 2015, 7:29:17 AM3/17/15

to numerica...@googlegroups.com

Hello,

I hope you don't mind me digging up an old thread. The native extensions for Vectorz are one of this year's GSOC projects so I guess it's still on.

The GPU implementation is certainly feasible, I've once done a similar thing for Scala. NVidia has the CuSparse library to handle sparse matrices and it's really cool (and there's Java port: http://www.jcuda.org/jcuda/jcusparse/JCusparse.html). The issue of copying here is more a little more subtle than with pure-CPU native code: actually, it's not that bad to do even a lot of copying between host and device if that means keeping both units busy (for real speedups you need a hybrid CPU-GPU implementation, like the ones from V.Volkov).

However, keep in mind that GPU programming comes with its own set of problems and the increase in code complexity is inevitable. And the speedup may not be so obvious. That's perhaps why it seems to be losing momentum lately, in favour of CPU-based solutions.

That said, if you think it's a good idea we can give it a try: the libraries are there and programming the GPU is fun.

Mike Anderson

unread,

Mar 18, 2015, 3:58:36 AM3/18/15

to numerica...@googlegroups.com

Yes, it is certainly timely to resurrect this thread.

Worth noting that a nice thing with core.matrix and vectorz-clj is that you can effectively mix and match implementations. So I imagine a quite common workflow could be:

- Do everything at first in pure core.matrix / vectorz-clj on the JVM

- Only if the need for extra performance arises then use the GPU/native variant for the specific ops that need it (big matrix multiplication and factorisation, most likely)

Piotr Moczurad

unread,

Mar 18, 2015, 5:05:46 PM3/18/15

to numerica...@googlegroups.com

Ok, that sounds good. Do you think that's in the scope of the GSOC project? (I'm actually happy doing both CPU and GPU bindings, so I'm fine with whatever you think is more important)

Mike Anderson

unread,

Mar 18, 2015, 9:35:21 PM3/18/15

to numerica...@googlegroups.com

Ultimately it is up to you as the applicant to propose a scope / plan for GSoC.

I think both native and GPU are important, so take your pick. If you think that it is feasible to do both (and by "do" I mean robustly including testing, documentation, examples etc.) then definitely go for it! I'd also think it would be fine to target one first and do the other later if there is time.

From an implementation perspective, a lot of the common code could probably be shared, but there would presumably be different dependencies. As a result you would probably end up working on multiple repos:

- core.matrix (may be some minor changes needed)

- Vectorz (may be a few extensions to base classes, extra methods etc.)

- vectorz-clj

- vectorz-gpu

- vectorz-native

Not a big problem, but just something to be aware of as you plan. Need to think about what changes need to go upstream first etc.

Piotr Moczurad

unread,

Mar 19, 2015, 5:18:03 PM3/19/15

to numerica...@googlegroups.com

That's right, I think I would start with vectorz-native first. It seems more general and may target a broader range of users.

Matt Revelle

unread,

Mar 19, 2015, 5:52:39 PM3/19/15

to numerica...@googlegroups.com

A vectorz-native would be great. It’d be nice to have native sparse matrix support. Sparse matrices seem to get neglected by most libraries in all languages and that’s an opportunity for Clojure (and I suppose Java) to shine. Vectorz has decent sparse matrix performance, but I expect a native implementation would outperform pure-JVM. I started to look for native libraries (C, C++, etc.) with support for sparse matrices and sparse-by-sparse operations, but the best I found was Eigen (http://eigen.tuxfamily.org).

My plan was to mimic Neanderthal’s (http://neanderthal.uncomplicate.org) use of direct buffers to store the underlying data structures for subclasses of Eigen’s various matrix types. This way there is limited copying of data and matrices can be accessed both from Clojure/Java and C++ code (JNI and Eigen).

Neanderthal uses ATLAS, which appears to not include routines for sparse-by-sparse operations. It’d be worth searching for other libraries that support sparse-by-sparse operations in case there’s a better option than Eigen. Also, it could be that native sparse matrix support is implemented with Eigen and a better library is used for dense-by-dense and dense-by-sparse operations.

-Matt

Reply all

Reply to author

Forward