> What if I wanted to use Clojure for scientific computing, and in
> particular for doing linear algebra and matrix computations a la
> MATLAB?
>
> What would my options be for representing matrices, not to mention,
> especially, MULTIDIMENSIONAL ARRAYS?
>
> Would java arrays be the way to go, or nested vectors?
There are a couple of Java libraries that you can use. Unfortunately the
efforts to make everyone agree on a single library a few years ago
have failed, so now you have to choose between a couple of incompatible
libraries.
For matrix computations and linear algebra, your best choice is probably
the Colt library developed at CERN, or the somewhat parallelized
version called Parallel Colt. Colt has arrays up to three dimensions for
a couple of data types, enough for matrix stuff but not much more.
If you need higher-dimensional arrays, e.g. to store complex data sets,
your best bet would be the netCDF implementation for Java, which
contains a very complete array library that handles any number of
dimensions. However, it provides no linear algebra at all.
Another library in this category is JAMA. It looks rather similar to
Colt, but I never looked at it more closely.
There's also an array package developed by IBM in conjunction with an
optimizing Java compiler to go with it, but apparently it is no longer
maintained.
> at the moment. For instance, once you have a multidimensional array
> (whatever kind it is) you can't, I believe, access any property of
> that object at runtime to know its rank and dimensions, like you can
> do with MATLAB or Scientifc Python (NumPy), or other similar
> languages. And slicing, such as a[2:20:4, 1:9:3], which means every
> element in a with indices that go from start to finish with a certain
> step, seem to be problematic.
All the libraries listed above provide most of this functionality. Like
NumPy, they represent an array as a 1D Java array containing the data
elements plus bookkeeping information about the shape and strides.
> Actually the whole buiseness of having to deal with immutable (and
> lazy) data structures appears to make computing with matrices and
> multidimensional arrays a lot more difficult. Maybe this is just an
> impression. Opinions?
The array packages listed above all permit modification of the array
elements. Fully immutable arrays are probably of little practical
interest, but something similar to Clojures transients (data types that
can be initialized once by the function that creates them and then
become immutable) would probably be sufficient to change this. There is
little practial experience with this at the moment.
Konrad.
__________ Information provenant d'ESET NOD32 Antivirus, version de la base des signatures de virus 4537 (20091023) __________
Le message a été vérifié par ESET NOD32 Antivirus.
> For matrix computations and linear algebra, your best choice is
> probably
> the Colt library developed at CERN, or the somewhat parallelized
> version called Parallel Colt. Colt has arrays up to three dimensions
> for
> a couple of data types, enough for matrix stuff but not much more.
>
> If you need higher-dimensional arrays, e.g. to store complex data
> sets,
> your best bet would be the netCDF implementation for Java, which
> contains a very complete array library that handles any number of
> dimensions. However, it provides no linear algebra at all.
I should have added that it is possible to convert between netCDF and
Colt arrays without a costly element-by-element copying procedure. For
example, the following code inverts a matrix read from a netCDF file
using Colt:
(ns netcdf-read
(:import (ucar.nc2 NetcdfFile)
(cern.colt.matrix.tdouble.impl DenseDoubleMatrix2D)
(cern.colt.matrix.tdouble.algo DenseDoubleAlgebra)))
(defn unidata-to-colt
[#^ucar.ma2.ArrayDouble$D2 array]
(let [java-array-1d (.get1DJavaArray array Double)
[rows, cols] (.getShape array)]
(new DenseDoubleMatrix2D rows cols java-array-1d 0 0 cols 1 false)))
(let [matrix (with-open [ncfile (NetcdfFile/open "/Users/hinsen/
Temp/matrix.nc")]
(.read (.findVariable ncfile "matrix")))
colt-matrix (unidata-to-colt matrix)
la (new DenseDoubleAlgebra)
inverse (.inverse la colt-matrix)]
(prn inverse))
Konrad.
> Just one more thing. It's still not really clear to me if I am better
> off using Java arrays (make-array ...) or clojure vectors especially
> when dealing with multidimensional arrays. I know that if use Java
> libraries such as Colt, I have no choice. But in general? What do you
> think?
Using Clojure's built-in data structures is almost always the best
choice until you have a good reason to use something else. In the case
of arrays, that good reason is likely to be
1) memory usage, once your data sets become big,
2) performance, either because of the size of the data sets or because
some algorithm needs very frequent access
3) interoperability with Java libraries such as Colt.
As long as you work with small enough datasets, stick to Clojure
vectors.
Konrad.
> What if we created a structmap (a sort of class), where we have a
> flattened out one-dimensional clojure vector containing all the data
> of our potentially multidimensional (rank n) array, together with its
> dimensions and possibly other info? I believe the C scienitific
That's certainly possible. In fact, this is very much how Colt and
netCDF represent multidimensional arrays, except that they use a Java
array as the 1D element storage space.
The real question is what you gain from such an implementation.
Compared to
1) nested Clojure vectors
2) Colt or netCDF arrays
what is the practical advantage of your proposed new data structure?
One answer could be "the best of those two worlds", but do you have a
concrete application in mind where this would be a real advantage?
> I think such an approach could be more flexible. You can have the
> vector as a data pool, and the rest would be sort of like a view on
> that data (which can be changed at runtime if one chooses to). For
That's how nested Clojure vectors behave as well. With Colt and netCDF
arrays, creating such views is simple as well (but they are not
immutable).
Konrad.
> these things. Why? Because they're just that: nested vectors. They're
> not truly multidimensional vectors, and the more I think about them,
> the more they really suck from that point of view. For instance, first
> of all they're not that safe to use (for these purposes): you could
> potentially have a RAGGED nested vector, which you might end up
> passing to a function that's expecting a true multidimensional array!
That's indeed one of the arguments against the use of multidimensional
Java arrays (which are nested arrays) for this purpose.
> What is one to do? Implement checks all over the place just to make
> sure that it's ok? Don't like that.
"All over the place" would be a call to a single test function. But
this need is not specific to a nested vector representation.
Basically, there are two approaches to dealing with Clojure data
structures representing multidimensional arrays:
1) Make them an abstract data structure. Client programs are not
supposed to know the internal representation, they create arrays
exclusively by calling factory functions provided for this purpose.
Functions that expect array arguments can then assume that the data
structure is consistent and needn't do any checks. The tightness of
the abstraction can be enforced to various degrees, but that's a
different issue.
2) Expose the internal representation to client code. Since any
representation in terms of standard Clojure data structures could be
invalid, all array functions need to do some check on their arguments.
In fact, the second approach would be the best argument for the use of
nested vectors, as it is a rather simple and intuitive representation
from the user's perspective. With the first approach, the internal
representation would be chosen exlusively for the convenience of the
implementation.
> But it gets worse. I imagine, when
> dealing with nested vectors, that there's no guarantee that the data
> will be contiguous as when you're dealing with a linear vector. So, as
> far as efficiency is concerned, maybe they're not that good, but I'm
> not sure about that (don't know the implementation details.)
I don't think there is a difference. A Clojure vector is never a
contiguous block of storage.
Konrad.
> Your analysis is crystal clear and very helpful Konrad. But you
> haven't addressed the issue of dealing with useful information
> regarding the data structure itself. What if, for example, a function
> wanted to know the rank and dimensions of a multidimensional array it
> was being passed, and that array were represented by means of a nested
> vector?
I didn't address such issues because there is no point in discussing
them before making the fundamental decision whether to use an abstract
or an exposed data type for array data. That choice dictates the
priorities for everything that follows.
> Suppose we're dealing with rank n objects. Do you think it
> would be an easy task to figure all that out dealing with nested
> vectors?
If you can assume the array is well-formed, it is rather easy.
Otherwise it isn't.
> And by the way, How would you go about implementing in detail
> a check to see if a nested vector is actually an authentic
> multidimensional array or not?
That's a rather simple recursive function. The only problem is its run-
time cost.
> I honestly prefer your first case scenario. Seems much more efficient,
> less resource-consuming, and just straightforward. But I really would
> like to know what your preference is. If you had to choose, which way
> would you go?
If I were to design an array interface for Clojure, it would consist
of multimethods implemented for both an efficient array data structure
for internal use and for nested vectors. The implementation for the
latter would convert the nested vectors to the efficient structure and
do all the necessary checks. It would be there for convenience and
clarity in user code. Ideally I would then implement the same
interface for Colt arrays and netCDF arrays as well, both for having
an effcient structure for large data sets and for interoperability.
And once we are at it, why not have another implementation for sparse
arrays, using a suitable data structure?
Unfortunately, a good design and implementation represents a lot of
work. At the moment I am not sure if we have the critical mass of
people interested in working on this to get the job done in a
reasonable amount of time.
Maps could be a good choice for sparse arrays. For dense ones, they
would represent an enormous waste of memory, and probably time as well.
Konrad.
On 28 Oct 2009, at 22:21, Rock wrote:
> I honestly prefer your first case scenario. Seems much more efficient,> less resource-consuming, and just straightforward. But I really wouldIf I were to design an array interface for Clojure, it would consist
> like to know what your preference is. If you had to choose, which way
> would you go?
of multimethods implemented for both an efficient array data structure
for internal use and for nested vectors. The implementation for the
latter would convert the nested vectors to the efficient structure and
do all the necessary checks. It would be there for convenience and
clarity in user code. Ideally I would then implement the same
interface for Colt arrays and netCDF arrays as well, both for having
an effcient structure for large data sets and for interoperability.
And once we are at it, why not have another implementation for sparse
arrays, using a suitable data structure?
>>> Suppose we're dealing with rank n objects. Do you think it
>>> would be an easy task to figure all that out dealing with nested
>>> vectors?
>>
>> If you can assume the array is well-formed, it is rather easy.
>> Otherwise it isn't.
>
> Can you give an example in code? I really would like to see it.
(defn rank
[item]
(if (vector? item)
(inc (rank (nth item 0)))
0))
(rank 0)
(rank [1 2 3])
(rank [[1 2] [3 4]])
(defn shape
[item]
(if (vector? item)
(let [element (nth item 0)
n (count item)]
(cons n (shape element)))
(list)))
(shape 0)
(shape [1 2 3])
(shape [[1 2] [3 4]])
>>> And by the way, How would you go about implementing in detail
>>> a check to see if a nested vector is actually an authentic
>>> multidimensional array or not?
>>
>> That's a rather simple recursive function. The only problem is its
>> run-
>> time cost.
>
> Please provide an example. In the meantime, I'll give it a shot
> myself.
(defn check-multiarray
[item]
(and (vector? item)
(apply = (map shape item))))
; true
(check-multiarray [1 2 3])
(check-multiarray [[1 2] [3 4]])
; false
(check-multiarray 1)
(check-multiarray [[1 2] 3])
(check-multiarray [[1 2] [3]])
> I feel Clojure can and actually should become a language well-suited
> for scientific computation. It may not have all the prerequisites now,
> but, hey, this is Lisp, and it can be done, always! :)
I completely agree. I don't think Clojure will become a mainstream
language for science any time soon, for that it is too "exotic" for a
community that still sticks to Fortran. But it can find its niche, and
perhaps not even a small one.
Konrad.
> One issue with using multimethods is the resolution overhead. I
> don't think the JIT can optimize this to the extent it can optimize
> a normal polymorphic Java call.
That's worth exploring - while keeping in mind that the JIT improves
all the time.
> So you might want to use a Java interface for this instead, and
> Clojure's Java interop with macros to wrap a nice Clojury API around
> it.
Maybe. For now I would consider this premature optimization. The big
negative impact of using a Java interface is the impossibility to work
with existing Java classes, such as Colt arrays. They would have to be
wrapped in another object layer just for implementing the interface.
> A second problem is boxing of primitives, and a third is
> noncontiguous storage. The Java for the nonsparse stuff should
> probably use and expose Java arrays of primitives.
I can see a role both for an implementation based on Clojure vectors
and for one using Java arrays. With a multimethod interface, both can
coexist.
> If you can live without runtime polymorphism, having only compile-
> time polymorphism, the polymorphism can move back to the Clojure
> side but live in a macro;
Indeed. One could implement a polymorphic interface in one namespace
and a compile-time polymorphic interface in another namespace, making
it possible to switch easily between the two.
What is nice about Clojure is that many of these implementation
decisions can be changed later on without modifying the client-side API.
Konrad.