IO to core.matrix from NDArray in binary file for hyper- and multispectral images?

86 views
Skip to first unread message

Ben K

unread,
Dec 29, 2014, 10:41:24 AM12/29/14
to numerica...@googlegroups.com
I'm looking at moving over a library of hyperspectral library routines I have that are written in a mix of Matlab, IDL, Python, etc. into a Clojure library built against core.matrix. The datasets are not (at least in my view) prohibitively large. A tiny one I'm using for initial testing is [472 682 128] (as a hyperspectral image, it's essentially a 472 x 682 image with 128 color channels), so just over 40 million elements. A larger dataset could be 4k+ x 4k+ x 1000 channels.

The blocking issue for me at present is figuring out how to do fast io, reshape, and transpose. These come from ENVI datasets which are stored as flat binary files (just the flattened NDArray in w/e datatype). It's fairly trivial to write the Java nio code to read in the array, and this takes a few hundred msec for the smaller dataset (472, 682, 128), so that's not particularly a big deal. The main issue I'm encountering is getting it into the two different shapes I need to start doing any meaningful numerical analysis. Image texture problems need an image as a slice of the 3D array, versus the spectral matrix of [128 321904] or its transpose (X and Y flattened). To get a known spectra from a scene, It's just getting the spectral slice from the dataset in w/e dimensionality.

Calling reshape on the short array I get from IO runs from one to several minutes to go from flat to [472 682 128] (fromfile to this shape takes 30 msec in NumPy on the same box). It's presently a blocking problem for me. I'll need to follow it up by changing dimensions to [128 472 682] then collapsing the 2nd and 3rd dimensions to get the spectral matrix. I was expecting that these would be faster, esp. with :ndarray backend implementation ('ve tried a few backends), but this doesn't seem to be the case - even just another naive reshape is really slow.

Is there an expected/supported way of saving/reading core.matrix data I could be making use of? The performance cost isn't so bad that I couldn't just batch file conversion then work from the new format. Is there a better way to read a flat array into a shape I can make use of in core.matrix that involves targeting something for a specific impl? Is the reshape/transpose performance expected?

-B

Mike Anderson

unread,
Dec 29, 2014, 6:10:26 PM12/29/14
to numerica...@googlegroups.com
4k * 4k * 1000 is pretty big - that is ~30Gb even with shorts. Might be tricky to do much in-memory with that right now?

The smaller dataset should be simple though. Reshape and transpose should be pretty fast - I get under a second for either operation with vectorz-clj 0.28.0 using the regular dense double arrays.

(use 'clojure.core.matrix)
(set-current-implementation :vectorz)

(def a (new-array [472, 682, 128]))

(time (reshape a [128 321904]))
"Elapsed time: 883.068431 msecs"

(time (transpose a))
"Elapsed time: 149.31487 msecs"

Can you post the following and I'm happy to take a look:
a) What implementation you are using (specifically what type of array are you loading the data into)
b) What code are you using to load / reshape / transpose the data?

If you are using regular nested Clojure persistent vectors (i.e. the :persistent-vector implementation), I guess you might see the kind of times you are talking about. Even then I'd hope to be able to best NumPy though so I'd like to see what code you are using so that we can have a crack at optimising it!

Likewise, the :ndarray implementation isn't yet fully optimised so happy to take a look and see if you are hitting a slow code path. 

Ben K

unread,
Dec 30, 2014, 11:05:29 AM12/30/14
to numerica...@googlegroups.com
Hi Mike - Thanks for the reponse! I was able to go from your example and comments to resolve the issue. It's very obvious in retrospect but I can detail why it was non-obvious to me at the time. I had two issues:

(1) Did not take happy path in to library b/c I didn't call "array" on the returned Java short array from io.
(2) I must have had a version misalignment for vetorclj and core.matrix when I originally tested this case, which meant the implementation wasn't configured correctly.

One other note: the happy path still seems to be quite slow with ndarray.

What I ran into was this:
-- Starting w/ :vectorz impl I tried to do the reshape on the short array, saw that it worked but took a long time.
-- After the reshape, I tried more reshapes, to see if it would still take a long time -- it did (**but here I suspect it had silently used the persistent vector implementation)
-- I thought it was something intrinsic to reshape with :vectorz, tried to switch to :ndarray and then :jblas, saw same problems b/c:
  a-- :ndarray really does seem to be that slow for reshape
  b-- :jblas backend wasn't working and I didn't realize it - same case as w/vectorz

Workflow I followed for resolution was:

1) Following the new-array initialization path you copied I started getting the warning/error messages about the implementation not working.
2) Updated project to latest vectorz and core.matrix versions, set implement to vectorz
3) Implementation then seemed to be working, tested your case with new-array and everything was fine.
4) Fixed my io path to use "clojure.core.matrix/array" on the short-array first: you can see really rough wip for io here: https://github.com/benkamphaus/rs-polyglot/blob/master/rs-clj/src/rs_clj/io.clj#L50-L58 (this is a scratch workspace, not library quality code) - now at ~2 seconds.
5) Verified that this still takes a long time with :ndarray

Thanks again, I'll be able to play with stuff pretty easily now. :)

-B

Ben K

unread,
Dec 30, 2014, 11:16:03 AM12/30/14
to numerica...@googlegroups.com
Also, re "4k * 4k * 1000 is pretty big - that is ~30Gb even with shorts. Might be tricky to do much in-memory with that right now?" 

Yeah, plan is for memory mapped file of some type for the initial dimensionality reduction (probably PCA), which would probably take it to 4k*4k*50 or so -- ? Then you just need to keep the image around on disk so you can read spectra from it for reference or comparison to other datasets. Depending on the resolution/coverage of the image you could also process in subsets, but you need the right scene composition for certain analyses, so with high spatial resolution you might need that scene to be large enough to contain a sufficiently heterogenous mix of materials.

-B

Mike Anderson

unread,
Dec 31, 2014, 12:22:39 AM12/31/14
to numerica...@googlegroups.com
Thanks for the report!

Glad to hear that you got it working nice and fast.

Can you file an issue with the slow code example so that we can track and fix it?

A tangential observation is that functions like "reshape" don't always create a new array using the current implementation, rather they behave according to the implementation of their input. Maybe this should be revisited but it is a tricky one: not every array can be cast into every other because of element types etc. Currently (as you have discovered) the best approach is to convert to the format you want first, then start the manipulations.
Reply all
Reply to author
Forward
0 new messages