non-zero-indices as a seq of indices rather than nested collections

32 views
Skip to first unread message

Russell Dunphy

unread,
Jan 28, 2017, 5:08:14 PM1/28/17
to Numerical Clojure
Very inexperienced with matrix-y things so may be missing something obvious. Was trying to get the list of non-zero indices from a sparse matrix, but `clojure.core.matrix/non-zero-indices` returns a nested array with one fewer dimensions that its argument, rather than the seq of indices that I initially expected. To get the non-zero indices in the form I wanted I did the below, but it seems pretty inefficient - is there a better way?

Thanks,

Russell

(defn- non-zero-indices
  [m]
  (let [shape (m/shape m)
        idxs  (m/non-zero-indices m)]
    (->> (map range (butlast shape))
         (apply clojure.math.combinatorics/cartesian-product)
         (mapcat #(mapv (partial conj (vec %)) (get-in idxs %))))))

Mike Anderson

unread,
Jan 28, 2017, 10:22:18 PM1/28/17
to Numerical Clojure
Hi Russell,

What exactly are you trying to do?

There may be a better way if you look at the bigger picture. Creating a big seq of indexes and doing lookup on these on a per-element basis is pretty expensive.

  Mike.

Russell Dunphy

unread,
Jan 29, 2017, 6:34:24 AM1/29/17
to Numerical Clojure
Hi Mike, and thanks for the quick response.

There are two things I'm trying to do: the first is to write out a sparse matrix to file in the matrix market format. For this I need to doseq over only the matrix's non-zero indexes, and print them out line by line. So a sparse matrix that looks like this:

[[0.0 0.0 2.3]
 [0.0 0.7 0.0]
 [0.0 0.0 0.0]]

I want to print like this:

0 2 2.3
1 1 0.7

The second thing I would like to be able to do is to map a function over all the non-zero elements of a sparse matrix. The specific reason I want to do this is the following:

I have a large set of documents made up of about 300 different, non-overlapping subgroups of documents. I want to calculate TF-IDF for each subgroup individually, where the "document" is an individual document, but I also want to calculate an overall TF-IDF that treats each subgroup as a whole as a single "document".

I started by, for each subgroup, counting the number of documents in which each term appears. So I have one matrix per subgroup that contains the number of documents within the subgroup that contain each term. Now I want to combine those matrices into one that contains the counts of subgroups that contain each term.

My initial idea of how to achieve this was to somehow map the function (constantly 1) over the non-zero elements of the matrices and then reduce them with m/add. But I'm not sure if that's possible and/or easy. Or if there's a simple alternative way to do it. I don't have any grounding in linear algebra other than what I was able to scrape together while doing a Coursera ML course, so I am very much blundering my way round.

Thanks again for your help,

Russell

Russell Dunphy

unread,
Jan 29, 2017, 9:03:26 AM1/29/17
to Numerical Clojure
Just seen that someone asked a similar question on this mailing list not long ago (re: setting all non-zero values to 1). However I think this is a case where the answer to that reply doesn't apply, as the matrix is very sparse.

For a bit of extra context, I'm converting the terms in the documents into indices in the frequency matrix by taking the absolute value of the Murmur3 hash of the term. So the matrix (well, vector, I apologise for my muddled use of terminology) is very large and very sparse. Again, if there's a better way to do this I'm all ears!

Mike Anderson

unread,
Jan 29, 2017, 10:56:45 PM1/29/17
to Numerical Clojure
OK so I'm considering a function `emap-sparse` that will do what you want (map a function over sparse elements of an array, leaving others unchanged at zero).

This needs backend implementation support to be efficient, which should be easy enough in Vectorz but again needs some quick changes.

Not top of my priority list though... so no guarantees on timing.

Other thing you might try is:

(gt [0 1 2 3 0 1] 0)
=> [0 1 1 1 0 1]

This doesn't have acceleration for sparse arrays yet, but may be fast enough anyway as it is quite efficiently implemented (for Vectorz at least). 

Russell Dunphy

unread,
Jan 30, 2017, 3:43:05 AM1/30/17
to Numerical Clojure
Great - thanks. Would your proposed `emap-sparse` also work when you wanted to `doseq` over the indices, as in the other example of writing them to file?

No rush if this is going require a bunch of changes. I've been looking at existing Java libs, and it looks like la4j has the functionality I need, so I will use that directly in the meantime.

Thanks,

Russell
Reply all
Reply to author
Forward
0 new messages