[SciPy-User] indexation of sparse matrices

18 views
Skip to first unread message

Dmitrey

unread,
Dec 13, 2009, 8:44:52 AM12/13/09
to scipy...@scipy.org
hi all,
suppose I have a scipy.sparse matrix A and positions of elements to be extracted I = (2,3,4), J = (0,1,5).
how can I get vector val (Python list or numpy array) that contains the elements A[2,0], A[3,1], A[4,5]?
I haven't found it in scipy.sparse doc.

And, if anyone knows, how can I get it from dense numpy array A?

Thank you in advance, D.

Emmanuelle Gouillart

unread,
Dec 13, 2009, 12:34:02 PM12/13/09
to SciPy Users List
Hi Dmitrey,

what you want to do is called *fancy indexing* in numpy. Fancy
indexing consists in indexing an array with (a) sequence(s) of indices,
see
http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html#advanced-indexing
for more details. It works on numpy arrays as well as sparse matrices for
the lil_matrix format. Here is an example:

>>> import numpy as np
>>> from scipy import sparse
>>> a = np.zeros((10,10))
>>> a[2, 0] = a[3, 1] = a[4, 2] = 1
>>> a[4, 5] = 2
>>> a_sparse = sparse.lil_matrix(a)
>>> a_sparse
<10x10 sparse matrix of type '<type 'numpy.float64'>'
with 4 stored elements in LInked List format>
>>> I = (2, 3, 4)
>>> J = (0, 1, 5)
>>> a[I, J]
array([ 1., 1., 2.])
>>> a_sparse[I, J]
<1x3 sparse matrix of type '<type 'numpy.float64'>'
with 3 stored elements in LInked List format>
>>> a_sparse[I, J].todense()
matrix([[ 1., 1., 2.]])

The lil_matrix is meant for supporting fancy indexing, but it is
not efficient for matrices operations such as inversion or
multiplication; you should transform your matrix to another format for
performing such operations.

Cheers,

Emmanuelle

> _______________________________________________
> SciPy-User mailing list
> SciPy...@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user

_______________________________________________
SciPy-User mailing list
SciPy...@scipy.org
http://mail.scipy.org/mailman/listinfo/scipy-user

Nathan Bell

unread,
Dec 14, 2009, 1:39:03 AM12/14/09
to SciPy Users List
On Sun, Dec 13, 2009 at 12:34 PM, Emmanuelle Gouillart
<emmanuelle...@normalesup.org> wrote:
>        Hi Dmitrey,
>
>        what you want to do is called *fancy indexing* in numpy.  Fancy
> indexing consists in indexing an array with (a) sequence(s) of indices,
> see
> http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html#advanced-indexing
> for more details. It works on numpy arrays as well as sparse matrices for
> the lil_matrix format. Here is an example:
>
>        The lil_matrix is meant for supporting fancy indexing, but it is
> not efficient for matrices operations such as inversion or
> multiplication; you should transform your matrix to another format for
> performing such operations.
>

The CSR and CSC formats also support fast fancy indexing. Something like:
>>> A = ... some sparse matrix
>>> csr_matrix(A)[I,J]
ought to work well.

--
Nathan Bell wnb...@gmail.com
http://www.wnbell.com/

Dmitrey

unread,
Dec 14, 2009, 3:51:20 PM12/14/09
to SciPy Users List
От кого: Nathan Bell <wnb...@gmail.com>

The CSR and CSC formats also support fast fancy indexing. Something like:
>>> A = ... some sparse matrix
>>> csr_matrix(A)[I,J]
ought to work well.

I have tried all 3, lil works faster than any others, but casting to numpy ndarray and invoking it on the dense array obtained works several times faster, that is extremely important for me. Thus on my problem with sparse matrix of shape 151* 1178
each time casting it to dense before applying fancy indexing makes numerical optimization ipopt solver works faster from 130 to 60 sec. Here's the code that demonstrates the difference:

from scipy.sparse import lil_matrix, find
from numpy import where, zeros, prod, vstack, ones
from time import time
n, m = 1000, 20000
M = vstack((zeros((n, m)), ones((25, m))))
I, J = where(M)

# 1: time for lil
M = lil_matrix(M)
print 'sparsity of the matrix:', float(M.size)/prod(M.shape)

t = time()
M[I, J]
print('time elapsed with lil: %f' % (time()-t))

# 2: time for dense
t = time()
M = M.A # even with this time elapsed is less
M[I, J]
print('time elapsed with ndarray: %f' % (time()-t))

# output:
# sparsity of the matrix: 0.0243902439024
# time elapsed with lil: 17.006631
# time elapsed with ndarray: 0.710046
# as for csr and csc fancy indexation it works several times slower

So, I have filed a ticket for it.

Nathan Bell

unread,
Dec 15, 2009, 4:31:02 AM12/15/09
to SciPy Users List
2009/12/14 Dmitrey <tm...@ukr.net>:

>
> I have tried all 3, lil works faster than any others, but casting to numpy
> ndarray and invoking it on the dense array obtained works several times
> faster, that is extremely important for me. Thus on my problem with sparse
> matrix of shape 151* 1178
> each time casting it to dense before applying fancy indexing makes numerical
> optimization ipopt solver works faster from 130 to 60 sec. Here's the code
> that demonstrates the difference:
>
>
> So, I have filed a ticket for it.

Hi Dmitrey,

I think what you're asking for is simply not possible. The time spent
looking up elements in a dense matrix (a completely trivial operation)
will always be less than the lookup in sparse matrix data structures
where indexing elements is inherently more expensive.

If you want something that does the above operation really fast then you can do
>>> M = coo_matrix(M)
>>> M.data # same as M[I,J]
or
>>> M = csr_matrix(M)
>>> M.data # same as M[I,J]

Why do you want to index sparse matrices in the first place? For many
operations there are better formulations that avoid the need to do
indexing of any kind.

Dmitrey

unread,
Dec 15, 2009, 6:03:08 AM12/15/09
to SciPy Users List
От кого: Nathan Bell <wnb...@gmail.com>

2009/12/14 Dmitrey <tm...@ukr.net>:
>
> I have tried all 3, lil works faster than any others, but casting to numpy
> ndarray and invoking it on the dense array obtained works several times
> faster, that is extremely important for me. Thus on my problem with sparse
> matrix of shape 151* 1178
> each time casting it to dense before applying fancy indexing makes numerical
> optimization ipopt solver works faster from 130 to 60 sec. Here's the code
> that demonstrates the difference:
>
>
> So, I have filed a ticket for it.

Hi Dmitrey,

I think what you're asking for is simply not possible. The time spent
looking up elements in a dense matrix (a completely trivial operation)
will always be less than the lookup in sparse matrix data structures
where indexing elements is inherently more expensive.

but what about dok format? I guess it shouldn't be too expensive - it is  O(numberOfRequiredElements*log2(nonZerosNumber)), and there will be no peak memory jump as for lil.

If you want something that does the above operation really fast then you can do
>>> M = coo_matrix(M)
>>> M.data # same as M[I,J]
or

But I want to have M[I, J], not just M. Some (or lots of) required elements can be zeros and thus absent in  M.data.

Why do you want to index sparse matrices in the first place? For many
operations there are better formulations that avoid the need to do
indexing of any kind.

Because some optimization problems can have constraints gradient of size nVars*nConstraints very huge.  nVars are up to ~100'000, and nConstraints are up to 200'000'000. I will just have memory overflow if I will try casting it to dense each time.
BTW, why return type of numpy.where is int64, but type of scipy.sparse.find is numpy.int32? I had spent lots of time to search why ipopt hadn't work with that data, and it turned out it is due to this type difference (maybe some issues related to connecting Python code to C++). Also, does it mean that I cannot create sparse matrices with one dimension greater than max(int32)  instead of max(int64)?

Dag Sverre Seljebotn

unread,
Dec 15, 2009, 6:15:06 AM12/15/09
to SciPy Users List
I can confirm this, scipy.sparse uses int32 indices only. Search for
"intc" in the sources for the relevant lines.

Dag Sverre

Nathan Bell

unread,
Dec 15, 2009, 8:32:47 AM12/15/09
to SciPy Users List
2009/12/15 Dmitrey <tm...@ukr.net>:

>
> Because some optimization problems can have constraints gradient of size
> nVars*nConstraints very huge.  nVars are up to ~100'000, and nConstraints
> are up to 200'000'000. I will just have memory overflow if I will try
> casting it to dense each time.

I looked at the CSR fancy indexing code and realized it was much worse
than I originally thought so I implemented a faster path in C++ for
the operation. With SciPy r6139 (
http://projects.scipy.org/scipy/changeset/6139 ) I get a considerable
improvement with a variation of your example:

from scipy.sparse import csr_matrix, find


from numpy import where, zeros, prod, vstack, ones
from time import time
n, m = 1000, 20000
M = vstack((zeros((n, m)), ones((25, m))))
I, J = where(M)

# 1: time for lil

M = csr_matrix(M)


print 'sparsity of the matrix:', float(M.size)/prod(M.shape)

t = time()
M[I, J]

print('time elapsed with csr: %f' % (time()-t))

# 2: time for dense
t = time()
M = M.A # even with this time elapsed is less
M[I, J]
print('time elapsed with ndarray: %f' % (time()-t))

# output:
# sparsity of the matrix: 0.0243902439024

# time elapsed with csr: 0.044210
# time elapsed with ndarray: 0.197938

> BTW, why return type of numpy.where is int64, but type of scipy.sparse.find
> is numpy.int32? I had spent lots of time to search why ipopt hadn't work
> with that data, and it turned out it is due to this type difference (maybe
> some issues related to connecting Python code to C++). Also, does it mean
> that I cannot create sparse matrices with one dimension greater than
> max(int32) instead of max(int64)?

That's correct, max(int32) is the largest size you can use.

There are a few reasons for this limitation. The first is that few
people will have matrices with more than 2B rows or columns since most
meaningful matrices of that size would require 100GB+ of memory. The
second is that external libraries (SuperLU, UMFPACK, ARPACK, etc.)
don't necessarily support larger indices, or our wrappers for those
libraries don't necessarily support it right now. Other SciPy
components, like io.savemat(), would also have to be checked as well.

The C++ backend of the sparse module (sparsetools) is fully templated
so adding support for 64-bit integers is fairly trivial there.
However, it would have the downside of doubling the compilation time
and executable size.

Dmitrey

unread,
Dec 15, 2009, 9:05:46 AM12/15/09
to SciPy Users List
thank you, it work very well now for both M and M.T
Doesn't anyone know when next scipy/numpy releases are planned (approximately, of course)?
Thank you in advance, D.

Reply all
Reply to author
Forward
0 new messages