[SciPy-User] Normalizing a sparse matrix

coolhea...@gmail.com

unread,

Mar 20, 2011, 3:06:53 AM3/20/11

to scipy...@scipy.org

Hi,

I have a sparse matrix with nearly (300*10000) entries constructed out of 14000*14000 matrix...In each iteration after performing some operations on the sparse matrix(like multiply and dot) I have to divide each row of the corresponding dense matrix with the sum of its elements...

Since sparse matrix format doesn't allow all the required matrix operation(divide) I tried to convert it to a dense format and then divide by the sum. But this raises MemoryError exception because 14000*14000 matrix doesn't fit memory..

Can someone tell me how to normalize a sparse matrix ?

Warren Weckesser

unread,

Mar 20, 2011, 6:37:21 AM3/20/11

to SciPy Users List

This will normalize the rows of R, a sparse matrix in CSR format:

-----

# Normalize the rows of R.
row_sums = np.array(R.sum(axis=1))[:,0]
# OR: row_sums = R.dot(np.ones(R.shape[1]))
row_indices, col_indices = R.nonzero()
R.data /= row_sums[row_indices]

-----

The attached code provides an example of that snippet in use.

Warren

sparse_normalize_rows_example.py

David

unread,

Mar 21, 2011, 10:04:12 PM3/21/11

to SciPy Users List

On 03/20/2011 04:06 PM, coolhea...@gmail.com wrote:
> Hi,
>
> I have a sparse matrix with nearly (300*10000) entries constructed out
> of 14000*14000 matrix...In each iteration after performing some
> operations on the sparse matrix(like multiply and dot) I have to divide
> each row of the corresponding dense matrix with the sum of its elements...

It is not well documented, not really part of the public API and too
low-level, but you can use scipy.sparse.sparsetools. As it is
implemented in C++, it should be both cpu and memory efficient:

I am using the following function to normalize each row of a CSR matrix:

def normalize_pairs(pairs):
"""Normalized rows of the pairs matrix so that sum(row) == 1 (or 0 for
empty rows).

Note
----
Does the modificiation in-place."""
factor = pairs.sum(axis=1)
nnzeros = np.where(factor > 0)
factor[nnzeros] = 1 / factor[nnzeros]
factor = np.array(factor)[0]

if not pairs.format == "csr":
raise ValueError("csr only")
csr_scale_rows(pairs.shape[0], pairs.shape[1], pairs.indptr,
pairs.indices,
pairs.data, factor)
return pairs

I don't advise using this function if reliability is a concern, but it
works well for matrices bigger than the ones you are mentioning,
cheers,

David
_______________________________________________
SciPy-User mailing list
SciPy...@scipy.org
http://mail.scipy.org/mailman/listinfo/scipy-user

Reply all

Reply to author

Forward