Can't TfidfModel or LogEntropyModel take sparse Scipy matrices?

Johann Petrak

unread,

Apr 4, 2017, 2:29:31 PM4/4/17

to gensim

I am a bit confused about how to best go about speeding up the following

steps:

1) read in a large corpus of documents to generate a dictionary

This should be done with a streaming corpus since the original

textual corpus does not fit into memory

2) reduce the dictionary

3) read in the corpus again and store BOW vectors as memory-efficiently as possible

So far this is a streaming corpus implementation that yields

the bow representation created by the dictionary

4) Perform e.g. tfidf or logentropy transformation

Ideally this would happen already on something in memory instead

of reading in the whole corpus.

5) Performing e.g. LSI. This can work on a sparse scipy representation

of the corpus.

It is possible to convert the result of the Tfidf transformation to scipy

and then run the LSI on this, but this will still read the corpus several

times before that. So I thought, why not convert the corpus into

scipy format before doing the Tfidf transformation already?
However, when I then pass the scipy sparse matrix like so
myCorpus = gensim.matutils.corpus2csc(myCorpus)

model = TfidfModel(myCorpus,id2word=dictionary,normalize=doNorm)

I get the following error:

File "/usr/local/lib/python3.5/dist-packages/gensim/models/tfidfmodel.py", line 96, in __init__

self.initialize(corpus)

File "/usr/local/lib/python3.5/dist-packages/gensim/models/tfidfmodel.py", line 118, in initialize

numnnz += len(bow)

File "/usr/local/lib/python3.5/dist-packages/scipy/sparse/base.py", line 246, in __len__

raise TypeError("sparse matrix length is ambiguous; use getnnz()"

TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]

So what would be the most memory efficient method to keep everything in RAM once

I get my BOW vectors? Is the BOW representation that is created by the dictionary

comparable in memory efficiency to scipy sparse matrix format?

If I have to use an in-memory list of BOW vectors from the dictionary instead of the scipy

sparse matrix representation for the tfidf transform, will it still pay off to then convert the

result of this to scipy sparse matrix format before passing it on to LSI?

Thanks,

johann

Lev Konstantinovskiy

unread,

Apr 4, 2017, 8:30:06 PM4/4/17

to gensim

Hi Johann,

Ndarrays used in scipy sparse matrics are indeed more memory efficient than Python lists but as you can see they are not supported. Some modifications are needed for Tf-idf model to accept and return sparse matrices.

The matrix should stay sparse inside LSI, so converting it before passing is an advantage.

Regards
Lev

Radim Řehůřek

unread,

Apr 5, 2017, 10:07:06 PM4/5/17

to gensim

Hello Johann,

I'm not aware of the LSI implementation being able to work on scipy.sparse. Are you sure?

You say you need streaming because your corpus doesn't fit in RAM, so I'm confused about why you'd want to load it all into RAM instead.

Regardless, all models accept a streaming corpus on input, not an in-RAM scipy.sparse matrix. But if you store your corpus into something like the MatrixMarket (.mm) format, you'll still have streaming with minimal overhead.