Can't TfidfModel or LogEntropyModel take sparse Scipy matrices?

165 views
Skip to first unread message

Johann Petrak

unread,
Apr 4, 2017, 2:29:31 PM4/4/17
to gensim
I am a bit confused about how to best go about speeding up the following 
steps:
1) read in a large corpus of documents to generate a dictionary
This should be done with a streaming corpus since the original
textual corpus does not fit into memory
2) reduce the dictionary
3) read in the corpus again and store BOW vectors as memory-efficiently as possible
So far this is a streaming corpus implementation that yields
the bow representation created by the dictionary 
4) Perform e.g. tfidf or logentropy transformation
Ideally this would happen already on something in memory instead
of reading in the whole corpus. 
5) Performing e.g. LSI. This can work on a sparse scipy representation
of the corpus.

It is possible to convert the result of the Tfidf transformation to scipy
and then run the LSI on this, but this will still read the corpus several
times before that. So I thought, why not convert the corpus into 
scipy format before doing the Tfidf transformation already? 
However, when I then pass the scipy sparse matrix like so
   myCorpus = gensim.matutils.corpus2csc(myCorpus)
   model = TfidfModel(myCorpus,id2word=dictionary,normalize=doNorm)

I get the following error:
  File "/usr/local/lib/python3.5/dist-packages/gensim/models/tfidfmodel.py", line 96, in __init__
    self.initialize(corpus)
  File "/usr/local/lib/python3.5/dist-packages/gensim/models/tfidfmodel.py", line 118, in initialize
    numnnz += len(bow)
  File "/usr/local/lib/python3.5/dist-packages/scipy/sparse/base.py", line 246, in __len__
    raise TypeError("sparse matrix length is ambiguous; use getnnz()"
TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]

So what would be the most memory efficient method to keep everything in RAM once 
I get my BOW vectors? Is the BOW representation that is created by the dictionary
comparable in memory efficiency to scipy sparse matrix format? 

If I have to use an in-memory list of BOW vectors from the dictionary instead of the scipy 
sparse matrix representation for the tfidf transform, will it still pay off to then convert the
result of this to scipy sparse matrix format before passing it on to LSI?

Thanks,
  johann

Lev Konstantinovskiy

unread,
Apr 4, 2017, 8:30:06 PM4/4/17
to gensim
Hi Johann,

Ndarrays used in scipy sparse matrics are indeed more memory efficient than Python lists but as you can see they are not supported. Some modifications are needed for Tf-idf model to accept and return sparse matrices. 

The matrix should stay sparse inside LSI, so converting it before passing is an advantage.

Regards
Lev

Radim Řehůřek

unread,
Apr 5, 2017, 10:07:06 PM4/5/17
to gensim
Hello Johann,

I'm not aware of the LSI implementation being able to work on scipy.sparse. Are you sure?

You say you need streaming because your corpus doesn't fit in RAM, so I'm confused about why you'd want to load it all into RAM instead.

Regardless, all models accept a streaming corpus on input, not an in-RAM scipy.sparse matrix. But if you store your corpus into something like the MatrixMarket (.mm) format, you'll still have streaming with minimal overhead.

HTH,
Radim
Reply all
Reply to author
Forward
0 new messages