I am a bit confused about how to best go about speeding up the following
steps:
1) read in a large corpus of documents to generate a dictionary
This should be done with a streaming corpus since the original
textual corpus does not fit into memory
2) reduce the dictionary
3) read in the corpus again and store BOW vectors as memory-efficiently as possible
So far this is a streaming corpus implementation that yields
the bow representation created by the dictionary
4) Perform e.g. tfidf or logentropy transformation
Ideally this would happen already on something in memory instead
of reading in the whole corpus.
5) Performing e.g. LSI. This can work on a sparse scipy representation
of the corpus.
It is possible to convert the result of the Tfidf transformation to scipy
and then run the LSI on this, but this will still read the corpus several
times before that. So I thought, why not convert the corpus into
scipy format before doing the Tfidf transformation already?
However, when I then pass the scipy sparse matrix like so
myCorpus = gensim.matutils.corpus2csc(myCorpus)
model = TfidfModel(myCorpus,id2word=dictionary,normalize=doNorm)
I get the following error:
File "/usr/local/lib/python3.5/dist-packages/gensim/models/tfidfmodel.py", line 96, in __init__
self.initialize(corpus)
File "/usr/local/lib/python3.5/dist-packages/gensim/models/tfidfmodel.py", line 118, in initialize
numnnz += len(bow)
File "/usr/local/lib/python3.5/dist-packages/scipy/sparse/base.py", line 246, in __len__
raise TypeError("sparse matrix length is ambiguous; use getnnz()"
TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]
So what would be the most memory efficient method to keep everything in RAM once
I get my BOW vectors? Is the BOW representation that is created by the dictionary
comparable in memory efficiency to scipy sparse matrix format?
If I have to use an in-memory list of BOW vectors from the dictionary instead of the scipy
sparse matrix representation for the tfidf transform, will it still pay off to then convert the
result of this to scipy sparse matrix format before passing it on to LSI?
Thanks,
johann