I am having a similar issue under Archlinux X86_64,
( on an Intel dual core ), I am trying to run the
LSI model on the enron dataset :
http://www.isi.edu/~adibi/Enron/Enron.htm
I use a sample of 5000 emails as corpus for
testing, the code is basically copy paste from
the tutorial, it computes tfidf then attempts
to compute LSI.
Just like in the thread you mention above,
LSI model hangs. First I tried to decrease
the chunks number.
When the chunks number is low ( maybe below 5000),
it does an assert error ( decomposition not initialized yet ).
At 10000, or higher, like the default, it hangs.
If I apply the bugfix you proposed to disable threading,
I get the Assert error. Logs below.
How could I fix that ?
Kind regards,
Boris Arnoux
Numpy : 1.5.1-2
Scipy : 0.9.0-1
Gensim 0.7.8
Running on 5000 samples. Chunks = default.
Hangs, log :
INFO:utils:loading Dictionary object from dict.dict
<generator object identitize at 0x274f500>
INFO:tfidfmodel:collecting document frequencies
INFO:tfidfmodel:PROGRESS: processing document #0
Number of rows returned: 252759
INFO:tfidfmodel:calculating IDF weights \
for 4999 documents and 19663 features
(502886 matrix non-zeros)
Corpus TfIdf ready :
INFO:lsimodel:using serial LSI version on this node
INFO:lsimodel:updating SVD with new documents
Chunks = 200 : Assert error
INFO:utils:loading Dictionary object from dict.dict
<generator object identitize at 0x24fa5f0>
INFO:tfidfmodel:collecting document frequencies
INFO:tfidfmodel:PROGRESS: processing document #0
Number of rows returned: 252759
INFO:tfidfmodel:calculating IDF weights for 4999 documents\
and 19663 features
(502886 matrix non-zeros)
Corpus TfIdf ready :
INFO:lsimodel:using serial LSI version on this node
INFO:lsimodel:updating SVD with new documents
Traceback (most recent call last):
File "performlsi.py", line 25, in <module>
corpus_lsi = lsimodel[corpus_tfidf]
File
"/usr/lib/python2.7/site-packages/
gensim-0.7.8-py2.7.egg/gensim/models/lsimodel.py",
line 400, in __getitem__
assert self.projection.u is not None,
"decomposition not initialized yet"
AssertionError: decomposition not initialized yet
python2.7 performlsi.py
27.19s user 0.74s system 84% cpu 32.864 total
Chunks=2000:
Assert error, same log
Chunks=10000:
Hangs, same log as default
Chunks=10000
+ bugfix ( numworkers replaced by 0 ): Assert error
INFO:utils:loading Dictionary object from dict.dict
<generator object identitize at 0x2dec5f0>
INFO:tfidfmodel:collecting document frequencies
INFO:tfidfmodel:PROGRESS: processing document #0
Number of rows returned: 252759
INFO:tfidfmodel:calculating IDF weights for
4999 documents and 19663 features
(502886 matrix non-zeros)
Corpus TfIdf ready :
INFO:lsimodel:using serial LSI version on this node
INFO:lsimodel:updating SVD with new documents
Traceback (most recent call last):
File "performlsi.py", line 25, in <module>
corpus_lsi = lsimodel[corpus_tfidf]
File
"/usr/lib/python2.7/site-packages/
gensim-0.7.8-py2.7.egg/gensim/models/lsimodel.py",
line 402, in __getitem__
assert self.projection.u is not None,
"decomposition not initialized yet"
AssertionError: decomposition not initialized yet