Stuck in the middle of lsimodel

1,181 views
Skip to first unread message

idealogue

unread,
May 23, 2011, 4:36:16 PM5/23/11
to gensim
I'm just trying to get the basic functions outlined in the tutorial
working and having trouble with the lsimodel function

The versions of the relevant modules:
Python - 2.7.1
gensim - 0.7.8
scipy - 0.8.0
numpy - 1.5.1

What seems to be happening is a problem in the call from
lsimodel.addDocuments to utils.chunksize (line 346 in lsimodel, line
375 ing utils).

I'm just passing the very simple example corpus of 9 documents and
that seems to be causing an infinite loop. I just want to see this
module in action before trying on a much larger corpus, so I thought
it worth debugging at this point. Here's the script and the logger
output:

dictionary = gensim.corpora.Dictionary.load('/tmp/
deerwester.dict')
corpus = gensim.corpora.MmCorpus('/tmp/deerwester.mm')

tfidf = gensim.models.TfidfModel(corpus) # step 1 -- initialize a
model
doc_bow = [(0, 1), (1, 1)]
print tfidf[doc_bow]
corpus_tfidf = tfidf[corpus]

lsi = gensim.models.LsiModel(corpus_tfidf, id2word=dictionary,
numTopics=2)

INFO:utils:loading Dictionary object from /tmp/deerwester.dict
INFO:matutils:initializing corpus reader from /tmp/deerwester.mm
INFO:matutils:accepted corpus with 9 documents, 12 terms, 28 non-zero
entries
INFO:tfidfmodel:collecting document frequencies
INFO:tfidfmodel:PROGRESS: processing document #0
INFO:tfidfmodel:calculating IDF weights for 9 documents and 12
features (28 matr ix non-zeros)
[(0, 0.7071067811865476), (1, 0.7071067811865476)]
INFO:lsimodel:using serial LSI version on this node
INFO:lsimodel:updating SVD with new documents
INFO:matutils:constructing sparse document matrix
/usr/lib/python2.7/dist-packages/scipy/sparse/compressed.py:129:
UserWarning: in dices array has non-integer dtype (float64)
% self.indices.dtype.name )
INFO:lsimodel:using 100 extra samples and 2 power iterations
INFO:lsimodel:1st phase: constructing (12, 102) action matrix
INFO:lsimodel:orthonormalizing (12, 102) action matrix
INFO:lsimodel:keeping 9 factors (discarding 0.000% of energy spectrum)
INFO:lsimodel:2nd phase: running dense svd on (9, 9) matrix
INFO:lsimodel:computing the final decomposition
INFO:lsimodel:keeping 2 factors (discarding 47.565% of energy
spectrum)
INFO:lsimodel:processed documents up to #9
INFO:lsimodel:topic #0(1.594): 0.703*"trees" + 0.538*"graph" +
0.402*"minors" + 0.187*"survey" + 0.061*"system" +
0.060*"response" + 0.060*"time" + 0.058*"user" +
0.049*"computer" + 0.035*"interface"
INFO:lsimodel:topic #1(1.476): 0.460*"system" + 0.373*"user" +
0.332*"eps" + 0.3 28*"interface" + 0.320*"time" +
0.320*"response" + 0.293*"computer" + 0.280*"hum an" +
0.171*"survey" + -0.161*"trees"

It just hangs here indefinitely. I've left it for at least 30 minutes,
with no action. I can't imagine the lsi model is taking that long to
calculate a 9 document corpus.

Any help on this would be much appreciated.

Thanks!

Radim

unread,
May 25, 2011, 2:46:52 PM5/25/11
to gensim
Hello,

another user reported the same problem on his machine,
http://groups.google.com/group/gensim/browse_thread/thread/22ba5dd54578fb93/8c64e694f5a8ebd4?q=

I offered a quick fix in that thread which apparently worked. In the
meanwhile, I have also rewritten the gensim code so that the problem
shouldn't happen (the newest code is in the `develop` branch on
github, https://github.com/piskvorky/gensim ).

I couldn't reproduce the error on my machine so it's hard to test. If
you can try the latest code and let me know whether that helped, it
would be great!

Cheers,
Radim

Boris Arnoux

unread,
Jun 11, 2011, 4:34:53 AM6/11/11
to gen...@googlegroups.com
Hi,

I am having a similar issue under Archlinux X86_64,
( on an Intel dual core ), I am trying to run the
LSI model on the enron dataset :
http://www.isi.edu/~adibi/Enron/Enron.htm

I use a sample of 5000 emails as corpus for
testing, the code is basically copy paste from
the tutorial, it computes tfidf then attempts
to compute LSI.

Just like in the thread you mention above,
LSI model hangs. First I tried to decrease
the chunks number.

When the chunks number is low ( maybe below 5000),
it does an assert error ( decomposition not initialized yet ).

At 10000, or higher, like the default, it hangs.

If I apply the bugfix you proposed to disable threading,
I get the Assert error. Logs below.

How could I fix that ?

Kind regards,
Boris Arnoux


Numpy : 1.5.1-2
Scipy : 0.9.0-1
Gensim 0.7.8


Running on 5000 samples. Chunks = default.
Hangs, log :
INFO:utils:loading Dictionary object from dict.dict
<generator object identitize at 0x274f500>


INFO:tfidfmodel:collecting document frequencies
INFO:tfidfmodel:PROGRESS: processing document #0

Number of rows returned: 252759


INFO:tfidfmodel:calculating IDF weights \

for 4999 documents and 19663 features
(502886 matrix non-zeros)
Corpus TfIdf ready :


INFO:lsimodel:using serial LSI version on this node
INFO:lsimodel:updating SVD with new documents

Chunks = 200 : Assert error
INFO:utils:loading Dictionary object from dict.dict
<generator object identitize at 0x24fa5f0>


INFO:tfidfmodel:collecting document frequencies
INFO:tfidfmodel:PROGRESS: processing document #0

Number of rows returned: 252759
INFO:tfidfmodel:calculating IDF weights for 4999 documents\
and 19663 features
(502886 matrix non-zeros)
Corpus TfIdf ready :


INFO:lsimodel:using serial LSI version on this node
INFO:lsimodel:updating SVD with new documents

Traceback (most recent call last):
File "performlsi.py", line 25, in <module>
corpus_lsi = lsimodel[corpus_tfidf]
File
"/usr/lib/python2.7/site-packages/
gensim-0.7.8-py2.7.egg/gensim/models/lsimodel.py",
line 400, in __getitem__
assert self.projection.u is not None,
"decomposition not initialized yet"
AssertionError: decomposition not initialized yet
python2.7 performlsi.py
27.19s user 0.74s system 84% cpu 32.864 total

Chunks=2000:
Assert error, same log

Chunks=10000:
Hangs, same log as default

Chunks=10000
+ bugfix ( numworkers replaced by 0 ): Assert error

INFO:utils:loading Dictionary object from dict.dict
<generator object identitize at 0x2dec5f0>


INFO:tfidfmodel:collecting document frequencies
INFO:tfidfmodel:PROGRESS: processing document #0

Number of rows returned: 252759


INFO:tfidfmodel:calculating IDF weights for

4999 documents and 19663 features
(502886 matrix non-zeros)
Corpus TfIdf ready :


INFO:lsimodel:using serial LSI version on this node
INFO:lsimodel:updating SVD with new documents

Traceback (most recent call last):
File "performlsi.py", line 25, in <module>
corpus_lsi = lsimodel[corpus_tfidf]
File
"/usr/lib/python2.7/site-packages/
gensim-0.7.8-py2.7.egg/gensim/models/lsimodel.py",
line 402, in __getitem__
assert self.projection.u is not None,
"decomposition not initialized yet"
AssertionError: decomposition not initialized yet


Radim

unread,
Jun 12, 2011, 12:12:36 PM6/12/11
to gensim
Whoa, 5 error logs in one report :)

Anyway, it looks like your LSI init (constructor) doesn't work, the
model is not getting initialized. Can you post your code preceding the
LSI init? (=how you create the tfidf corpus).

I suspect the tfidf corpus is empty for some reason.

Best,
Radim

Xolve

unread,
Jun 18, 2011, 7:39:26 AM6/18/11
to gen...@googlegroups.com
I am also receiving the same error. Here is my code: http://pastebin.com/isD9QaFz
Reply all
Reply to author
Forward
0 new messages