Hi all
After upgrading gensim, it stalls when trying to do LSI. The corpus
and the dictionary seems to be created just fine, while the tfidf
model and the corpus transformation also seems to create as they
should. During these things the processor is working hard, but as soon
as im trying to create lsi the proccessor goes idle while the system
is still fully responsive.
The code that I'm using:
*******************************************************************************************
dirs = ["C:\\Users\\Ironman\\Documents\\DTU\\00000 Master thesis\
\Corpora\\oxfordandgut\\smallcorpusprocstripped.txt","C:\\Users\
\Ironman\\Documents\\DTU\\00000 Master thesis\\Corpora\\oxfordandgut\
\largecorpusprocstripped.txt"]
dictdir = 'C:\\Users\\Ironman\\Documents\\DTU\\00000 Master thesis\
\Corpora\\oxfordandgut\\oxfordandgutdictionary.dict'
corpusdir = 'C:\\Users\\Ironman\\Documents\\DTU\\00000 Master thesis\
\Corpora\\oxfordandgut\\
oxfordandgutcorpus.mm'
class MyCorpus(object):
def __iter__(self):
for dir in dirs:
for line in open(dir):
# assume there's one document per line, tokens separated by
whitespace
yield dictionary.doc2bow(line.lower().split())
import os, datetime
import logging
logging.root.setLevel(logging.INFO)
import gensim
from gensim import corpora, similarities, models
for dir in dirs:
dictionary = corpora.Dictionary(line.lower().split() for line in
open(dir))
# remove stop words and words that appear only once
once_ids = [tokenid for tokenid, docfreq in dictionary.dfs.iteritems()
if docfreq == 1]
dictionary.filterTokens(once_ids) # remove stop words and words that
appear only once
dictionary.compactify() # remove gaps in id sequence after words that
were removed
print dictionary
corpus = MyCorpus()
tfidf = models.TfidfModel(corpus)
print 'created tfidf:' +" "+str(datetime.datetime.now())
corpus_tfidf = tfidf[corpus]
print 'created corpus_tfidf:' +" "+str(datetime.datetime.now())
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, numTopics=50)
print 'created lsi d=50:' +" "+str(datetime.datetime.now())
lsi.save('C:\\Users\\Ironman\\Documents\\DTU\\00000 Master thesis\
\Corpora\\oxfordandgut\\50\\model.lsi')
***************************************************************************************************
and this is from the python shell:
***************************************************************************************************
>>> import lsitest
INFO:root:adding document #0 to Dictionary(0 unique tokens)
INFO:dictionary:built Dictionary(53927 unique tokens) from 1427
documents (total 1801331 corpus positions)
INFO:root:adding document #0 to Dictionary(0 unique tokens)
INFO:dictionary:built Dictionary(120568 unique tokens) from 2790
documents (total 4784707 corpus positions)
Dictionary(59782 unique tokens)
INFO:tfidfmodel:collecting document frequencies
INFO:tfidfmodel:PROGRESS: processing document #0
INFO:tfidfmodel:calculating IDF weights for 4217 documents and 59782
features (2370246 matrix non-zeros)
created tfidf: 2011-04-06 19:33:43.948000
created corpus_tfidf: 2011-04-06 19:33:43.952000
INFO:lsimodel:using serial LSI version on this node
INFO:lsimodel:updating SVD with new documents
***************************************************************************************************
This is where is just stops and the proccessor goes idle and it never
gets the job done even though I've let it stay for an eternity (might
be a drastic, but still ;)
Hope someone can help me with this
Kind Regards
Jens