I tried to run the following code on my Desktop (Windows 7, core i7 with 12GB of memory):
import logging, sys, pprint
import os.path
import numpy as np
from gensim import corpora, matutils
from gensim.parsing.preprocessing import preprocess_documents, preprocess_string
from gensim.models import LdaModel, TfidfModel
bg_corpus_file = 'wikipedia_background.cor'
class MyCorpus(object):
def __iter__(self):
for line in open(bg_corpus_file):
yield dictionary.doc2bow(preprocess_string(line))
print "collecting statistics about all tokens..."
dictionary = corpora.Dictionary(preprocess_string(line) for line in open(bg_corpus_file))
print "preprocessing and converting the background corpus to BOW..."
bg_corpus = MyCorpus()
print "weighting the background corpus according to TFIDF schema..."
tfidf = TfidfModel(bg_corpus)
print "training LDA model..."
lda = LdaModel(corpus=tfidf[bg_corpus], id2word=dictionary, num_topics=300, update_every=0, passes=20)
lda.save("lda.model")
#t1 = "information retrieval"
#v1 = lda[dictionary.doc2bow(preprocess_string(t1))]
#t2 = "topic modelling"
#v2 = lda[dictionary.doc2bow(preprocess_string(t2))]
#print(matutils.cossim(v1, v2))
print "DONE"
It has been terminated after 3.5 days.
Is it possible to know what's the remaining time?
The computation process is interesting:
The
memory is iteratively move up to 11-12GB and then move down to 6-7GB
and do some calculation and then move up to 11GB again.
I want to have a unifying experiments and test the performance of a lot
of models over a lot of corpuses to find the answer of different
questions. So, it will be prohibitive if the calculation process will be
time-consuming. Can you help me in this issue?
Thanks.
Amir