LDA running time

50 views
Skip to first unread message

Amir H. Jadidinejad

unread,
Aug 11, 2014, 11:00:15 AM8/11/14
to gen...@googlegroups.com
I tried to run the following code on my Desktop (Windows 7, core i7 with 12GB of memory):

import logging, sys, pprint
import os.path

import numpy as np

from gensim import corpora, matutils
from gensim.parsing.preprocessing import preprocess_documents, preprocess_string
from gensim.models import LdaModel, TfidfModel

bg_corpus_file = 'wikipedia_background.cor'

class MyCorpus(object):
def __iter__(self):
for line in open(bg_corpus_file):
yield dictionary.doc2bow(preprocess_string(line))

print "collecting statistics about all tokens..."
dictionary = corpora.Dictionary(preprocess_string(line) for line in open(bg_corpus_file))

print "preprocessing and converting the background corpus to BOW..."
bg_corpus = MyCorpus()

print "weighting the background corpus according to TFIDF schema..."
tfidf = TfidfModel(bg_corpus)

print "training LDA model..."
lda = LdaModel(corpus=tfidf[bg_corpus], id2word=dictionary, num_topics=300, update_every=0, passes=20)

lda.save("lda.model")

#t1 = "information retrieval"
#v1 = lda[dictionary.doc2bow(preprocess_string(t1))]

#t2 = "topic modelling"
#v2 = lda[dictionary.doc2bow(preprocess_string(t2))]

#print(matutils.cossim(v1, v2))

print "DONE"

It has been terminated after 3.5 days.
Is it possible to know what's the remaining time?

The computation process is interesting:
The memory is iteratively move up to 11-12GB and then move down to 6-7GB and do some calculation and then move up to 11GB again.


I want to have a unifying experiments and test the performance of a lot of models over a lot of corpuses to find the answer of different questions. So, it will be prohibitive if the calculation process will be time-consuming. Can you help me in this issue?

Thanks.
Amir

Radim Řehůřek

unread,
Aug 11, 2014, 3:42:13 PM8/11/14
to gen...@googlegroups.com
Hello Amir,

thanks for using the mailing list :)

Can you post the log (at INFO level, at least) of your run?

Re. memory: I think that fluctuation could be due to temporary matrices needed during each training step. Things should be clearer once I see the log (which contains no. terms, timings etc).

The training is predictably linear, so again, we'll know immediately how far we got once we see the log.

Cheers,
Radim

Amir H. Jadidinejad

unread,
Aug 11, 2014, 5:01:42 PM8/11/14
to gen...@googlegroups.com
I have to enable logging using the following code and run it again?
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

or the log file for the current session is stored somewhere in my machine?
Reply all
Reply to author
Forward
0 new messages