extremely slow lda?

2,444 views
Skip to first unread message

Jason

unread,
Apr 29, 2013, 3:58:06 PM4/29/13
to gensim
I've run lda on a few different corpuses with great results, but when
I run it on a large dataset, it is too slow. I think I might be doing
something wrong though as I was reading the wiki page and it said:
"Creating this LDA model of Wikipedia takes about 6 hours and 20
minutes on my laptop"
For my corpus, I have 1.5 million documents,with a mean of 304.95
words per document and the standar deviation is 1314.35.
To process this whole dataset from creating the dictionary to running
the model took 10 days running on aws on a m1.large box (I think
equivalent to ~2 intel cores).
My questions are, am I doing something inherently wrong? I have not
tried the distributed version yet, are my times expected? Are my
modeling options set incorrectly? Is there anything else I can do to
speed this up? My main reasons for speeding this up is because I am
still testing and tweaking settings, and if it is 10 days between
tests, that will be too painful.

I've included the relevant code.



def prepare_doc(text):
#print(text)
text = text.lower()
text = re.sub(r'https?:\/\/.*[\r\n]*', '', text,
flags=re.MULTILINE)
#print(text)
text = re.sub(r'@\w+', ' ', text, flags=re.MULTILINE)
text = gensim.parsing.preprocessing.strip_tags(text)
text = gensim.parsing.preprocessing.strip_punctuation(text)
text = gensim.parsing.preprocessing.remove_stopwords(text)
text = gensim.parsing.preprocessing.strip_short(text)
text = gensim.parsing.preprocessing.strip_numeric(text)
text = gensim.parsing.preprocessing.stem_text(text)
#print(text)
return text


#corpus loader
class MyCorpus(object):
def __init__(self,docs_file,dictionary):
self.docs_file = docs_file
self.dictionary = dictionary
def __iter__(self):
for line in open(self.docs_file):
# assume there's one document per line, tokens separated
by whitespace
yield self.dictionary.doc2bow(json.loads(line)
['doc'].split())

dictionary = corpora.Dictionary()
for line in open(docs_file):
dictionary.add_documents([json.loads(line)['doc'].split()])
dictionary.filter_extremes(no_below=5, no_above=0.5, keep_n=100000)

corpus = MyCorpus(docs_file,dictionary)
model = gensim.models.ldamodel.LdaModel(corpus, id2word=dictionary,
num_topics=num_topics,chunksize=100)

Radim Řehůřek

unread,
Apr 29, 2013, 5:41:51 PM4/29/13
to gensim
Hello Jason,

10 days is definitely not normal!

I don't see anything obviously wrong with your code... which part is
slow? Is it the last line, `model = LdaModel(...)`? How big is the
`docs_file` file?

Posting the full log (incl. timestamps) of the first few iterations
should make things clearer :)

-rr

Jason

unread,
Apr 30, 2013, 11:47:06 AM4/30/13
to gensim
Radim, yes ,the creation of the model is what is slow. Here is the
output, let me know if you need more information:

2013-04-30 15:10:07,633 : INFO : loading Dictionary object from
lists.dict
2013-04-30 15:10:18,148 : INFO : keeping 100000 tokens which were in
no less than 5 and no more than 796730 (=50.0%) documents
2013-04-30 15:10:21,382 : INFO : resulting dictionary:
Dictionary(100000 unique tokens)
Dictionary(100000 unique tokens)
2013-04-30 15:10:21,395 : INFO : using serial LDA version on this node
2013-04-30 15:10:41,103 : WARNING : input corpus stream has no len();
counting documents

2013-04-30 15:32:16,315 : INFO : running online LDA training, 350
topics, 1 passes over the supplied corpus of 1593460 documents,
updating model once every 100 documents
2013-04-30 15:32:16,985 : INFO : PROGRESS: iteration 0, at document
#100/1593460
2013-04-30 15:32:20,702 : INFO : 23/100 documents converged within 50
iterations
2013-04-30 15:32:24,548 : INFO : merging changes from 100 documents
into a model of 1593460 documents
2013-04-30 15:32:49,080 : INFO : topic #0: 0.090*footbal +
0.039*player + 0.020*fulham + 0.014*que + 0.012*sport + 0.012*tokio +
0.010*leagu + 0.009*premier + 0.008*nois + 0.008*hotel
2013-04-30 15:32:49,751 : INFO : topic #1: 0.141*garden + 0.035*writer
+ 0.030*organ + 0.027*health + 0.026*food + 0.025*farm + 0.025*herbal
+ 0.017*educ + 0.017*foodi + 0.016*relat
2013-04-30 15:32:50,428 : INFO : topic #2: 0.195*footbal +
0.096*player + 0.055*game + 0.049*fulham + 0.025*indi + 0.021*sport +
0.020*new + 0.018*tweet + 0.016*soccer + 0.014*leagu
2013-04-30 15:32:51,157 : INFO : topic #3: 0.021*belieb + 0.017*love +
0.017*team + 0.013*bieber + 0.012*teamfollowback + 0.011*cool +
0.009*sexi + 0.008*ladi + 0.007*justin + 0.007*beauti
2013-04-30 15:32:51,839 : INFO : topic #4: 0.013*pra + 0.012*que +
0.012*tokita + 0.011*tokio + 0.010*nois + 0.009*hotel + 0.009*lista +
0.008*meu + 0.008*love + 0.007*melhor
2013-04-30 15:32:52,569 : INFO : topic #5: 0.024*mtb + 0.019*bike +
0.016*sexi + 0.014*love + 0.013*team + 0.013*teamfollowback +
0.013*lista + 0.010*best + 0.009*friend + 0.009*ladi
2013-04-30 15:32:53,236 : INFO : topic #6: 0.137*footbal +
0.068*player + 0.033*fulham + 0.024*sport + 0.023*epub + 0.023*goro +
0.022*creat + 0.021*automat + 0.019*sfbayarea + 0.018*timelin
2013-04-30 15:32:53,905 : INFO : topic #7: 0.124*option + 0.070*trader
+ 0.057*market + 0.052*trade + 0.051*stock + 0.046*best + 0.030*financ
+ 0.020*futur + 0.017*financi + 0.015*sentiment
2013-04-30 15:32:54,573 : INFO : topic #8: 0.000*gak + 0.000*vemevetv
+ 0.000*alic + 0.000*worldalert + 0.000*alif + 0.000*businesstool +
0.000*obvious + 0.000*sugarpop + 0.000*phriend + 0.000*onetimegirl
2013-04-30 15:32:55,291 : INFO : topic #9: 0.134*lista + 0.032*listei
+ 0.019*segu + 0.014*amigo + 0.014*listado + 0.014*melhor +
0.013*friend + 0.013*pessoa + 0.012*que + 0.011*listando
2013-04-30 15:32:55,959 : INFO : topic #10: 0.081*sledui +
0.060*friend + 0.043*talk + 0.036*best + 0.023*new + 0.017*updat +
0.016*self + 0.016*ukrain + 0.015*interest + 0.014*like
2013-04-30 15:32:56,626 : INFO : topic #11: 0.034*lista + 0.026*pessoa
+ 0.022*listei + 0.022*listado + 0.019*amigo + 0.016*friend +
0.015*que + 0.014*onlin + 0.012*tedio + 0.012*seguir
2013-04-30 15:32:57,352 : INFO : topic #12: 0.000*gak + 0.000*vemevetv
+ 0.000*alic + 0.000*worldalert + 0.000*alif + 0.000*businesstool +
0.000*obvious + 0.000*sugarpop + 0.000*phriend + 0.000*onetimegirl
2013-04-30 15:32:58,019 : INFO : topic #13: 0.102*cyclist +
0.072*bicycl + 0.062*roadbik + 0.041*fan + 0.037*cycl + 0.032*bike +
0.029*road + 0.028*osaka + 0.025*?転? + 0.021*friend
2013-04-30 15:32:58,703 : INFO : topic #14: 0.000*gak + 0.000*vemevetv
+ 0.000*alic + 0.000*worldalert + 0.000*alif + 0.000*businesstool +
0.000*obvious + 0.000*sugarpop + 0.000*phriend + 0.000*onetimegirl
2013-04-30 15:32:59,433 : INFO : topic diff=346.935929, rho=1.000000
2013-04-30 15:33:00,385 : INFO : PROGRESS: iteration 0, at document
#200/1593460
2013-04-30 15:33:19,932 : INFO : 81/100 documents converged within 50
iterations
2013-04-30 15:33:23,231 : INFO : merging changes from 100 documents
into a model of 1593460 documents
2013-04-30 15:33:47,695 : INFO : topic #0: 0.090*footbal +
0.038*player + 0.020*fulham + 0.014*que + 0.012*sport + 0.012*tokio +
0.010*leagu + 0.009*premier + 0.008*nois + 0.008*hotel
2013-04-30 15:33:48,366 : INFO : topic #1: 0.096*writer + 0.073*write
+ 0.058*green + 0.044*health + 0.037*garden + 0.030*educ + 0.029*vip +
0.026*eco + 0.025*famili + 0.022*teacher
2013-04-30 15:33:49,097 : INFO : topic #2: 0.197*celeb + 0.108*footbal
+ 0.065*celebr + 0.047*soccer + 0.046*new + 0.045*industri +
0.026*player + 0.018*pro + 0.018*favourit + 0.017*tweep
2013-04-30 15:33:49,768 : INFO : topic #3: 0.052*belieb + 0.046*want +
0.037*golden + 0.035*dat + 0.035*mean + 0.035*lovelovelov +
0.035*welovebieb + 0.035*recruit + 0.033*youu + 0.033*coolest
2013-04-30 15:33:50,439 : INFO : topic #4: 0.013*pra + 0.012*que +
0.012*tokita + 0.011*tokio + 0.010*nois + 0.009*hotel + 0.009*lista +
0.008*meu + 0.008*love + 0.007*melhor
2013-04-30 15:33:51,116 : INFO : topic #5: 0.023*mtb + 0.018*bike +
0.016*love + 0.016*sexi + 0.014*legend + 0.013*team + 0.013*friend +
0.012*teamfollowback + 0.012*lista + 0.011*beauti
2013-04-30 15:33:51,847 : INFO : topic #6: 0.399*tech + 0.053*geek +
0.038*account + 0.026*new + 0.024*program + 0.018*brave + 0.018*tweet
+ 0.018*stuff + 0.018*gener + 0.017*present
2013-04-30 15:33:52,518 : INFO : topic #7: 0.055*market +
0.053*partner + 0.047*account + 0.045*sourc + 0.044*latest +
0.041*best + 0.038*option + 0.036*expert + 0.036*invest + 0.036*beat
2013-04-30 15:33:53,188 : INFO : topic #8: 0.000*gak + 0.000*vemevetv
+ 0.000*alic + 0.000*worldalert + 0.000*alif + 0.000*businesstool +
0.000*obvious + 0.000*sugarpop + 0.000*phriend + 0.000*onetimegirl
2013-04-30 15:33:53,859 : INFO : topic #9: 0.314*publicidad +
0.204*lista + 0.019*listando + 0.015*listei + 0.014*amigo + 0.012*que
+ 0.012*lindo + 0.012*obrigado + 0.010*love + 0.009*amo
2013-04-30 15:33:54,590 : INFO : topic #10: 0.118*friend +
0.046*celebr + 0.039*updat + 0.039*sledui + 0.033*talk + 0.032*tweep +
0.025*new + 0.025*daili + 0.023*rebuilt + 0.022*dynam
2013-04-30 15:33:55,262 : INFO : topic #11: 0.252*onlin + 0.048*com +
0.032*bb + 0.026*friendli + 0.019*lista + 0.017*friend + 0.016*water +
0.016*best + 0.015*pessoa + 0.013*listei
2013-04-30 15:33:55,933 : INFO : topic #12: 0.000*gak + 0.000*vemevetv
+ 0.000*alic + 0.000*worldalert + 0.000*alif + 0.000*businesstool +
0.000*obvious + 0.000*sugarpop + 0.000*phriend + 0.000*onetimegirl
2013-04-30 15:33:56,665 : INFO : topic #13: 0.127*you + 0.104*photo +
0.054*fan + 0.053*cyclist + 0.037*bicycl + 0.033*philippin +
0.032*birthdai + 0.027*thank + 0.026*daili + 0.024*celebr
2013-04-30 15:33:57,336 : INFO : topic #14: 0.000*gak + 0.000*vemevetv
+ 0.000*alic + 0.000*worldalert + 0.000*alif + 0.000*businesstool +
0.000*obvious + 0.000*sugarpop + 0.000*phriend + 0.000*onetimegirl
2013-04-30 15:33:58,141 : INFO : topic diff=0.794075, rho=0.707107
2013-04-30 15:33:59,038 : INFO : PROGRESS: iteration 0, at document
#300/1593460
2013-04-30 15:34:13,367 : INFO : 99/100 documents converged within 50
iterations
2013-04-30 15:34:16,660 : INFO : merging changes from 100 documents
into a model of 1593460 documents
2013-04-30 15:34:41,007 : INFO : topic #0: 0.090*footbal +
0.038*player + 0.020*fulham + 0.013*que + 0.012*sport + 0.012*tokio +
0.010*leagu + 0.009*premier + 0.008*nois + 0.008*hotel
2013-04-30 15:34:41,766 : INFO : topic #1: 0.114*writer + 0.049*health
+ 0.044*write + 0.034*resourc + 0.033*relat + 0.032*unschool +
0.031*folk + 0.028*educ + 0.024*green + 0.024*famili
2013-04-30 15:34:42,463 : INFO : topic #2: 0.112*journo + 0.097*new +
0.085*celeb + 0.064*citi + 0.053*journalist + 0.040*footbal +
0.034*media + 0.033*celebr + 0.030*class + 0.029*tweet
2013-04-30 15:34:43,229 : INFO : topic #3: 0.367*auto + 0.098*recruit
+ 0.050*want + 0.029*golden + 0.027*youu + 0.026*somo + 0.021*love +
0.019*worldwid + 0.018*peep + 0.017*belieb
2013-04-30 15:34:43,926 : INFO : topic #4: 0.013*pra + 0.012*que +
0.012*tokita + 0.011*tokio + 0.010*nois + 0.009*hotel + 0.009*lista +
0.008*meu + 0.008*love + 0.007*melhor
2013-04-30 15:34:44,685 : INFO : topic #5: 0.032*legend + 0.020*mtb +
0.017*teamfollowback + 0.016*gorgeou + 0.016*bike + 0.014*love +
0.014*sexi + 0.013*team + 0.011*cuti + 0.011*friend
2013-04-30 15:34:45,382 : INFO : topic #6: 0.370*tech + 0.058*geek +
0.054*program + 0.038*new + 0.036*present + 0.032*account +
0.023*gener + 0.020*tweet + 0.017*stuff + 0.016*brave
2013-04-30 15:34:46,140 : INFO : topic #7: 0.121*account + 0.084*vip +
0.070*partner + 0.065*beat + 0.065*latest + 0.060*expert + 0.041*new +
0.034*blog + 0.030*site + 0.028*fellow
2013-04-30 15:34:46,838 : INFO : topic #8: 0.000*gak + 0.000*vemevetv
+ 0.000*alic + 0.000*worldalert + 0.000*alif + 0.000*businesstool +
0.000*obvious + 0.000*sugarpop + 0.000*phriend + 0.000*onetimegirl
2013-04-30 15:34:47,597 : INFO : topic #9: 0.326*publicidad +
0.191*lista + 0.018*listando + 0.015*amigo + 0.014*listei + 0.013*que
+ 0.011*lindo + 0.011*obrigado + 0.010*love + 0.009*friend
2013-04-30 15:34:48,294 : INFO : topic #10: 0.111*friend + 0.071*dynam
+ 0.060*rebuilt + 0.059*daili + 0.052*talk + 0.045*conversationlist +
0.037*new + 0.028*convers + 0.025*import + 0.022*best
2013-04-30 15:34:49,052 : INFO : topic #11: 0.266*onlin + 0.132*com +
0.047*que + 0.035*chocol + 0.034*msn + 0.034*independ + 0.022*cool +
0.018*bb + 0.018*friend + 0.015*www
2013-04-30 15:34:49,749 : INFO : topic #12: 0.000*gak + 0.000*vemevetv
+ 0.000*alic + 0.000*worldalert + 0.000*alif + 0.000*businesstool +
0.000*obvious + 0.000*sugarpop + 0.000*phriend + 0.000*onetimegirl
2013-04-30 15:34:50,507 : INFO : topic #13: 0.180*photo + 0.110*you +
0.097*ing + 0.049*fan + 0.038*cyclist + 0.033*children + 0.026*bicycl
+ 0.023*friend + 0.023*philippin + 0.023*birthdai
2013-04-30 15:34:51,206 : INFO : topic #14: 0.000*gak + 0.000*vemevetv
+ 0.000*alic + 0.000*worldalert + 0.000*alif + 0.000*businesstool +
0.000*obvious + 0.000*sugarpop + 0.000*phriend + 0.000*onetimegirl
2013-04-30 15:34:52,040 : INFO : topic diff=0.507097, rho=0.577350
2013-04-30 15:34:53,013 : INFO : PROGRESS: iteration 0, at document
#400/1593460
2013-04-30 15:35:12,471 : INFO : 91/100 documents converged within 50
iterations
2013-04-30 15:35:15,762 : INFO : merging changes from 100 documents
into a model of 1593460 documents
2013-04-30 15:35:40,234 : INFO : topic #0: 0.089*footbal +
0.038*player + 0.020*fulham + 0.013*que + 0.012*sport + 0.011*tokio +
0.010*leagu + 0.009*premier + 0.008*nois + 0.008*hotel
2013-04-30 15:35:40,908 : INFO : topic #1: 0.192*writer + 0.046*folk +
0.037*relat + 0.035*resourc + 0.035*write + 0.028*advic + 0.025*health
+ 0.025*pet + 0.020*blogger + 0.020*educ
2013-04-30 15:35:41,577 : INFO : topic #2: 0.177*journo + 0.108*new +
0.074*favourit + 0.063*journalist + 0.048*media + 0.047*citi +
0.038*celeb + 0.029*tweet + 0.020*commentari + 0.020*class
2013-04-30 15:35:42,306 : INFO : topic #3: 0.246*auto + 0.152*recruit
+ 0.071*want + 0.062*coolest + 0.035*worldwid + 0.034*peep +
0.026*welovebieb + 0.019*golden + 0.019*love + 0.018*youu
2013-04-30 15:35:42,976 : INFO : topic #4: 0.013*pra + 0.012*que +
0.012*tokita + 0.011*tokio + 0.010*nois + 0.009*hotel + 0.009*lista +
0.008*meu + 0.008*love + 0.007*melhor
2013-04-30 15:35:43,655 : INFO : topic #5: 0.026*intern + 0.022*athlet
+ 0.022*love + 0.022*teamtwist + 0.019*teamfollowback + 0.019*team +
0.017*sexi + 0.016*cool + 0.015*best + 0.015*gorgeou
2013-04-30 15:35:44,325 : INFO : topic #6: 0.367*tech + 0.058*geek +
0.055*program + 0.038*present + 0.034*new + 0.032*account +
0.027*employe + 0.021*tweet + 0.017*profession + 0.016*gener
2013-04-30 15:35:45,056 : INFO : topic #7: 0.121*account +
0.093*latest + 0.085*vip + 0.079*partner + 0.056*beat + 0.055*expert +
0.045*futur + 0.037*new + 0.031*up + 0.031*financ
2013-04-30 15:35:45,727 : INFO : topic #8: 0.000*gak + 0.000*vemevetv
+ 0.000*alic + 0.000*worldalert + 0.000*alif + 0.000*businesstool +
0.000*obvious + 0.000*sugarpop + 0.000*phriend + 0.000*onetimegirl
2013-04-30 15:35:46,396 : INFO : topic #9: 0.387*publicidad +
0.169*lista + 0.018*handl + 0.016*listando + 0.013*amigo +
0.013*listei + 0.011*que + 0.010*lindo + 0.010*obrigado + 0.009*love
2013-04-30 15:35:47,068 : INFO : topic #10: 0.086*daili + 0.079*friend
+ 0.078*dynam + 0.074*conversationlist + 0.071*rebuilt + 0.070*talk +
0.031*new + 0.024*convers + 0.024*royal + 0.019*?пи?ок
2013-04-30 15:35:47,797 : INFO : topic #11: 0.367*onlin +
0.176*poynter + 0.083*com + 0.024*que + 0.023*skype + 0.019*friend +
0.019*friendli + 0.015*think + 0.014*cool + 0.013*chocol
2013-04-30 15:35:48,466 : INFO : topic #12: 0.000*gak + 0.000*vemevetv
+ 0.000*alic + 0.000*worldalert + 0.000*alif + 0.000*businesstool +
0.000*obvious + 0.000*sugarpop + 0.000*phriend + 0.000*onetimegirl
2013-04-30 15:35:49,135 : INFO : topic #13: 0.234*photo + 0.118*you +
0.078*ing + 0.057*children + 0.048*philippin + 0.042*fan +
0.030*cyclist + 0.021*bicycl + 0.021*thank + 0.020*friend
2013-04-30 15:35:49,864 : INFO : topic #14: 0.000*gak + 0.000*vemevetv
+ 0.000*alic + 0.000*worldalert + 0.000*alif + 0.000*businesstool +
0.000*obvious + 0.000*sugarpop + 0.000*phriend + 0.000*onetimegirl
2013-04-30 15:35:50,611 : INFO : topic diff=0.392038, rho=0.500000

013-04-30 15:35:51,607 : INFO : PROGRESS: iteration 0, at document
#500/1593460
2013-04-30 15:36:12,225 : INFO : 94/100 documents converged within 50
iterations
2013-04-30 15:36:15,507 : INFO : merging changes from 100 documents
into a model of 1593460 documents
2013-04-30 15:36:39,910 : INFO : topic #0: 0.182*fútbol + 0.166*azul +
0.092*jugador + 0.092*futbolista + 0.050*porqu + 0.046*fan +
0.046*deportista + 0.040*segundo + 0.037*tomar + 0.029*que
2013-04-30 15:36:40,672 : INFO : topic #1: 0.184*writer + 0.177*educ +
0.047*relat + 0.034*write + 0.027*resourc + 0.025*folk + 0.024*blogger
+ 0.022*thing + 0.020*advic + 0.018*tweep
2013-04-30 15:36:41,379 : INFO : topic #2: 0.123*journo + 0.113*celeb
+ 0.112*new + 0.102*favourit + 0.044*journalist + 0.044*media +
0.036*tweet + 0.033*updat + 0.028*citi + 0.020*celebr
2013-04-30 15:36:42,141 : INFO : topic #3: 0.235*auto + 0.156*recruit
+ 0.087*want + 0.078*coolest + 0.039*worldwid + 0.032*welovebieb +
0.030*golden + 0.027*peep + 0.025*lovelovelov + 0.019*luv
2013-04-30 15:36:42,842 : INFO : topic #4: 0.013*pra + 0.012*que +
0.011*tokita + 0.011*tokio + 0.010*nois + 0.009*hotel + 0.009*lista +
0.008*meu + 0.008*love + 0.007*melhor
2013-04-30 15:36:43,544 : INFO : topic #5: 0.069*intern + 0.036*vice +
0.019*world + 0.019*cool + 0.019*athlet + 0.017*team + 0.015*teamtwist
+ 0.015*homi + 0.014*love + 0.014*know
2013-04-30 15:36:44,308 : INFO : topic #6: 0.355*tech + 0.075*geek +
0.070*program + 0.034*account + 0.034*new + 0.029*present +
0.025*interact + 0.024*london + 0.021*profession + 0.021*employe
2013-04-30 15:36:45,009 : INFO : topic #7: 0.190*account + 0.148*vip +
0.080*latest + 0.066*beat + 0.050*partner + 0.034*expert + 0.032*new +
0.032*core + 0.029*tweet + 0.025*blog
2013-04-30 15:36:45,771 : INFO : topic #8: 0.000*gak + 0.000*vemevetv
+ 0.000*alic + 0.000*worldalert + 0.000*alif + 0.000*businesstool +
0.000*obvious + 0.000*sugarpop + 0.000*phriend + 0.000*onetimegirl
2013-04-30 15:36:46,473 : INFO : topic #9: 0.386*publicidad +
0.169*lista + 0.018*handl + 0.016*listando + 0.013*amigo +
0.013*listei + 0.011*que + 0.010*lindo + 0.010*obrigado + 0.009*love
2013-04-30 15:36:47,234 : INFO : topic #10: 0.102*talk + 0.097*daili +
0.089*friend + 0.087*conversationlist + 0.074*dynam + 0.070*rebuilt +
0.031*royal + 0.024*new + 0.020*convers + 0.019*adv
2013-04-30 15:36:47,936 : INFO : topic #11: 0.479*onlin +
0.114*poynter + 0.085*com + 0.032*skype + 0.030*friend + 0.016*que +
0.012*friendli + 0.011*cool + 0.011*best + 0.010*think
2013-04-30 15:36:48,699 : INFO : topic #12: 0.000*gak + 0.000*vemevetv
+ 0.000*alic + 0.000*worldalert + 0.000*alif + 0.000*businesstool +
0.000*obvious + 0.000*sugarpop + 0.000*phriend + 0.000*onetimegirl
2013-04-30 15:36:49,410 : INFO : topic #13: 0.298*photo + 0.093*you +
0.076*philippin + 0.062*ing + 0.045*children + 0.039*cherish +
0.033*fan + 0.024*cyclist + 0.023*friend + 0.019*daili
2013-04-30 15:36:50,180 : INFO : topic #14: 0.000*gak + 0.000*vemevetv
+ 0.000*alic + 0.000*worldalert + 0.000*alif + 0.000*businesstool +
0.000*obvious + 0.000*sugarpop + 0.000*phriend + 0.000*onetimegirl
2013-04-30 15:36:51,019 : INFO : topic diff=0.463744, rho=0.447214
2013-04-30 15:36:51,943 : INFO : PROGRESS: iteration 0, at document
#600/1593460
2013-04-30 15:37:08,463 : INFO : 96/100 documents converged within 50
iterations
2013-04-30 15:37:11,697 : INFO : merging changes from 100 documents
into a model of 1593460 documents
2013-04-30 15:37:36,281 : INFO : topic #0: 0.414*porqu + 0.127*azul +
0.097*fútbol + 0.049*jugador + 0.049*futbolista + 0.038*toda +
0.025*fan + 0.024*deportista + 0.021*segundo + 0.020*tomar
2013-04-30 15:37:36,953 : INFO : topic #1: 0.208*writer + 0.146*educ +
0.040*relat + 0.035*resourc + 0.031*write + 0.026*folk + 0.021*blogger
+ 0.021*watch + 0.020*tweep + 0.018*thing
2013-04-30 15:37:37,623 : INFO : topic #2: 0.219*celeb + 0.088*new +
0.076*favourit + 0.070*journo + 0.050*present + 0.046*past +
0.037*tweet + 0.032*celebr + 0.031*journalist + 0.027*media
2013-04-30 15:37:38,293 : INFO : topic #3: 0.218*auto + 0.145*recruit
+ 0.120*want + 0.068*worldwid + 0.066*coolest + 0.062*golden +
0.022*real + 0.021*welovebieb + 0.020*peep + 0.016*lovelovelov
2013-04-30 15:37:39,023 : INFO : topic #4: 0.013*pra + 0.012*que +
0.011*tokita + 0.010*tokio + 0.010*nois + 0.009*hotel + 0.009*lista +
0.008*meu + 0.008*love + 0.006*melhor
2013-04-30 15:37:39,692 : INFO : topic #5: 0.065*intern + 0.024*vice +
0.021*basic + 0.020*friend + 0.018*cuz + 0.017*handsom + 0.016*world +
0.016*don + 0.015*team + 0.015*sexi
2013-04-30 15:37:40,360 : INFO : topic #6: 0.490*tech + 0.086*geek +
0.057*program + 0.033*new + 0.027*account + 0.020*tweet +
0.019*employe + 0.018*interact + 0.014*present + 0.011*world
2013-04-30 15:37:41,043 : INFO : topic #7: 0.217*account + 0.188*vip +
0.076*latest + 0.067*beat + 0.055*partner + 0.030*expert + 0.026*up +
0.026*new + 0.024*tweet + 0.022*core
2013-04-30 15:37:41,774 : INFO : topic #8: 0.000*gak + 0.000*vemevetv
+ 0.000*alic + 0.000*worldalert + 0.000*alif + 0.000*businesstool +
0.000*obvious + 0.000*sugarpop + 0.000*phriend + 0.000*onetimegirl
2013-04-30 15:37:42,444 : INFO : topic #9: 0.435*publicidad +
0.156*lista + 0.016*handl + 0.015*que + 0.014*listando + 0.012*amigo +
0.011*listei + 0.009*seguir + 0.009*listado + 0.008*lindo
2013-04-30 15:37:43,113 : INFO : topic #10: 0.104*talk + 0.088*friend
+ 0.083*daili + 0.075*conversationlist + 0.075*dynam + 0.072*rebuilt +
0.028*old + 0.026*new + 0.020*royal + 0.019*import
2013-04-30 15:37:43,784 : INFO : topic #11: 0.323*onlin +
0.253*poynter + 0.094*com + 0.032*sign + 0.024*friend + 0.023*best +
0.019*msn + 0.018*step + 0.014*cool + 0.014*que
2013-04-30 15:37:44,513 : INFO : topic #12: 0.000*gak + 0.000*vemevetv
+ 0.000*alic + 0.000*worldalert + 0.000*alif + 0.000*businesstool +
0.000*obvious + 0.000*sugarpop + 0.000*phriend + 0.000*onetimegirl
2013-04-30 15:37:45,181 : INFO : topic #13: 0.306*sake + 0.238*photo +
0.089*children + 0.061*you + 0.031*philippin + 0.028*friend +
0.025*rock + 0.025*thank + 0.025*ing + 0.016*fan
2013-04-30 15:37:45,852 : INFO : topic #14: 0.000*gak + 0.000*vemevetv
+ 0.000*alic + 0.000*worldalert + 0.000*alif + 0.000*businesstool +
0.000*obvious + 0.000*sugarpop + 0.000*phriend + 0.000*onetimegirl
2013-04-30 15:37:46,652 : INFO : topic diff=0.383917, rho=0.408248
2013-04-30 15:37:47,616 : INFO : PROGRESS: iteration 0, at document
#700/1593460
2013-04-30 15:38:00,464 : INFO : 98/100 documents converged within 50
iterations
2013-04-30 15:38:03,708 : INFO : merging changes from 100 documents
into a model of 1593460 documents
2013-04-30 15:38:28,161 : INFO : topic #0: 0.403*porqu + 0.193*azul +
0.091*toda + 0.062*fútbol + 0.032*jugador + 0.031*futbolista +
0.022*fan + 0.017*que + 0.016*deportista + 0.014*segundo
2013-04-30 15:38:28,928 : INFO : topic #1: 0.179*writer + 0.136*educ +
0.058*resourc + 0.040*relat + 0.035*write + 0.027*advic + 0.025*folk +
0.022*watch + 0.022*blogger + 0.022*tweep
2013-04-30 15:38:29,628 : INFO : topic #2: 0.283*celeb +
0.089*favourit + 0.068*new + 0.049*journo + 0.043*celebr + 0.041*past
+ 0.037*present + 0.034*tweet + 0.023*media + 0.022*journalist
2013-04-30 15:38:30,389 : INFO : topic #3: 0.344*recruit + 0.171*auto
+ 0.089*want + 0.057*worldwid + 0.036*coolest + 0.034*golden +
0.029*welovebieb + 0.023*real + 0.021*peep + 0.014*love
2013-04-30 15:38:31,091 : INFO : topic #4: 0.012*pra + 0.011*que +
0.011*tokita + 0.010*tokio + 0.009*nois + 0.008*hotel + 0.008*lista +
0.008*meu + 0.007*love + 0.006*melhor
2013-04-30 15:38:31,872 : INFO : topic #5: 0.083*intern + 0.020*vice +
0.019*friend + 0.018*throw + 0.017*don + 0.017*basic + 0.016*world +
0.016*teamfollowback + 0.014*team + 0.014*cuz
2013-04-30 15:38:32,603 : INFO : topic #6: 0.567*tech + 0.067*geek +
0.039*program + 0.029*new + 0.027*account + 0.022*tweet +
0.020*employe + 0.016*interact + 0.012*present + 0.010*current
2013-04-30 15:38:33,303 : INFO : topic #7: 0.211*partner +
0.176*account + 0.136*vip + 0.060*latest + 0.039*beat + 0.029*tweet +
0.028*inform + 0.027*expert + 0.026*economi + 0.022*core
2013-04-30 15:38:34,063 : INFO : topic #8: 0.000*gak + 0.000*vemevetv
+ 0.000*alic + 0.000*worldalert + 0.000*alif + 0.000*businesstool +
0.000*obvious + 0.000*sugarpop + 0.000*phriend + 0.000*onetimegirl
2013-04-30 15:38:34,762 : INFO : topic #9: 0.375*publicidad +
0.135*lista + 0.132*anda + 0.014*handl + 0.013*que + 0.012*listando +
0.010*amigo + 0.009*listei + 0.008*seguir + 0.007*listado
2013-04-30 15:38:35,521 : INFO : topic #10: 0.098*talk + 0.085*friend
+ 0.084*dynam + 0.081*conversationlist + 0.072*daili + 0.067*rebuilt +
0.042*old + 0.030*new + 0.023*interest + 0.021*best
2013-04-30 15:38:36,221 : INFO : topic #11: 0.366*onlin +
0.230*poynter + 0.093*com + 0.029*sign + 0.025*best + 0.024*friend +
0.020*msn + 0.016*step + 0.013*cool + 0.012*que
2013-04-30 15:38:36,920 : INFO : topic #12: 0.000*gak + 0.000*vemevetv
+ 0.000*alic + 0.000*worldalert + 0.000*alif + 0.000*businesstool +
0.000*obvious + 0.000*sugarpop + 0.000*phriend + 0.000*onetimegirl
2013-04-30 15:38:37,680 : INFO : topic #13: 0.248*photo + 0.247*sake +
0.128*children + 0.106*you + 0.028*rock + 0.027*friend +
0.025*philippin + 0.020*thank + 0.020*ing + 0.015*fan
2013-04-30 15:38:38,379 : INFO : topic #14: 0.000*gak + 0.000*vemevetv
+ 0.000*alic + 0.000*worldalert + 0.000*alif + 0.000*businesstool +
0.000*obvious + 0.000*sugarpop + 0.000*phriend + 0.000*onetimegirl
2013-04-30 15:38:39,275 : INFO : topic diff=0.278820, rho=0.377964
013-04-30 15:38:40,169 : INFO : PROGRESS: iteration 0, at document
#800/1593460
2013-04-30 15:38:50,895 : INFO : 99/100 documents converged within 50
iterations
2013-04-30 15:38:54,171 : INFO : merging changes from 100 documents
into a model of 1593460 documents
2013-04-30 15:39:18,679 : INFO : topic #0: 0.324*azul + 0.285*porqu +
0.125*toda + 0.042*fan + 0.035*fútbol + 0.022*linda + 0.018*jugador +
0.018*futbolista + 0.016*que + 0.010*mundo
2013-04-30 15:39:19,345 : INFO : topic #1: 0.359*writer + 0.067*write
+ 0.066*educ + 0.033*resourc + 0.033*tweep + 0.032*blogger +
0.030*relat + 0.023*folk + 0.016*link + 0.015*watch
2013-04-30 15:39:20,016 : INFO : topic #2: 0.298*celeb +
0.085*favourit + 0.060*new + 0.052*journo + 0.041*class + 0.041*celebr
+ 0.036*past + 0.035*tweet + 0.030*present + 0.023*entertain
2013-04-30 15:39:20,683 : INFO : topic #3: 0.241*recruit + 0.191*auto
+ 0.106*want + 0.076*coolest + 0.071*worldwid + 0.042*golden +
0.037*real + 0.021*love + 0.021*like + 0.020*welovebieb
2013-04-30 15:39:21,419 : INFO : topic #4: 0.011*pra + 0.011*que +
0.010*tokita + 0.010*tokio + 0.009*nois + 0.008*hotel + 0.008*lista +
0.007*meu + 0.007*love + 0.006*melhor
2013-04-30 15:39:22,100 : INFO : topic #5: 0.094*intern + 0.041*cuz +
0.028*handsom + 0.023*teamtwist + 0.022*vice + 0.018*bodi +
0.017*friend + 0.017*homi + 0.016*don + 0.016*know
2013-04-30 15:39:22,790 : INFO : topic #6: 0.523*tech + 0.095*geek +
0.034*program + 0.030*london + 0.029*interact + 0.029*new +
0.025*account + 0.021*tweet + 0.018*employe + 0.011*present
2013-04-30 15:39:23,471 : INFO : topic #7: 0.234*partner +
0.180*account + 0.130*vip + 0.063*latest + 0.033*beat + 0.033*tweet +
0.028*up + 0.024*inform + 0.023*expert + 0.022*economi
2013-04-30 15:39:24,202 : INFO : topic #8: 0.000*gak + 0.000*vemevetv
+ 0.000*alic + 0.000*worldalert + 0.000*alif + 0.000*businesstool +
0.000*obvious + 0.000*sugarpop + 0.000*phriend + 0.000*onetimegirl
2013-04-30 15:39:24,872 : INFO : topic #9: 0.474*publicidad +
0.101*lista + 0.081*anda + 0.033*seguir + 0.029*listado + 0.026*por +
0.016*para + 0.014*que + 0.009*amigo + 0.008*handl
2013-04-30 15:39:25,543 : INFO : topic #10: 0.107*talk +
0.100*conversationlist + 0.099*friend + 0.096*dynam + 0.092*daili +
0.087*rebuilt + 0.042*convers + 0.022*old + 0.021*new + 0.018*best
2013-04-30 15:39:26,220 : INFO : topic #11: 0.402*onlin +
0.189*poynter + 0.099*com + 0.029*friend + 0.025*skype + 0.024*best +
0.024*sign + 0.016*msn + 0.013*step + 0.013*cool
2013-04-30 15:39:26,892 : INFO : topic #12: 0.000*gak + 0.000*vemevetv
+ 0.000*alic + 0.000*worldalert + 0.000*alif + 0.000*businesstool +
0.000*obvious + 0.000*sugarpop + 0.000*phriend + 0.000*onetimegirl
2013-04-30 15:39:27,619 : INFO : topic #13: 0.248*children +
0.220*photo + 0.192*sake + 0.098*you + 0.039*rock + 0.030*friend +
0.018*fan + 0.016*philippin + 0.016*thank + 0.013*ing
2013-04-30 15:39:28,286 : INFO : topic #14: 0.000*gak + 0.000*vemevetv
+ 0.000*alic + 0.000*worldalert + 0.000*alif + 0.000*businesstool +
0.000*obvious + 0.000*sugarpop + 0.000*phriend + 0.000*onetimegirl
2013-04-30 15:39:29,092 : INFO : topic diff=0.245962, rho=0.353553
2013-04-30 15:39:29,992 : INFO : PROGRESS: iteration 0, at document
#900/1593460
2013-04-30 15:39:29,992 : INFO : PROGRESS: iteration 0, at document
#900/1593460
2013-04-30 15:39:41,739 : INFO : 97/100 documents converged within 50
iterations
2013-04-30 15:39:44,953 : INFO : merging changes from 100 documents
into a model of 1593460 documents
2013-04-30 15:40:09,534 : INFO : topic #0: 0.322*azul + 0.321*porqu +
0.109*toda + 0.040*fan + 0.028*fútbol + 0.020*linda + 0.016*que +
0.016*tomar + 0.014*jugador + 0.014*amant
2013-04-30 15:40:10,234 : INFO : topic #1: 0.289*writer + 0.117*educ +
0.055*write + 0.047*resourc + 0.044*teacher + 0.033*tweep +
0.031*blogger + 0.025*relat + 0.018*folk + 0.016*thing
2013-04-30 15:40:10,999 : INFO : topic #2: 0.341*celeb +
0.070*favourit + 0.060*celebr + 0.054*new + 0.045*journo + 0.033*tweet
+ 0.031*class + 0.027*past + 0.027*entertain + 0.022*present
2013-04-30 15:40:11,705 : INFO : topic #3: 0.266*auto + 0.187*recruit
+ 0.125*want + 0.092*worldwid + 0.059*coolest + 0.032*golden +
0.029*real + 0.023*love + 0.017*like + 0.016*welovebieb
2013-04-30 15:40:12,469 : INFO : topic #4: 0.011*pra + 0.010*que +
0.010*tokita + 0.009*tokio + 0.008*nois + 0.008*hotel + 0.007*lista +
0.007*meu + 0.007*love + 0.005*melhor
2013-04-30 15:40:13,170 : INFO : topic #5: 0.116*intern + 0.031*cuz +
0.026*teamtwist + 0.021*handsom + 0.017*know + 0.016*ass + 0.016*vice
+ 0.015*man + 0.015*friend + 0.014*men
2013-04-30 15:40:13,870 : INFO : topic #6: 0.539*tech + 0.093*geek +
0.038*program + 0.026*new + 0.025*interact + 0.022*london +
0.019*tweet + 0.018*account + 0.017*field + 0.013*employe
2013-04-30 15:40:14,630 : INFO : topic #7: 0.199*partner +
0.163*account + 0.131*vip + 0.078*economi + 0.045*latest + 0.032*tweet
+ 0.030*beat + 0.029*info + 0.026*inform + 0.026*core
2013-04-30 15:40:15,389 : INFO : topic #8: 0.000*gak + 0.000*vemevetv
+ 0.000*alic + 0.000*worldalert + 0.000*alif + 0.000*businesstool +
0.000*obvious + 0.000*sugarpop + 0.000*phriend + 0.000*onetimegirl
2013-04-30 15:40:16,089 : INFO : topic #9: 0.473*publicidad +
0.101*lista + 0.081*anda + 0.033*seguir + 0.029*listado + 0.026*por +
0.016*para + 0.014*que + 0.009*amigo + 0.008*handl
2013-04-30 15:40:16,849 : INFO : topic #10: 0.111*conversationlist +
0.110*daili + 0.108*talk + 0.096*friend + 0.095*dynam + 0.088*rebuilt
+ 0.052*convers + 0.022*new + 0.018*old + 0.017*best
2013-04-30 15:40:17,548 : INFO : topic #11: 0.392*onlin +
0.156*poynter + 0.110*com + 0.053*step + 0.034*friend + 0.027*best +
0.021*skype + 0.020*sign + 0.014*msn + 0.011*cool
2013-04-30 15:40:18,246 : INFO : topic #12: 0.000*gak + 0.000*vemevetv
+ 0.000*alic + 0.000*worldalert + 0.000*alif + 0.000*businesstool +
0.000*obvious + 0.000*sugarpop + 0.000*phriend + 0.000*onetimegirl
2013-04-30 15:40:19,004 : INFO : topic #13: 0.487*sake + 0.265*photo +
0.096*children + 0.032*you + 0.019*friend + 0.015*new + 0.013*rock +
0.008*fan + 0.006*mrchildren + 0.006*japanes
2013-04-30 15:40:19,762 : INFO : topic #14: 0.000*gak + 0.000*vemevetv
+ 0.000*alic + 0.000*worldalert + 0.000*alif + 0.000*businesstool +
0.000*obvious + 0.000*sugarpop + 0.000*phriend + 0.000*onetimegirl
2013-04-30 15:40:20,595 : INFO : topic diff=0.266508, rho=0.333333
2013-04-30 15:40:21,507 : INFO : PROGRESS: iteration 0, at document
#1000/1593460
2013-04-30 15:40:34,035 : INFO : 100/100 documents converged within 50
iterations
2013-04-30 15:40:37,333 : INFO : merging changes from 100 documents
into a model of 1593460 documents
2013-04-30 15:41:01,872 : INFO : topic #0: 0.360*azul + 0.277*porqu +
0.094*toda + 0.067*fan + 0.024*fútbol + 0.018*mundo + 0.018*linda +
0.017*que + 0.014*tomar + 0.012*jugador
2013-04-30 15:41:02,539 : INFO : topic #1: 0.209*writer + 0.149*educ +
0.104*write + 0.092*teacher + 0.043*resourc + 0.033*blogger +
0.024*tweep + 0.022*folk + 0.020*relat + 0.015*blog
2013-04-30 15:41:03,205 : INFO : topic #2: 0.344*celeb + 0.074*journo
+ 0.066*celebr + 0.053*favourit + 0.049*new + 0.047*class +
0.028*tweet + 0.027*journalist + 0.023*media + 0.023*past
2013-04-30 15:41:03,933 : INFO : topic #3: 0.266*recruit + 0.198*auto
+ 0.155*worldwid + 0.090*want + 0.051*coolest + 0.028*real +
0.026*welovebieb + 0.020*love + 0.020*golden + 0.012*like
2013-04-30 15:41:04,600 : INFO : topic #4: 0.010*pra + 0.009*que +
0.009*tokita + 0.008*tokio + 0.008*nois + 0.007*hotel + 0.007*lista +
0.006*meu + 0.006*love + 0.005*melhor
2013-04-30 15:41:05,268 : INFO : topic #5: 0.099*intern +
0.049*handsom + 0.023*promo + 0.022*cuz + 0.022*man + 0.021*ass +
0.019*teamtwist + 0.019*men + 0.016*know + 0.015*friend
2013-04-30 15:41:05,996 : INFO : topic #6: 0.571*tech + 0.096*geek +
0.030*new + 0.026*program + 0.022*interact + 0.020*employe +
0.019*tweet + 0.013*london + 0.013*account + 0.012*world
2013-04-30 15:41:06,664 : INFO : topic #7: 0.165*partner +
0.139*account + 0.131*vip + 0.064*latest + 0.062*economi + 0.045*beat
+ 0.029*up + 0.029*tweet + 0.028*core + 0.028*info
2013-04-30 15:41:07,331 : INFO : topic #8: 0.000*gak + 0.000*vemevetv
+ 0.000*alic + 0.000*worldalert + 0.000*alif + 0.000*businesstool +
0.000*obvious + 0.000*sugarpop + 0.000*phriend + 0.000*onetimegirl
2013-04-30 15:41:07,998 : INFO : topic #9: 0.581*publicidad +
0.096*lista + 0.052*anda + 0.035*listado + 0.022*seguir + 0.021*por +
0.011*para + 0.010*que + 0.007*amigo + 0.005*musica
2013-04-30 15:41:08,724 : INFO : topic #10: 0.134*daili +
0.108*conversationlist + 0.108*talk + 0.096*dynam + 0.095*friend +
0.085*rebuilt + 0.039*convers + 0.026*old + 0.021*new + 0.017*import
2013-04-30 15:41:09,391 : INFO : topic #11: 0.474*onlin +
0.113*poynter + 0.087*com + 0.080*step + 0.027*friend + 0.020*best +
0.016*bb + 0.015*skype + 0.015*que + 0.014*sign
2013-04-30 15:41:10,061 : INFO : topic #12: 0.000*gak + 0.000*vemevetv
+ 0.000*alic + 0.000*worldalert + 0.000*alif + 0.000*businesstool +
0.000*obvious + 0.000*sugarpop + 0.000*phriend + 0.000*onetimegirl
2013-04-30 15:41:10,752 : INFO : topic #13: 0.432*sake + 0.251*photo +
0.179*children + 0.029*you + 0.018*friend + 0.014*new + 0.012*rock +
0.007*fan + 0.005*mrchildren + 0.005*japanes
2013-04-30 15:41:11,483 : INFO : topic #14: 0.000*gak + 0.000*vemevetv
+ 0.000*alic + 0.000*worldalert + 0.000*alif + 0.000*businesstool +
0.000*obvious + 0.000*sugarpop + 0.000*phriend + 0.000*onetimegirl
2013-04-30 15:41:12,219 : INFO : topic diff=0.254042, rho=0.316228
2013-04-30 15:41:13,204 : INFO : PROGRESS: iteration 0, at document
#1100/1593460
2013-04-30 15:41:27,349 : INFO : 98/100 documents converged within 50
iterations
2013-04-30 15:41:30,572 : INFO : merging changes from 100 documents
into a model of 1593460 documents
2013-04-30 15:41:55,037 : INFO : topic #0: 0.360*porqu + 0.304*azul +
0.079*toda + 0.062*fan + 0.021*mundo + 0.020*fútbol + 0.019*amant +
0.015*que + 0.015*linda + 0.011*tomar
2013-04-30 15:41:55,798 : INFO : topic #1: 0.217*educ + 0.216*writer +
0.087*write + 0.084*teacher + 0.031*resourc + 0.030*blogger +
0.023*tweep + 0.019*folk + 0.016*relat + 0.015*blog
2013-04-30 15:41:56,506 : INFO : topic #2: 0.311*celeb + 0.129*journo
+ 0.060*celebr + 0.049*new + 0.043*favourit + 0.042*journalist +
0.040*class + 0.029*media + 0.024*tweet + 0.017*entertain
2013-04-30 15:41:57,267 : INFO : topic #3: 0.239*auto + 0.219*recruit
+ 0.213*worldwid + 0.075*want + 0.042*coolest + 0.026*real +
0.022*welovebieb + 0.017*love + 0.016*golden + 0.010*like
2013-04-30 15:41:57,968 : INFO : topic #4: 0.009*pra + 0.008*que +
0.008*tokita + 0.007*tokio + 0.007*nois + 0.006*hotel + 0.006*lista +
0.006*meu + 0.005*love + 0.005*melhor
2013-04-30 15:41:58,727 : INFO : topic #5: 0.148*intern + 0.037*promo
+ 0.032*handsom + 0.031*suivr + 0.016*legend + 0.016*vice + 0.016*fun
+ 0.015*friend + 0.015*male + 0.015*cuz
2013-04-30 15:41:59,425 : INFO : topic #6: 0.556*tech + 0.087*geek +
0.037*london + 0.037*program + 0.030*new + 0.026*tweet +
0.018*interact + 0.017*employe + 0.016*world + 0.012*account
2013-04-30 15:42:00,123 : INFO : topic #7: 0.170*partner +
0.143*account + 0.109*vip + 0.086*economi + 0.083*latest + 0.038*beat
+ 0.028*info + 0.027*tweet + 0.025*inform + 0.024*up
2013-04-30 15:42:00,872 : INFO : topic #8: 0.000*gak + 0.000*vemevetv
+ 0.000*alic + 0.000*worldalert + 0.000*alif + 0.000*businesstool +
0.000*obvious + 0.000*sugarpop + 0.000*phriend + 0.000*onetimegirl
2013-04-30 15:42:01,597 : INFO : topic #9: 0.454*publicidad +
0.197*anda + 0.088*lista + 0.027*listado + 0.025*por + 0.020*seguir +
0.011*para + 0.010*que + 0.009*el + 0.009*musica
2013-04-30 15:42:02,358 : INFO : topic #10: 0.132*daili + 0.119*talk +
0.110*conversationlist + 0.103*dynam + 0.093*friend + 0.089*rebuilt +
0.027*old + 0.025*convers + 0.020*new + 0.017*interest
2013-04-30 15:42:03,063 : INFO : topic #11: 0.540*onlin +
0.091*poynter + 0.086*com + 0.065*step + 0.025*friend + 0.020*best +
0.014*que + 0.013*bb + 0.012*skype + 0.011*sign
2013-04-30 15:42:03,827 : INFO : topic #12: 0.000*gak + 0.000*vemevetv
+ 0.000*alic + 0.000*worldalert + 0.000*alif + 0.000*businesstool +
0.000*obvious + 0.000*sugarpop + 0.000*phriend + 0.000*onetimegirl
2013-04-30 15:42:04,528 : INFO : topic #13: 0.537*sake + 0.166*photo +
0.101*children + 0.037*ohayo + 0.021*friend + 0.016*you +
0.015*philippin + 0.013*beer + 0.012*?????? + 0.012*u
2013-04-30 15:42:05,288 : INFO : topic #14: 0.000*gak + 0.000*vemevetv
+ 0.000*alic + 0.000*worldalert + 0.000*alif + 0.000*businesstool +
0.000*obvious + 0.000*sugarpop + 0.000*phriend + 0.000*onetimegirl
2013-04-30 15:42:06,061 : INFO : topic diff=0.220991, rho=0.301511

Radim Řehůřek

unread,
May 1, 2013, 4:27:14 PM5/1/13
to gensim
Hi Jason, thanks.

Do you have a fast BLAS installed? It can make a huge difference in
performance, http://radimrehurek.com/gensim/distributed.html .

Even topic printing seems to be slow on your machine. One second per
printed topic... That also points to slow BLAS, as printing entails
virtually just a sum of two matrices.

Btw any particular reason you're running batches of `chunksize=100`
docs at a time? The default is 2,000, but I see you set this to 100
manually. I'm asking because the merge procedure (M-step of the online
EM algo) seems to be rather slow on your comp (possibly because of
that slow BLAS). And smaller chunksize => more merges...

Otherwise everything seems fine with the log, I don't see any other
possible problem. Unless you ran out of main memory and the machine
was swapping :)

Radim
> 2013-04-30 15:34:50,507 : ...
>
> read more »

Jason

unread,
May 3, 2013, 11:14:10 AM5/3/13
to gensim
Radim,
I didn't have a proper BLAS before, but I've installed ATLAS now, is
that the recommended BLAS for numpy on ubuntu?
It seems to be a little faster, but it seems like not a large enough
change to bring the total computation to less than days. The reason I
set chunksize to 100 is because the system immediately runs out of
memory at larger chunk sizes. It is because some of the documents are
very large.

is increasing update_every a valid approach to improving model
creation time?
if so, are there some guidelines for increasing update_every that will
not result in too diminished quality of topics?
I'm not sure what else I'm doing wrong here, as I expect to get times
near the speed of the wikipedia dump.
> > 0.023*gener + 0.020*tweet + 0.017*stuff + 0.016*brave...
>
> read more »

Skipper Seabold

unread,
May 3, 2013, 11:19:51 AM5/3/13
to gensim
On Fri, May 3, 2013 at 11:14 AM, Jason <jaso...@gmail.com> wrote:
> Radim,
> I didn't have a proper BLAS before, but I've installed ATLAS now, is
> that the recommended BLAS for numpy on ubuntu?
> It seems to be a little faster, but it seems like not a large enough
> change to bring the total computation to less than days. The reason I
> set chunksize to 100 is because the system immediately runs out of
> memory at larger chunk sizes. It is because some of the documents are
> very large.
>
> is increasing update_every a valid approach to improving model
> creation time?
> if so, are there some guidelines for increasing update_every that will
> not result in too diminished quality of topics?
> I'm not sure what else I'm doing wrong here, as I expect to get times
> near the speed of the wikipedia dump.
>

It may not shave off hours and hours, but FWIW printing is slow.

def loop(n):
for i in range(n):
print i

def loop2(n):
for i in range(n):
pass

%timeit loop(10000)
10 loops, best of 3: 63 ms per loop

%timeit loop2(10000)
1000 loops, best of 3: 307 us per loop

Skipper

Radim Řehůřek

unread,
May 3, 2013, 3:22:52 PM5/3/13
to gensim
Hmm. What exactly is "a little faster"? For comparison, on my laptop,
I'm getting:

>>> import numpy
>>> a, b = numpy.random.rand(350, 100000), numpy.random.rand(100000, 350)
>>> timeit numpy.dot(a, b)
1 loops, best of 3: 566 ms per loop

Note that numpy+scipy must be installed *after* installing BLAS,
because they pick up the BLAS libraries at compile (=install) time. Re-
installing BLAS will have no effect on an already installed numpy.

It's also worrying you can't fit 2,000 documents into RAM... what's up
with that? Are you sure your input data is ok? 2k text documents of a
few thousand words each shouldn't be a problem...

Radim
> ...
>
> read more »

Radim Řehůřek

unread,
May 3, 2013, 3:24:52 PM5/3/13
to gensim
On May 3, 5:19 pm, Skipper Seabold <jsseab...@gmail.com> wrote:
Thanks Skipper, but I believe this is just a red herring.

In any case, the printing of topics during LDA training can be
disabled by commenting out `self.print_topics(15) # print out some
debug info at the end of each EM iteration` in the ldamodel.py file.

-rr



>
> Skipper

Jason

unread,
May 6, 2013, 6:33:14 PM5/6/13
to gensim
Radim, BLAS is working fine. I looked into investigating why my memory
was blowing up and I think I've found a bug or documentation issue.

When I pass in the dictionary to lda with:
dictionary = corpora.Dictionary.load(dictionary_file)
model = gensim.models.ldamodel.LdaModel(corpus,id2word=dictionary,
num_topics=100)

The process fills up with memory until it the process dies. I believe
the bug is in the lda code because the memory issues happens before it
even starts processing the corpus __iter__ method. I tested this by
adding a print statement to my custom corpus:
class MyCorpus(object):
def __init__(self,docs_file):
self.docs_file = docs_file
def __len__(self):
return int(commands.getstatusoutput('wc -l '+self.docs_file)
[1].split()[0])

def __iter__(self):
for line in open(self.docs_file):
# assume there's one document per line, tokens separated
by whitespace
print('iter')
yield dictionary.doc2bow(json.loads(line)['doc'].split())


If I remove the id2word argument to lda, then there is no memory
explosion and the print statement appears for each line, but then I
get this issue:
2013-05-06 22:22:19,969 : INFO : no word id mapping provided;
initializing from corpus, assuming identity

The actual dictionary file is only 81 MB.
> > > > 2013-04-30...
>
> read more »

Radim Řehůřek

unread,
May 7, 2013, 6:13:44 AM5/7/13
to gensim
Hello Jason,

good, thanks for debugging, now we're getting somewhere :)

The only difference between providing id2word or not are the following
lines in ldamodel.py:

if self.id2word is None:
logger.info("no word id mapping provided; initializing
from corpus, assuming identity")
self.id2word = utils.dict_from_corpus(corpus)
self.num_terms = len(self.id2word)
else:
self.num_terms = 1 + max([-1] + self.id2word.keys())

So when you *do* provide id2word, its `keys()` method is called, and
the maximum id is computed (and incremented +1). None of these steps
is particularly memory/CPU intensive, so I still don't know what's
happening, but we've certainly narrowed it down.

Can you post your (gzipped) dictionary file? What is the highest "word
id" in your dictionary?

-rr
> ...
>
> read more »

Jason

unread,
May 7, 2013, 8:28:08 AM5/7/13
to gensim
Hi Radim,


Here is my dictionary file:
https://dl.dropboxusercontent.com/u/166636/lists.dict.gz


I'm not sure how to get the highest id for my dictionary.

Here is the code for how I build my dictionary:

dictionary = corpora.Dictionary()
for line in open(docs_file):
dictionary.add_documents([json.loads(line)['doc'].split()])
dictionary.save(dictionary_file)
> > > > > > #200/1593460...
>
> read more »

Radim Řehůřek

unread,
May 7, 2013, 1:10:51 PM5/7/13
to gensim
Hello,

On May 7, 2:28 pm, Jason <jason...@gmail.com> wrote:
> Hi Radim,
>
>    Here is my dictionary file:https://dl.dropboxusercontent.com/u/166636/lists.dict.gz

that dictionary contains 2,764,907 terms. This is way too much for
LDA.

Earlier in your log, I remember you pruned this down to 100,000 terms,
which is saner (though still huge). Can you confirm that the out-of-
memory happens with the 100k dictionary, too?

A 2.7m dictionary would also explain the slowdown you experience
during training...

If it's really a 100k dict you're using, can you ctrl+c your program
as it blows up the memory, to see the exact line+stacktrace? There's
nothing in the gensim code to warrant this, so I'm at a loss where to
look.

Best,
Radim

Jason

unread,
May 7, 2013, 3:31:48 PM5/7/13
to gensim
Radim,
Yes, filtering to 100k words stops the memory issue. I'm working
with about 2m documents. In general 100k words is too much and if so
is there a number I should try to shoot for? Is there any kind of
formula I should use to figure out my optimal dictionary size?

Also, on more investigation into my server's BLAS, I do have BLAS
libraries setup, but is there a better particular one I should use?

running your example seems slow on my server:
timeit.timeit('numpy.dot(a, b)','import numpy; a, b =
numpy.random.rand(350, 100000), numpy.random.rand(100000,
350)',number=10)
82.75257706642151

On my laptop:
10.773652076721191

I couldnt get the exact command that you sent me to run, could you let
me know what time you get for:

import timeit
timeit.timeit('numpy.dot(a, b)','import numpy; a, b =
numpy.random.rand(350, 100000), numpy.random.rand(100000,
350)',number=10)





Here is the config of the server:
atlas_threads_info:
libraries = ['lapack', 'ptf77blas', 'ptcblas', 'atlas']
library_dirs = ['/usr/lib/atlas-base/atlas', '/usr/lib/atlas-
base']
define_macros = [('ATLAS_INFO', '"\\"3.8.4\\""')]
language = f77
include_dirs = ['/usr/include/atlas']
blas_opt_info:
libraries = ['ptf77blas', 'ptcblas', 'atlas']
library_dirs = ['/usr/lib/atlas-base']
define_macros = [('ATLAS_INFO', '"\\"3.8.4\\""')]
language = c
include_dirs = ['/usr/include/atlas']
atlas_blas_threads_info:
libraries = ['ptf77blas', 'ptcblas', 'atlas']
library_dirs = ['/usr/lib/atlas-base']
define_macros = [('ATLAS_INFO', '"\\"3.8.4\\""')]
language = c
include_dirs = ['/usr/include/atlas']
lapack_opt_info:
libraries = ['lapack', 'ptf77blas', 'ptcblas', 'atlas']
library_dirs = ['/usr/lib/atlas-base/atlas', '/usr/lib/atlas-
base']
define_macros = [('ATLAS_INFO', '"\\"3.8.4\\""')]
language = f77
include_dirs = ['/usr/include/atlas']
lapack_mkl_info:
NOT AVAILABLE
blas_mkl_info:
NOT AVAILABLE
mkl_info:
NOT AVAILABLE
> > > > > > > > 2013-04-30 15:32:54,573 : INFO : topic #8: 0.000*gak + 0.000*vemevetv...
>
> read more »

Skipper Seabold

unread,
May 7, 2013, 4:18:27 PM5/7/13
to gensim
On Tue, May 7, 2013 at 3:31 PM, Jason <jaso...@gmail.com> wrote:
> Also, on more investigation into my server's BLAS, I do have BLAS
> libraries setup, but is there a better particular one I should use?
<snip>
Is this a pre-packaged atlas from some repository or was it
built/tuned for the architecture on the server?

I've had good experience moving from atlas to openblas recently based
on recommendations on the numpy mailing list. If you want more
instruction on linear algebra libraries that might be the place to
ask.

Skipper

Radim Řehůřek

unread,
May 7, 2013, 4:32:51 PM5/7/13
to gensim
Hello Jason,

glad we figured it out.

On May 7, 9:31 pm, Jason <jason...@gmail.com> wrote:
>  Yes, filtering to 100k words stops the memory issue.  I'm working
> with about 2m documents. In general 100k words is too much and if so
> is there a number I should try to shoot for? Is there any kind of
> formula I should use to figure out my optimal dictionary size?

There's no such formula; it depends on the language/your goal/etc. In
general, include words that are meaningful to your app + exclude words
that are not (stop words, function words, ...). For example, academic
papers like to trim the vocabulary to ~10k. Real-world apps have to
deal with real-world words so larger vocab is needed, but it quickly
becomes diminishing returns, esp. for English. In the tutorials I used
100k.

Stemming/lemmatization can be useful to reduce the vocabulary size,
too, but I saw you're already using that :) Note that stemming can
also decrease accuracy (new vs. news), and is tricky in general for
non-English languages (mentioning this because I believe I saw some
non-latin scripts in the dictionary you sent earlier).


> Also, on more investigation into my server's BLAS, I do have BLAS
> libraries setup, but is there a better particular one I should use?
>
> running your example seems slow on my server:
> timeit.timeit('numpy.dot(a, b)','import numpy; a, b =
> numpy.random.rand(350, 100000), numpy.random.rand(100000,
> 350)',number=10)
> 82.75257706642151
>
> On my laptop:
> 10.773652076721191
>
> I couldnt get the exact command that you sent me to run, could you let
> me know what time you get for:

On my laptop, this takes about 5.8 seconds. So your server seems to be
~14x slower.

ATLAS is a fine library, but my guess is you have some generic,
unoptimized version installed. For example, in debian, there is the
libatlas3-base package, targetting generic x86, which is rather
horrendous for modern CPUs.

HTH,
Radim
> ...
>
> read more »

Jason

unread,
May 7, 2013, 5:40:16 PM5/7/13
to gensim
Yes, I am using the generically compiled version from ubuntu. I will
compile my own version. Thanks both of you for all the help.
> > > > > > > > > Btw any particular...
>
> read more »
Reply all
Reply to author
Forward
0 new messages