I'm working on a large twitter corpus, and models.ldamallet seems to be giving more coherent topics in a much shorter run time than models.ldamodel or models.ldamulticore. I think that there is topic drift in my corpus which is confusing the model when using the gensim streaming approach.
My next step is to look at the perplexity for different numbers of topics to see if I can find the 'best' number of topics.
I tried the approach suggested by Alta de Waal in this forum post: https://groups.google.com/d/msg/gensim/63cU-6DCyeo/cJjbgoT-hEoJ
But got this error: AttributeError: 'LdaMallet' object has no attribute 'bound'
What is the best way to find the perplexity when using models.ldamallet?
Is there a test for topic drift, or settings to minimise the effect of topic drift so I could use the gensim streaming approach?
thanks,
Brenda
On Tuesday, November 4, 2014 7:53:23 AM UTC+1, Brenda Moon wrote:What is the best way to find the perplexity when using models.ldamallet?AFAIR, Mallet displays the perplexity to stdout -- would that be enough for you? Capturing these values programmatically should be possible too, but I haven't looked into that. Hopefully Mallet has some API call for perplexity eval too, but it's certainly not included in the wrapper.
Speed and quality -- how do you preprocess the corpus? Do you do any parsing transformations on-the-fly, during corpus iteration? Do you use the same settings (namely alpha) in all cases?
Is there a test for topic drift, or settings to minimise the effect of topic drift so I could use the gensim streaming approach?Sure. Just use batch LDA (by setting update_every=0). This will make an M-step (=model update) only once after each full corpus pass. This is equivalent to the "original" Blei's variational LDA.
The default version (update_every > 0) corresponds to Matt Hoffman's online variational LDA, where model update is performed once after `update_every` chunks of documents (not entire corpus).
Hello Radim,Thanks for your quick reply. I am new to topic modelling and have been learning by following the gensim tutorials and then running a lot of tests on my data.On 4 November 2014 20:02, Radim Řehůřek <m...@radimrehurek.com> wrote:On Tuesday, November 4, 2014 7:53:23 AM UTC+1, Brenda Moon wrote:What is the best way to find the perplexity when using models.ldamallet?AFAIR, Mallet displays the perplexity to stdout -- would that be enough for you? Capturing these values programmatically should be possible too, but I haven't looked into that. Hopefully Mallet has some API call for perplexity eval too, but it's certainly not included in the wrapper.Yes, I've been using the console output from ldamallet, I like being able to see the topic weight as well as the weight of the words in the topic. I hadn't noticed that it had the perplexity. On my last run it reported:
[beta: 0.04343]
<1000> LL/token: -7.86552So that's the log likelyhood/token (which is perplexity I think?) and the optimised beta (not clear what the beta tells me).
Speed and quality -- how do you preprocess the corpus? Do you do any parsing transformations on-the-fly, during corpus iteration? Do you use the same settings (namely alpha) in all cases?The corpus is made from words remaining after the raw words from tweets have had stemming, stopwords and normalisation applied and saved:
corpora.MmCorpus.serialize(data_path + 'words_cleaned_corpus_jan.mm', corpus_memory_friendly)
corpus_memory_friendly.dictionary.save(data_path + 'words_cleaned_corpus_jan.dict')
I then use:
new_dict.filter_extremes(no_below=20, no_above=0.5, keep_n=100000)
to further reduce the corpus before running LDA. I save the new dictionary and transform the corpus to match the compacted dictionary and save that.
I've done a lot of testing with different alpha settings (auto, asymmetric, symmetric) to see if that helps to get coherent topics. I have also varied many of the other parameters - I'll try to make a summary of those tests and runtimes tomorrow.
Is there a test for topic drift, or settings to minimise the effect of topic drift so I could use the gensim streaming approach?Sure. Just use batch LDA (by setting update_every=0). This will make an M-step (=model update) only once after each full corpus pass. This is equivalent to the "original" Blei's variational LDA.I have used batch LDA in my testing, will include those results in the summary tomorrow. The batch LDA seems a lot slower than the online variational LDA, and the new multicoreLDA doesn't support batch mode.
The default version (update_every > 0) corresponds to Matt Hoffman's online variational LDA, where model update is performed once after `update_every` chunks of documents (not entire corpus).I was hoping to find a way to detect topic drift when using online variational LDA because I'm interested to understand what the contribution of topic drift or the very small document size I'm working with is to the poor topic results. Perhaps a summary of my results will help with this.
[beta: 0.04343]<1000> LL/token: -7.86552So that's the log likelyhood/token (which is perplexity I think?) and the optimised beta (not clear what the beta tells me).2^(-LL/token), yes. So ~233.2 in this case.
I have used batch LDA in my testing, will include those results in the summary tomorrow. The batch LDA seems a lot slower than the online variational LDA, and the new multicoreLDA doesn't support batch mode.It does -- there's a parameter called `batch` :-)Or do you mean it didn't work for you?
run, word count, alpha, iterations, heldout corpus/words, runtime(hours), perplexity
==100 topics, online training, models.LdaModel
1, 318823, symmetric, 50, 821/8088, 1.5, 13168.3
2, 318823, auto, 50, 821/8088, 1.24, 9940.6
3, 318823, auto, 10, 821/8088, 1.21, 9851.3
4, 50350, auto, 10, 821/6914, 0.28, 662.1
5, 18351, auto, 10, 821/5962, 0.12, 391.6
==100 topics, online training, models.LdaMulticore
6, 18351, symmetric, 10, 821/5962, 0.07, 465.4
7, 18351, loaded, 10, 821/5962, 0.06, 384.6
8, 18351, asymmetric, 10, 821/5962, 0.07, 571.8
==100 topics, batch training with 10 passes, models.LdaMulticore
9, 18351, symmetric, 10, 2821/20589, 0.73, 438.4
==100 topics, online training with 10 passes, models.LdaMulticore}
10, 18351, symmetric, 10, 821/5962, 0.60, 492.5
==100 topics, 1000 iterations (passes), models.LdaMallet}
11, 18351, optimize, 10, unknown, 0.31, 199.9 (from Mallet converted using 2^-ll/word)
0.148*grade, 0.086*daily, 0.069*fail, 0.045*midterm, 0.035*young, 0.033*site, 0.029*discussion, 0.025*resource, 0.023*beauty, 0.021*user-virtualastro
0.086*teaching, 0.068*information, 0.064*law, 0.060*library, 0.054*secret, 0.052*trick, 0.040*ready, 0.031*uploaded, 0.026*attraction, 0.026*ebook
0.175*center, 0.104*please, 0.103*always, 0.091*use, 0.079*mind, 0.056*remember, 0.040*turn, 0.026*air, 0.019*shut, 0.019*fit
0.248*need, 0.073*reading, 0.060*business, 0.046*phd, 0.039*automatic, 0.029*publishing, 0.025*audio, 0.023*africa, 0.023*along, 0.017*suggestion
0.215*scientist, 0.078*truth, 0.054*bored, 0.047*snow, 0.045*break, 0.041*bird, 0.039*researcher, 0.038*winter, 0.031*bible, 0.025*james
0.107*google, 0.100*fair, 0.084*online, 0.036*global, 0.025*first, 0.020*world, 0.017*come, 0.015*push, 0.014*gone, 0.014*launch
0.116*christian, 0.094*monitor, 0.028*news, 0.009*winter, 0.009*iphone, 0.008*obama, 0.008*egypt, 0.007*year, 0.006*protest, 0.006*shark
0.125*stupid, 0.065*bracelet, 0.041*homeopathy, 0.033*pseudo, 0.029*user-donttrythis, 0.029*debunk, 0.014*toy, 0.012*still, 0.008*behind, 0.007*hashtag-agw
0.070*state, 0.031*union, 0.025*phone, 0.023*obama, 0.017*christian, 0.016*address, 0.014*monitor, 0.013*mobile, 0.010*blast, 0.010*baby
0.044*week, 0.035*thank, 0.027*essay, 0.017*think, 0.014*done, 0.014*due, 0.012*english, 0.011*sleep, 0.010*day, 0.008*re
0.083*home, 0.063*channel, 0.043*already, 0.040*set, 0.040*nerd, 0.034*soon, 0.033*new, 0.033*planet, 0.024*astrology, 0.024*mom
0.222*google, 0.144*online, 0.103*found, 0.041*dead, 0.039*issue, 0.036*available, 0.020*mother, 0.019*gift, 0.018*find, 0.017*poor
0.077*looking, 0.073*lot, 0.049*half, 0.038*park, 0.037*break, 0.025*wanted, 0.024*biggest, 0.024*kinda, 0.024*forward, 0.023*five
0.409*fair, 0.180*project, 0.061*grade, 0.047*feel, 0.019*need, 0.016*south, 0.016*work, 0.015*method, 0.014*applied, 0.011*quote
0.129*teach, 0.095*philosophy, 0.059*sex, 0.051*baby, 0.038*phd, 0.035*cuz, 0.032*symphony, 0.030*eat, 0.020*bear, 0.017*indeed
[767.3, 612.2, 556.6, 524.1, 501.6, 482.6, 469.0, 458.1, 448.1, 438.4]
[473.8, 468.7, 446.9, 450.3, 481.0, 478.2, 485.2, 486.3, 487.9, 492.5]
Hello,