how to evaluate perplexity for models.ldamallet

3,040 views
Skip to first unread message

Brenda Moon

unread,
Nov 4, 2014, 1:53:23 AM11/4/14
to gen...@googlegroups.com

I'm working on a large twitter corpus, and models.ldamallet seems to be giving more coherent topics in a much shorter run time than models.ldamodel or models.ldamulticore. I think that there is topic drift in my corpus which is confusing the model when using the gensim streaming approach.

My next step is to look at the perplexity for different numbers of topics to see if I can find the 'best' number of topics.

I tried the approach suggested by
Alta de Waal in this forum post: https://groups.google.com/d/msg/gensim/63cU-6DCyeo/cJjbgoT-hEoJ

But got this error: 
AttributeError: 'LdaMallet' object has no attribute 'bound'

What is the best way to find the perplexity when using models.ldamallet?

Is there a test for topic drift, or settings to minimise the effect of topic drift so I could use the gensim streaming approach?

thanks,

Brenda

Radim Řehůřek

unread,
Nov 4, 2014, 4:02:17 AM11/4/14
to gen...@googlegroups.com
Hello Brenda,


On Tuesday, November 4, 2014 7:53:23 AM UTC+1, Brenda Moon wrote:

I'm working on a large twitter corpus, and models.ldamallet seems to be giving more coherent topics in a much shorter run time than models.ldamodel or models.ldamulticore. I think that there is topic drift in my corpus which is confusing the model when using the gensim streaming approach.

My next step is to look at the perplexity for different numbers of topics to see if I can find the 'best' number of topics.

I tried the approach suggested by
Alta de Waal in this forum post: https://groups.google.com/d/msg/gensim/63cU-6DCyeo/cJjbgoT-hEoJ

But got this error: 
AttributeError: 'LdaMallet' object has no attribute 'bound'

What is the best way to find the perplexity when using models.ldamallet?

AFAIR, Mallet displays the perplexity to stdout -- would that be enough for you? Capturing these values programmatically should be possible too, but I haven't looked into that. Hopefully Mallet has some API call for perplexity eval too, but it's certainly not included in the wrapper.

Speed and quality -- how do you preprocess the corpus? Do you do any parsing transformations on-the-fly, during corpus iteration? Do you use the same settings (namely alpha) in all cases?

 

Is there a test for topic drift, or settings to minimise the effect of topic drift so I could use the gensim streaming approach?


Sure. Just use batch LDA (by setting update_every=0). This will make an M-step (=model update) only once after each full corpus pass. This is equivalent to the "original" Blei's variational LDA.

The default version (update_every > 0) corresponds to Matt Hoffman's online variational LDA, where model update is performed once after `update_every` chunks of documents (not entire corpus).

Hope that helps,
Radim

 

thanks,

Brenda

Brenda Moon

unread,
Nov 4, 2014, 7:46:58 AM11/4/14
to gen...@googlegroups.com
Hello Radim,

Thanks for your quick reply. I am new to topic modelling and have been learning by following the gensim tutorials and then running a lot of tests on my data.

On 4 November 2014 20:02, Radim Řehůřek <m...@radimrehurek.com> wrote:
On Tuesday, November 4, 2014 7:53:23 AM UTC+1, Brenda Moon wrote:
What is the best way to find the perplexity when using models.ldamallet?

AFAIR, Mallet displays the perplexity to stdout -- would that be enough for you? Capturing these values programmatically should be possible too, but I haven't looked into that. Hopefully Mallet has some API call for perplexity eval too, but it's certainly not included in the wrapper.

Yes, I've been using the console output from ldamallet, I like being able to see the topic weight as well as the weight of the words in the topic. I hadn't noticed that it had the perplexity. On my last run it reported:

[beta: 0.04343]
<1000> LL/token: -7.86552

So that's the log likelyhood/token (which is perplexity I think?) and the optimised beta (not clear what the beta tells me).
 
Speed and quality -- how do you preprocess the corpus? Do you do any parsing transformations on-the-fly, during corpus iteration? Do you use the same settings (namely alpha) in all cases?

The corpus is made from words remaining after the raw words from tweets have had stemming, stopwords and normalisation applied and saved:

corpora.MmCorpus.serialize(data_path + 'words_cleaned_corpus_jan.mm', corpus_memory_friendly)
corpus_memory_friendly.dictionary.save(data_path + 'words_cleaned_corpus_jan.dict')

I then use:

new_dict.filter_extremes(no_below=20, no_above=0.5, keep_n=100000)

to further reduce the corpus before  running LDA. I save the new dictionary and transform the corpus to match the compacted dictionary and save that.

I've done a lot of testing with different alpha settings (auto, asymmetric, symmetric) to see if that helps to get coherent topics. I have also varied many of the other parameters - I'll try to make a summary of those tests and runtimes tomorrow.

Is there a test for topic drift, or settings to minimise the effect of topic drift so I could use the gensim streaming approach?

Sure. Just use batch LDA (by setting update_every=0). This will make an M-step (=model update) only once after each full corpus pass. This is equivalent to the "original" Blei's variational LDA.

I have used batch LDA in my testing, will include those results in the summary tomorrow. The batch LDA seems a lot slower than the online variational LDA, and the new multicoreLDA doesn't support batch mode.

The default version (update_every > 0) corresponds to Matt Hoffman's online variational LDA, where model update is performed once after `update_every` chunks of documents (not entire corpus).

I was hoping to find a way to detect topic drift when using online variational LDA because I'm interested to understand what the contribution of topic drift or the very small document size I'm working with is to the poor topic results. Perhaps a summary of my results will help with this.

regards,

Brenda

Radim Řehůřek

unread,
Nov 5, 2014, 11:29:47 AM11/5/14
to gen...@googlegroups.com


On Tuesday, November 4, 2014 1:46:58 PM UTC+1, Brenda Moon wrote:
Hello Radim,

Thanks for your quick reply. I am new to topic modelling and have been learning by following the gensim tutorials and then running a lot of tests on my data.

On 4 November 2014 20:02, Radim Řehůřek <m...@radimrehurek.com> wrote:
On Tuesday, November 4, 2014 7:53:23 AM UTC+1, Brenda Moon wrote:
What is the best way to find the perplexity when using models.ldamallet?

AFAIR, Mallet displays the perplexity to stdout -- would that be enough for you? Capturing these values programmatically should be possible too, but I haven't looked into that. Hopefully Mallet has some API call for perplexity eval too, but it's certainly not included in the wrapper.

Yes, I've been using the console output from ldamallet, I like being able to see the topic weight as well as the weight of the words in the topic. I hadn't noticed that it had the perplexity. On my last run it reported:

[beta: 0.04343]
<1000> LL/token: -7.86552

So that's the log likelyhood/token (which is perplexity I think?) and the optimised beta (not clear what the beta tells me).

2^(-LL/token), yes. So ~233.2 in this case.
 

 
Speed and quality -- how do you preprocess the corpus? Do you do any parsing transformations on-the-fly, during corpus iteration? Do you use the same settings (namely alpha) in all cases?

The corpus is made from words remaining after the raw words from tweets have had stemming, stopwords and normalisation applied and saved:

corpora.MmCorpus.serialize(data_path + 'words_cleaned_corpus_jan.mm', corpus_memory_friendly)
corpus_memory_friendly.dictionary.save(data_path + 'words_cleaned_corpus_jan.dict')

I then use:

new_dict.filter_extremes(no_below=20, no_above=0.5, keep_n=100000)

to further reduce the corpus before  running LDA. I save the new dictionary and transform the corpus to match the compacted dictionary and save that.

Sound fine.

 

I've done a lot of testing with different alpha settings (auto, asymmetric, symmetric) to see if that helps to get coherent topics. I have also varied many of the other parameters - I'll try to make a summary of those tests and runtimes tomorrow.

Is there a test for topic drift, or settings to minimise the effect of topic drift so I could use the gensim streaming approach?

Sure. Just use batch LDA (by setting update_every=0). This will make an M-step (=model update) only once after each full corpus pass. This is equivalent to the "original" Blei's variational LDA.

I have used batch LDA in my testing, will include those results in the summary tomorrow. The batch LDA seems a lot slower than the online variational LDA, and the new multicoreLDA doesn't support batch mode.

It does -- there's a parameter called `batch` :-)

Or do you mean it didn't work for you?

 

The default version (update_every > 0) corresponds to Matt Hoffman's online variational LDA, where model update is performed once after `update_every` chunks of documents (not entire corpus).

I was hoping to find a way to detect topic drift when using online variational LDA because I'm interested to understand what the contribution of topic drift or the very small document size I'm working with is to the poor topic results. Perhaps a summary of my results will help with this.

Sure. Tuning LDA models is always tricky. Mallet does a great job (plus Gibbs sampling is inherently more accurate than the mean field approximations), but if you compare the two, make sure you compare against the same alpha/preprocessing, so it's apples to apples.

Let us know how your experiments went!
Radim

Brenda Moon

unread,
Nov 6, 2014, 6:26:48 AM11/6/14
to gen...@googlegroups.com
Hello Radim,


On Thursday, 6 November 2014 03:29:47 UTC+11, Radim Řehůřek wrote:
 
[beta: 0.04343]
<1000> LL/token: -7.86552

So that's the log likelyhood/token (which is perplexity I think?) and the optimised beta (not clear what the beta tells me).

2^(-LL/token), yes. So ~233.2 in this case.

Thanks, that's made it clearer to me.

Sure. Just use batch LDA (by setting update_every=0). This will make an M-step (=model update) only once after each full corpus pass. This is equivalent to the "original" Blei's variational LDA.

I have used batch LDA in my testing, will include those results in the summary tomorrow. The batch LDA seems a lot slower than the online variational LDA, and the new multicoreLDA doesn't support batch mode.

It does -- there's a parameter called `batch` :-)

Or do you mean it didn't work for you?

I thought multicoreLDA had crashed when trying to distribute the chunks, but I've done more testing today and it's all working, so not sure what the problem was.

I'll post some comparisons of the different approaches I've tried soon, taking longer than I thought to go back through them all to decide which are useful to share.

thanks,

Brenda
 

Brenda Moon

unread,
Nov 13, 2014, 9:07:41 PM11/13/14
to gen...@googlegroups.com
I've redone my testing to make the tests easier to compare. In these tests I'm looking at one month of tweets collected using the keyword 'science' in January 2011.

I ran each of the Gensim LDA models over my whole corpus with mainly the default settings . The 318,823 corpus was without any gensim filtering of most frequent and least frequent terms. The 50,350 corpus was the default filtering and the 18,351 corpus was after removing some extra terms and increasing the rare word threshold from 5 to 20. I reduce iterations set to 10 because of the short average length of my documents (tweets).

These are the final perplexity and run time results reported by gensim or mallet:



run
, word count, alpha, iterations, heldout corpus/words, runtime(hours), perplexity

==100 topics, online training, models.LdaModel
1, 318823, symmetric, 50, 821/8088, 1.5, 13168.3  
2, 318823, auto, 50, 821/8088, 1.24, 9940.6
3, 318823, auto, 10, 821/8088, 1.21, 9851.3
4, 50350, auto, 10, 821/6914, 0.28, 662.1
5, 18351, auto, 10, 821/5962, 0.12, 391.6
==100 topics, online training, models.LdaMulticore
6, 18351, symmetric, 10, 821/5962, 0.07, 465.4
7, 18351, loaded, 10, 821/5962, 0.06, 384.6
8, 18351, asymmetric, 10, 821/5962, 0.07, 571.8
==100 topics, batch training with 10 passes, models.LdaMulticore
9, 18351, symmetric, 10, 2821/20589, 0.73, 438.4
==100 topics, online training with 10 passes, models.LdaMulticore}
10, 18351, symmetric, 10, 821/5962, 0.60, 492.5
==100 topics, 1000 iterations (passes), models.LdaMallet}
11, 18351, optimize, 10, unknown, 0.31, 199.9 (from Mallet converted using 2^-ll/word)



The 'loaded' alpha used the final alpha values from the earlier gensim.LdaModel alpha='auto' run.

I'm not sure that he perplexity from Mallet can be compared with the final perplexity results from the other gensim models, or how comparable the perplexity is between the different gensim models?

The resulting topics are not very coherent, so it is difficult to tell which are better. This is a sample of topics from run 5 (final filtered corpus, ldaModel alpha='auto'):

0.148*grade, 0.086*daily, 0.069*fail, 0.045*midterm, 0.035*young, 0.033*site, 0.029*discussion, 0.025*resource, 0.023*beauty, 0.021*user-virtualastro

0.086*teaching, 0.068*information, 0.064*law, 0.060*library, 0.054*secret, 0.052*trick, 0.040*ready, 0.031*uploaded, 0.026*attraction, 0.026*ebook

0.175*center, 0.104*please, 0.103*always, 0.091*use, 0.079*mind, 0.056*remember, 0.040*turn, 0.026*air, 0.019*shut, 0.019*fit

0.248*need, 0.073*reading, 0.060*business, 0.046*phd, 0.039*automatic, 0.029*publishing, 0.025*audio, 0.023*africa, 0.023*along, 0.017*suggestion

0.215*scientist, 0.078*truth, 0.054*bored, 0.047*snow, 0.045*break, 0.041*bird, 0.039*researcher, 0.038*winter, 0.031*bible, 0.025*james


The ones from the batch processing seem a bit clearer:

0.107*google, 0.100*fair, 0.084*online, 0.036*global, 0.025*first, 0.020*world, 0.017*come, 0.015*push, 0.014*gone, 0.014*launch

0.116*christian, 0.094*monitor, 0.028*news, 0.009*winter, 0.009*iphone, 0.008*obama, 0.008*egypt, 0.007*year, 0.006*protest, 0.006*shark

0.125*stupid, 0.065*bracelet, 0.041*homeopathy, 0.033*pseudo, 0.029*user-donttrythis, 0.029*debunk, 0.014*toy, 0.012*still, 0.008*behind, 0.007*hashtag-agw

0.070*state, 0.031*union, 0.025*phone, 0.023*obama, 0.017*christian, 0.016*address, 0.014*monitor, 0.013*mobile, 0.010*blast, 0.010*baby

0.044*week, 0.035*thank, 0.027*essay, 0.017*think, 0.014*done, 0.014*due, 0.012*english, 0.011*sleep, 0.010*day, 0.008*re


In run 10 I used the same 10 passes on the online LDA as for the batch to see if this helps to determine whether I have topic drift or not. These are a sample of topic from that run:

0.083*home, 0.063*channel, 0.043*already, 0.040*set, 0.040*nerd, 0.034*soon, 0.033*new, 0.033*planet, 0.024*astrology, 0.024*mom

0.222*google, 0.144*online, 0.103*found, 0.041*dead, 0.039*issue, 0.036*available, 0.020*mother, 0.019*gift, 0.018*find, 0.017*poor

0.077*looking, 0.073*lot, 0.049*half, 0.038*park, 0.037*break, 0.025*wanted, 0.024*biggest, 0.024*kinda, 0.024*forward, 0.023*five

0.409*fair, 0.180*project, 0.061*grade, 0.047*feel, 0.019*need, 0.016*south, 0.016*work, 0.015*method, 0.014*applied, 0.011*quote

0.129*teach, 0.095*philosophy, 0.059*sex, 0.051*baby, 0.038*phd, 0.035*cuz, 0.032*symphony, 0.030*eat, 0.020*bear, 0.017*indeed


I think that these are less clear. Looking at the perplexity at the end of each pass the batch run it keeps dropping:
[767.3, 612.2, 556.6, 524.1, 501.6, 482.6, 469.0, 458.1, 448.1, 438.4]

While for the online it increases again after the 3rd pass:
[473.8, 468.7, 446.9, 450.3, 481.0, 478.2, 485.2, 486.3, 487.9, 492.5]

I think this shows that I have topic drift?

Would randomly shuffling the corpus order make it possible to use online training without the topic drift being a problem?

How much clearer can I expect the topics to be?

Are there any suggestions of things I should try or anything I'm doing wrong?

I've moved onto trying to find the best number of topics, I'll write a separate post about that.

thanks,

Brenda


Ian Wood

unread,
Nov 14, 2014, 5:22:02 AM11/14/14
to gen...@googlegroups.com
I know that Mallet's alpha optimisation leads to much 'better' models (with a few caveats - you can end up with mega-topoics that swamp more subtle things you're interested in). There's a paper on it where I think the perplexity scores are substantially improved. I understand it's a good idea to have lots of topics for it - if there are too many, the 'extra' ones get optimised into trivial tiny topics, effectively disappearing.

Not sure about the topic drift, but shuffling the data would seem a good idea to stop it. Perhaps comparing a run with shuffled data to what you've already got would tell you if it's happening?

One thing I'd be careful of is retweets - repeated text flies in the face of the statistical assumptions of LDA, and draws the repeated words into the same topics in an unreasonable way. If a tweet is retweeted a thousand times or so, you'll likely end up with a topic that represents that tweet alone! 

I'd be interested to see how the models themselves compare - if it seems worth the effort, you could use hellinger distance to find sets of similar topics, one from each model, see which ones make most sense.
Message has been deleted

oli ebe

unread,
Jul 2, 2018, 11:07:27 AM7/2/18
to gensim
Hi Radium,

I am also extracting perplexity values from the LdaMallet wrapper atm and I am not sure I understand why 2^(-LL/token)  is the perplexity. Referring to the definition of perplexity in http://qpleple.com/perplexity-to-evaluate-topic-models/ it seems to me that LL/token is what goes into the exponenial, or what am I missing here?

Best

Gautam Kishore Shahi

unread,
Feb 11, 2019, 2:33:51 PM2/11/19
to Gensim
Hello,

I am confused with the LDA Mallet

Does LDA Mallet sue Gibbs sampling?

Please share your opinion.

Regards, 
Reply all
Reply to author
Forward
0 new messages