Comparing two Topic Models (LDA)

774 views
Skip to first unread message

Prateek Mehta

unread,
Feb 3, 2016, 5:43:21 PM2/3/16
to gensim
Hello Everyone,

I am aiming at evaluating two topic models learnt on two different corpuses, one on complete training documents and another one on a subset of training documents. 
For that i am using log_perplexity on completely different heldout documents as a metric.

I have 2 questions precisely:

1. How does quality of training documents affect the learning of a LDA model? and how to quantify it?
2. At times while learning the LDA model on a subset of training documents it gives a warning saying not enough updates, how to decide on number of passes and iterations automatically.
and make sure that the LDA model converges.

I will be very thankful if someone could help me out here.

Prateek Mehta

 

John H

unread,
Feb 5, 2016, 3:16:24 PM2/5/16
to gensim
You asked:


> 1. How does quality of training documents affect the learning of a LDA model? and how to quantify it?

In order to measure or quantify anything, you first need to define precisely the thing that you are trying to measure or quantify. In this case, you mention "quality of training documents". If you're feeding LDA a "bag of words", that is LDA's exclusive view of your documents.
Each document is a bag of words. Are the words in those "bags" semantically meaningful, often misspelled, or full of markup notations from wiki pages or html? Or are stopwords removed, words spelled correctly, with no markup? Are the documents typically short (e.g., tweets), typically long (better for LDA), or is document length normally distributed? Those are just examples. GIGO -- garbage in, garbage out. You can always find ways to artificially reduce the "quality" of your documents and then run LDA using identical hyperparameters with the "good training set" and "bad training set". If you're using gensim, then compare perplexity between the two results. However, I'm not personally convinced that any purely human-out-of-loop approach is the "answer" for evaluating topic model quality.


> 2. At times while learning the LDA model on a subset of training documents it gives a warning saying not enough updates, how to decide on number of passes and iterations automatically.
and make sure that the LDA model converges

See the ldamodel "update" function on this page: https://radimrehurek.com/gensim/models/ldamodel.html
Set a maximum number of iterations that you can tolerate. My understanding is that the number of passes affects the accuracy of the topic models -- the higher, the more accurate (probabilistically speaking), and based on https://groups.google.com/forum/#!topic/gensim/z0wG3cojywM, it's a useful parameter to raise higher when training on relatively small corpora. I suspect that the number of passes doesn't affect convergence behavior, but I'm not sure. If it doesn't, and you're only concerned with convergence, then you only need to be concerned with iterations.

Re: both of your questions: These are good reads:
https://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a24867a.html
http://jmlr.org/proceedings/papers/v32/tang14.pdf


Prateek Mehta

unread,
May 30, 2016, 11:53:47 PM5/30/16
to gen...@googlegroups.com
Hi everyone, 

I will be really helpful  if anyone could help me understand  the quantity gensim returns.
model.log_perplexity(heldoutCorpus)  gives me this:


25 topics

50 topics

75 topics

100 topics

model1

-12.0598461316


-13.3652520541

-14.6806344384

-15.995936448738435

model2

-12.9461469306



-14.3757049418



-16.5143267384



-18.133616543534654

model3

-11.991266939



-13.3365578803



-14.7494131703



-16.0733079648




would you please help me understand which one is a better model? I know perplexity has been discussed quite a few timeson mailing list but I am confused. Kindly help, which one is a better model based on predicting held out documents.

regards,



--
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages