online LDA with Associated Press corpus

334 views
Skip to first unread message

Andrew Wan

unread,
Oct 9, 2014, 2:43:58 PM10/9/14
to gen...@googlegroups.com
Hi, 

I'm new to gensim and topic modeling in general.  I'm trying to compare the performance of the LdaModel and LdaMallet implementations for an application, and I started with the associated press corpus from here: http://www.cs.princeton.edu/~blei/lda-c/.  I'm mainly concerned with the quality of the topics assigned to each document, as I'm assuming for my setting that the LdaModel will be more efficient.  The ap corpus is only ~2400 documents, and I'm wondering how the size might affect the quality of the LdaModel output.  So far, I can't get it to match the quality and consistency of the topics generated by the mallet implementation.

I'm wondering if this is to be expected on a corpus of this size, or if I'm not using the algorithm correctly on the corpus (should I do different preprocessing than I would with mallet; how should I set the number of passes, number of iterations and the decay)?  
Does the class have some variable that measures the quality of the model and whether it converged? I thought that maybe because the corpus is small, the algorithm might not be converging, but I'm not sure what the best way would be to push it towards convergence, e.g. should I expect similar quality results if I use chunk size = 200 on a 20,000 document corpus vs  100 on a 10,000 corpus vs 100 on a 10,000 corpus with 2 passes? Also, I read in the paper that on a smaller corpus, smaller chunk size seemed to work with decay closer to 1 and with Tau >> 1 (on the gensim implementation, there doesn't seem to be an option to set Tau?).  Sorry if these questions don't make sense, I'm new to all this... 

Thanks, 
Andrew

Radim Řehůřek

unread,
Oct 10, 2014, 5:31:50 AM10/10/14
to gen...@googlegroups.com
Hello Andrew,


On Thursday, October 9, 2014 8:43:58 PM UTC+2, Andrew Wan wrote:
Hi, 

I'm new to gensim and topic modeling in general.  I'm trying to compare the performance of the LdaModel and LdaMallet implementations for an application, and I started with the associated press corpus from here: http://www.cs.princeton.edu/~blei/lda-c/.  I'm mainly concerned with the quality of the topics assigned to each document, as I'm assuming for my setting that the LdaModel will be more efficient.  The ap corpus is only ~2400 documents, and I'm wondering how the size might affect the quality of the LdaModel output.  So far, I can't get it to match the quality and consistency of the topics generated by the mallet implementation.

I'm wondering if this is to be expected on a corpus of this size, or if I'm not using the algorithm correctly on the corpus (should I do different preprocessing than I would with mallet; how should I set the number of passes, number of iterations and the decay)?  

That depends on what you're actually doing -- what parameters are you using? Can you share your code?

The biggest impact will probably come from:

a) alpha & eta hyperparameters (and their default values differ a lot between mallet vs. gensim!)
b) preprocessing, esp. stop words removal

I'm actually curious about this too, I'll compare them on that public AP dataset myself when I get time... hopefully this weekend.


Does the class have some variable that measures the quality of the model and whether it converged?

that would be `model.log_perplexity(test_corpus)` -- outputs models's per-word perplexity (and variational bound).

 
I thought that maybe because the corpus is small, the algorithm might not be converging, but I'm not sure what the best way would be to push it towards convergence, e.g. should I expect similar quality results if I use chunk size = 200 on a 20,000 document corpus vs  100 on a 10,000 corpus vs 100 on a 10,000 corpus with 2 passes? Also, I read in the paper that on a smaller corpus, smaller chunk size seemed to work with decay closer to 1 and with Tau >> 1 (on the gensim implementation, there doesn't seem to be an option to set Tau?).  Sorry if these questions don't make sense, I'm new to all this... 

Yes, the default parameters are optimized for large corpora. For tiny corpora (2k docs), other settings may indeed perform better. You can set decay in the parameters; `tau` is fixed at 1.0.

But I suspect the most difference will come from a) and b) above, plus the small corpus size, rather than tweaking micro-settings. And such micro tweaks wouldn't translate well to other corpora anyway, you'd probably have to fine-tune all over again for a different corpus.

I'll ping this thread with my results later :)

Best,
Radim



Thanks, 
Andrew

Andrew Wan

unread,
Oct 10, 2014, 4:58:37 PM10/10/14
to gen...@googlegroups.com
Hi Radim, 
Thanks a lot!  It didn't occur to me that the default settings of the hyperparameters would be so different.  I had tried setting alpha='auto' with little success, but after your post I set alpha =.5 (I think this is the default in mallet?), and the topics seem to have improved.  I should probably go back and understand what that parameter really means...
As for preprocessing, I am directly using the ap.dat file from Blei's site and vocab.txt as the dictionary.  I saw that the mallet wrapper uses a stopword set from the mallet package, so I downloaded those and removed them from the corpus before running Lda.  Doing this didn't seem to help too much (before setting alpha), nor did preprocessing with tfidf.   Anyways, I'll keep you updated as I experiment with the other hyperparamter, but definitely let me know what you find since I might be doing it all wrong :). 

As for code, I'm happy to send you an email with the scraps that I have, but I'm not doing anything beyond what's mentioned above and then calling 
lda_model = models.LdaModel(corpus=corpus,id2word=dictionary,num_topics=100,passes=10), and sometimes setting the other parameters as I'd mentioned.  I set passes to 10 because I was getting a warning about too few passes.  

Thanks again!

Christopher S. Corley

unread,
Oct 10, 2014, 5:15:35 PM10/10/14
to gensim
Just to chime in, the default alpha in Mallet is 50/num_topics for each alpha
in the vector. A bit strange. The parameter you give to Mallet will always be
divided by the number of topics.

cheers,
Chris.

Excerpts from Andrew Wan's message of 2014-10-10 15:58:37 -0500:

Andrew Wan

unread,
Oct 14, 2014, 9:24:23 AM10/14/14
to gen...@googlegroups.com
Hi Chris, 
Good point...  It seems that the default setting for alpha in gensim's LDA also depends on the number of topics, although when you manually set the parameter it does not divide by the number of topics...
Reply all
Reply to author
Forward
0 new messages