Strange results using Gensim LDA

Ciprian-Octavian Truica

unread,

Jun 13, 2016, 8:39:29 AM6/13/16

to gensim

Dear all,

I have been trying to evaluate the LDA topic model from GenSim and I get very strange results.

My method:

- I used a labeled corpus (e.g. 20 news group)

- I did some text cleaning (removing stopwords, punctuation, etc.)

- Used the Bag of words vectorization.

- Applyed Gensim Latent Dirichlet Allocation with: the number of topics is equal to the number of labels (in this case 20), alpha and beta are set to 'auto', the number of iterations is set to 2000, the chuncksize is equal to the number of documents in the corpus. The rest of the parameters have the default value.

- Fit the model to the corpus

- Construct the Contingency table (rows are labels, columns are topics, and the cells contain the number of documents that have the label and the topic at the same time).

- Computed the Adjusted Rand Index [1] and the Purity [2] using the Contingency table.

Did the same thing using MALLET and Sklearn implementations of Latent Dirichlet Allocation with the same parameters.

These are my results (Package = AVG +/- STD)

GENSIM LDA ARI = 0.023 +/- 0.007

MALLET LDA ARI = 0.483 +/- 0.018

SKLEARN LDA ARI = 0.419 +/- 0.036

GENSIM LDA PURITY = 0.297 +/- 0.022

MALLET LDA PURITY = 0.709 +/- 0.016

SKLEARN LDA PURITY = 0.701 +/- 0.021

As seen from these results MALLET and Sklearn LDA give simmilar results, but the results I get using the GenSim LDA are really off.

I am doing something wrong? Why are these results so off?

Thank you very much,

Ciprian Truică

[1] Wagner, S., & Wagner, D. (2007). Comparing clusterings: an overview. Karlsruhe: Universität Karlsruhe, Fakultät für Informatik.

[2] Zhao, Y., & Karypis, G. (2001). Criterion functions for document clustering: Experiments and analysis (Vol. 1, p. 40). Technical report.

Radim Řehůřek

unread,

Jun 13, 2016, 9:04:37 AM6/13/16

to gensim, Lev Konstantinovskiy

Hello Ciprian,

gensim and sklearn use the exact same algo (online variational bayes), so the difference must lie in different preprocessing or training parameters you used.

However, in case your use-case is very common, we could consider switching to some other parameter choices as "default" -- if these give consistently better results. IIRC, we're using the default parameters recommended in the original paper by Matt Hoffman.

Can you share your eval scripts?

CC Lev -- please look into this.

Cheers,

Radim

Christopher S. Corley

unread,

Jun 13, 2016, 10:10:07 AM6/13/16

to gensim

One key difference between sklearn and gensim is how each handles the concept of repeating the corpus. In sklearn, as long as you're doing a batch update (which it seems you are), then it will repeat the corpus until convergence. In gensim, you must specify how many repetitions you desire with the `passes` parameter. You should get similar results if you set the `passes` parameter in gensim to however many sklearn took to complete.

Chris.

--
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ciprian-Octavian Truica

unread,

Jun 13, 2016, 11:39:46 AM6/13/16

to gen...@googlegroups.com

Dear all,

Thank you for the answers.

I have added the code to github (https://github.com/cipriantruica/TM_TESTS).

@Christopher

Is the parameter in Sklearn LDA you are referring to "evaluate_every"? If so, I set it's value to -1.

Ciprian

--
You received this message because you are subscribed to a topic in the Google Groups "gensim" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/bBHkGogNrfg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gensim+un...@googlegroups.com.

Christopher S. Corley

unread,

Jun 13, 2016, 11:48:32 AM6/13/16

to gensim

No, it is the parameter "max_iter" in sklearn, which defaults to 10. "passes" in gensim defaults to 1. Try setting passes=10 for gensim.

Ciprian-Octavian Truica

unread,

Jun 13, 2016, 11:50:57 AM6/13/16

to gen...@googlegroups.com

Ok, thank you!

Ciprian

Ciprian-Octavian Truica

unread,

Jun 13, 2016, 11:53:07 AM6/13/16

to gen...@googlegroups.com

Then what does the iterations parameter in Gensim LDA do?

Ciprain

Christopher S. Corley

unread,

Jun 13, 2016, 11:55:39 AM6/13/16

to gensim

`iterations` controls how many are taken during inference *per document*, rather than the entire corpus, which is what `passes` does.

Chris.

Devashish Deshpande

unread,

Jun 21, 2016, 12:42:30 PM6/21/16

to gensim

Hey Ciprian,

I was thinking of testing out the similarity in results of LDA by using same values for `max_iter` and `passes` parameter from sklearn and gensim respectively and also illustrating the difference between `passes` and `iterations` in gensim through a script. Were you able to produce similar results after setting `passes` = 10?

Thanks and regards,
Devashish

Ciprian-Octavian Truica

unread,

Jun 21, 2016, 4:44:16 PM6/21/16

to gen...@googlegroups.com

Hi Devashish,

I tried to test Gensim LDA with the value for `passes` equal to the value I used for `max_iter` in Sklearn LDA, in my case this number was 2000. I kept the number of `iterations` equal to 2000. It turned out to be a bad idea, because it took around 3 days to finish one test with Gensim LDA and after that the script just froze. With the results I got from this 1 test, the Purity and ARI didn't increase too much, e.g. from the previous ARI=0.023 now I got an ARI = 0.062, and form Purity=0.077921 I got a new Purity=0.149516.

The corpus is the same, you can find it on my Github (https://github.com/cipriantruica/TM_TESTS).

If you look at the `tm_all_default.py` script on my Github repository you can find in the `TopicModeling` class my implementation of Gensim LDA (`topicsLDA_gensim`) and see how I used it.

I don't know if I was continuing to be doing something wrong in my implementation, and because it took so much time for only one test, I decided to stop testing Gensim LDA.

All the best,

Ciprian

--

Reply all

Reply to author

Forward