Dear all,
I have been trying to evaluate the LDA topic model from GenSim and I get very strange results.
My method:
- I used a labeled corpus (e.g. 20 news group)
- I did some text cleaning (removing stopwords, punctuation, etc.)
- Used the Bag of words vectorization.
- Applyed Gensim Latent Dirichlet Allocation with: the number of topics is equal to the number of labels (in this case 20), alpha and beta are set to 'auto', the number of iterations is set to 2000, the chuncksize is equal to the number of documents in the corpus. The rest of the parameters have the default value.
- Fit the model to the corpus
- Construct the Contingency table (rows are labels, columns are topics, and the cells contain the number of documents that have the label and the topic at the same time).
- Computed the Adjusted Rand Index [1] and the Purity [2] using the Contingency table.
Did the same thing using MALLET and Sklearn implementations of Latent Dirichlet Allocation with the same parameters.
These are my results (Package = AVG +/- STD)
GENSIM LDA ARI = 0.023 +/- 0.007
MALLET LDA ARI = 0.483 +/- 0.018
SKLEARN LDA ARI = 0.419 +/- 0.036
GENSIM LDA PURITY = 0.297 +/- 0.022
MALLET LDA PURITY = 0.709 +/- 0.016
SKLEARN LDA PURITY = 0.701 +/- 0.021
As seen from these results MALLET and Sklearn LDA give simmilar results, but the results I get using the GenSim LDA are really off.
I am doing something wrong? Why are these results so off?
Thank you very much,
Ciprian Truică
[1] Wagner, S., & Wagner, D. (2007). Comparing clusterings: an overview. Karlsruhe: Universität Karlsruhe, Fakultät für Informatik.
[2] Zhao, Y., & Karypis, G. (2001). Criterion functions for document clustering: Experiments and analysis (Vol. 1, p. 40). Technical report.