Evaluation of LDA model on Lee data set

468 views
Skip to first unread message

Amir H. Jadidinejad

unread,
Feb 22, 2014, 2:24:15 AM2/22/14
to gen...@googlegroups.com
I'm using "test_lee.py" and edit the following line to evaluate LDA instead of LSA:
lsi = models.LdaModel(bg_corpus_ent, id2word=dictionary, num_topics=200)
The output correlation is:
[[ 1.          0.20640225]
 [ 0.20640225  1.        ]]
C:\Users\PSI\AppData\Local\Enthought\Canopy\User\lib\site-packages\gensim-0.8.9-py2.7.egg\gensim\models\ldamodel.py:453: RuntimeWarning: overflow encountered in exp2
  (perwordbound, numpy.exp2(-perwordbound), len(chunk), corpus_words))

Is it the real performance of LDA (r=0.2) or something is wrong?

Radim Řehůřek

unread,
Mar 9, 2014, 5:54:47 PM3/9/14
to gen...@googlegroups.com, amir....@yahoo.com
Hello Amir,

the "Lee corpus" contains only 300 documents. It's quite possible the LDA results are not good, esp. when asking for 200 topics.

Also, you're doing only a single online training pass (try more passes, like 50), not removing common words (try dictionary.filter_extremes), use plain bag-of-words vectors (drop the log-entropy transformation), use fewer topics etc.

I tried LDA on Lee and got a correlation to human judgements of around ~0.4 (which is still way worse than LSI's 0.6).

You could also try LDA based on gibbs sampling (gensim's LdaModel is based on variational inference), which is a different method of training the model.

Incidentally, I just pushed a wrapper for Mallet's LDA to gensim, which is a great java implementation that uses gibbs sampling, hyperparameter optimizations etc. You can find it under `gensim.models.ldamallet`, using the develop branch of gensim from github.

Best,
Radim
--
Radim Řehůřek, Ph.D.
consulting @ machine learning, natural language processing, big data
 

Yuanhan Mo

unread,
Mar 13, 2014, 6:47:04 PM3/13/14
to gen...@googlegroups.com, amir....@yahoo.com
Can u show a example for how can we use this interface ldamallet()?

Thanks

Max

Radim Řehůřek

unread,
Mar 13, 2014, 7:05:00 PM3/13/14
to gen...@googlegroups.com, amir....@yahoo.com

On Thursday, March 13, 2014 11:47:04 PM UTC+1, Yuanhan Mo wrote:
Can u show a example for how can we use this interface ldamallet()?


One example (link to code) is in the recent thread on Lee corpus: https://groups.google.com/forum/#!topic/gensim/rp1Vs4H_0A4

There's nothing special though, it's exactly the same as any other class. Except you give it one extra parameter = path to Mallet.

HTH,
Radim

Yuanhan Mo

unread,
Mar 13, 2014, 7:30:26 PM3/13/14
to gen...@googlegroups.com, amir....@yahoo.com
Yeah, just checked your source code on github and I am clear of how it works.

Yuanhan Mo

unread,
Mar 13, 2014, 7:32:59 PM3/13/14
to gen...@googlegroups.com
Forget to say, Thank u very much for this excellent library.
Reply all
Reply to author
Forward
0 new messages