Evaluation of LDA model on Lee data set

Amir H. Jadidinejad

unread,

Feb 22, 2014, 2:24:15 AM2/22/14

to gen...@googlegroups.com

I'm using "test_lee.py" and edit the following line to evaluate LDA instead of LSA:

lsi = models.LdaModel(bg_corpus_ent, id2word=dictionary, num_topics=200)

The output correlation is:

[[ 1. 0.20640225]
[ 0.20640225 1. ]]
C:\Users\PSI\AppData\Local\Enthought\Canopy\User\lib\site-packages\gensim-0.8.9-py2.7.egg\gensim\models\ldamodel.py:453: RuntimeWarning: overflow encountered in exp2
(perwordbound, numpy.exp2(-perwordbound), len(chunk), corpus_words))

Is it the real performance of LDA (r=0.2) or something is wrong?

Radim Řehůřek

unread,

Mar 9, 2014, 5:54:47 PM3/9/14

to gen...@googlegroups.com, amir....@yahoo.com

Hello Amir,

the "Lee corpus" contains only 300 documents. It's quite possible the LDA results are not good, esp. when asking for 200 topics.

Also, you're doing only a single online training pass (try more passes, like 50), not removing common words (try dictionary.filter_extremes), use plain bag-of-words vectors (drop the log-entropy transformation), use fewer topics etc.

I tried LDA on Lee and got a correlation to human judgements of around ~0.4 (which is still way worse than LSI's 0.6).

You could also try LDA based on gibbs sampling (gensim's LdaModel is based on variational inference), which is a different method of training the model.

Incidentally, I just pushed a wrapper for Mallet's LDA to gensim, which is a great java implementation that uses gibbs sampling, hyperparameter optimizations etc. You can find it under `gensim.models.ldamallet`, using the develop branch of gensim from github.

Best,

Radim

--

Radim Řehůřek, Ph.D.

consulting @ machine learning, natural language processing, big data

http://radimrehurek.com

Yuanhan Mo

unread,

Mar 13, 2014, 6:47:04 PM3/13/14

to gen...@googlegroups.com, amir....@yahoo.com

Can u show a example for how can we use this interface ldamallet()?

Thanks

Max

Radim Řehůřek

unread,

Mar 13, 2014, 7:05:00 PM3/13/14

to gen...@googlegroups.com, amir....@yahoo.com

On Thursday, March 13, 2014 11:47:04 PM UTC+1, Yuanhan Mo wrote:

Can u show a example for how can we use this interface ldamallet()?

One example (link to code) is in the recent thread on Lee corpus: https://groups.google.com/forum/#!topic/gensim/rp1Vs4H_0A4

There's nothing special though, it's exactly the same as any other class. Except you give it one extra parameter = path to Mallet.

HTH,

Radim

Yuanhan Mo

unread,

Mar 13, 2014, 7:30:26 PM3/13/14

to gen...@googlegroups.com, amir....@yahoo.com

Yeah, just checked your source code on github and I am clear of how it works.

Yuanhan Mo

unread,

Mar 13, 2014, 7:32:59 PM3/13/14

to gen...@googlegroups.com

Forget to say, Thank u very much for this excellent library.

Reply all

Reply to author

Forward