LDA model strange behaviour

154 views
Skip to first unread message

Vladimir Zaytsev

unread,
Dec 17, 2013, 5:46:06 PM12/17/13
to gen...@googlegroups.com, Ekaterina Ovchinnikova

I'm tying to apply gensim package for topic modelling in one of my tasks and I followed the tutorial on your website, but I got strange results. No sure, may be I'm doing some obvious mistakes.

To compute LDA model, I used 200 as a number of topics and other parameters were set by default. My dataset is English Gigaword with 5.5M documents, it was lemmatised, then I collected a dictionary and filtered stop words using filter_extremes(no_below=5, no_above=0.5, keep_n=None) which produced a new dictionary with 1,427,924 unique tokens. Then I vectorized my corpus and got ~20GB matrix in MM format.

There are couple of strange things that I noticed:

After LDA model is computed, I tried to get topic distribution for a fixed token "work", each time gensim returned a different distribution:

>>> lda.__getitem__(d.doc2bow(["work"]), 0.1)

[(25, 0.50250000000000461)]

>>> lda.__getitem__(d.doc2bow(["work"]), 0.1)

[(121, 0.50250000000000195)]

>>> lda.__getitem__(d.doc2bow(["work"]), 0.1)

[(136, 0.5025000000000015)]

>>> lda.__getitem__(d.doc2bow(["work"]), 0.1)

[(25, 0.50250000000000461)]

>>> lda.__getitem__(d.doc2bow(["work"]), 0.1)

[(109, 0.50250000000000228)]

>>> lda.__getitem__(d.doc2bow(["work"]), 0.1)

[(136, 0.5025000000000015)]

>>> lda.__getitem__(d.doc2bow(["work"]), 0.1)

[(121, 0.50250000000000195)]

I'm not sure why this happens, is it because something is wrong with my model?


Also, if we look at top K words for different topics, some words (for example "his") dominate with pretty high probability in many of them:

>>> lda.print_topic(1, 10)

'0.008*his + 0.006*_ + 0.005*who + 0.005*we + 0.005*they + 0.004*had + 0.004*were + 0.004*after + 0.004*: + 0.004*--'

>>> lda.print_topic(2, 10)

'0.010*; + 0.009*his + 0.007*new + 0.005*i + 0.005*who + 0.004*had + 0.003*more + 0.003*they + 0.003*after + 0.003*were'

>>> lda.print_topic(20, 10)

'0.012*his + 0.006*who + 0.006*i + 0.005*we + 0.005*they + 0.005*had + 0.004*about + 0.004*been + 0.004*more + 0.004*their'

>>> lda.print_topic(21, 10)

'0.010*i + 0.006*his + 0.005*they + 0.005*; + 0.005*: + 0.004*who + 0.004*were + 0.004*had + 0.003*new + 0.003*we'

>>> lda.print_topic(23, 10)

"0.005*who + 0.005*they + 0.004*had + 0.004*: + 0.004*i + 0.004*n't + 0.004*his + 0.004*we + 0.004*or + 0.004*one"

Which also looks suspicious for me.

May be it is caused by a high amount of stop words and noise that they produce? My hypothesis is that documents in Gigaword have relatively small size (5-15 sentences) and no_below=5, no_above=0.5 were not enough to filter them and additional stoplist may help.

Also, on the website you claims that on your machine it took about 5 hours to compute a model for Wikipedia, what was the corpus size? On our machine (16 core & 32GB RAM, but I noticed that only 1 core was used) it took about 3-4 days to process 20GB feature matrix - again, may be setting some parameters to different than default values may help.

Thank you in advance,

-vladimir

Radim Řehůřek

unread,
Dec 23, 2013, 5:59:59 PM12/23/13
to gen...@googlegroups.com, Ekaterina Ovchinnikova
Hello Vladimir,


On Tuesday, December 17, 2013 11:46:06 PM UTC+1, Vladimir Zaytsev wrote:

To compute LDA model, I used 200 as a number of topics and other parameters were set by default. My dataset is English Gigaword with 5.5M documents, it was lemmatised, then I collected a dictionary and filtered stop words using filter_extremes(no_below=5, no_above=0.5, keep_n=None) which produced a new dictionary with 1,427,924 unique tokens. Then I vectorized my corpus and got ~20GB matrix in MM format.


so you're using a dictionary of 1.5m tokens? I assume that's on purpose... just wondering why so many.


There are couple of strange things that I noticed:

After LDA model is computed, I tried to get topic distribution for a fixed token "work", each time gensim returned a different distribution:

I'm not sure why this happens, is it because something is wrong with my model?


Can you post `lda.state.get_lambda()[:, d.token2id['work']]` ? It should be a 200 dim vector.
 
 

Also, if we look at top K words for different topics, some words (for example "his") dominate with pretty high probability in many of them:

Which also looks suspicious for me.

May be it is caused by a high amount of stop words and noise that they produce? My hypothesis is that documents in Gigaword have relatively small size (5-15 sentences) and no_below=5, no_above=0.5 were not enough to filter them and additional stoplist may help.


Yes, very possible. LDA is picky about corpus preprocessing. The recent patch by Ben (automatically tuned asymmetric alpha) may help here; see http://radimrehurek.com/2013/12/python-lda-in-gensim-christmas-edition/ .

Or, like you say, tweak filter_extremes / use stopwords. An English stopword set is in `from gensim.parsing import STOPWORDS`.



Also, on the website you claims that on your machine it took about 5 hours to compute a model for Wikipedia, what was the corpus size? On our machine (16 core & 32GB RAM, but I noticed that only 1 core was used) it took about 3-4 days to process 20GB feature matrix - again, may be setting some parameters to different than default values may help.


I suspect this will be due to the 1.5m dictionary. The bigger the dictionary, the longer training takes (+ the more memory). I think the tutorial uses the 50k most common words in filter_extremes, whereas you use all 1.5m.

Quality of the BLAS library that NumPy/SciPy link against may also affect performance, by as much as an order of magnitude.

Best,
Radim

Vladimir Zaytsev

unread,
Dec 27, 2013, 8:44:20 PM12/27/13
to gen...@googlegroups.com, Ekaterina Ovchinnikova
Radim,

Looks like huge dictionary size was the issue, setting kepp_n=50000 and using stop list fixed both - computation time and strange model's behaviour issue.

Thank you a lot!
Reply all
Reply to author
Forward
0 new messages