I'm tying to apply gensim package for topic modelling in one of my tasks and I followed the tutorial on your website, but I got strange results. No sure, may be I'm doing some obvious mistakes.
To compute LDA model, I used 200 as a number of topics and other parameters were set by default. My dataset is English Gigaword with 5.5M documents, it was lemmatised, then I collected a dictionary and filtered stop words using filter_extremes(no_below=5, no_above=0.5, keep_n=None) which produced a new dictionary with 1,427,924 unique tokens. Then I vectorized my corpus and got ~20GB matrix in MM format.
There are couple of strange things that I noticed:
After LDA model is computed, I tried to get topic distribution for a fixed token "work", each time gensim returned a different distribution:
>>> lda.__getitem__(d.doc2bow(["work"]), 0.1)
[(25, 0.50250000000000461)]
>>> lda.__getitem__(d.doc2bow(["work"]), 0.1)
[(121, 0.50250000000000195)]
>>> lda.__getitem__(d.doc2bow(["work"]), 0.1)
[(136, 0.5025000000000015)]
>>> lda.__getitem__(d.doc2bow(["work"]), 0.1)
[(25, 0.50250000000000461)]
>>> lda.__getitem__(d.doc2bow(["work"]), 0.1)
[(109, 0.50250000000000228)]
>>> lda.__getitem__(d.doc2bow(["work"]), 0.1)
[(136, 0.5025000000000015)]
>>> lda.__getitem__(d.doc2bow(["work"]), 0.1)
[(121, 0.50250000000000195)]
I'm not sure why this happens, is it because something is wrong with my model?
Also, if we look at top K words for different topics, some words (for example "his") dominate with pretty high probability in many of them:
>>> lda.print_topic(1, 10)
'0.008*his + 0.006*_ + 0.005*who + 0.005*we + 0.005*they + 0.004*had + 0.004*were + 0.004*after + 0.004*: + 0.004*--'
>>> lda.print_topic(2, 10)
'0.010*; + 0.009*his + 0.007*new + 0.005*i + 0.005*who + 0.004*had + 0.003*more + 0.003*they + 0.003*after + 0.003*were'
>>> lda.print_topic(20, 10)
'0.012*his + 0.006*who + 0.006*i + 0.005*we + 0.005*they + 0.005*had + 0.004*about + 0.004*been + 0.004*more + 0.004*their'
>>> lda.print_topic(21, 10)
'0.010*i + 0.006*his + 0.005*they + 0.005*; + 0.005*: + 0.004*who + 0.004*were + 0.004*had + 0.003*new + 0.003*we'
>>> lda.print_topic(23, 10)
"0.005*who + 0.005*they + 0.004*had + 0.004*: + 0.004*i + 0.004*n't + 0.004*his + 0.004*we + 0.004*or + 0.004*one"
Which also looks suspicious for me.
May be it is caused by a high amount of stop words and noise that they produce? My hypothesis is that documents in Gigaword have relatively small size (5-15 sentences) and no_below=5, no_above=0.5 were not enough to filter them and additional stoplist may help.
Also, on the website you claims that on your machine it took about 5 hours to compute a model for Wikipedia, what was the corpus size? On our machine (16 core & 32GB RAM, but I noticed that only 1 core was used) it took about 3-4 days to process 20GB feature matrix - again, may be setting some parameters to different than default values may help.
Thank you in advance,
-vladimir
To compute LDA model, I used 200 as a number of topics and other parameters were set by default. My dataset is English Gigaword with 5.5M documents, it was lemmatised, then I collected a dictionary and filtered stop words using filter_extremes(no_below=5, no_above=0.5, keep_n=None) which produced a new dictionary with 1,427,924 unique tokens. Then I vectorized my corpus and got ~20GB matrix in MM format.
There are couple of strange things that I noticed:
After LDA model is computed, I tried to get topic distribution for a fixed token "work", each time gensim returned a different distribution:
I'm not sure why this happens, is it because something is wrong with my model?
Also, if we look at top K words for different topics, some words (for example "his") dominate with pretty high probability in many of them:
Which also looks suspicious for me.
May be it is caused by a high amount of stop words and noise that they produce? My hypothesis is that documents in Gigaword have relatively small size (5-15 sentences) and no_below=5, no_above=0.5 were not enough to filter them and additional stoplist may help.
Also, on the website you claims that on your machine it took about 5 hours to compute a model for Wikipedia, what was the corpus size? On our machine (16 core & 32GB RAM, but I noticed that only 1 core was used) it took about 3-4 days to process 20GB feature matrix - again, may be setting some parameters to different than default values may help.