Interpretation of coherence plot for choosing number of topics

1,657 views
Skip to first unread message

Amar DS

unread,
Apr 25, 2017, 6:05:41 AM4/25/17
to gensim

I'm trying to find the number of topics by plotting topic coherence vs number of topics. I followed the tutorial for implementation from the link https://markroxor.github.io/gensim/static/notebooks/topic_coherence_tutorial.html
In the link, it is mentioned that the good LDA model is the one which has high value of coherence.
However, in this link https://gist.github.com/dsquareindia/ac9d3bf57579d02302f9655db8dfdd55 , its mentioned that the ideal number of topics is 2 even though the topic coherence keeps on increasing even after 3 topics.

May I know, how to interpret the coherence plot and how to choose the number of topics based on the plot?. Below is the plot obtained for my use case. Does this plot makes sense to my use case and what do you think is right number of topics for my case?. Thank you.


Auto Generated Inline Image 1

Devashish Deshpande

unread,
Apr 25, 2017, 6:40:23 AM4/25/17
to gen...@googlegroups.com
Hi Amar,

A certain amount of intuition still does come into play even while using coherence for choosing the optimal number of topics. This reflects in the range you use to iterate for n_topics. In the plot in my gist, I took 2 as the number of topics because the coherence value seemed the highest from the plot (should've probably done np.argmax, my bad) and also made logical sense because of the nature of the corpus.

In your data we can see that there is a peak between 0-100 and a peak between 400-500. What I would think in this case is that "does ~480 topics make sense for the kind of data I have?" If not, you can just do an np.argmax for 0-100 topics and trade-off coherence score for simpler understanding. Otherwise just do an np.argmax on the full set.

Please note that this is just my way of doing it. There might be a better way to do this too.

Hope that helps,
Devashish

--
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

john.hard...@gmail.com

unread,
Apr 25, 2017, 9:24:03 AM4/25/17
to gensim
Hey Devashish,

Am I correct in interpreting that in the "Making LDA Behave Like LSA" section your are showing which topics are the most coherent? That is, feeding the words of individual topics into the CoherenceModel shows how coherent the individual topic is? 

To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.

Devashish Deshpande

unread,
Apr 25, 2017, 10:08:56 AM4/25/17
to gen...@googlegroups.com
Hey John,

Yes that is right. Just making a list of all the words in the topic and calculating the coherence in that piece of code.

Regards,
Devashish

To unsubscribe from this group and stop receiving emails from it, send an email to gensim+unsubscribe@googlegroups.com.

john.hard...@gmail.com

unread,
Apr 25, 2017, 10:45:51 AM4/25/17
to gensim
Ok, thank you.

One more question. When I read the original paper, it seemed to suggest that using an exterior corpus for the coherence text provided better results than using the text used to train the LDA model. How much worse is is to use the training text?

Devashish Deshpande

unread,
Apr 26, 2017, 5:13:42 AM4/26/17
to gen...@googlegroups.com
I haven't tried comparing numerically how much better using an external text is than using the original text. The paper does say that calculating probabilities on the english wikipedia is probably the best alternative but I can't really comment on how much better it is.




To unsubscribe from this group and stop receiving emails from it, send an email to gensim+unsubscribe@googlegroups.com.

Ivan Menshikh

unread,
May 9, 2017, 6:58:26 AM5/9/17
to gensim

Hello John,
It can not be said how much worse to use trainset to evaluate coherence but from a methodological point of view, using an external corpus is more desirable.


вторник, 25 апреля 2017 г., 19:45:51 UTC+5 пользователь john.hard...@gmail.com написал:
Reply all
Reply to author
Forward
Message has been deleted
0 new messages