topic distribution over corpus

434 views
Skip to first unread message

Teemu Kanstrén

unread,
May 15, 2017, 7:36:01 AM5/15/17
to gensim
Hello,

 I have successfully trained an LDA model on my corpus using gensim. Works great. Now I would like to see the topic/word distribution over the corpus.

 With lda.show_topics() I am able to get the word distribution for a topic. My understanding is that these probabilities sum up to give the probability of each word appearing in a given topic. So for each topic, the word probabilities would sum up to 1.

 But my understanding is also that different numbers of words (tokens?) overall may be assigned to topics. 

 Just as a made up example, maybe I have 2 topics, dictionary of 50 words, and a corpus with documents having total of 1100 words. Topic 1 has 1000 words assigned to it, and Topic 2 has 100 words assigned to it. Now, I guess, the values from lda.show_topics() would still sum up to 1 over the dictionary of 50 words for both topics? 

 How do I get comparable values to see how "big" each topic is within the corpus? 

 Or how common the words within topics are across the corpus?

 For example, topic1.hello=550, topic2.hello=20. Don't really need exact numbers (don't know if gensim keeps them around), just something comparable would be nice.

Cheers,
Teemu

Ivan Menshikh

unread,
May 15, 2017, 11:02:45 AM5/15/17
to gensim
Hello Teemu,

 Just as a made up example, maybe I have 2 topics, dictionary of 50 words, and a corpus with documents having total of 1100 words. Topic 1 has 1000 words assigned to it, and Topic 2 has 100 words assigned to it. Now, I guess, the values from lda.show_topics() would still sum up to 1 over the dictionary of 50 words for both topics? 

Yes, you are right. Do not forget that the same word can belong to several topics


 How do I get comparable values to see how "big" each topic is within the corpus? 

You can fit LdaModel, apply to your corpus
topic_dists = [ldamodel[doc] for doc in corpus]

 And after this, you can calculate sum by columns and get most relevant topics for this corpus

 Or how common the words within topics are across the corpus?

If I understand you correct, you can simply count words over a corpus (term frequency or document frequency) 

Teemu Kanstrén

unread,
May 15, 2017, 1:55:58 PM5/15/17
to gensim
Thanks for the help Ivan,


On Monday, 15 May 2017 18:02:45 UTC+3, Ivan Menshikh wrote:
Hello Teemu,

 Just as a made up example, maybe I have 2 topics, dictionary of 50 words, and a corpus with documents having total of 1100 words. Topic 1 has 1000 words assigned to it, and Topic 2 has 100 words assigned to it. Now, I guess, the values from lda.show_topics() would still sum up to 1 over the dictionary of 50 words for both topics? 

Yes, you are right. Do not forget that the same word can belong to several topics

 How do I get comparable values to see how "big" each topic is within the corpus? 

You can fit LdaModel, apply to your corpus
topic_dists = [ldamodel[doc] for doc in corpus]

OK, I tried to do this. Is this about right?
 
topic_sizes = defaultdict(int)
for doc in docs: #assume doc is a text string
doc_bow = dict.doc2bow(doc.split())
dist = lda[doc_bow]
for topic_size in dist:
topic_id = topic_size[0]
percent = topic_size[1]
topic_sizes[topic_id] += percent

And the result is a set of comparable numbers for sorting to get the most "relevant" topics?


 And after this, you can calculate sum by columns and get most relevant topics for this corpus

 Or how common the words within topics are across the corpus?

If I understand you correct, you can simply count words over a corpus (term frequency or document frequency) 


Sorry, bad wording from me.

Lets say I take the top 10 topics from the results of above code, and wish to count how "relevant" is the word "hello" in those top 10 topics. Maybe "hello" is in two of those topics, and I want to weight its occurrence by the "relevance" of the topic itself. So I guess I would take the topic relevance measure from above code (again, assuming it is correct), and weight "hello" for each topic with the topic "relevance" weight..?

Ivan Menshikh

unread,
May 16, 2017, 12:29:33 AM5/16/17
to gensim
And the result is a set of comparable numbers for sorting to get the most "relevant" topics?
 
Yes, you get most "popular" / "relevant" topic over a corpus.

Lets say I take the top 10 topics from the results of above code, and wish to count how "relevant" is the word "hello" in those top 10 topics. Maybe "hello" is in two of those topics, and I want to weight its occurrence by the "relevance" of the topic itself. So I guess I would take the topic relevance measure from above code (again, assuming it is correct), and weight "hello" for each topic with the topic "relevance" weight..?

You can try this, but I've never tried it this way.
What do you mean by the relevance of the word? 

Teemu Kanstrén

unread,
May 16, 2017, 5:53:02 AM5/16/17
to gensim

On Tuesday, 16 May 2017 07:29:33 UTC+3, Ivan Menshikh wrote:
And the result is a set of comparable numbers for sorting to get the most "relevant" topics?
 
Yes, you get most "popular" / "relevant" topic over a corpus.

Lets say I take the top 10 topics from the results of above code, and wish to count how "relevant" is the word "hello" in those top 10 topics. Maybe "hello" is in two of those topics, and I want to weight its occurrence by the "relevance" of the topic itself. So I guess I would take the topic relevance measure from above code (again, assuming it is correct), and weight "hello" for each topic with the topic "relevance" weight..?

You can try this, but I've never tried it this way.
What do you mean by the relevance of the word? 


I guess it is just the top words (their weights) from the top topics summed up and weighted by topic relevance. Not directly "relevance" perhaps but just experimenting with stuff.

I believe I have this working now. Thanks!

Sana Talha

unread,
Oct 19, 2018, 6:40:32 AM10/19/18
to Gensim

hello,,
how can i categorize the topics extracted from LDA?how can i say that this topic contains A% of topics from document /Corpus?
How to find the topic distribution for each document?please help.

REGARDS,
Reply all
Reply to author
Forward
0 new messages