Just as a made up example, maybe I have 2 topics, dictionary of 50 words, and a corpus with documents having total of 1100 words. Topic 1 has 1000 words assigned to it, and Topic 2 has 100 words assigned to it. Now, I guess, the values from lda.show_topics() would still sum up to 1 over the dictionary of 50 words for both topics?
How do I get comparable values to see how "big" each topic is within the corpus?
topic_dists = [ldamodel[doc] for doc in corpus]Or how common the words within topics are across the corpus?
Hello Teemu,Just as a made up example, maybe I have 2 topics, dictionary of 50 words, and a corpus with documents having total of 1100 words. Topic 1 has 1000 words assigned to it, and Topic 2 has 100 words assigned to it. Now, I guess, the values from lda.show_topics() would still sum up to 1 over the dictionary of 50 words for both topics?Yes, you are right. Do not forget that the same word can belong to several topics
How do I get comparable values to see how "big" each topic is within the corpus?You can fit LdaModel, apply to your corpustopic_dists = [ldamodel[doc] for doc in corpus]
topic_sizes = defaultdict(int)
for doc in docs: #assume doc is a text string
doc_bow = dict.doc2bow(doc.split())
dist = lda[doc_bow]
for topic_size in dist:
topic_id = topic_size[0]
percent = topic_size[1]
topic_sizes[topic_id] += percent
And the result is a set of comparable numbers for sorting to get the most "relevant" topics?
And after this, you can calculate sum by columns and get most relevant topics for this corpusOr how common the words within topics are across the corpus?If I understand you correct, you can simply count words over a corpus (term frequency or document frequency)
And the result is a set of comparable numbers for sorting to get the most "relevant" topics?
Lets say I take the top 10 topics from the results of above code, and wish to count how "relevant" is the word "hello" in those top 10 topics. Maybe "hello" is in two of those topics, and I want to weight its occurrence by the "relevance" of the topic itself. So I guess I would take the topic relevance measure from above code (again, assuming it is correct), and weight "hello" for each topic with the topic "relevance" weight..?
And the result is a set of comparable numbers for sorting to get the most "relevant" topics?Yes, you get most "popular" / "relevant" topic over a corpus.Lets say I take the top 10 topics from the results of above code, and wish to count how "relevant" is the word "hello" in those top 10 topics. Maybe "hello" is in two of those topics, and I want to weight its occurrence by the "relevance" of the topic itself. So I guess I would take the topic relevance measure from above code (again, assuming it is correct), and weight "hello" for each topic with the topic "relevance" weight..?You can try this, but I've never tried it this way.What do you mean by the relevance of the word?