Determine the probability p(topic|word) in LDA

831 views

Skip to first unread message

Jonas Wacker

unread,

Nov 7, 2015, 5:40:47 PM11/7/15

to gensim

As far as I understood for now, LdaModel.print_topic(n) shows a distribution of words given a topic t, i.e. p(w|t) for every word in descending order.

As I am trying to mimic a research paper's results, I need the opposite: p(t|w). Given a word w I want to get the distribution over topics.

I know that I can apply the Bayes' Theorem here and use the following equation:

p(t|w) = p(w|t) * p(t) / p(w)

But then again I would need to compute p(t), the probability of a topic t to appear in the entire corpus.

Therefore, my question is: Is there a simple way to get either p(t) or p(t|w) using gensim's lda model?

But I did not really understand the suggested solution. If I just need to call LdaModel.inference(), what do I need to pass as chunks?

Jonas Wacker

unread,

Nov 8, 2015, 5:10:06 AM11/8/15

to gensim

For now, I came up with the following solution to infer p(t) for each of 100 topics:

# Create an array with 100 0-values
topic_prob_dist = [0] * 100

dist_file = open(os.path.join(data_directory(), 'topic_prob_dist.txt'), 'w')

for index, document in enumerate(corpus):
    # infer topic distribution for each document
    for topic in model.get_document_topics(document, minimum_probability=0.0):
        topic_prob_dist[topic[0]] += topic[1]
    if index % 1000 == 0:
        print 'Document:', index

for index, prob in enumerate(topic_prob_dist):
    dist_file.write(str(index) + ' ' + str(prob / len(corpus)) + '\n')

dist_file.close()

It goes through each document of the corpus gets its topic distribution. The probabilities are accumulated in the 100-field array: topic_prob_dist (one probability for every topic)

At the end, each field is normalized by dividing it by the number of documents in the corpus in order to return a real probability.

I suppose this gives me the correct topic distribution?

If there is an easier way of doing this, please let me know ;)

Cheers,

Reply all

Reply to author

Forward

0 new messages