Obtaining topic-document probabilities matrix.

56 views
Skip to first unread message

Matias

unread,
Apr 6, 2017, 11:10:58 AM4/6/17
to gensim
Hello, 

I need to calculate the distribution of a topic t across a corpus of documents. 
For that I would need a matrix M = K x D, (K = #topics, D = #docs ) where each element M[k,d] is the probability of topic k pertaining to document d.
I need that matrix for computing the entropy of a topic,  as done in the work : Pierre F. Baldi, Cristina V. Lopes, Erik J. Linstead, and Sushil K. Bajracharya. 2008. A theory of aspects as latent topics.

The method ldamodel.get_document_topics() returns a matrix this the probability of documents to belong to topics, but I don't think this matrix could help me to calculate the topics entropy.

Thanks a lot for your help
best regards
Matias

Ivan Menshikh

unread,
May 10, 2017, 10:58:12 AM5/10/17
to gensim
Hello Matias,

You can easily calculate matrix M with get_document_topics method, look at the example

from gensim.models import LdaModel
from gensim.corpora import Dictionary
import numpy as np

docs = [["a", "a", "b"], ["a", "c", "g"], ["c"]]
dct = Dictionary(docs)
corpus = [dct.doc2bow(_) for _ in docs]
K = 10
D = len(corpus)

ldamodel = LdaModel(corpus=corpus, num_topics=K, id2word=dct)

M = np.zeros((K, D)) # Matrix topics x documents

for (idy, doc) in enumerate(corpus):
    for (idx, prob) in ldamodel.get_document_topics(doc,  minimum_probability=1e-8):
        M[idx][idy] = prob





четверг, 6 апреля 2017 г., 20:10:58 UTC+5 пользователь Matias написал:
Reply all
Reply to author
Forward
0 new messages