LDA topic distribution for new document and similarity

245 views
Skip to first unread message

alex

unread,
Feb 19, 2014, 5:24:02 AM2/19/14
to gen...@googlegroups.com
Hello all. I made an LDA model with 86000+ tokens. When trying to infer new documents topic distributions, does the sum of topic distribution probability have to be equal to 1?


lda_model = models.LdaModel.load('/myldamodel.lda_model')

corpus_bow = [[(32439,1), (73079, 2), (73150, 1)],\
[(22949, 1), (73079, 1)],\
[(73150, 2)]]

doc_lda = lda_model[corpus_bow]

for i in doc_lda:
print i


the results are:
[(9, 0.40200000000000152), (26, 0.22707850347701716), (57, 0.17692149652298442)]
[(9, 0.33666666666666822), (35, 0.33666666666666817)]
[(26, 0.67000000000000337)]

0.40200000000000152 + 0.22707850347701716 + 0.17692149652298442 != 1
0.33666666666666822 + 0.33666666666666817 != 1
0.67000000000000337 != 1

I don't understand how should I interpret LDA inference results on new documents?

Radim Řehůřek

unread,
Feb 20, 2014, 12:57:21 PM2/20/14
to gen...@googlegroups.com
Hello Alex,

small topic weights are not returned in the sparse format. By default, "small" means <0.01.

My guess is you're using a lot of topics, most of them end up only with the tiny prior weights (~alpha), but collectively all these big + small weights add up to 1. But you "see" only the large ones in sparse format.

You can verify whether this is the case by changing the "small" threshold to something even smaller:

print lda_model.__getitem__(bow, eps=0.00001)  # default is 0.01
# should add up to 1.0

HTH,
Radim

alex

unread,
Feb 20, 2014, 9:13:44 PM2/20/14
to gen...@googlegroups.com
Got it. Thanks a lot for explaining :)
Reply all
Reply to author
Forward
0 new messages