LDA: infering topic distributions on a new document

1,794 views
Skip to first unread message

Artem Yankov

unread,
Sep 21, 2013, 12:43:09 AM9/21/13
to gen...@googlegroups.com

I trained LDA with a bunch of text and now trying to infer topics for a new document.
I'm doing it like in the tutorial:

doc_lda
= lda[doc_bow]

and that returns me the following list:

[(0, 0.060771759757132261), (1, 0.11843179910545466), (2, 0.34692926700963628), (3, 0.193440525
53420571), (4, 0.28042664859357103)]

I would think those tuples are word ids mapped to their probabilities, but I always get only first 4 ids
when size of the dictionary is about 44,000 words. Am I misunderstanding something?

Another question, is there a simple way to map this result to the actual list of infered topics?

Thanks.

Artem Yankov

unread,
Sep 21, 2013, 7:44:50 PM9/21/13
to gen...@googlegroups.com
Answering my own question in case someone run into the same problem.

It returned only probabilities for first 5 topics, because I trained LDA model with num_topics=5.
So apparently, for a new document, it would just estimate probabilities for those 5 topics.

As for converting results of inference back to words, looks like there's no built-in solution,
but a simple method would do:

def get_topics(dictionary, topics, prob=0.5):
    return [dictionary.id2token[topic[0]] for topic in topics if topic[1] > prob]

where topics is a list of tuples (id, probability)

Radim Řehůřek

unread,
Sep 22, 2013, 8:23:51 AM9/22/13
to gen...@googlegroups.com
Hello Artem,

no, that is not correct. The list returned from `lda[doc_bow]` contains 2-tuples (topic, probability). Words don't come into it anymore; you cannot convert the topic into a word...

Each topic is itself a prob. distribution over words; have a look at `lda.print_topics()` and http://radimrehurek.com/gensim/tut2.html#transforming-vectors

Best,
Radim

Artem Yankov

unread,
Sep 22, 2013, 11:14:29 PM9/22/13
to gen...@googlegroups.com

Oh, so generated topics don't really have a word representation, they are just ids that represents the probability distribution
over specific set of words?  It all make sense now, thanks. I thought that it chooses topics from the sets of words in the document..

Do you think LDA could be appropriate tool for text categorization (events' descriptions) in order to later build a recommendations based
on what user liked? Or there are better ways to categorize text. User do not have to see generated categories.

Shubham

unread,
Dec 19, 2014, 7:45:07 AM12/19/14
to gen...@googlegroups.com

Radim Řehůřek <me@...> writes:

>
>
> Hello Artem,
> no, that is not correct. The list returned from `lda[doc_bow]`
contains 2-tuples (topic, probability). Words don't come into it
anymore; you cannot convert the topic into a word...
>
> Each topic is itself a prob. distribution over words; have a look at
`lda.print_topics()`
and http://radimrehurek.com/gensim/tut2.html#transforming-vectors
>
> Best,
> Radim
>
>
> On Sunday, September 22, 2013 1:44:50 AM UTC+2, Artem Yankov
wrote:Answering my own question in case someone run into the same
problem.It returned only probabilities for first 5 topics, because I
trained LDA model with num_topics=5. So apparently, for a new document,
it would just estimate probabilities for those 5 topics. As for
converting results of inference back to words, looks like there's no
built-in solution, but a simple method would do:def
get_topics(dictionary, topics, prob=0.5):    return
[dictionary.id2token[topic[0]] for topic in topics if topic[1] >
prob]where topics is a list of tuples (id, probability)On Friday,
September 20, 2013 9:43:09 PM UTC-7, Artem Yankov wrote:
> I trained LDA with a bunch of text and now trying to infer topics for
a new document.I'm doing it like in the tutorial:doc_lda =
lda[doc_bow]and that returns me the following list:[(0,
0.060771759757132261), (1, 0.11843179910545466), (2,
0.34692926700963628), (3, 0.19344052553420571), (4,
0.28042664859357103)]I would think those tuples are word ids mapped to
their probabilities, but I always get only first 4 idswhen size of the
dictionary is about 44,000 words. Am I misunderstanding something?
Another question, is there a simple way to map this result to the actual
list of infered topics?Thanks.
>
>
>
>
>
>
>
Hi Radim

I trained LDA with n_topics = 10 as follows:
"lda = gensim.models.LdaModel(corpus, id2word = dictionary, num_topics =
n_topics)"

Now when I apply lda[doc2bow], I am getting probability distribution on
<10 topics for some input cases!, and also the sum(individual topic's
probabilities)<1.

What can be thee reason for this behaviour?

Thanks
Shubham





Christopher S. Corley

unread,
Dec 19, 2014, 9:15:14 AM12/19/14
to gensim
Excerpts from Shubham's message of 2014-12-19 05:53:38 -0600:
Shubham,

By default, gensim filters out topics that aren't very related to your query
and returns the resulting sparse vector. To get them all, you can call
directly call:

>>> lda.__getitem__(doc2bow, eps=0)

The eps param is the value that topics are filtered by. Setting it to 0 means
no topics are removed.

FWIW, instead of that, I tend to get good results by using sparse2full,

>>> topics = gensim.matutils.sparse2full(model[doc], model.num_topics)

which will set all of those filtered values to 0, giving a full vector.
Doesn't solve the sum(topics), however. :)

Chris.
Reply all
Reply to author
Forward
0 new messages