LDA: distribution of word over topics

580 views
Skip to first unread message

Miroslav Batchkarov

unread,
Apr 22, 2015, 7:18:49 AM4/22/15
to gen...@googlegroups.com
Hi, 

I am getting funny results when I try to get a word's distribution over topics. I trained an LDA model over several MB of newswire (11k documents), and then ask for the distribution over topics of a document containing a single word. The output is something like:

2015-04-22 12:01:15,398 : INFO : LDA vector for rachel is [(35, 0.50499999999999989)]
2015-04-22 12:01:15,399 : INFO : LDA vector for badenhorst is [(70, 0.50499999596117906)]
2015-04-22 12:01:15,399 : INFO : LDA vector for cracking is [(13, 0.50499999999999989)]

However, the topics I get look reasonable. The only thing that is out of the ordinary is the topic weights, which are all the same.

2015-04-22 12:14:49,789 : INFO : topic #56 (0.010): 0.013*i + 0.010*race + 0.009*mansell + 0.007*car + 0.006*stage + 0.006*km + 0.006*indurain + 0.005*indy + 0.005*tour + 0.004*people
2015-04-22 12:14:49,818 : INFO : topic #55 (0.010): 0.010*north + 0.006*carter + 0.005*pyongyang + 0.005*prix + 0.005*we + 0.005*korea + 0.005*ford + 0.005*grand + 0.005*team + 0.005*france
2015-04-22 12:14:49,839 : INFO : topic #12 (0.010): 0.006*against + 0.006*soviet + 0.006*all + 0.005*former + 0.005*south + 0.005*world + 0.004*trial + 0.004*cup + 0.004*korea + 0.003*coup
2015-04-22 12:14:49,858 : INFO : topic #28 (0.010): 0.008*second + 0.008*south + 0.007*minutes + 0.007*th + 0.007*against + 0.007*half + 0.007*world + 0.007*only + 0.006*minute + 0.006*off
2015-04-22 12:14:49,867 : INFO : topic #4 (0.010): 0.025*north + 0.022*nuclear + 0.021*korea + 0.008*korean + 0.008*international + 0.008*states + 0.008*united + 0.007*agency + 0.007*iaea + 0.007*pyongyang


Looking inside LdaModel.inference, the results gamma is a uniform distribution. Any idea what I am doing wrong?

Disclaimer: a similar question was asked recently by Manikandan Tv, but he they did not get a reply.

Code: (effectively lifted from the tutorials)

import logging
import os
from random import sample
from gensim.corpora import Dictionary
from gensim.models import LdaMulticore
from gensim.utils import tokenize


def iter_documents(top_directory):
    """Iterate over all documents, yielding a document (=list of utf8 tokens) at a time."""
    for root, dirs, files in os.walk(top_directory):
        for file in filter(lambda file: not file.startswith('.'), files):
            with open(os.path.join(root, file)) as document:
                for line in document:
                    yield tokenize(line, lower=True)  # or whatever tokenization suits you


class MyCorpus(object):
    def __init__(self, top_dir):
        self.top_dir = top_dir
        self.dictionary = Dictionary(iter_documents(top_dir))
        self.dictionary.filter_extremes(no_below=1, no_above=0.25, keep_n=30000)

    def __iter__(self):
        for tokens in iter_documents(self.top_dir):
            yield self.dictionary.doc2bow(tokens)


logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
corpus = MyCorpus('/tmp/gigaword')

dictionary = corpus.dictionary
dictionary.save_as_text('radim_dict.txt')
logging.info(dictionary)

lda = LdaMulticore(corpus=corpus,
                   id2word=dictionary,  # this MUST be there, can't be set automatically from corpus. WTF?
                   num_topics=100, workers=4, passes=5) 
lda.save('radim_lda.pkl')

lda = LdaMulticore.load('radim_lda.pkl')
dictionary = Dictionary.load_from_text('radim_dict.txt')
lda.print_topics(10)
for word in sample(dictionary.token2id.keys(), 50):
    doc_bow = dictionary.doc2bow([word])
    logging.info('LDA vector for %s is %r', word, lda[doc_bow])

Data sample:

At least four people were injured and seven were arrested during a protest Wednesday by disgruntled former employees of the Independent Electoral Commission (IEC), which managed South Africa's first all-race election last month.    About ...
An opposition Senate candidate, Osias Arciniegas, said that the elections were called off in two southern areas.    Local authorities scrapped the poll...
An estimated 400 international observers, including a team from the Organization of American States (OAS), were on hand for Monday's elections...


Radim Řehůřek

unread,
Apr 22, 2015, 10:49:25 AM4/22/15
to gen...@googlegroups.com
Hello Miro,


On Wednesday, April 22, 2015 at 1:18:49 PM UTC+2, Miroslav Batchkarov wrote:
Hi, 

I am getting funny results when I try to get a word's distribution over topics. I trained an LDA model over several MB of newswire (11k documents), and then ask for the distribution over topics of a document containing a single word. The output is something like:

2015-04-22 12:01:15,398 : INFO : LDA vector for rachel is [(35, 0.50499999999999989)]
2015-04-22 12:01:15,399 : INFO : LDA vector for badenhorst is [(70, 0.50499999596117906)]
2015-04-22 12:01:15,399 : INFO : LDA vector for cracking is [(13, 0.50499999999999989)]

However, the topics I get look reasonable. The only thing that is out of the ordinary is the topic weights, which are all the same.

2015-04-22 12:14:49,789 : INFO : topic #56 (0.010): 0.013*i + 0.010*race + 0.009*mansell + 0.007*car + 0.006*stage + 0.006*km + 0.006*indurain + 0.005*indy + 0.005*tour + 0.004*people
2015-04-22 12:14:49,818 : INFO : topic #55 (0.010): 0.010*north + 0.006*carter + 0.005*pyongyang + 0.005*prix + 0.005*we + 0.005*korea + 0.005*ford + 0.005*grand + 0.005*team + 0.005*france
2015-04-22 12:14:49,839 : INFO : topic #12 (0.010): 0.006*against + 0.006*soviet + 0.006*all + 0.005*former + 0.005*south + 0.005*world + 0.004*trial + 0.004*cup + 0.004*korea + 0.003*coup
2015-04-22 12:14:49,858 : INFO : topic #28 (0.010): 0.008*second + 0.008*south + 0.007*minutes + 0.007*th + 0.007*against + 0.007*half + 0.007*world + 0.007*only + 0.006*minute + 0.006*off
2015-04-22 12:14:49,867 : INFO : topic #4 (0.010): 0.025*north + 0.022*nuclear + 0.021*korea + 0.008*korean + 0.008*international + 0.008*states + 0.008*united + 0.007*agency + 0.007*iaea + 0.007*pyongyang

it looks like your topics do have different weights for different words.

Or what exactly is the same? Maybe I misunderstood.

If you mean the 0.01 in "topic #4 (0.010)", then that number is the hyperparameter prior (alpha). There's only one for each topic (not one for each word in each topic). Unless you specify asymmetric alpha, all topic have the same alpha by default (1.0 / num_topics).

You can get the matrix of word-topic weights from `lda_model.get_lambda()`.

Hope that helps,
Radim

Miroslav Batchkarov

unread,
Apr 22, 2015, 11:17:04 AM4/22/15
to gen...@googlegroups.com
Hi Radim,

thanks for the quick response, that is very helpful! Looks like I can get the information I need directly from the lambda matrix.

I’ve got a follow-up question regarding API consistency. 

The docstring for LsiModel.__getitem__ says "Return latent representation, as a list of (topic_id, topic_value) 2-tuples.” 

The docstring for LdaModel.__getitem__ says “Return topic distribution for the given document `bow`, as a list of (topic_id, topic_probability) 2-tuples."

To me that sounds like both are doing the same thing. The tutorial (https://radimrehurek.com/gensim/tut2.html) also says __getitem__ is how you get the distribution of a document over “topics” for a LsiModel:
lsi_vec = model[tfidf_vec] # convert some new document into the LSI space, without affecting the model

I tried my example code with LsiModel instead of LdaModel and got a well-formed distribution over all 100 “topics”, as I expected. I assumed the same would work for LdaModel, which it didn’t. Is this a bug in LdaModel or the API inconsistent? Why would one need to call lda.state.getLambda() instead of __getitem__?



---
Miroslav Batchkarov
PhD Student,
Text Analysis Group,
Department of Informatics,
University of Sussex



-- 
You received this message because you are subscribed to a topic in the Google Groups "gensim" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/5WrdTuA3IL8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gensim+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Radim Řehůřek

unread,
Apr 22, 2015, 2:51:02 PM4/22/15
to gen...@googlegroups.com

On Wednesday, April 22, 2015 at 5:17:04 PM UTC+2, Miroslav Batchkarov wrote:
I’ve got a follow-up question regarding API consistency. 

The docstring for LsiModel.__getitem__ says "Return latent representation, as a list of (topic_id, topic_value) 2-tuples.” 

The docstring for LdaModel.__getitem__ says “Return topic distribution for the given document `bow`, as a list of (topic_id, topic_probability) 2-tuples."

To me that sounds like both are doing the same thing. The tutorial (https://radimrehurek.com/gensim/tut2.html) also says __getitem__ is how you get the distribution of a document over “topics” for a LsiModel:
lsi_vec = model[tfidf_vec] # convert some new document into the LSI space, without affecting the model

I tried my example code with LsiModel instead of LdaModel and got a well-formed distribution over all 100 “topics”, as I expected. I assumed the same would work for LdaModel, which it didn’t.

Both return a list of `(topic_id, topic_weight_in_input_doc)` 2-tuples. Looks consistent to me :)

Or what is the inconsistency?

 
Is this a bug in LdaModel or the API inconsistent? Why would one need to call lda.state.getLambda() instead of __getitem__?

But these do different things! Lambda is a "vocab x topics" matrix, which tells you the prominence of each word in each topic. LDA has something similar -- the matrix of left singular values, U. These matrices are derived from your training corpus, and they don't depend on any particular "query" document.

__getitem__ gives you the topic distribution for an input document = query.

Best,
Radim

Miroslav Batchkarov

unread,
Apr 22, 2015, 5:09:31 PM4/22/15
to gen...@googlegroups.com
On 22 Apr 2015, at 19:51, Radim Řehůřek <m...@radimrehurek.com> wrote:


On Wednesday, April 22, 2015 at 5:17:04 PM UTC+2, Miroslav Batchkarov wrote:
I’ve got a follow-up question regarding API consistency. 

The docstring for LsiModel.__getitem__ says "Return latent representation, as a list of (topic_id, topic_value) 2-tuples.” 

The docstring for LdaModel.__getitem__ says “Return topic distribution for the given document `bow`, as a list of (topic_id, topic_probability) 2-tuples."

To me that sounds like both are doing the same thing. The tutorial (https://radimrehurek.com/gensim/tut2.html) also says __getitem__ is how you get the distribution of a document over “topics” for a LsiModel:
lsi_vec = model[tfidf_vec] # convert some new document into the LSI space, without affecting the model

I tried my example code with LsiModel instead of LdaModel and got a well-formed distribution over all 100 “topics”, as I expected. I assumed the same would work for LdaModel, which it didn’t.

Both return a list of `(topic_id, topic_weight_in_input_doc)` 2-tuples. Looks consistent to me :)

Or what is the inconsistency?

So we are back to my original question :) Why is there just one very prominent topic in all single-word queries in my example, given the topics look reasonable?


 
Is this a bug in LdaModel or the API inconsistent? Why would one need to call lda.state.getLambda() instead of __getitem__?

But these do different things! Lambda is a "vocab x topics" matrix, which tells you the prominence of each word in each topic. LDA has something similar -- the matrix of left singular values, U. These matrices are derived from your training corpus, and they don't depend on any particular "query" document.

Did you mean LSI?


__getitem__ gives you the topic distribution for an input document = query.

If __getitem__ is the right way to get the topic distribution for an input document, why did you point me towards getLambda? Isn’t the topic the topic distribution for an input document the same as the prominence of a word in each topic when the query document consists of a single word (or at least proportional).?

If I wanted to compare two words based on their topic distribution, would I used their corresponding vectors in the lambda matrix?

PS I’m not trying to be difficult here, believe it or not :)

Radim Řehůřek

unread,
Apr 22, 2015, 6:48:24 PM4/22/15
to gen...@googlegroups.com


On Wednesday, April 22, 2015 at 11:09:31 PM UTC+2, Miroslav Batchkarov wrote:

On 22 Apr 2015, at 19:51, Radim Řehůřek <m...@radimrehurek.com> wrote:


On Wednesday, April 22, 2015 at 5:17:04 PM UTC+2, Miroslav Batchkarov wrote:
I’ve got a follow-up question regarding API consistency. 

The docstring for LsiModel.__getitem__ says "Return latent representation, as a list of (topic_id, topic_value) 2-tuples.” 

The docstring for LdaModel.__getitem__ says “Return topic distribution for the given document `bow`, as a list of (topic_id, topic_probability) 2-tuples."

To me that sounds like both are doing the same thing. The tutorial (https://radimrehurek.com/gensim/tut2.html) also says __getitem__ is how you get the distribution of a document over “topics” for a LsiModel:
lsi_vec = model[tfidf_vec] # convert some new document into the LSI space, without affecting the model

I tried my example code with LsiModel instead of LdaModel and got a well-formed distribution over all 100 “topics”, as I expected. I assumed the same would work for LdaModel, which it didn’t.

Both return a list of `(topic_id, topic_weight_in_input_doc)` 2-tuples. Looks consistent to me :)

Or what is the inconsistency?

So we are back to my original question :) Why is there just one very prominent topic in all single-word queries in my example, given the topics look reasonable?

 
Is this a bug in LdaModel or the API inconsistent? Why would one need to call lda.state.getLambda() instead of __getitem__?

But these do different things! Lambda is a "vocab x topics" matrix, which tells you the prominence of each word in each topic. LDA has something similar -- the matrix of left singular values, U. These matrices are derived from your training corpus, and they don't depend on any particular "query" document.

Did you mean LSI?

Yes, LSI, sorry.
 



__getitem__ gives you the topic distribution for an input document = query.

If __getitem__ is the right way to get the topic distribution for an input document, why did you point me towards getLambda? Isn’t the topic the topic distribution for an input document the same as the prominence of a word in each topic when the query document consists of a single word (or at least proportional).?

No, it isn't. The LDA algorithm assigns one topic to one document word (usually called `z` in the LDA math). This is not the same as knowing the word has different propensity toward different topics (lambda).

The process of determining what topic is picked for what word is called "inference", and its outcome depends on other words in the document too. It's not the case that "the topic where this word is most probable always wins". You can get finer statistics on this using the collect_sstats parameter in inference: `LdaModel.inference([doc], collect_sstats=True)`.

The reason why your "document=single word" topic doesn't have probability 1.0 is that some little mass is assigned toward every other topic too (related to the prior "hyperparameter alpha").

Hopefully that makes sense and I didn't confuse you further :)

Radim

 
To unsubscribe from this group and all its topics, send an email to gensim+unsubscribe@googlegroups.com.

Miroslav Batchkarov

unread,
Apr 23, 2015, 5:29:29 AM4/23/15
to gen...@googlegroups.com
On 22 Apr 2015, at 23:48, Radim Řehůřek <m...@radimrehurek.com> wrote:



On Wednesday, April 22, 2015 at 11:09:31 PM UTC+2, Miroslav Batchkarov wrote:

On 22 Apr 2015, at 19:51, Radim Řehůřek <m...@radimrehurek.com> wrote:


On Wednesday, April 22, 2015 at 5:17:04 PM UTC+2, Miroslav Batchkarov wrote:
I’ve got a follow-up question regarding API consistency. 

The docstring for LsiModel.__getitem__ says "Return latent representation, as a list of (topic_id, topic_value) 2-tuples.” 

The docstring for LdaModel.__getitem__ says “Return topic distribution for the given document `bow`, as a list of (topic_id, topic_probability) 2-tuples."

To me that sounds like both are doing the same thing. The tutorial (https://radimrehurek.com/gensim/tut2.html) also says __getitem__ is how you get the distribution of a document over “topics” for a LsiModel:
lsi_vec = model[tfidf_vec] # convert some new document into the LSI space, without affecting the model

I tried my example code with LsiModel instead of LdaModel and got a well-formed distribution over all 100 “topics”, as I expected. I assumed the same would work for LdaModel, which it didn’t. 

Both return a list of `(topic_id, topic_weight_in_input_doc)` 2-tuples. Looks consistent to me :)

Or what is the inconsistency?

So we are back to my original question :) Why is there just one very prominent topic in all single-word queries in my example, given the topics look reasonable?

 
Is this a bug in LdaModel or the API inconsistent? Why would one need to call lda.state.getLambda() instead of __getitem__?

But these do different things! Lambda is a "vocab x topics" matrix, which tells you the prominence of each word in each topic. LDA has something similar -- the matrix of left singular values, U. These matrices are derived from your training corpus, and they don't depend on any particular "query" document.

Did you mean LSI?

Yes, LSI, sorry.
 



__getitem__ gives you the topic distribution for an input document = query.

If __getitem__ is the right way to get the topic distribution for an input document, why did you point me towards getLambda? Isn’t the topic the topic distribution for an input document the same as the prominence of a word in each topic when the query document consists of a single word (or at least proportional).?

No, it isn't. The LDA algorithm assigns one topic to one document word (usually called `z` in the LDA math). This is not the same as knowing the word has different propensity toward different topics (lambda).

Ah yes, a word is a distribution over topics, but in a particular document each word is sampled from a single topic. That answers my question. Thanks a lot for you help.

To unsubscribe from this group and all its topics, send an email to gensim+un...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages