LDA document-topic assignment

Bhaskar Khaneja

unread,

Jul 17, 2016, 3:09:10 AM7/17/16

to gensim

Hello, everyone! I am using LDA to assign topics to documents but while the topics are coming out to be pretty good, the document-topic assignment is really bad. Here's the code that I am using:

corpus_bow = BOWCorpus()

tfidf = tfidfmodel.TfidfModel(corpus_bow)
corpus_tfidf = tfidf[corpus_bow]

lda = ldamulticore.LdaModel(corpus_tfidf, id2word=dictionary, num_topics=50, alpha=0.1, eta=0.1)
corpus_lda = lda[corpus_tfidf]

lda.print_topics(50)

for doc, vector in zip(DocCorpus(), corpus_lda):
    print(str(doc) + str(sorted(lda[vector], key=lambda x: x[1], reverse=True)[:3]))

Is there anything I am doing wrong? lda[vector] seems to be giving me the wrong topics - or so I feel. For example, there's a topic that's clearly about cars (0.939*cars + 0.001..) - 93.9% cars and yet a document (which is basically just the word "cars") gets assigned to an unrelated topic. What's going on here? How is lda[vector] different from lda.get_document_topics(vector)?

BHARGAV SRINIVASA DESIKAN

unread,

Jul 18, 2016, 12:26:00 AM7/18/16

to gensim

Hello Bhaskar,

Have you tried giving the 'get_document_topics' method a shot?

This notebook goes through how it works - https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/topic_methods.ipynb

Bhaskar Khaneja

unread,

Jul 18, 2016, 3:51:38 PM7/18/16

to gensim

Hi Bhargav,

I have tried both but am still getting the same bad assignments. To debug the problem, I took a very small corpus (5 docs where 3 were about cars and 2 were about food) and ran LDA with num_topics = 2. While the topics were good, one thing I noticed (with both lda[vector] and lda.get_document_topics(vector)) was that every doc was getting the same document-topic probability distribution. For some reason, I was getting "[(0, 0.083333335604447431), (1, 0.91666666439555256)]" for every single document.

I checked the corresponding corpus_tfidf and corpus_bow vectors for all the documents and they were all different (like they should be) but that was not true for vectors in corpus_lda. I do get different document-topic distributions with a huge corpus though but I am beginning to think if there's something fundamentally wrong with what I am doing or how gensim's LDA is working. Any ideas?

Any help will be appreciated. Thank you so much!

Bhaskar Khaneja

unread,

Jul 18, 2016, 5:32:27 PM7/18/16

to gensim

UPDATE: The document-topic probability distributions are actually not identical but very similar to each other (like "(0, 0.9166666630474718), (1, 0.08333333695252812)" and "(0, 0.91666666304798194), (1, 0.083333336952017958)")

BHARGAV SRINIVASA DESIKAN

unread,

Jul 19, 2016, 12:54:19 AM7/19/16

to gensim

I'll have a closer look and let you know... Lev, Radim, any thoughts on this?

BHARGAV SRINIVASA DESIKAN

unread,

Jul 19, 2016, 10:24:15 AM7/19/16

to gensim

It could also be because of your alpha value being 0.1, which is rather high;

Could you try running the same code again with the default values for alpha and eta, as well as maybe increase the number of iterations?

Bhaskar Khaneja

unread,

Jul 19, 2016, 3:51:35 PM7/19/16

to gensim

Turns out I was doing "lda[vector]" (where vector is a vector in corpus_lda) in the last line that was causing the assignments to be all messed up. When I changed it to just "vector", I got more sensible results. The probabilities though are still very black and white - it's either 0.024999999999999991 or 0.21999983554996505 for any given document and topic (no other number ever appears). Is that correct?

adalex

unread,

Aug 31, 2016, 3:39:13 AM8/31/16

to gensim

Hi Bhaskar,

I have similar problems. The topic assignment look bad for my data. It seems that you found what was the problem in you case, but I don't understand how you fixed the problem, i.e. what do you mean by "Turns out I was doing "lda[vector]" (where vector is a vector in corpus_lda) in the last line that was causing the assignments to be all messed up. When I changed it to just "vector", I got more sensible results." Could you please clarify? Thanks, Andrea

Lev Konstantinovskiy

unread,

Sep 4, 2016, 12:02:08 PM9/4/16

to gensim

Hi Adalex,

Could you please post the code so i could better answer your question?

Bhaskar was referring to the fact that he was erroneously calling lda twice like this: lda[lda[vector]]. Correct way is lda[vector]