Please help a newbie :)---LDA model: No topics for words, only 1 topic for document?

849 views
Skip to first unread message

David Rudel

unread,
Oct 10, 2017, 7:05:09 PM10/10/17
to gensim
I trained an LDA model on the most of the gutenberg corpus (16 GB of text, 47 million documents). Note: due to the very large number of documents, I used a much larger chunksize than normal.

I went to play with the trained model and got some unexpected results.

I wrote a script to find topics for "I love squash as well!" and only 1 topic was returned by get_document_topics(). I expected multiple topics to have non-zero weight.

I then looked to find the topics for each term and found that 3 of the 4 terms had no output from get_term_topics() ("as" occurs too frequently to be in the corpus.)

The term that did have topics associated to it had several topics, but none of them were the one returned by the get_document_topics() for the whole document.

I must be missing something rather fundamental, because none of this seems right.

Does this sound like expected behavior?

I'm a newbie to NLP, but I would have expected to get multiple topics from get_document_topics().
I would also expect each word to provide output from get_term_topics().
Perhaps most strange is that the one document I get as output from get_document_topics() does not show up in the output of get_term_topics() for any of the individual words.

Code Below:

import gensim.corpora.dictionary as gsdict
from gensim import corpora, models, similarities
import gensim.parsing.preprocessing as gspp
import logging

logging.basicConfig(
format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

my_dictionary = gsdict.Dictionary.load(
'compact_dictionary.pickle') #dictionary for corpus
my_tfidf_transformer = models.TfidfModel(dictionary=my_dictionary)
my_lda_model = models.LdaMulticore.load(
'lda_model.pickle')#LDA model trained on gutenberg corpus

#transform_text converts a document to tfidf bag of words using the same pairing as lda_model
def transform_text(text):
p = gspp.strip_non_alphanum(text)
p = gspp.strip_punctuation(p)
p = gspp.strip_multiple_whitespaces(p)
stemmed = gspp.stem(p)
tokenized = stemmed.lower().split()
integer_tokens = my_dictionary.doc2bow(tokenized)
normalized = my_tfidf_transformer[integer_tokens]
return(normalized)

test_vector = transform_text(
"I love squash as well!")

print(test_vector)

test_vector_topics = my_lda_model.get_document_topics(test_vector)

print(test_vector_topics)

for x in [41300, 63961, 295047, 314360]:
print(my_lda_model.get_term_topics(x))


Returns:

(text_vector = ) [(41300, 0.27637812721145755), (63961, 0.8884855785595154), (295047, 0.1308802162540492), (314360, 0.34216790685881704)]

(output from my_lda_model.get_document_topics(test_vector) = )[(30, 0.62470314998402232)]

(topics for each term separately = )
[]
[]
[(1, 0.011547986188927634), (17, 0.027571190976285399), (28, 0.010000921614138609), (36, 0.010903172119846512), (59, 0.010494616992062562), (67, 0.033608278750364595), (71, 0.010858201676055121), (75, 0.010440402498124578), (76, 0.017004753096027021), (90, 0.025057497313024479), (91, 0.010945368645774396), (96, 0.012055742897041141), (98, 0.013998287893132269)]
[]

Ivan Menshikh

unread,
Oct 11, 2017, 12:28:24 AM10/11/17
to gensim
Hi David,

1. LDA works badly with short text (your text is really short), for this reason, you are unlikely to get a good result). Try to replace your "I love squash as well!" to a longer text about squash.
2. In get_term_topics method we used `minimum_probability` parameter, by default it's None - > used 0.01 as default value from constructor, you can set-up your threshould more carefully, i.e. minimum_probability=1e-4

David Rudel

unread,
Oct 11, 2017, 12:43:27 AM10/11/17
to gensim
Ah, thanks! I was led astray by the API guide, where a default of 1e-8 was indicated.

Is there a rule of thumb for how long the documents need to be for LDA?  I plan on using the model on snippets 1-3 sentences in length---probably 7-20 terms typically.

Are there other models you would suggest for documents in the range of 7-20 terms?

Thanks for all your help!

Ivan Menshikh

unread,
Oct 11, 2017, 12:55:36 AM10/11/17
to gensim
Short answer: more terms - better, but 20 terms should be enough for normal work. The problem is Lda based on statistics and if we have small docs - we have no enough information for this model. 
For avoiding this problem, I propose to use WNTM model. This is a very simple idea - you re-slicing your corpus and creates new "pseudo-documents" (in the manner of co-occurence matrix), after it, each of your "pseudo document" will be document for word from dictionary. Next step - pass this corpus to Lda, fit, and when you need to retrieve topic-vector for original document (not pseudo) - use simple inference formula from page 6.

David Rudel

unread,
Oct 11, 2017, 10:06:13 AM10/11/17
to gensim
Thanks so much, Ivan!

David Rudel

unread,
Oct 11, 2017, 10:55:42 AM10/11/17
to gensim
Ivan,
Follow up question---for these short documents (7-20 terms) I actually have topic connected that I want to use to predict topics for other short documents.

Would Yoon Kim's CNN algorithm [https://datawarrior.wordpress.com/2016/10/12/short-text-categorization-using-deep-neural-networks-and-word-embedding-models/] work okay for text of this length, or would there be performance issues using 20-term documents in that convolution model?

It looks like exactly what I need!

Ivan Menshikh

unread,
Oct 12, 2017, 1:19:47 AM10/12/17
to gensim
In the blogpost, I see word2vec and CNN, but it's not topic models (embedding isn't interpretable unlike LDA). As I understand from your first message, you want to use topic model.

But if you want to solve classification/clusterization task - you are welcome with this approaches (from blogpost). I see that it have no difference between the sum of word-vectors and CNN model. For this reason, you can try mode simple solution (Word2vec)

1. Train Word2Vec model with your dataset
2. For each document
    - collect all word vectors
    - combine it (the simpler variant is normalized sum OR average your word vector), I also suggest a more advanced method (presentation with highlights and original paper)
3. Use this combination as new document vectors for any upstream task.

David Rudel

unread,
Oct 12, 2017, 3:26:54 AM10/12/17
to gensim
Hi Ivan, I think I'm confused [or perhaps am using the wrong nomenclature.]

The blog post seems to refer to topic classification to me. I'm thinking of each subject as a topic. So the training data is (document -> topic) pairs:

"linear algebra" -> mathematics
"topology" -> mathematics
"algebra" -> mathematics
...
"eucharist" -> theology

Am I misinterpreting this? Or perhaps "topic modeling" means something more specific than I realize?

Maybe "modeling" is too strong of a word... I mean "topic classification." I.e., take a document and estimate which of several pre-set topics it refers to.

Ivan Menshikh

unread,
Oct 12, 2017, 7:12:50 AM10/12/17
to gensim
If you already have topic labels (labeled dataset) - you can formulate it as a classification task, of course. Topic modeling, by default, is fully unsupervised technic.

David Rudel

unread,
Oct 12, 2017, 10:48:00 AM10/12/17
to gensim
Thanks for the correction in terminology! Now I understand what you are saying.

Ivan Menshikh

unread,
Oct 12, 2017, 11:25:20 AM10/12/17
to gensim
Small addition, in topic models we typically have 2 matrices - document X topic and topic X word, these matrixes are stochastic -> it's really discrete distributions (unlike any2vec models, where we have no similar constraint).
Thanks to this feature, the topics are interpreted.
Reply all
Reply to author
Forward
0 new messages