I trained an LDA model on the most of the gutenberg corpus (16 GB of text, 47 million documents). Note: due to the very large number of documents, I used a much larger chunksize than normal.
I went to play with the trained model and got some unexpected results.
I wrote a script to find topics for "I love squash as well!" and only 1 topic was returned by get_document_topics(). I expected multiple topics to have non-zero weight.
I then looked to find the topics for each term and found that 3 of the 4 terms had no output from get_term_topics() ("as" occurs too frequently to be in the corpus.)
The term that did have topics associated to it had several topics, but none of them were the one returned by the get_document_topics() for the whole document.
I must be missing something rather fundamental, because none of this seems right.
Does this sound like expected behavior?
I'm a newbie to NLP, but I would have expected to get multiple topics from get_document_topics().
I would also expect each word to provide output from get_term_topics().
Perhaps most strange is that the one document I get as output from get_document_topics() does not show up in the output of get_term_topics() for any of the individual words.
Code Below:
import gensim.corpora.dictionary as gsdict
from gensim import corpora, models, similarities
import gensim.parsing.preprocessing as gspp
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
my_dictionary = gsdict.Dictionary.load('compact_dictionary.pickle') #dictionary for corpus
my_tfidf_transformer = models.TfidfModel(dictionary=my_dictionary)
my_lda_model = models.LdaMulticore.load('lda_model.pickle')#LDA model trained on gutenberg corpus
#transform_text converts a document to tfidf bag of words using the same pairing as lda_model
def transform_text(text):
p = gspp.strip_non_alphanum(text)
p = gspp.strip_punctuation(p)
p = gspp.strip_multiple_whitespaces(p)
stemmed = gspp.stem(p)
tokenized = stemmed.lower().split()
integer_tokens = my_dictionary.doc2bow(tokenized)
normalized = my_tfidf_transformer[integer_tokens]
return(normalized)
test_vector = transform_text("I love squash as well!")
print(test_vector)
test_vector_topics = my_lda_model.get_document_topics(test_vector)
print(test_vector_topics)
for x in [41300, 63961, 295047, 314360]:
print(my_lda_model.get_term_topics(x))
Returns:
(text_vector = ) [(41300, 0.27637812721145755), (63961, 0.8884855785595154), (295047, 0.1308802162540492), (314360, 0.34216790685881704)]
(output from my_lda_model.get_document_topics(test_vector) = )[(30, 0.62470314998402232)]
(topics for each term separately = )
[]
[]
[(1, 0.011547986188927634), (17, 0.027571190976285399), (28, 0.010000921614138609), (36, 0.010903172119846512), (59, 0.010494616992062562), (67, 0.033608278750364595), (71, 0.010858201676055121), (75, 0.010440402498124578), (76, 0.017004753096027021), (90, 0.025057497313024479), (91, 0.010945368645774396), (96, 0.012055742897041141), (98, 0.013998287893132269)]
[]