Applying Previous Model to New Document Corpus

42 views
Skip to first unread message

Brady Krien

unread,
Feb 22, 2021, 10:03:11 AM2/22/21
to Gensim
I'm running into a confusing issue when attempting to apply an existing LDA model to the documents of a new corpus (the ultimate goal being to get the topic scores for each document in the new corpus. 

I'm able to load the LDA model fine, but the output has two dimensions for each document (two lists of numbers, with the correct number of items in each list, corresponding to the number of topics) and the numbers don't look right (though I'm not 100% sure about this). 

For reference, I'm attempting to follow the documentation at https://radimrehurek.com/gensim/models/ldamodel.html, specifically

>>> other_corpus = [common_dictionary.doc2bow(text) for text in other_texts] 
>>> unseen_doc = other_corpus[0] 
>>> vector = lda[unseen_doc] # get topic probability distribution for a document

After loading the documents into the corpus and dictionary and loading the previous model, I attempt to run, 

scoring =[ ]
for document in corpus:
    score = lda_model[document]
    scoring =[ ]
    scoring.append(score)
np.savetxt('new_document_scoring.csv', scoring, delimiter=',' )

This generates an error that says the data is 3D rather than 2D. To get a closer look at the output, I ran

newdoc = corpus[0]
scoring = lda_model[newdoc]
np.savetxt('new_document_scoring.csv', scoring, delimiter=',' )

and was then able to see the two columns with the scores. To give a sense of the output, the first five values in each of the columns are: 

0.000000000000000000e+00,1.117508625611662865e-03
1.000000000000000000e+00,1.117508625611662865e-03
2.000000000000000000e+00,1.117508625611662865e-03
3.000000000000000000e+00,1.117508625611662865e-03
4.000000000000000000e+00,1.117508625611662865e-03

If anyone has any suggestions for how to revise to get the topic score for the individual documents, I would appreciate it!

Radim Řehůřek

unread,
Feb 23, 2021, 10:19:59 AM2/23/21
to Gensim
That sounds right – lda_model[document] returns the document's topics, as a list of pairs (topic_id, topic_weight).

See also the tutorial on transformation in Gensim:

HTH,
Radim




Reply all
Reply to author
Forward
0 new messages