I'm running into a confusing issue when attempting to apply an existing LDA model to the documents of a new corpus (the ultimate goal being to get the topic scores for each document in the new corpus.
I'm able to load the LDA model fine, but the output has two dimensions for each document (two lists of numbers, with the correct number of items in each list, corresponding to the number of topics) and the numbers don't look right (though I'm not 100% sure about this).
>>> other_corpus = [common_dictionary.doc2bow(text) for text in other_texts]
>>> unseen_doc = other_corpus[0]
>>> vector = lda[unseen_doc] # get topic probability distribution for a document
After loading the documents into the corpus and dictionary and loading the previous model, I attempt to run,
scoring =[ ]
for document in corpus:
score = lda_model[document]
scoring =[ ]
scoring.append(score)
np.savetxt('new_document_scoring.csv', scoring, delimiter=',' )
This generates an error that says the data is 3D rather than 2D. To get a closer look at the output, I ran
newdoc = corpus[0]
scoring = lda_model[newdoc]
np.savetxt('new_document_scoring.csv', scoring, delimiter=',' )
and was then able to see the two columns with the scores. To give a sense of the output, the first five values in each of the columns are:
0.000000000000000000e+00,1.117508625611662865e-03
1.000000000000000000e+00,1.117508625611662865e-03
2.000000000000000000e+00,1.117508625611662865e-03
3.000000000000000000e+00,1.117508625611662865e-03
4.000000000000000000e+00,1.117508625611662865e-03
If anyone has any suggestions for how to revise to get the topic score for the individual documents, I would appreciate it!