Using existing model for inference on new corpus

170 views
Skip to first unread message

frederi...@gmail.com

unread,
Dec 6, 2016, 5:27:15 AM12/6/16
to gensim
Hi everyone,

I'm trying to use a model trained on one corpus to get topic assignments on new data.
Here is what I have so far:

#Model is LdaMulticore and was trained on another corpus
model
= gensim.models.LdaMulticore.load(model_file)
corpus_new
= gensim.corpora.MmCorpus(corpus_file)


topic_assignments
= model[corpus_new]

for i in topic_assignments:
   
print i
   
# Results in error:
Traceback (most recent call last):

 
File "<ipython-input-17-be690e265876>", line 1, in <module>
   
for i in t_doc:

 
File "/home/frederic/anaconda/lib/python2.7/site-packages/gensim/interfaces.py", line 122, in __iter__
   
yield self.obj[doc]

 
File "/home/frederic/anaconda/lib/python2.7/site-packages/gensim/models/ldamodel.py", line 921, in __getitem__
   
return self.get_document_topics(bow, eps)

 
File "/home/frederic/anaconda/lib/python2.7/site-packages/gensim/models/ldamodel.py", line 908, in get_document_topics
    gamma
, _ = self.inference([bow])

 
File "/home/frederic/anaconda/lib/python2.7/site-packages/gensim/models/ldamodel.py", line 432, in inference
    expElogbetad
= self.expElogbeta[:, ids]

IndexError: index 8088 is out of bounds for axis 1 with size 7477

I suspect that this is caused by the mismatch between the word IDs used by the model and those used by the new corpus.
I would think that I somehow have to use the dictionary of the old corpus on the new one, but I am unclear on how to do this.

Any hints are appreciated
Frederic
Message has been deleted

Kenneth Orton

unread,
Dec 7, 2016, 1:54:12 AM12/7/16
to gensim
Hello,
I'm just another user of Gensim and I thought I'd try to help. You might try transforming the corpus using the dictionary that was used to train the model
before you query the model for topics.
from gensim.models import VocabTransform

model = gensim.models.LdaMulticore.load(model_file)
old_dict
= gensim.corpora.Dictionary.load('dictionary_used_to_train_model')

corpus = gensim.corpora.MmCorpus(corpus_file)
new_dict = corpora.Dictionary.from_corpus(corpus)
# alternatively just load the saved dictionary from disk
# new_dict = corpora.Dictionary.load('dict.dict')

# transform the corpus
old2new = {old_dict.token2id[token]:new_id for new_id, token in new_dict.iteritems()}
vt = VocabTransform(old2new)
gensim.corpora.MmCorpus.serialize('transformed_corpus.mm', vt[corpus], id2word=new_dict, progress_cnt=10000)

corpus_new
= gensim.corpora.MmCorpus(
'transformed_corpus.mm')

topic_assignments
= model[corpus_new]

for i in topic_assignments:
   
print i
Reply all
Reply to author
Forward
0 new messages