Using existing model for inference on new corpus

frederi...@gmail.com

unread,

Dec 6, 2016, 5:27:15 AM12/6/16

to gensim

Hi everyone,

I'm trying to use a model trained on one corpus to get topic assignments on new data.
Here is what I have so far:

#Model is LdaMulticore and was trained on another corpus
model = gensim.models.LdaMulticore.load(model_file)
corpus_new = gensim.corpora.MmCorpus(corpus_file)


topic_assignments = model[corpus_new]

for i in topic_assignments: 
    print i
    
# Results in error:
Traceback (most recent call last):

  File "<ipython-input-17-be690e265876>", line 1, in <module>
    for i in t_doc:

  File "/home/frederic/anaconda/lib/python2.7/site-packages/gensim/interfaces.py", line 122, in __iter__
    yield self.obj[doc]

  File "/home/frederic/anaconda/lib/python2.7/site-packages/gensim/models/ldamodel.py", line 921, in __getitem__
    return self.get_document_topics(bow, eps)

  File "/home/frederic/anaconda/lib/python2.7/site-packages/gensim/models/ldamodel.py", line 908, in get_document_topics
    gamma, _ = self.inference([bow])

  File "/home/frederic/anaconda/lib/python2.7/site-packages/gensim/models/ldamodel.py", line 432, in inference
    expElogbetad = self.expElogbeta[:, ids]

IndexError: index 8088 is out of bounds for axis 1 with size 7477

I suspect that this is caused by the mismatch between the word IDs used by the model and those used by the new corpus.
I would think that I somehow have to use the dictionary of the old corpus on the new one, but I am unclear on how to do this.

Any hints are appreciated
Frederic

Message has been deleted

Kenneth Orton

unread,

Dec 7, 2016, 1:54:12 AM12/7/16

to gensim

Hello,

I'm just another user of Gensim and I thought I'd try to help. You might try transforming the corpus using the dictionary that was used to train the model

before you query the model for topics.

from gensim.models import VocabTransform

model = gensim.models.LdaMulticore.load(model_file)
old_dict = gensim.corpora.Dictionary.load('dictionary_used_to_train_model')

corpus = gensim.corpora.MmCorpus(corpus_file)
new_dict = corpora.Dictionary.from_corpus(corpus)
# alternatively just load the saved dictionary from disk
# new_dict = corpora.Dictionary.load('dict.dict')

# transform the corpus
old2new = {old_dict.token2id[token]:new_id for new_id, token in new_dict.iteritems()}
vt = VocabTransform(old2new)
gensim.corpora.MmCorpus.serialize('transformed_corpus.mm', vt[corpus], id2word=new_dict, progress_cnt=10000)

corpus_new = gensim.corpora.MmCorpus('transformed_corpus.mm')


topic_assignments = model[corpus_new]

for i in topic_assignments: 
    print i

Reply all

Reply to author

Forward