Index Error, using an already trained LDA model

2,125 views
Skip to first unread message

Jonathan Klemetz

unread,
May 10, 2016, 9:08:39 AM5/10/16
to gensim
Hello!

I am trying to map topics to each document in a corpus of ca 2000 academic papers.

So following Bleis (2003) recommendations to use a training set of ca 10% of the corpus to train a LDA model, I used the following code
************************************************************************************************************************
def iter_documents(top_directory):
    """Iterate over all documents, yielding a document (=list of utf8 tokens) at a time."""
    for root, dirs, files in os.walk(top_directory):
        for file in filter(lambda file: file.endswith('.txt'), files):
            document = open(os.path.join(root, file)).read() # read the entire document, as one big string
            yield gensim.utils.tokenize(document, lower=True) # or whatever tokenization suits you

class MyCorpus(object):
    def __init__(self, top_dir):
        self.top_dir = top_dir
        self.dictionary = gensim.corpora.Dictionary(iter_documents(top_dir))
        self.dictionary.save('/tmp/deerwester.dict')

    def __iter__(self):
        for tokens in iter_documents(self.top_dir):
            yield self.dictionary.doc2bow(tokens)

def main():
    top_directory = path to my training set
    logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
    corpus = MyCorpus(top_directory) #Creates a MyCorpus object, containing all the documents
    dictionary = corpora.Dictionary.load('/tmp/deerwester.dict')

    # Not entierly sure what this is doing, but it was necessary to create a proper corpus object.
    # I think this is because that gensims algorithms require to have the corpuses stored on the hard drive.
    corpora.MmCorpus.serialize('/tmp/corpus.mm', corpus)
    corpus = corpora.MmCorpus('/tmp/corpus.mm')
    lda = gensim.models.ldamulticore.LdaMulticore(corpus=corpus, id2word=dictionary, num_topics=100, passes=50, batch=True, workers=2, chunksize=3000, iterations=1000)
    lda.save('model directory')
    print("Ending!")

main()
************************************************************************************************************************

Then I thought it would be totally fine to reuse this saved model on the entire set of papers, so I use a different directory but almost the same code. It is only the last lines in the main that is of importance here I guess.

************************************************************************************************************************
def iter_documents(top_directory):
    """Iterate over all documents, yielding a document (=list of utf8 tokens) at a time."""
    for root, dirs, files in os.walk(top_directory):
        for file in filter(lambda file: file.endswith('.txt'), files):
            document = open(os.path.join(root, file)).read() # read the entire document, as one big string
            yield gensim.utils.tokenize(document, lower=True) # or whatever tokenization suits you

class MyCorpus(object):
    def __init__(self, top_dir):
        self.top_dir = top_dir
        self.dictionary = gensim.corpora.Dictionary(iter_documents(top_dir))
 

    def __iter__(self):
        for tokens in iter_documents(self.top_dir):
            yield self.dictionary.doc2bow(tokens)

def main():
    top_directory = 'my already lemmatized and pre processed documents'
    logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
    dictionary = corpora.Dictionary.load('/tmp/deerwester.dict')
    corpus = MyCorpus(top_directory) #Creates a MyCorpus object, containing all the documents
    # Not entierly sure what this is doing, but it was necessary to create a proper corpus object.
    # I think this is because that gensims algorithms require to have the corpuses stored on the hard drive.
    corpora.MmCorpus.serialize('/tmp/runCorpus.mm', corpus)
    corpus = corpora.MmCorpus('/tmp/runCorpus.mm')
    lda = gensim.models.LdaModel.load('my already trained model')
    #lda.print_topics(num_topics=100, num_words=7)
    print(corpus)

    #A counter so I can try to find where it goes to shit
    x=1
    for i in corpus:
        print(x)
        print(lda[i])
        x = x + 1

main()
************************************************************************************************************************

Then it all goes down the drain, I have tried a lot of different in house gensim functions to try to achieve a topic extraction at least, but I think it is something that I am doing wrong when it comes to the dictionary.

The error I receive is "Traceback (most recent call last):
  File "runModel.py", line 55, in <module>
    main()
  File "runModel.py", line 49, in main
    print(lda[i])
  File "/usr/local/lib/python2.7/dist-packages/gensim-0.12.4-py2.7-linux-x86_64.egg/gensim/models/ldamodel.py", line 921, in __getitem__
    return self.get_document_topics(bow, eps)
  File "/usr/local/lib/python2.7/dist-packages/gensim-0.12.4-py2.7-linux-x86_64.egg/gensim/models/ldamodel.py", line 908, in get_document_topics
    gamma, _ = self.inference([bow])
  File "/usr/local/lib/python2.7/dist-packages/gensim-0.12.4-py2.7-linux-x86_64.egg/gensim/models/ldamodel.py", line 432, in inference
    expElogbetad = self.expElogbeta[:, ids]
IndexError: index 30213 is out of bounds for axis 1 with size 30213"

And according to Radim in other similar error threads, he suggests that it something with the dictionary being used wrong. But maybe it is just me that is slow minded because I cant seem to figure it out.

Anyone who can help?

Lev Konstantinovskiy

unread,
May 16, 2016, 8:02:42 AM5/16/16
to gensim
Hi Jonathan,

The training dictionary is different from the test dictionary.

Try using a HashDictionary instead

Radim Řehůřek

unread,
May 17, 2016, 10:41:12 PM5/17/16
to gensim
... or just re-use the same dictionary, inside your first MyCorpus.dictionary.

Basically, your model was trained using a specific id=>word mapping (dictionary), stored in MyCorpus.dictionary. Now when you want to apply the model to new documents, you must "process" those documents using the same procedure (same dictionary), otherwise you'll get an id/word mismatch.

Simply store the already-created dictionary like you're already doing -- though perhaps don't call it "deerwester.dict" :) then load it like you're already doing, then assign `corpus.dictionary = dictionary`. That means both your MyCorpus objects will be using the same id=>word mapping.

Hope that helps,
Radim

Adam Cavendish

unread,
Jun 30, 2016, 9:59:33 AM6/30/16
to gensim
Could you please be more specifiy on the assignment?

`corpus.dictionary = dictionary`

Isn't corpus a list of lists?

Lev Konstantinovskiy

unread,
Jun 30, 2016, 7:08:57 PM6/30/16
to gensim
Hi Adam,

In general anything iterable is a corpus in gensim but in Jonathan's example object corpus is of class MyCorpus. 

Regards
Lev

Carlos Peña

unread,
Jul 14, 2016, 4:26:06 AM7/14/16
to gensim
Hi, i have the same problem but in my case. I have a corpus as list of list:

texts = [] #  list with texts to introduce in lda.
documents = []
for text in texts:
    documents.append(get_separate_words(text)) # function split text in list of words
dictionary = corpora.Dictionary(documents)
stop_list = getLines('Backend/NLP/spanish_stopwords.txt') # list with stopwords
stop_ids = [dictionary.token2id[stop_word] for stop_word in stop_list if stop_word in dictionary.token2id] # list of ids stop_words
dictionary.filter_tokens(stop_ids)

corpus = []
for doc in documents:
    corpus.append(dictionary.doc2bow(doc, allow_update=True))

lda = ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=200, update_every=0, passes=20)

texts2 = [] #  list with other texts to introduce in lda.
documents2 = []
for text in texts2:
    documents2.append(get_separate_words(text2)) # function split text in list of words
dictionary2 = corpora.Dictionary(documents2)
stop_ids2 = [dictionary2.token2id[stop_word] for stop_word in stop_list if stop_word in dictionary2.token2id] # list of ids stop_words
dictionary.filter_tokens(stop_ids2)

corpus2 = []
for doc in documents2:
    corpus2.append(dictionary.doc2bow(doc, allow_update=True))

lda.update(corpus2)

How can i solve?

thank you in advance.

Lev Konstantinovskiy

unread,
Aug 4, 2016, 5:42:45 PM8/4/16
to gensim
Hi Carlos,

Please use `dictionary` instead of `dictionary2` to update the model. See this message for more information.

Regards
Lev

Rabia Mehmood

unread,
Jan 11, 2018, 3:02:07 AM1/11/18
to gensim
HI Radim,
Can you please elaborate your answer? What do you mean by "re-using the same dictionary inside MyCorpus.dictionary"?

Ivan Menshikh

unread,
Jan 11, 2018, 11:24:04 PM1/11/18
to gensim
Hello Rabia,

Main "target" of dictionary - create mapping (id <-> word), for this reason, two different mappings shouldn't work for one model.
For this reason, you should use always only one dictionary for all stuff.

Rabia Mehmood

unread,
Jan 12, 2018, 1:17:01 PM1/12/18
to gen...@googlegroups.com
Thanks for the reply Ivan.
By using same dictionary, you mean we should use same dictionary for both training and testing. What i did in my training is :

    dictionary = corpora.HashDictionary(training_text)  
    train_corpus = [dictionary.doc2bow(text) for text in training_text]  
    topics = math.log2(len(trainTokens))
    trained_ldamodel = gensim.models.ldamodel.LdaModel(train_corpus, num_topics=topics, id2word=dictionary, iterations=50)
    trained_ldamodel.save(savedModel".model")

And in testing:
    testDic = corpora.Dictionary(testing_text)
    testCorpus = [testDic.doc2bow(text) for text in testing_text]
   trained_ldamodel = models.LdaModel.load(savedModel".model") 
   testDoc = trained_ldamodel[testCorpus]

So, at which step the same dictionary can be used? Although i'm not sure that i wrote the right code or not. But according to my understanding we need to prepare our data-set in the form of doc2bow for passing it to LDA and creating dictionary is the pre-required step of creating doc2bow.

--
You received this message because you are subscribed to a topic in the Google Groups "gensim" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/T0GMxE7YZqM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gensim+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Rabia Mehmood

unread,
Jan 12, 2018, 1:17:01 PM1/12/18
to gen...@googlegroups.com
What i guessed from your reply is that i should save the dictionary of training mode and in testing step i should load saved dictionary to testDic instead of doing  "testDic = corpora.Dictionary(testing_text)". Isn't it?

Ivan Menshikh

unread,
Jan 14, 2018, 11:38:16 PM1/14/18
to gensim
Yes, exactly.
To unsubscribe from this group and all its topics, send an email to gensim+un...@googlegroups.com.

Rabia Mehmood

unread,
Jan 15, 2018, 2:52:45 AM1/15/18
to gen...@googlegroups.com
Thanks.

To unsubscribe from this group and all its topics, send an email to gensim+unsubscribe@googlegroups.com.

Nitheen Rao T

unread,
Jan 23, 2018, 6:29:28 PM1/23/18
to gensim
Hi Lev,

Could you please help on this question.

Vinc420

unread,
May 3, 2018, 5:44:08 AM5/3/18
to gensim
Hi everyone,

I am facing the same error as raised (IndexError: index 6614 is out of bounds for axis 1 with size 6614), but I am using the same dictionary from the beginning to the end.

As I have a big dataset, I am loading it chunk by chunk (using pickle.load). I am building the dictionary in this way, iteratively, thanks to this piece of code : 
 fr_documents_lda = open("documents_lda_40_rails_30_ruby_full.dat", 'rb')
 dictionary
= Dictionary()
 chunk_no
= 0
 
while 1:
     
try:
         t0
= time()
         documents_lda
= pickle.load(fr_documents_lda)
         chunk_no
+= 1
         dictionary
.add_documents(documents_lda)
         t1
= time()
         
print("Chunk number {0} took {1:.2f}s".format(chunk_no, t1-t0))
     
except EOFError:
         
print("Finished going through pickle")
         
break

Once built for the whole dataset, I am training the model in the same fashion, iteratively, this way :

fr_documents_lda = open("documents_lda_40_rails_30_ruby_full.dat", 'rb')
first_iter
= True
chunk_no
= 0
lda_gensim
= None
while 1:
   
try:
        t0
= time()
        documents_lda
= pickle.load(fr_documents_lda)
        chunk_no
+= 1
        corpus
= [dictionary.doc2bow(text) for text in documents_lda]
       
if first_iter:
            first_iter
= False
            lda_gensim
= LdaModel(corpus, num_topics=no_topics, iterations=100, offset=50., random_state=0, alpha='auto')
       
else:
            lda_gensim
.update(corpus)
        t1
= time()
       
print("Chunk number {0} took {1:.2f}s".format(chunk_no, t1-t0))
   
except EOFError:
       
print("Finished going through pickle")
       
break

I also tried updating the dictionary at every chunk, i.e. having
 dictionary.add_documents(documents_lda)

right before
 corpus = [dictionary.doc2bow(text) for text in documents_lda]

 in the last piece of code. Finally, I tried setting the allow_update argument of doc2bow to True. Nothing works.

FYI, the size of my final dictionary is 85k. The size of my dictionary built only from the first chunk is 10k. The error occurs on the second iteration, when it passes in the else condition, when calling the update method.

Is anyone having an idea of how to fix this ?

Thank you in advance.
Reply all
Reply to author
Forward
0 new messages