Stuck getting topic to document matrix from trained model

561 views
Skip to first unread message

Matthias Eickhoff

unread,
Nov 18, 2016, 6:17:55 AM11/18/16
to gensim
Hi,

I'm trying to get the topic assignments for all documents in my corpus.
However, I get stuck at "random" documents without any error.

I'm using this function to get the topic assignments (which works fine for some of my corpora, but not all):

  
def get_doc_topic(corpus, model):
    doc_topic
= list()

   
for doc in tqdm(corpus):
        doc_topic
.append(model.get_document_topics(doc))
       
    doc_topic
= [dict(i) for i in doc_topic]
    doc_topic
= pd.DataFrame(doc_topic)
    doc_topic
.fillna(value=0, inplace=True)
   
return doc_topic

Note: This is the same corpus that was used to train the model (LdaMulticore).
The corpus contains ~250,000 documents. I have tried removing the tqdm(), which just adds a progress bar to the loop, no changes.



I have two questions about this:

1. Is looping over the corpus the preferred way to do this or is there a better way?

2. What might cause this to get stuck?
I would assume it can't depend on the content of a document (only works with the word IDs?) and any document that was fine during training should just work here?



Best regards
Matthias

Lev Konstantinovskiy

unread,
Nov 24, 2016, 6:27:56 AM11/24/16
to gensim
Hi Matthias,

It shouldn't hang without any input - expect at least an error. It is probably taking longer to infer than usual. what is the iterations parameter that you use? Default is 50.

If you change logging level to debug then you would see what is the step where it hangs.

Looking forward to the debug log,
Lev

Matthias Eickhoff

unread,
Nov 24, 2016, 7:34:51 AM11/24/16
to gensim
Hi Lev,

I am using 800 passes and training the models using this function (for now, I plan on making the passes a function of the document count to make this more dynamic):

def train_lda_model(corpus, ntops):
    
    no_docs = len(corpus)
    
    if no_docs > 10000:
        print "Training Model > 10k docs"
        lda = gensim.models.ldamulticore.LdaMulticore(corpus=corpus,
                                                      num_topics=50,
                                                      id2word=corpus.dictionary,
                                                      workers=3,
                                                      iterations=800,
                                                      chunksize = 1000)
    else:
        print "Training Model less than 10k docs"
        lda = gensim.models.ldamulticore.LdaMulticore(corpus=corpus,
                                                      num_topics=50,
                                                      id2word=corpus.dictionary,
                                                      workers=3,
                                                      iterations=800,
                                                      chunksize = 1000,
                                                      passes = 10)
    
    
    return lda



The loop over the corpus normally performs at 130 iterations (documents) per second. For the documents it gets "stuck" at (you are probably right, there still was CPU utlitization on one core) I had it running all night. I suppose an easy fix would be to limit the execution of each loop iteration to something like 100x of its average and return an NA result it this condition is violatet. Of course, fixing it would be ideal. 


I will try to recreate the problem with a debug log over the weekend (the workstation I use for this is currently in use).

Thank you for your response
Matthias

Lev Konstantinovskiy

unread,
Nov 24, 2016, 12:50:38 PM11/24/16
to gensim
Hi Matthias,

The code stops if no convergence after 800 iterations

Regards
Lev

Matthias Eickhoff

unread,
Nov 25, 2016, 9:49:47 AM11/25/16
to gensim
Hi Lev,

you mean it stops iterating and returns whatever (non converged) gamma it currently has right?
This would be my reading of the inference method.

Also, I have done some more testing and this problem does not occur each time I run my code on the documents.
For example, I have just tried this on a model for ~260k documents and it ran fine the second time I tried it. However, the first try stalled. 

My understanding now would be that this is just really "bad luck" in a single e-step iteration or the entire 800 iterations take several orders of magnitude longer for a document than for all others?


Regarding logs: 
I don't think my code generates any debug level logging because all logging.debug() calls only happen if chunks are submitted but I loop over individual documents?

Best regards
Matthias 

Lev Konstantinovskiy

unread,
Nov 28, 2016, 10:02:16 AM11/28/16
to gensim
Hi Matthias,

It does return whatever gamma is after number of iterations is exceeded.

The parameter chunksize = 1000 splits the documents into chunks so you should see debug messages printed. Is the loggin configured to be at debug level?


Regards
Lev

Matthias Eickhoff

unread,
Nov 28, 2016, 10:22:40 AM11/28/16
to gensim
Hi Lev,

I was under the impression that the chunksize is only relevant when training the model, not when using the trained models "get_document_topics" method?
Like I wrote, training works just fine on the same documents for which I would like to get the topic to document assignments afterwards.

I do get debug messages during training but not when looping over a corpus using the get_document_topics method.
This seems to be intended because I do not submit chunks but individual documents?

Here is my complete script, which reads in models and corpora, and is supposed to get the topic to document assignments for each corpus and convert it to a pd.DataFrame (lines 34 to 45 should be most relevant):


#!/usr/bin/env python2
# -*- coding: utf-8 -*-
"""
Created on Thu Nov 24 13:04:30 2016

@author: matthias
"""

import gensim 
import os
import pandas as pd
#import glob
from tqdm import tqdm
import logging

# Set up logging
logging.basicConfig(level=logging.DEBUG)
logFormatter = logging.Formatter("%(asctime)s [%(threadName)-12.12s] [%(levelname)-5.5s]  %(message)s")
rootLogger = logging.getLogger()

# Add console handler for printing
consoleHandler = logging.StreamHandler()
consoleHandler.setFormatter(logFormatter)
rootLogger.addHandler(consoleHandler)


def listdir_fullpath(d):
    return [os.path.join(d, f) for f in os.listdir(d)]    

#-----------------------------------
# This should be the relevant part.
#-----------------------------------
def get_doc_topic(corpus, model):
    doc_topic = list()
    logging.info("Getting topic to document matrix")
    logging.info("Applying model to documents...")
    for doc in tqdm(corpus): # Sometimes gets stuck on a single doc in this loop
        doc_topic.append(model.get_document_topics(doc))
        
    doc_topic = [dict(i) for i in doc_topic]
                 
    doc_topic = pd.DataFrame(doc_topic)
    doc_topic.fillna(value=0, inplace=True)
    return doc_topic 
#-----------------------------------
# End of relevant part
#-----------------------------------
    



def get_doc_topic_one_model_dir(comp_model_dir):
    
    fileHandler = logging.FileHandler("{0}/{1}.log".format(comp_model_dir, 'Topic_Doc_All.log'))
    fileHandler.setFormatter(logFormatter)
    rootLogger.addHandler(fileHandler)
    
    
    logging.info("Running on: \n {}".format(comp_model_dir))
    event_corpus = comp_model_dir + "/Corpus_Event.mm"
    event_model  = comp_model_dir + "/Model_Event.gensim"
    pre_corpus   = comp_model_dir + "/Corpus_Pre.mm"
    pre_model    = comp_model_dir + "/Model_Pre.gensim"
    
    logging.info("Reading Corpus Event")
    event_corpus = gensim.corpora.MmCorpus(event_corpus)
    logging.info("Reading Model Event")
    event_model = gensim.models.LdaMulticore.load(event_model)
    doc_topic_event = get_doc_topic(event_corpus, event_model)
    doc_topic_event.to_excel(comp_model_dir + "/Doc_Top_Event.xlsx", index=False)
    
    logging.info("Reading Corpus Pre")
    pre_corpus = gensim.corpora.MmCorpus(pre_corpus)
    logging.info("Reading Model Pre")
    pre_model = gensim.models.LdaMulticore.load(pre_model)
    doc_topic_pre = get_doc_topic(pre_corpus, pre_model)
    doc_topic_pre.to_excel(comp_model_dir + "/Doc_Top_Pre.xlsx", index=False)

if __name__ == "__main__":
models_main = "/path/to/my/project/Models"
model_dirs = listdir_fullpath(models_main)
model_dirs = [i for i in model_dirs if os.path.isdir(i)]
for m_dir in model_dirs:
get_doc_topic_one_model_dir(m_dir)


Lev Konstantinovskiy

unread,
Nov 28, 2016, 6:01:37 PM11/28/16
to gensim
Hi Matthias,

Same inference code is called during training and estimation so you should see same messages. What is the last message you see before the hang? 

Regards
Lev

Kenneth Orton

unread,
Nov 29, 2016, 1:13:36 AM11/29/16
to gensim
I think there is a difference in the input corpus you are using to infer topics from a model and the LDA model. 

I think you need to use the same LDA model that was trained using the same corpus if you want to infer topics from that corpus
because the corpus is in vector format and in order to infer topics the LDA model needs to know what those vectors map to. 

Here is what I get when I use a Wikipedia corpus and a model trained with a corpus from Tweets:
2016-11-28 22:40:31,150 : INFO : loaded corpus index from data/wiki_corpus.mm.index
2016-11-28 22:40:31,150 : INFO : initializing corpus reader from data/wiki_corpus.mm
2016-11-28 22:40:31,151 : INFO : accepted corpus with 1319343 documents, 170000 features, 435221157 non-zero entries
2016-11-28 22:40:31,151 : INFO : loading LdaModel object from data/tweets_25_lem_5_pass.model
2016-11-28 22:40:31,572 : INFO : loading id2word recursively from data/tweets_25_lem_5_pass.model.id2word.* with mmap=None
2016-11-28 22:40:31,572 : INFO : setting ignored attribute state to None
2016-11-28 22:40:31,572 : INFO : setting ignored attribute dispatcher to None
2016-11-28 22:40:31,573 : INFO : loading LdaModel object from data/tweets_100_lem_5_pass.model.state
0it [00:00, ?it/s]Traceback (most recent call last):
  File "test.py", line 28, in <module>
    topics = [item for item in tqdm(get_doc_topics())]
  File "/usr/local/lib/python2.7/dist-packages/tqdm/_tqdm.py", line 816, in __iter__
    for obj in iterable:
  File "test.py", line 14, in get_doc_topics
    yield dict(lda.get_document_topics(doc))
  File "/usr/local/lib/python2.7/dist-packages/gensim/models/ldamodel.py", line 910, in get_document_topics
    gamma, phis = self.inference([bow], collect_sstats=True)
  File "/usr/local/lib/python2.7/dist-packages/gensim/models/ldamodel.py", line 433, in inference
    expElogbetad = self.expElogbeta[:, ids]
IndexError: index 100225 is out of bounds for axis 1 with size 100000


When I use the corpus of Tweets it is successful: 
2016-11-28 22:54:35,954 : INFO : initializing corpus reader from data/tweets.mm
2016-11-28 22:54:35,954 : INFO : accepted corpus with 40537 documents, 100000 features, 74069118 non-zero entries
2016-11-28 22:54:35,954 : INFO : loading LdaModel object from data/tweets_100_lem_5_pass.model
2016-11-28 22:54:36,092 : INFO : loading id2word recursively from data/tweets_100_lem_5_pass.model.id2word.* with mmap=None
2016-11-28 22:54:36,092 : INFO : setting ignored attribute state to None
2016-11-28 22:54:36,092 : INFO : setting ignored attribute dispatcher to None
2016-11-28 22:54:36,093 : INFO : loading LdaModel object from data/tweets_100_lem_5_pass.model.state
12325it [05:02, 12.53it/s]

When I use the correct model trained with Wikipedia corpus it is also successful:
2016-11-28 22:35:30,597 : INFO : loaded corpus index from data/wiki_corpus.mm.index
2016-11-28 22:35:30,597 : INFO : initializing corpus reader from data/wiki_corpus.mm
2016-11-28 22:35:30,597 : INFO : accepted corpus with 1319343 documents, 170000 features, 435221157 non-zero entries
2016-11-28 22:35:30,597 : INFO : loading LdaModel object from data/lda_100_lem_5_pass.model
2016-11-28 22:35:30,964 : INFO : loading id2word recursively from data/lda_100_lem_5_pass.model.id2word.* with mmap=None
2016-11-28 22:35:30,964 : INFO : loading expElogbeta from data/lda_100_lem_5_pass.model.expElogbeta.npy with mmap=None
2016-11-28 22:35:31,373 : INFO : setting ignored attribute state to None
2016-11-28 22:35:31,373 : INFO : setting ignored attribute dispatcher to None
2016-11-28 22:35:31,373 : INFO : loading LdaModel object from data/lda_100_lem_5_pass.model.state
2016-11-28 22:35:31,374 : INFO : loading sstats from data/lda_100_lem_5_pass.model.state.sstats.npy with mmap=None
25it [00:05,  5.13it/s]

Matthias Eickhoff

unread,
Dec 6, 2016, 6:06:20 AM12/6/16
to gensim
Hi Lev,


Sorry for letting this slip, deadlines.... 

I am not getting any debug output during this stage, I think it is because all debug messages are wrapped like follows:
# From self.inference (line 433):
if len(chunk) > 1:
            logger.debug("performing inference on a chunk of %i documents", len(chunk))

Since I am looping over the corpus bow for bow, this does not get called. I may well be wrong though.
That aside, the last output I get is the progress bar of the loop over the corpus (I have tried to get rid of the tqdm(), no change). 


I have since successfully run my code on all my corpora successfully.
While the error persists, it is so rare (happens less than 1 in 1M) that I can circumvent it by simply rerunning the code.

I would be happy to continue to debug this though, but I am at a loss how to do it. To reiterate:

1. It seems to get "stuck" randomly.
2. Re-running the same code "fixes" it, though it might get stuck somehwere else. 

Maybe this could be an upstream problem? Some freak error in the BLAS or numpy?
I have also tried testing my RAM, seems intact. 

@ Kenneth: 
I am using the same corpus the model was trained on. As noted, it ...mostly... works.
Also, if there was a problem with the word mappings I think I should get the out of bounds error you posted.
I had the same problem without saving and loading the model and corpus, so I don't think it could be related to that. 

Best regards
Matthias

Radim Řehůřek

unread,
Dec 29, 2016, 1:24:00 AM12/29/16
to gensim
Hi Matthias,

sorry for the delay -- did you figure this out?

Intermittent hanging seems weird indeed.

My first debugging step would be to try to make the hangs fully reproducible. From the gensim POV, this means setting a fixed seed for the numpy + random modules, using the exact same training data, and using only a single process (worker).

If it still hangs in different places, the problem is elsewhere (your input iterator, or even hardware...).

By the way, when you interrupt the process with CTRL-C, what is the printed stack trace? What line exactly does it hang on?

Radim
Reply all
Reply to author
Forward
0 new messages