ldamodel divide by zero error

1,232 views
Skip to first unread message

Kevin

unread,
Apr 8, 2015, 11:47:31 PM4/8/15
to gen...@googlegroups.com
I am not familiar with python and was hoping I might get a nudge in the right direction tracking down this error I am getting trying to generate and evaluate an LdaModel.

warning (from warnings module):
  File "/Library/Python/2.7/site-packages/gensim-0.10.3-py2.7-macosx-10.10-intel.egg/gensim/models/ldamodel.py", line 474
    perwordbound = self.bound(chunk, subsample_ratio=subsample_ratio) / (subsample_ratio * corpus_words)
RuntimeWarning: divide by zero encountered in double_scalars


I am using the default parameters to the LdaModel (chunk = 2000, etc) so looking at the code it seems subsample_ratio can't be 0 based on this definition;
subsample_ratio = 1.0 * total_docs / len(chunk)

I assume this means "corpus_words" is somehow evaluating as 0, but with my near 0 knowledge of python I haven't yet deciphered how this line would do that;

corpus_words = sum(cnt for document in chunk for _, cnt in document)

I am not used to the format of the looping statements, but my best guess is this might be the count of words across all docs in the current chunk? If that's the case I would think it would have to encounter 2000 empty documents in order to cause this which doesn't seem likely.

Thanks

Kevin

Radim Řehůřek

unread,
Apr 9, 2015, 10:38:08 AM4/9/15
to gen...@googlegroups.com
Hello Kevin,

send your log at DEBUG level (instruction for turning on logging here: http://radimrehurek.com/gensim/tut1.html).

It should be clearer what's happening then.

Best,
Radim

Kevin

unread,
Apr 12, 2015, 7:44:55 PM4/12/15
to gen...@googlegroups.com
Sorry, that was an obvious thing I should have collected.  The logging is quite helpful.

I am thinking it is related to my use of tfidf weights with a high level of precision and genesis using 3 digits of precision which were iterating to 0 in the LDA model.  Regardless the strange thing is that when I ran with logging on it did not error on me with the same data.  

I did see messages like the following which make me think the small values where somewhere getting summed up to 0;

INFO : -inf per-word bound, inf perplexity estimate based on a held-out corpus of 2000 documents with 0 words

However, since it didn't error I am able to focus on seeing how the models come out instead of figuring out how I was causing the error.  Thanks for the tip!

Kevin

David Cambronero

unread,
Jun 15, 2015, 4:39:34 PM6/15/15
to gen...@googlegroups.com
Hey Kevin, 

I am having this same error, how do I solve this?

Thank you,
David.

Radim Řehůřek

unread,
Jun 16, 2015, 6:30:53 AM6/16/15
to gen...@googlegroups.com, blais...@gmail.com
On Monday, April 13, 2015 at 1:44:55 AM UTC+2, Kevin wrote:
Sorry, that was an obvious thing I should have collected.  The logging is quite helpful.

I am thinking it is related to my use of tfidf weights with a high level of precision and genesis using 3 digits of precision which were iterating to 0 in the LDA model.  Regardless the strange thing is that when I ran with logging on it did not error on me with the same data.  

I did see messages like the following which make me think the small values where somewhere getting summed up to 0;

INFO : -inf per-word bound, inf perplexity estimate based on a held-out corpus of 2000 documents with 0 words

^^ the "2000 documents with 0 words" is a good clue here.

For some reason, your corpus is empty (zero words in a given chunk of 2000 documents), which leads to some pathological behaviour later on.

I'd suggest to check your input -- print it / log it, even without gensim. Seeing what's going through your data pipeline, as a sort of "human eyeballing sanity check", is a great practice anyway. And maybe filter out empty documents -- these can't affect LDA training anyway.

HTH,
Radim
Reply all
Reply to author
Forward
0 new messages