Bug report: rounding error in gensim.models.word2vec.py

15 views
Skip to first unread message

Matej P

unread,
Dec 9, 2018, 5:22:39 PM12/9/18
to Gensim
Lines 464-468:

for word_index in range(vocab_size):
    cumulative += self.vocab[self.index2word[word_index]].count**power / train_words_pow
    self.cum_table[word_index] = round(cumulative * domain)
if len(self.cum_table) > 0:
    assert self.cum_table[-1] == domain


After the loop, cumulative is supposed to be 1. However, if vocab_size is big enough (or something like that), floating point arithmetic
causes self.cum_table[-1] != domain, which results in the AssertionErrror:

In my case, self.cum_table[-1] equals 2147483646 and domain equals 2147483647.

Best regards, Matej



Gordon Mohr

unread,
Dec 10, 2018, 1:12:11 PM12/10/18
to Gensim
Thanks for the report! This was observed before and a fix applied that we thought resolved the issue (solving an available test case). See: https://github.com/RaRe-Technologies/gensim/issues/865

Can you say more about the OS and versions of Python & gensim in which you saw this? How large is your vocabulary, and if requested could you share the raw sequence of word tallies so someone could reproduce the error elsewhere?

- Gordon

Matej P

unread,
Dec 11, 2018, 9:33:38 AM12/11/18
to Gensim
Hi,

I have Windows 10 and was forced to use the version 0.12.4. with Python27. If I apply the suggested fix, the program successfully passes the assert line. I apologize for the inconveniences. Nevertheless, here are the count statistics, copied from the standard output:

collected 5837510 word types and 63995 unique tags from a corpus of 63995 examples and 280167305 words

Thanks for the solution and best regards,
Matej

Dne ponedeljek, 10. december 2018 19.12.11 UTC+1 je oseba Gordon Mohr napisala:
Reply all
Reply to author
Forward
0 new messages