Integer too large error while indexing 4 million documents

Alon Jacovi

unread,

Dec 27, 2018, 5:00:02 AM12/27/18

to Whoosh

Hi, following StackOverflow question here: https://stackoverflow.com/questions/53937669/integer-too-large-error-with-vectoring-during-whoosh-indexing

I was advised to ask the question here because some people in this forum managed to index a large amount of documents.

This is the error:

Traceback (most recent call last):
  File "...", line 256, in <module>
    ...
  File "/home/nlp/*/anaconda3/envs/riken/lib/python3.6/site-packages/whoosh/writing.py", line 771, in add_document
    perdocwriter.add_vector_items(fieldname, field, vitems)
  File "/home/nlp/*/anaconda3/envs/riken/lib/python3.6/site-packages/whoosh/codec/whoosh3.py", line 244, in add_vector_items
    self.add_column_value(vecfield, VECTOR_COLUMN, offset)
  File "/home/nlp/*/anaconda3/envs/riken/lib/python3.6/site-packages/whoosh/codec/base.py", line 821, in add_column_value
    self._get_column(fieldname).add(self._docnum, value)
  File "/home/nlp/*/anaconda3/envs/riken/lib/python3.6/site-packages/whoosh/columns.py", line 678, in add
    self._dbfile.write(self._pack(v))
struct.error: 'I' format requires 0 <= number <= 4294967295

Relevant git issue: https://bitbucket.org/mchaput/whoosh/issues/460/overflowerror-when-writer-adds-document

From the StackOverflow question:

"It looks like the field that is used as a document index is only designed to be a 32-bit unsigned int, which gives you a limit of roughly 4M documents."

If you have a way of circumventing or fixing this problem, help is appreciated.

clach04

unread,

Dec 27, 2018, 2:01:27 PM12/27/18

to Whoosh

On Thursday, December 27, 2018 at 2:00:02 AM UTC-8, Alon Jacovi wrote:

...struct.error: 'I' format requires 0 <= number <= 4294967295

Relevant git issue: https://bitbucket.org/mchaput/whoosh/issues/460/overflowerror-when-writer-adds-document

From the StackOverflow question:

"It looks like the field that is used as a document index is only designed to be a 32-bit unsigned int, which gives you a limit of roughly 4M documents."\

Max unsigned 32-bit is 4 billion, rather than a million. Are you seeing 4 million or 4 billion document limit?

Chris

Alon Jacovi

unread,

Dec 27, 2018, 5:38:37 PM12/27/18

to Whoosh

My script crashed at around the 4 million document mark. I just pasted the quote from the StackOverflow question. Note that the error comes from the vector method so it is probably related to the fact that each document has a text field (the length of about one paragraph) that is with vector = True.

Reply all

Reply to author

Forward