Get "IOError: 1186398300 requested and 649239532 written" error

Fatemeh Lashkari

unread,

Apr 3, 2017, 8:31:13 AM4/3/17

to gensim

I tried to build model for corpus with size 7.8 GB and I got this error when I want to save model.

I create the model in this way:

model = Doc2Vec(allDocs, size = 300, window = 5, min_count = 2, workers = 4)
model.save(mPath+'/my_model.doc2vec')

Traceback (most recent call last):
File "buildModelIterator.py", line 71, in <module>
model.save(mPath+'/my_model.doc2vec')
File "/usr/local/lib/python2.7/dist-packages/gensim-0.13.3-py2.7-linux-x86_64.egg/gensim/models/word2vec.py", line 1756, in save
super(Word2Vec, self).save(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/gensim-0.13.3-py2.7-linux-x86_64.egg/gensim/utils.py", line 479, in save
pickle_protocol=pickle_protocol)
File "/usr/local/lib/python2.7/dist-packages/gensim-0.13.3-py2.7-linux-x86_64.egg/gensim/utils.py", line 349, in _smart_save
compress, subname)
File "/usr/local/lib/python2.7/dist-packages/gensim-0.13.3-py2.7-linux-x86_64.egg/gensim/utils.py", line 406, in _save_specials
numpy.save(subname(fname, attrib), numpy.ascontiguousarray(val))
File "/usr/local/lib/python2.7/dist-packages/numpy/lib/npyio.py", line 491, in save
pickle_kwargs=pickle_kwargs)
File "/usr/local/lib/python2.7/dist-packages/numpy/lib/format.py", line 584, in write_array
array.tofile(fp)
IOError: 1186398300 requested and 649239532 written

How can I solve it?

Gordon Mohr

unread,

Apr 3, 2017, 4:47:00 PM4/3/17

to gensim

Are you sure there is enough free space on the target disk volume?

- Gordon

Fatemeh Lashkari

unread,

Apr 3, 2017, 5:10:52 PM4/3/17

to gen...@googlegroups.com

How can I estimate the amount of needed free space for saving my model?

Best Regards,

Fatemeh

--
You received this message because you are subscribed to a topic in the Google Groups "gensim" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/Zczd7yvwKQM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gensim+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gordon Mohr

unread,

Apr 3, 2017, 6:00:24 PM4/3/17

to gensim

If you've enabled INFO level logging, during model setup there will be a rough estimate of needed RAM, and storage-on-disk is roughly similar in size.

The dominant sources of model size are:

word-vectors: (vocabulary-count) x (dimensions) x (4 bytes/float) # aka `model.wv.syn0`

out-weight-vectors: (vocabulary-count) x (dimensions) x (4 bytes/float) # aka `model.syn1` or `model.syn1neg`

doc-vectors: (doctag-vector-count) x (dimensions) x (4 bytes/float) # aka `model.docvecs.doctag_syn0`

word-dictionary: pickle-size of model.wv.vocab
doctag-dictionary: pickle-size of model.docvecs.doctags (nothing if using plain-int doctags; large if using many string doctags)

- Gordon

Best Regards,
Fatemeh

To unsubscribe from this group and all its topics, send an email to gensim+un...@googlegroups.com.

Fatemeh Lashkari

unread,

Apr 4, 2017, 9:04:01 AM4/4/17

to gen...@googlegroups.com

Thanks Gordon. When I build the model based on my code I just have`my_model.doc2vec.syn0.npy` , `my_model.doc2vec.syn1.npy` and `my_model.doc2vec` files. How can I build model to have other files?

Best Regards,

Fatemeh

To unsubscribe from this group and all its topics, send an email to gensim+unsubscribe@googlegroups.com.

Gordon Mohr

unread,

Apr 4, 2017, 2:11:09 PM4/4/17

to gensim

A single call to save should save exactly the files it needs to; there's no need for more files (or other steps) unless there's an error. (The files ending `.npy`are arrays that are saved aside from the model-object itself, and should be kept alongside the main file for later reloading.)

Is there still an error? How much space is available? What does INFO logging show as the estimated model size, and then progress/success/failure at the time of `save()`?

Also, are you using a recent gensim version? (The filename `my_model.doc2vec.syn1.npy` implies your model is using hierarchical-softmax mode, `hs=1, negative=0`, whereas the default for quite a while has instead been `negative=5, hs=0`.)

- Gordon

Fatemeh Lashkari

unread,

Apr 5, 2017, 12:17:31 PM4/5/17

to gensim

I have not tried to build that model because I want to find the estimated space correctly.

This is part of my INFO logging building model for the smaller input :

2017-04-04 23:53:09,388 : INFO : collecting all words and their counts
2017-04-04 23:53:09,388 : INFO : PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
2017-04-04 23:53:10,637 : INFO : PROGRESS: at example #10000, processed 4658741 words (3731303/s), 172162 word types, 10001 tags
2017-04-04 23:53:12,010 : INFO : PROGRESS: at example #20000, processed 9705612 words (3675846/s), 304250 word types, 20001 tags
2017-04-04 23:53:13,383 : INFO : PROGRESS: at example #30000, processed 14721694 words (3654145/s), 374764 word types, 30000 tags
2017-04-04 23:53:14,881 : INFO : PROGRESS: at example #40000, processed 20261251 words (3700028/s), 480911 word types, 40000 tags
2017-04-04 23:53:16,137 : INFO : PROGRESS: at example #50000, processed 24833526 words (3639378/s), 538055 word types, 49999 tags
2017-04-04 23:53:17,601 : INFO : PROGRESS: at example #60000, processed 30236474 words (3692793/s), 624805 word types, 59999 tags

....

.....

2017-04-05 06:22:54,712 : INFO : PROGRESS: at 99.98% examples, 251324 words/s, in_qsize 8, out_qsize 0
2017-04-05 06:22:55,779 : INFO : PROGRESS: at 99.98% examples, 251323 words/s, in_qsize 8, out_qsize 0
2017-04-05 06:22:56,823 : INFO : PROGRESS: at 99.99% examples, 251323 words/s, in_qsize 7, out_qsize 0
2017-04-05 06:22:57,831 : INFO : PROGRESS: at 99.99% examples, 251323 words/s, in_qsize 8, out_qsize 0
2017-04-05 06:22:58,858 : INFO : PROGRESS: at 100.00% examples, 251322 words/s, in_qsize 8, out_qsize 0
2017-04-05 06:22:59,629 : INFO : worker thread finished; awaiting finish of 3 more threads
2017-04-05 06:22:59,641 : INFO : worker thread finished; awaiting finish of 2 more threads
2017-04-05 06:22:59,664 : INFO : worker thread finished; awaiting finish of 1 more threads
2017-04-05 06:22:59,665 : INFO : worker thread finished; awaiting finish of 0 more threads
2017-04-05 06:22:59,665 : INFO : training on 5692755195 raw words (5678892465 effective words) took 22596.0s, 251323 effective words/s
2017-04-05 06:22:59,665 : INFO : saving Doc2Vec object under /home/fatemeh/Step2/input-output/finalDocs/model/my_model2.doc2vec, separately None
2017-04-05 06:22:59,665 : INFO : storing numpy array 'doctag_syn0' to /home/fatemeh/Step2/input-output/finalDocs/model/my_model2.doc2vec.docvecs.doctag_syn0.npy
2017-04-05 06:23:02,117 : INFO : not storing attribute syn0norm
2017-04-05 06:23:02,117 : INFO : not storing attribute cum_table
2017-04-05 06:23:02,117 : INFO : storing numpy array 'syn0' to /home/fatemeh/Step2/input-output/finalDocs/model/my_model2.doc2vec.syn0.npy
2017-04-05 06:23:07,322 : INFO : storing numpy array 'syn1' to /home/fatemeh/Step2/input-output/finalDocs/model/my_model2.doc2vec.syn1.npy
2017-04-05 06:25:29,245 : INFO : saved /home/fatemeh/Step2/input-output/finalDocs/model/my_model2.doc2vec

I do not know the gensim version on my computer.How can I have that kind of information?

Gordon Mohr

unread,

Apr 5, 2017, 12:47:45 PM4/5/17

to gensim

The model's logging of an estimated size should have happened in the range clipped out with "..." – between the initial "collecting all words" and beginning of training.

If you're using a virtual environment (which is highly recommended), and `pip` or similar to install gensim, then the command `pip freeze` from the command-line will print all installed packages and their versions, including gensim.

- Gordon

Fatemeh Lashkari

unread,

Apr 5, 2017, 1:00:13 PM4/5/17

to gen...@googlegroups.com

Thanks Gordon.

To answer your question:

are you using a recent gensim version? (The filename `my_model.doc2vec.syn1.npy` implies your model is using hierarchical-softmax mode, `hs=1, negative=0`, whereas the default for quite a while has instead been `negative=5, hs=0`.)

My gensim version is 0.13.3.

You mean my model was not build correctly?

The model's logging of an estimated size should have happened in the range clipped out with "..." – between the initial "collecting all words" and beginning of training.

Best Regards,

Fatemeh

To unsubscribe from this group and all its topics, send an email to gensim+unsubscribe@googlegroups.com.

Radim Řehůřek

unread,

Apr 5, 2017, 9:46:31 PM4/5/17

to gensim

Hello Fatemeh,

You mean my model was not build correctly?
The model's logging of an estimated size should have happened in the range clipped out with "..." – between the initial "collecting all words" and beginning of training.

It means you replaced the interesting part of the log with "...".

You can also calculate the ± space needed by following Gordon's "sources of model size" formulas above.

HTH,

Radim

Fatemeh Lashkari

unread,

Apr 6, 2017, 4:32:35 PM4/6/17

to gensim

Thanks. Just one more question.

I do not have model.wv.syn0 , model.wv.vocab and model.docvecs.doctags how can I change parameters of the Doc2Vec so when I build model these files are created too.

model = Doc2Vec(allDocs, size = 300, window = 5, min_count = 2, workers = 4)

Gordon Mohr

unread,

Apr 6, 2017, 5:30:11 PM4/6/17

to gensim

What makes you think you "do not have" them? They will be automatically created as needed by training. If your model is training, or individual word/doc vectors can be accessed after training, they exist. (If `model.wv` does not exist, that could be because you're using an older gensim. You could uninstall & reinstall to be up-to-date.)

Some of them might be saved as separate files by `save()`, but only if they exceed certain sizes. If they're small enough to fit in the main model save file, that's where they'll go – and then you have fewer files to keep together. So it's not really beneficial to force/expect them to be separate files.

- Gordon

Reply all

Reply to author

Forward