Get "IOError: 1186398300 requested and 649239532 written" error

1,311 views
Skip to first unread message

Fatemeh Lashkari

unread,
Apr 3, 2017, 8:31:13 AM4/3/17
to gensim
I tried to build model for corpus with size 7.8 GB and I got this error when I want to save model.
         
I create the model in this way:

model = Doc2Vec(allDocs, size = 300, window = 5, min_count = 2, workers = 4)
model.save(mPath+'/my_model.doc2vec')

Traceback (most recent call last):
  File "buildModelIterator.py", line 71, in <module>
    model.save(mPath+'/my_model.doc2vec')
  File "/usr/local/lib/python2.7/dist-packages/gensim-0.13.3-py2.7-linux-x86_64.egg/gensim/models/word2vec.py", line 1756, in save
    super(Word2Vec, self).save(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/gensim-0.13.3-py2.7-linux-x86_64.egg/gensim/utils.py", line 479, in save
    pickle_protocol=pickle_protocol)
  File "/usr/local/lib/python2.7/dist-packages/gensim-0.13.3-py2.7-linux-x86_64.egg/gensim/utils.py", line 349, in _smart_save
    compress, subname)
  File "/usr/local/lib/python2.7/dist-packages/gensim-0.13.3-py2.7-linux-x86_64.egg/gensim/utils.py", line 406, in _save_specials
    numpy.save(subname(fname, attrib), numpy.ascontiguousarray(val))
  File "/usr/local/lib/python2.7/dist-packages/numpy/lib/npyio.py", line 491, in save
    pickle_kwargs=pickle_kwargs)
  File "/usr/local/lib/python2.7/dist-packages/numpy/lib/format.py", line 584, in write_array
    array.tofile(fp)
IOError: 1186398300 requested and 649239532 written


How can I solve it?
 

Gordon Mohr

unread,
Apr 3, 2017, 4:47:00 PM4/3/17
to gensim
Are you sure there is enough free space on the target disk volume?

- Gordon

Fatemeh Lashkari

unread,
Apr 3, 2017, 5:10:52 PM4/3/17
to gen...@googlegroups.com
How can I estimate the amount of needed free space for saving my model?

Best Regards,
Fatemeh

--
You received this message because you are subscribed to a topic in the Google Groups "gensim" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/Zczd7yvwKQM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gensim+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gordon Mohr

unread,
Apr 3, 2017, 6:00:24 PM4/3/17
to gensim
If you've enabled INFO level logging, during model setup there will be a rough estimate of needed RAM, and storage-on-disk is roughly similar in size. 

The dominant sources of model size are:

word-vectors: (vocabulary-count) x (dimensions) x (4 bytes/float)     # aka `model.wv.syn0`
out-weight-vectors: (vocabulary-count) x (dimensions) x (4 bytes/float)   # aka `model.syn1` or `model.syn1neg`
doc-vectors: (doctag-vector-count) x (dimensions) x (4 bytes/float)   # aka `model.docvecs.doctag_syn0`
word-dictionary: pickle-size of model.wv.vocab
doctag-dictionary: pickle-size of model.docvecs.doctags (nothing if using plain-int doctags; large if using many string doctags)

- Gordon

Best Regards,
Fatemeh

To unsubscribe from this group and all its topics, send an email to gensim+un...@googlegroups.com.

Fatemeh Lashkari

unread,
Apr 4, 2017, 9:04:01 AM4/4/17
to gen...@googlegroups.com
Thanks Gordon. When I build the model based on my code I just have`my_model.doc2vec.syn0.npy` ,  `my_model.doc2vec.syn1.npy` and `my_model.doc2vec` files. How can I build model to have other files?

Best Regards,
Fatemeh

To unsubscribe from this group and all its topics, send an email to gensim+unsubscribe@googlegroups.com.

Gordon Mohr

unread,
Apr 4, 2017, 2:11:09 PM4/4/17
to gensim
A single call to save should save exactly the files it needs to; there's no need for more files (or other steps) unless there's an error. (The files ending `.npy`are arrays that are saved aside from the model-object itself, and should be kept alongside the main file for later reloading.)

Is there still an error? How much space is available? What does INFO logging show as the estimated model size, and then progress/success/failure at the time of `save()`?

Also, are you using a recent gensim version? (The filename `my_model.doc2vec.syn1.npy` implies your model is using hierarchical-softmax mode, `hs=1, negative=0`, whereas the default for quite a while has instead been `negative=5, hs=0`.)

- Gordon

Fatemeh Lashkari

unread,
Apr 5, 2017, 12:17:31 PM4/5/17
to gensim
I have not tried to build that model because I want to find the estimated space correctly.
This is part of my INFO logging building model for the smaller input :
 

2017-04-04 23:53:09,388 : INFO : collecting all words and their counts
2017-04-04 23:53:09,388 : INFO : PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
2017-04-04 23:53:10,637 : INFO : PROGRESS: at example #10000, processed 4658741 words (3731303/s), 172162 word types, 10001 tags
2017-04-04 23:53:12,010 : INFO : PROGRESS: at example #20000, processed 9705612 words (3675846/s), 304250 word types, 20001 tags
2017-04-04 23:53:13,383 : INFO : PROGRESS: at example #30000, processed 14721694 words (3654145/s), 374764 word types, 30000 tags
2017-04-04 23:53:14,881 : INFO : PROGRESS: at example #40000, processed 20261251 words (3700028/s), 480911 word types, 40000 tags
2017-04-04 23:53:16,137 : INFO : PROGRESS: at example #50000, processed 24833526 words (3639378/s), 538055 word types, 49999 tags
2017-04-04 23:53:17,601 : INFO : PROGRESS: at example #60000, processed 30236474 words (3692793/s), 624805 word types, 59999 tags 
          ....
          ....
         ..... 
2017-04-05 06:22:54,712 : INFO : PROGRESS: at 99.98% examples, 251324 words/s, in_qsize 8, out_qsize 0
2017-04-05 06:22:55,779 : INFO : PROGRESS: at 99.98% examples, 251323 words/s, in_qsize 8, out_qsize 0
2017-04-05 06:22:56,823 : INFO : PROGRESS: at 99.99% examples, 251323 words/s, in_qsize 7, out_qsize 0
2017-04-05 06:22:57,831 : INFO : PROGRESS: at 99.99% examples, 251323 words/s, in_qsize 8, out_qsize 0
2017-04-05 06:22:58,858 : INFO : PROGRESS: at 100.00% examples, 251322 words/s, in_qsize 8, out_qsize 0
2017-04-05 06:22:59,629 : INFO : worker thread finished; awaiting finish of 3 more threads
2017-04-05 06:22:59,641 : INFO : worker thread finished; awaiting finish of 2 more threads
2017-04-05 06:22:59,664 : INFO : worker thread finished; awaiting finish of 1 more threads
2017-04-05 06:22:59,665 : INFO : worker thread finished; awaiting finish of 0 more threads
2017-04-05 06:22:59,665 : INFO : training on 5692755195 raw words (5678892465 effective words) took 22596.0s, 251323 effective words/s
2017-04-05 06:22:59,665 : INFO : saving Doc2Vec object under /home/fatemeh/Step2/input-output/finalDocs/model/my_model2.doc2vec, separately None
2017-04-05 06:22:59,665 : INFO : storing numpy array 'doctag_syn0' to /home/fatemeh/Step2/input-output/finalDocs/model/my_model2.doc2vec.docvecs.doctag_syn0.npy
2017-04-05 06:23:02,117 : INFO : not storing attribute syn0norm
2017-04-05 06:23:02,117 : INFO : not storing attribute cum_table
2017-04-05 06:23:02,117 : INFO : storing numpy array 'syn0' to /home/fatemeh/Step2/input-output/finalDocs/model/my_model2.doc2vec.syn0.npy
2017-04-05 06:23:07,322 : INFO : storing numpy array 'syn1' to /home/fatemeh/Step2/input-output/finalDocs/model/my_model2.doc2vec.syn1.npy
2017-04-05 06:25:29,245 : INFO : saved /home/fatemeh/Step2/input-output/finalDocs/model/my_model2.doc2vec 

I do not know the gensim version on my computer.How can I have that kind of information?

Gordon Mohr

unread,
Apr 5, 2017, 12:47:45 PM4/5/17
to gensim
The model's logging of an estimated size should have happened in the range clipped out with "..." – between the initial "collecting all words" and beginning of training. 

If you're using a virtual environment (which is highly recommended), and `pip` or similar to install gensim, then the command `pip freeze` from the command-line will print all installed packages and their versions, including gensim. 

- Gordon

Fatemeh Lashkari

unread,
Apr 5, 2017, 1:00:13 PM4/5/17
to gen...@googlegroups.com
Thanks Gordon.

To answer your question:
 are you using a recent gensim version? (The filename `my_model.doc2vec.syn1.npy` implies your model is using hierarchical-softmax mode, `hs=1, negative=0`, whereas the default for quite a while has instead been `negative=5, hs=0`.)

             My gensim version is 0.13.3.


You mean my model was not build correctly?
The model's logging of an estimated size should have happened in the range clipped out with "..." – between the initial "collecting all words" and beginning of training. 

Best Regards,
Fatemeh

To unsubscribe from this group and all its topics, send an email to gensim+unsubscribe@googlegroups.com.

Radim Řehůřek

unread,
Apr 5, 2017, 9:46:31 PM4/5/17
to gensim
Hello Fatemeh,

You mean my model was not build correctly?
The model's logging of an estimated size should have happened in the range clipped out with "..." – between the initial "collecting all words" and beginning of training. 

It means you replaced the interesting part of the log with "...".

You can also calculate the ± space needed by following Gordon's "sources of model size" formulas above.

HTH,
Radim

Fatemeh Lashkari

unread,
Apr 6, 2017, 4:32:35 PM4/6/17
to gensim
Thanks. Just one more question.
I do not have  model.wv.syn0 ,  model.wv.vocab and model.docvecs.doctags how can I change parameters of  the Doc2Vec so when I build model these files are created too.

model = Doc2Vec(allDocs, size = 300, window = 5, min_count = 2, workers = 4)



Gordon Mohr

unread,
Apr 6, 2017, 5:30:11 PM4/6/17
to gensim
What makes you think you "do not have" them? They will be automatically created as needed by training. If your model is training, or individual word/doc vectors can be accessed after training, they exist. (If `model.wv` does not exist, that could be because you're using an older gensim. You could uninstall & reinstall to be up-to-date.)

Some of them might be saved as separate files by `save()`, but only if they exceed certain sizes. If they're small enough to fit in the main model save file, that's where they'll go – and then you have fewer files to keep together. So it's not really beneficial to force/expect them to be separate files. 

- Gordon 
Reply all
Reply to author
Forward
0 new messages