npy files in Doc2Vec

Robert Smith

unread,

Apr 6, 2016, 12:30:15 AM4/6/16

to gensim

Quick question: I noticed Doc2Vec stores .npy files after training on a "large" dataset (~120000 documents) but not after training on ~5000 documents. Is this correct? If so, what is the threshold that Doc2Vec uses in order to create these files? Apparently, the .npy files are required to load the model.

Another vaguely related question (assuming you have experience with Amazon EC2 instances). For training this model I chose a c4.xlarge instance but I'm thinking maybe I didn't choose wisely. Do you have any suggestion regarding the appropriate instance for Doc2Vec workloads?

Regards

Gordon Mohr

unread,

Apr 6, 2016, 1:09:25 AM4/6/16

to gensim

Word2Vec/Doc2Vec `save()` reuse the `save()` from `gensim.utils.SaveLoad`, which has an optional `sep_limit` parameter to specify how large a numpy array must be to be saved separately. See:

https://github.com/piskvorky/gensim/blob/2d25721f388642e96bb8045b51da20d986f3c363/gensim/utils.py#L446

and

https://github.com/piskvorky/gensim/blob/2d25721f388642e96bb8045b51da20d986f3c363/gensim/utils.py#L374

A numpy array whose `.size` is greater than this `sep_limit` value will be stored as a separate file. You could set this value to be much larger to force everything into the single python-pickled model file... but note that pickling breaks at some member size (I think 2GB) so larger models will eed to use separate-storage.

The right sized machine depends mostly on your model specifics – vocabulary-size in the Word2Vec case, and also in Doc2Vec the count of training documents. You mainly want to be sure there's enough RAM for the full model (absolutely no swapping). When logging is on, you can see estimates of the memory-needs printed during the `build_vocab()` steps.

- Gordon

Robert Smith

unread,

Apr 6, 2016, 11:39:49 AM4/6/16

to gensim

Excellent answer. Thank you.

Reply all

Reply to author

Forward