Loading big Doc2Vec model with error UnpicklingError: pickle data was truncated

9,237 views

Skip to first unread message

Hoa Tran

unread,

Jul 25, 2019, 12:12:29 PM7/25/19

to Gensim

I trained Doc2Vec on a 20Gb data, and saved the model with model.save

I ran into Error of UnpicklingError: pickle data was truncated when loading the model. Do you know what the issue here is and how to load this model?

model.save(/User/D2v_.model)

model = Doc2Vec.load("/Users/D2v_.model")

The following files are saved from command model.save(/User/D2v_.model)

D2v_.model : 2.9gb

D2v_.model.docvecs.vectors_docs.npy: 2.2 Gb

D2v_.model.trainables.syn1neg.npy: 300mb

D2v_.model.trainables.vectors_docs_lockf.npy: 400mb

D2v_.model.wv.vectors.npy: 300mb

Gordon Mohr

unread,

Jul 25, 2019, 1:13:44 PM7/25/19

to Gensim

If you get a "pickle data was truncated" error when trying to load the model, then the portion of the save-data that is a Python object-pickling stream (the `D2v_,model` file) is probably truncated – cut off before its true end.

Maybe there was an error saving the file, or (if it used to load successfully) some corruption or truncation has happened since (such as some time the file was copied)?

If that file is now incomplete, the model may not be loadable, and it may be necessary to re-train it. When saving another model, pay extra close attention to any log output or errors during `save()`, and perhaps verify a successful `load()` right away, to be sure the saved model is complete.

(In a few cases, it *might* be possible to skip re-training: just repeat the original steps through the `build_vocab()` step, save that new not-yet-trained model, but then replace that new model's `.npy` files with your existing older ones. If your corpus is identical, and each doc-tag appears exactly once, and in the same order on the second `build_vocab()`, then the pickled-but-untrained model should have the correct doctag-to-doc-vector slots mapping for this "frankenstein model" to work, with regard to doc-vectors. Unfortunately its word-vectors might only be in the same positions, and thus usefully look-uppable, if you're using Python 2. There's a discussion of a similar theory for patching a partially-saved `Word2Vec` model at an issue <https://github.com/RaRe-Technologies/gensim/issues/2441#issuecomment-483101044>.)