Compression of wordvectors file for large embeddings

54 views
Skip to first unread message

Maja-Olivia Lenz

unread,
Nov 17, 2021, 2:33:10 AM11/17/21
to Gensim
Hi all,
I am storing my wordvectors as

wv_file = 'nodes.wv'
wordvectors = w2v_model.wv
wordvectors.save(wv_file)

After that 2 files are stored: nodes.wv and nodes.wv.vectors.npy
The npy file is much larger than the nodes.wv that I specified. I don't understand why it does that. I cannot load the vectors.npy file using the normal load method.
Is this because the embeddings are very large?
When I trained a smaller model, it used to only store the .wv file I specified.

Gordon Mohr

unread,
Nov 17, 2021, 5:14:12 AM11/17/21
to Gensim
For larger models, Gensim starts storing large raw `numpy` arrays as separate files. For very-large models, this worked around some implementation limits in older Python pickling. Such separate-saving also enables the option of memory-mapping upon re-load, & might offer other slight advantages in other scenarios/analysis. 

Once it does this, you need to keep any subsidiary files starting with the same save-filename-prefix alongside the main (`nodes.wv`) file for that file to later re-load. (You'll never specify the other files, like `nodes.wv.vectors.npy`, directly – they'll be automatically loaded when you `Word2Vec.load('nodex.wv')`.)

You can also change the threshold at which `.save()` stores some arrays separately by supplying an alternate, higher `sep_limit` parameter during the saving. That restores the single-save-file behavior for arbitrarily-larger models, but you'd then lose the options, & conformance with default behavior, of the usual approach, and might in some configurations risk hitting other implementation limits.

- Gordon

Maja-Olivia Lenz

unread,
Nov 17, 2021, 8:41:13 AM11/17/21
to gen...@googlegroups.com
Many thanks, this was very helpful.

--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gensim/39338178-134e-4586-960f-f524795549ccn%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages