word2vec api to keyedvectors

88 views

Skip to first unread message

Gabriela Zuniga

unread,

Feb 11, 2023, 11:22:24 AM2/11/23

to Gensim

I've used this code to donwload the model.

from gensim.models import KeyedVectors

import gensim.downloader as api

model = api.load('word2vec-google-news-300')

And i would like to put the model in gensim.models.KeyedVectors.load_word2vec_format or keyed vectos in general, someone toldme that it's faster.

Gordon Mohr

unread,

Feb 13, 2023, 1:42:38 PM2/13/23

to Gensim

If you have any instance of `KeyedVectors`, you can save its full-word vectors out, into the same format as is readable by `.load_word2vec_format()`, using the companion method `.save_word2vec_format()`. For example:

model.save_word2vec_format('my_vectors.txt')

However, if you really just want that particular set of vectors in a file, I'd suggest just downloading directly from the original/quasi-official source, the link from this page: https://code.google.com/archive/p/word2vec/

You can uncompress that archive to get the exact original set-of-vectors, into local files at paths of your choosing, exactly as Google released them, in their original binary format – which is compatible with `.load_word2vec_format()` using the `binary=True` parameter. (If for some reason you prefer the text format you can re-save them per above.)

(In contrast, the `gensin.downloader` approach does a bunch of things that I consider more obsscure & higher-risk.)

I'm not sure if either the original `word2vec_format` or Gensim's alternate native (Python pickle-based) `.save()`/`.load()` are faster for writing/reading vectors. The Gensim `.save()` will typically split data over more than one file – a complication – but also offers a special non-default `mmap` mode upon loading, which *may* offer some memory & avoiding-redundant-reloading benefits in *some* usage scenarios, like many processes providing a network service, all sharing the same set of word-vectors. You can read a bit more about that potential at a StackOverflow answer – https://stackoverflow.com/questions/65394022/how-can-a-word2vec-pretrained-model-be-loaded-in-gensim-faster/65400203#65400203 – but I'd recommend *against* such extra-complexity optimizations *unless and until* you're sure you need them. (That is, don't just do it because there's a vague sense it's "faster".)