Is there a way to work with gensim models without reading them entirely into memory.

1,366 views
Skip to first unread message

Matt z

unread,
Jan 9, 2014, 1:30:13 PM1/9/14
to gen...@googlegroups.com
Just wondering, if there is an interface, either within gensim or elsewhere, for accessing a word2vec model for cosine similarity lookups without reading the entire model into memory. I want to provide a service that provides the functionality of the "most_similar" function in gensim (or ./distance in word2vec c) for multiple models simultaneously and intersect the results, but it's difficult to fit all models in memory. 

Thanks.

Radim Řehůřek

unread,
Jan 10, 2014, 4:07:51 AM1/10/14
to gen...@googlegroups.com
Hi Matt,

if you mean holding the models on disk only, and somehow doing the operations from disk, then no, gensim doesn't support that. It would be ridiculously slow too, even with top SSD disks.

I think somebody mentioned here on the mailing list that they are optimizing memory consumption in word2vec -- some internal matrices are no longer necessary once you've trained the model and just want to use it (no more training). For example, you can do `del model.syn1` after training, to save some space.

With the latest (develop) branch of gensim from github, you can even do:
model.init_sims()
del model.syn0
del model.syn1

to save even more memory.

If you want to keep several *identical* models simultaneously, then that's better news. You could use shared memory (mmap), so that all processes share the same physical RAM, for a constant memory footprint, no matter how many models you load.

Radim

Ved Mathai

unread,
Apr 24, 2017, 10:34:27 AM4/24/17
to gensim
Hi,
This seems to be a hotly requested feature, to be able to use the word vectors directly from the large file on disk without needing to load it on RAM. It's been 2 years since this topic was last posted on, does gensim support this now? If not, on personal computers with moderate resources it seems overkill to need to have ~3.5 GB RAM spare just to get the vectors for individual words. I understand the argument of I/O being slow compared to RAM but for people who can spare speed to save RAM is there another reason why this shouldn't be an option in gensim, for those who want it.

Thanks,
Ved

Andrey Kutuzov

unread,
Apr 24, 2017, 10:47:41 AM4/24/17
to gen...@googlegroups.com
Hi,

Why not just use models with smaller vocabulary size? For me, those 3
million words in the vocabulary of Mikolov's Google News model seem like
overkill. A model with 300K vocabulary size will need 10 times less RAM,
while still preserving comparable performance on all standard evaluation
test sets.
You don't even have to train such a model yourself: the Google vectors
seem to be sorted by word frequencies, so you can simply cut off the
lower n vectors.
> --
> You received this message because you are subscribed to the Google
> Groups "gensim" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to gensim+un...@googlegroups.com
> <mailto:gensim+un...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

--
Solve et coagula!
Andrey

Gordon Mohr

unread,
Apr 24, 2017, 1:34:54 PM4/24/17
to gensim
The `load_word2vec_format()` method takes an optional `limit` argument, to read just the given number of vectors from a supplied file, making it easy to take (for example) just the 1st 100,000, or 1,000,000, words from a file like the GoogleNews vectors. 

Alternatively, if you use gensim's native `.save()`, vector arrays over a (configurable) threshold size will be written to standalone files, and a subsequent `.load()` may use the optional `mmap='r'` argument to only load the data via memory-mapping – which delays the loading of ranges until they are accessed, and allows those ranges to be discarded from memory after remaining less-accessed under other memory usage pressure. For practical purposes, this is as good or better than any other random-seeking-through-an-on-disk-format that could be done. 

Still, the best course for most who hit such limits will be to get more RAM, or use smaller subsets of their corpus or vectors. Some common operations in Word2Vec, like finding the nearest-neighbors  (most-similar) for any word, require a pairwise distance calculation to all vectors – again bringing all vectors into memory, which will be frustratingly slow if relying on constant disk-revisits. 

- Gordon

Ved Mathai

unread,
Apr 24, 2017, 5:15:34 PM4/24/17
to gensim

Using a limited form of Vectors seems like a good make shift idea at least in the case of Mikolov's Google model since as you say the words are ordered by frequency. Also correct me if I am wrong, but to use the `.save()` function the whole model has to be loaded onto RAM first, which is not really a good idea if I don't have space for it in the first place. Instead, tell me if there is something wrong with my method:
  1. I read the initial file byte by byte and recognize the mention of a word.
  2. Use each word to key into dictionary which remembers the seek() location for the word in the original large file.
  3. Pickle this dictionary-index and use it to find exact locations to read.
  4. As an improvement we can page the words and remember the pages in another dictionary and the first dictionary now remembers the start and end location. Maybe use some page replacement policy too, depending on actual performance metrics

I implemented this and the two dictionaries come out to ~150 MB together. And after the initial load of the two pickled dictionaries the individual speeds of the vector extraction is quite fast, (as in it is at least not frustratingly slow). Sorry, I cannot give you better metrics than this because I don't have access to a machine that can load the entire model and then compare the two speeds (however, I can possibly test on a subset of the data).

This obviously doesn't answer the question of the nearest neighbours search. But I am thinking that some clever indexing will also be able to solve that, no concrete thoughts right now on that.

As a plug for this method: maybe in virtual instances this can be used to save money on RAM and use the large and cheaper disk memory. Also, larger models like the ones trained on Wikipedia etc can now be dealt with if needed given we don't want to cutoff the model at some point.


-Ved

Gordon Mohr

unread,
Apr 24, 2017, 9:53:57 PM4/24/17
to gensim
Yes, for the memory-mapping approach to work with the current you'd have to load the vectors into memory at least once. Yes, that would force the use of virtual memory, if you have small amount of RAM – and so it will be very slow. But the end result of that process is *also* to use virtual memory, accepting the slowness, for access. So that one-time slowness to convert things to a memory-mapped format seems a small price to pay. 

(The KeyedVectors `load_word2vec_format()` code could be updated to create the `syn0` array to be backed by a mmap-ed read/write file in the first place, and so 'read' the vectors directly through memory into that file. It might complete a smidge faster.)

Your approach would work, but will always be significantly slower than RAM-resident vectors for common operations. (Nearest-neighbor search without a full-scan is a challenging problem; many approaches trade-off correctness of results for speed, moreso than index-compactness.) 

An extra 4GB in your machine, or your cloud host, will usually pay for itself many times over when working in this domain, by speeding all sorts of related operations. 

- Gordon

anubhab majumdar

unread,
May 26, 2017, 12:57:33 AM5/26/17
to gensim
Thank for this "life-saver" answer!
Reply all
Reply to author
Forward
0 new messages