Loading a huge bin file

ahmed dawod

unread,

Jul 31, 2016, 12:53:12 AM7/31/16

to gensim

I'm trying to load the GoogleNews-vectors-negative300.bin.gz which is a 1.6 GB file using this line:

model = Word2Vec.load_word2vec_format(pretrained, binary=True)

and it seems I cannot fit it all in memory. Is there anyway I could load that file to parts?

I found this technique:
https://radimrehurek.com/gensim/tut1.html

which makes the corpus memory friendly like this:


class MyCorpus(object):
    def __iter__(self):
        for line in open(pretrained):
            # assume there's one document per line, tokens separated by whitespace
            yield dictionary.doc2bow(line.lower().split())

corpus_memory_friendly = MyCorpus()
for vector in corpus_memory_friendly:  # load one vector into memory at a time
    print(vector)

But this works only with the text format.

Thanks,

Gordon Mohr

unread,

Jul 31, 2016, 3:19:14 PM7/31/16

to gensim

The streaming approach works if you have a process (like training) that needs to look at the data, momentarily in order, then proceed. The GoogleNews-vectors-negative300.bin.gz is on the other hand the end-result of a training process. Most uses will want to have the results completely in memory for common bulk or random-access operations (like finding the N most-similar vectors to a target value).

Note that the data in that file is around 3.5GB uncompressed – so you essentially need 4+ GB free to have any chance of completing a load and then doing other operations on the data.

And further, to do operations like `most_similar()`, the vectors need to be unit-normalized. By default, this is done non-destructively – so it winds up creating another 3.5+ GB structure in memory, alongside the 3.5+ GB raw vectors. (You can force this to happen in-place, saving memory, by manually calling `model.init_sims(replace=True)` after loading but before any similarity-operations. But that requires the initial load to have succeeded.)

I believe the GoogleNews-vectors-negative300.bin.gz vectors are front-loaded, with more-common tokens at the beginning – so you could consider uncompressing the data and truncating/splitting the file at reasonable points – then editing the 1st-lines of any such edited files to still include an accurate count. That'd allow you to work with subsets of the data with fewer than the full 3M tokens. We don't have any code to do this; you'd have to use other tools to edit the file(s) based on reading what the loading code expects.

(With gensim's native save format, there are some tricks that by using memory-mapping might allow leaving most of the data in non-resident memory. But, you'd have to succeed in loading the full data at least once before re-saving it to try those, and even if it works the performance for common tasks would likely be very poor.)

This sized dataset essentially requires more RAM, to work with sensibly.

- Gordon

ahmed dawod

unread,

Jul 31, 2016, 4:45:05 PM7/31/16

to gensim

Thank you for that I was actually able to finally load it into memory after I closed all the applications and freed almost the entire RAM. I have a 6 GB RAM with a 2 GB swap partition.

Reply all

Reply to author

Forward