Memory problems on local machine and VM

yahya mortassim

unread,

Jul 17, 2017, 10:27:52 AM7/17/17

to gensim

Hi all,

I'm using a pre-trained FastText .vec file to find similarities between a list of words. the file is 3 GB. I tried 2 approaches:

1) On my local machine: I manage to load the file via load_word2vec_format, but when i try to calculate word similarities, i get a memory error. So I tried to call init_sims() before calculating similarities. But it's taking a very long time.

2) Using Google Cloud Datalab ( a jupyter notebook with python 2.7 + a 52 GB VM ): The kernel automatically dies while loading the .vec file from the cloud storage.

Did someone tried to use gensim with google cloud services before, especially Cloud Datalab. If so, did you get these kind of errors with relatively large files?

Thank you in advance.

Gordon Mohr

unread,

Jul 17, 2017, 2:45:47 PM7/17/17

to gensim

For your local machine, unless you're using a 64-bit Python and have at least 8GB RAM, it will be hard to work with a 3GB vector file.

You could try the optional argument to `load_word2vec_format()`, `limit`, which when given a number will only read that many vectors from the front of the supplied file. (As such files are usually organized to put the more-frequent words 1st, the later words are usually of much less value.) Loading just the first 100000 or 500000 words might save a lot of memory and yet still be fine for your other purposes.

Calling `init_sims()` will only help reduce memory usage if you use the `init_sims(replace=True)` option – which discards the raw-magnitude vectors in favor of the unit-normalized vectors that are used for most-similar measurements.

I've seen others with issues on Google Cloud but don't recall/know any specific workarounds, sorry.

- Gordon

yahya mortassim

unread,

Jul 18, 2017, 6:08:37 AM7/18/17

to gensim

I'm using a 64-bit Python with 8GB RAM. Even with 300000 words as limit arguments the kernel dies. I'm trying to calculate similarities between a list of 90000 word ( calculating the similarity between each word and all the others), and that's were the problem occurs.

Gordon Mohr

unread,

Jul 18, 2017, 12:06:51 PM7/18/17

to gensim

To store pairwise similarities between 90,000 vectors in memory will require at least...

((90,000 x 90,000) / 2) * 4 bytes per float = 16GB

...in addition to the initial load & unit-normalization (which is already likely to use most/all of 8GB RAM). So if you're trying to do this in a single operation, or simple loop, it isn't surprising that you'd either hit memory errors, or start using an immense amount of slow swap memory.

- Gordon

Reply all

Reply to author

Forward