RAM usage for word2vec

839 views

Skip to first unread message

Tim Pierson

unread,

Apr 13, 2014, 7:05:37 PM4/13/14

to gen...@googlegroups.com

Hi everyone,

I was just running a large corpus through the word2vec model (5.4 mil word types, 1000 dimensions) only to find that resetting the layer weights after constructing the huffman tree tried to allocate more ram than I have. (16gb) Is there a ratio we could use to estimate ram requirements along the lines of the one given here for LDA? Does increasing the number of worker-threads increase the amount of ram? (and would dropping the number of threads make training higher dimensions possible?).

Thanks,

Radim Řehůřek

unread,

Apr 14, 2014, 4:03:45 AM4/14/14

to gen...@googlegroups.com

Hello Tim,

On Monday, April 14, 2014 1:05:37 AM UTC+2, Tim Pierson wrote:

Hi everyone,
I was just running a large corpus through the word2vec model (5.4 mil word types, 1000 dimensions) only to find that resetting the layer weights after constructing the huffman tree tried to allocate more ram than I have. (16gb) Is there a ratio we could use to estimate ram requirements along the lines of the one given here for LDA? Does increasing the number of worker-threads

the requirements are pretty much identical to that of LDA/LSA. All the models in gensim are O(#dimensions * #vocabulary) of memory.

So in your case, expect 2 * 4 bytes per float * 1,000 dimensions * 5.4m vocab = ~43GB of RAM. After training, run `model.init_sims(replace=True)` immediately to get rid of unnecessary objects.

The C tool will be the same, ~43GB RAM.

A 1k x 5.4m is already a fairly large model: the GoogleNews word2vec model by google was only 300 dimensions x 3m vocab. You could just about train that one on your 16GB machine.

increase the amount of ram? (and would dropping the number of threads make training higher dimensions possible?).

Nope, the number of workers has no effect on memory.

HTH,

Radim

Thanks,

Reply all

Reply to author

Forward

0 new messages