Memory Error While Training Word2vec

405 views

Skip to first unread message

Vahid Ba

unread,

Jul 24, 2019, 6:21:22 PM7/24/19

to Gensim

I'm trying to train a word2vec model with approximately 14 million unique words and vector size of 100. gensim crashes after using 12 gb out of 16 gb RAM with following log

File "/usr/local/lib/python3.6/dist-packages/gensim/models/word2vec.py", line 783, in __init__
    fast_version=FAST_VERSION)
File "/usr/local/lib/python3.6/dist-packages/gensim/models/base_any2vec.py", line 759, in __init__
    self.build_vocab(sentences=sentences, corpus_file=corpus_file, trim_rule=trim_rule)
File "/usr/local/lib/python3.6/dist-packages/gensim/models/base_any2vec.py", line 943, in build_vocab
    self.trainables.prepare_weights(self.hs, self.negative, self.wv, update=update, vocabulary=self.vocabulary)
File "/usr/local/lib/python3.6/dist-packages/gensim/models/word2vec.py", line 1876, in prepare_weights
    self.reset_weights(hs, negative, wv)
File "/usr/local/lib/python3.6/dist-packages/gensim/models/word2vec.py", line 1897, in reset_weights
    self.syn1neg = zeros((len(wv.vocab), self.layer1_size), dtype=REAL)
MemoryError

anybody know how much memory do I need to get rid of this ?

is it possible to train in batches and then construct a single model ?

Gordon Mohr

unread,

Jul 25, 2019, 2:00:13 AM7/25/19

to Gensim

14 million words * 100 dimensions * 4 bytes/dimension = 5.6GB just for the word-vectors. A model requires that much memory again for its hidden-to-output neural-network weights, then more for the dictionary mapping words to their frequencies and slots – so that's 12GB or more. If anything else is consuming memory – especially, if your corpus isn't being efficiently streamed from disk to avoid taking up much RAM – that'd potentially fill or overflow 16GB RAM.

Your best bet is to reduce the vocabulary with a higher `min_count` to discard less-frequent words. A smaller vocabulary most directly reduces model memory size. Also, rare words tend not to have enough examples-of-usage, and enough influence-on-the-model, to receive really good word-vectors anyway – so often serve mainly as noise making other more-common and more-important word-vectors worse. (Even the giant `GoogleNews` set of pre-trained word-vectors has just a ~3 million word vocabulary.)

There's no good way to split the vocabulary into separate models/batches: words need to train together to be meaningfully situated in the "same coordinate space". More memory, fewer words, or fewer dimensions are your main options. (And, if you've really got enough text to justify a 14-million-word vocabulary, you might normally be considering larger word-vectors, say 300-400d, also. But of course that'd take proportionately that much more addressable memory.)