GoogleNews.vectors-negativa300.bin.gz: loading as model killed. What to do?

3,218 views
Skip to first unread message

Marco Ippolito

unread,
Sep 17, 2014, 3:30:50 AM9/17/14
to gen...@googlegroups.com
Hi Radim and hi everybody,

this time I tried to load the entire GoogleNews-vectors-negative300.bin.gz as model.

 python
Python 2.7.6 (default, Mar 22 2014, 22:59:56)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import gensim, logging
>>> logging.basicConfig(
...  format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
>>> from gensim.models import Word2Vec
>>> model = Word2Vec.load_word2vec_format(
...  '/home/ubuntu/ggc/prove/DCNN/G.bin.gz', binary=True)
2014-09-17 07:27:08,671 : INFO : loading projection weights from /home/ubuntu/ggc/prove/DCNN/G.bin.gz
Killed

What do I have to do in order to correctly loading  GoogleNews-vectors-negative300.bin.gz as model?

Looking forward to your helpfull hints.
Kind regards.
Marco

Marco Ippolito

unread,
Sep 17, 2014, 6:58:38 AM9/17/14
to gen...@googlegroups.com
Hi Radim and hi everybody,

I checked my AWS instance characteristics:
ModelvCPUMem (GiB)SSD Storage  (GB)
c3.large23.752 x 16
Are there enough for loading the entire GoogleNews-vectors-negative300.bin.gz as model? What are the minimum requirements for this task?

Looking forward to your help.
Kind regards.
Marco

Marco Ippolito

unread,
Sep 17, 2014, 12:19:09 PM9/17/14
to gen...@googlegroups.com
As my last attempt to load at least part of the GoogleNews Model, I splitted the big original file (I know I shouldn't but was a last trial) into several elements 100 Mb each.
And I tried to load the very first of these elements into a model:

model = Word2Vec.load_word2vec_format(
... '/home/ubuntu/ggc/prove/DCNN/G_dir/aa', binary=True)
2014-09-17 16:11:18,311 : INFO : loading projection weights from /home/ubuntu/ggc/prove/DCNN/G_dir/aa
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/usr/local/lib/python2.7/dist-packages/gensim-0.9.0-py2.7.egg/gensim/models/word2vec.py", line 445, in load_word2vec_format
    result.syn0[line_no] = fromstring(fin.read(binary_len), dtype=REAL)
ValueError: could not broadcast input array from shape (67) into shape (300)

Radim Řehůřek

unread,
Sep 18, 2014, 3:11:13 AM9/18/14
to gen...@googlegroups.com
Hello Marco,

your OS is killing the Python process, probably because you ran out of RAM.

How much RAM is needed depends on the vocabulary & layer size -- the calculation is described in the word2vec tutorial here:

(the googlenews model has a vocabulary of 3 million and layer size of 300).

Note that allocation of the matrix 3,000,000 x 300 matrix of floats needs contiguous (unfragmented) memory -- you'll likely need a machine with >4GB RAM for GoogleNews... and 8GB wouldn't hurt.

HTH,
Radim

Marco Ippolito

unread,
Sep 18, 2014, 5:04:34 AM9/18/14
to gen...@googlegroups.com
thanks Radim for answering my question.

Kind regards.
Marco

--
You received this message because you are subscribed to a topic in the Google Groups "gensim" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/T20LhrPiZng/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gensim+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
Message has been deleted
0 new messages