How to load the pre-trained Google News pre-trained vectors into gensim?

2,461 views
Skip to first unread message

Marco Ippolito

unread,
Sep 16, 2014, 7:46:56 AM9/16/14
to gen...@googlegroups.com
Hi everybody,

in order to use the pre-trained word and phrase vectors's Google news in my AWS Ubuntu C3 instance, I downloaded the whole big file into my windows laptop, and splitted it into smaller files (10 Mb each).
I uploaded the very first of these smaller files into my AWS Ubuntu C3 instance, to make a trial load of the model.

python
Python 2.7.6 (default, Mar 22 2014, 22:59:56)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import gensim
>>> import logging
>>> logging.basicConfig(
... format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

model = Word2Vec.load_word2vec_format(
... '/home/ubuntu/ggc/prove/DCNN/G.bin.001')
2014-09-16 11:33:19,068 : INFO : loading projection weights from /home/ubuntu/ggc/prove/DCNN/G.bin.001
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/usr/local/lib/python2.7/dist-packages/gensim-0.9.0-py2.7.egg/gensim/models/word2vec.py", line 422, in load_word2vec_format
    vocab_size, layer1_size = map(int, header.split())  # throws for invalid file format
ValueError: invalid literal for int() with base 10: '\x1f\x8b\x08\x08\x07I\x17T\x02'
>>> model = Word2Vec.load_word2vec_format(
... '/home/ubuntu/ggc/prove/DCNN/G.bin.001')
2014-09-16 11:44:09,924 : INFO : loading projection weights from /home/ubuntu/ggc/prove/DCNN/G.bin.001
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/usr/local/lib/python2.7/dist-packages/gensim-0.9.0-py2.7.egg/gensim/models/word2vec.py", line 422, in load_word2vec_format
    vocab_size, layer1_size = map(int, header.split())  # throws for invalid file format
ValueError: invalid literal for int() with base 10: '\x1f\x8b\x08\x08\x07I\x17T\x02'

What am I wrongly doing?
and what can I do to solve the problem?

Looking forward to your helpfull hints.
Kind regards.
Marco

Radim Řehůřek

unread,
Sep 16, 2014, 1:02:39 PM9/16/14
to gen...@googlegroups.com
Hello Marco,

binary model files cannot be split willy-nilly like that.

You'll have to either use a machine where the entire model fits into memory, or do the splitting more cleverly (+ probably do some low-level file editing).

HTH,
Radim

Marco Ippolito

unread,
Sep 16, 2014, 2:41:43 PM9/16/14
to gen...@googlegroups.com

thank you very much Radim for your kind explanation. I'm now upload the entire .gz file.
kind regards.
Marco

--
You received this message because you are subscribed to a topic in the Google Groups "gensim" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/LrreKt_xFt4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gensim+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Marco Ippolito

unread,
Sep 17, 2014, 2:45:31 AM9/17/14
to gen...@googlegroups.com
Hi Radim and hi everybody,

this time I used the entire .gz file, previously downloaded from: https://code.google.com/p/word2vec/


python
Python 2.7.6 (default, Mar 22 2014, 22:59:56)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import gensim, logging

>>> logging.basicConfig(
...  format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
>>> from gensim.models import Word2Vec
>>> model = Word2Vec.load_word2vec_format('/home/ubuntu/ggc/prove/G.bin.gz')
2014-09-17 06:37:07,390 : INFO : loading projection weights from /home/ubuntu/ggc/prove/G.bin.gz

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/gensim-0.9.0-py2.7.egg/gensim/mod                       els/word2vec.py", line 450, in load_word2vec_format
    raise ValueError("invalid vector on line %s (is this really the text format?                       )" % (line_no))
ValueError: invalid vector on line 0 (is this really the text format?)

What do I have to do in order to solve the problem and being able to correctly loading the entire model?

Looking forward to your kind helpfull hints.
Kind regards.
Marco
To unsubscribe from this group and all its topics, send an email to gensim+unsubscribe@googlegroups.com.

Marco Ippolito

unread,
Sep 17, 2014, 3:13:56 AM9/17/14
to gen...@googlegroups.com
Hi Radim and hi everybody,

I rectify my last post:

>>> model = Word2Vec.load_word2vec_format(
...  '/home/ubuntu/ggc/prove/DCNN/G.bin.gz', binary=True)
2014-09-17 07:09:47,256 : INFO : loading projection weights from /home/ubuntu/ggc/prove/DCNN/G.bin.gz
Killed


What do I have to do in order to solve the problem and being able to correctly loading the entire model?

Looking forward to your kind helpfull hints.
Kind regards.
Marco


Reply all
Reply to author
Forward
Message has been deleted
0 new messages