Ever used Gensim with GloVe (in conjunction with Word2Vec)?

4,101 views
Skip to first unread message

Marco Ippolito

unread,
Oct 24, 2014, 6:17:38 AM10/24/14
to gen...@googlegroups.com
Hi Radim and hi all,

I read with interest the paper and the results obtained with GloVe:
http://nlp.stanford.edu/projects/glove/
the new words vector representation which, according to Stanford researchers results, outperforms Word2Vec and GoogleNews-vector's way of representation.

Have you ever tried to import it and use it with/in Gensim?

I downloaded the "Common Crawl" pre-trained word vectors and tried to import it as gensim model, but failed in doing so. I guess because the file is a txt file, quite different format from the bin file of GoogleNews-vector.

Looking forward to your thoughts about it.
Kind regards.
Marco

Radim Řehůřek

unread,
Oct 24, 2014, 2:44:34 PM10/24/14
to gen...@googlegroups.com
Hello Marco & all,

I saw that glove paper, but haven't had time to try/implement it yet. Looks simple enough though IIRC, simpler than word2vec.

If you want to try, add it as another model in gensim and let us know :)

Best,
Radim

Pedro Cardoso

unread,
Oct 25, 2014, 6:49:00 AM10/25/14
to gen...@googlegroups.com
Hi

There are some some simple implementations of Glove in python. You can take a look at this one :
https://github.com/maciejkula/glove-python

Pedro
Message has been deleted
Message has been deleted

Mostapha Benhenda

unread,
Dec 4, 2014, 3:26:23 PM12/4/14
to gen...@googlegroups.com
I am having the same problem, trying to use the function load_word2vec_format of gensim. However, it is possible to load them and do some manipulations with the load_word2vec function of https://github.com/dhammack/Word2VecExample/blob/master/main.py


But I would really want to manipulate these pre-trained vectors with gensim. 

 with gensim, I get the error message:

vocab_size, layer1_size = map(int, header.split())  # throws for invalid file format
ValueError: invalid literal for int() with base 10: 'the'



any help?



Travis Brady

unread,
Dec 5, 2014, 9:50:07 AM12/5/14
to gen...@googlegroups.com
Gensim understands the word2vec text format, but the GloVe vectors you're trying to load are slightly different in that they lack word2vec's header line  (that contains the vocab size and vector dimension, eg "68959520 100\n").

You could manually add the header and then gensim should work fine.

Mostapha Benhenda

unread,
Dec 16, 2014, 5:08:34 PM12/16/14
to gen...@googlegroups.com
how can you do this in practice?

do you have references/keywords to recommend? (i am a beginner)

Manas Ranjan Kar

unread,
Dec 2, 2015, 4:18:48 AM12/2/15
to gensim
Might be helpful to anyone following the topic. This link has the solution, encoding and adding the lines to make it Gensim compatible.

Manas Ranjan Kar

unread,
Dec 7, 2015, 5:42:17 AM12/7/15
to gensim
My Python code for converting GloVe vectors into word2vec format for easy usage into gensim.


Regards,
Manas

bope...@gmail.com

unread,
Jul 26, 2016, 11:33:03 AM7/26/16
to gensim


Hi guys,

Is it possible to use Doc2vec with glove pre-trained word vectors? I'm trying to build a semantic search and I would love to have the semantic relationships of the glove word vectors as my foundation, and then use Doc2vec to map all the document vectors into vector space. This way, when I do a query search, it will give back similar vector documents with a strong semantic foundation for their word vectors. Will this work, or am I really off?

Thanks much for the help!

Gordon Mohr

unread,
Jul 27, 2016, 12:53:28 PM7/27/16
to gensim
Note that Doc2Vec doesn't need word-vectors as an input – it will create any that are needed during model/doc-vector training. (And, pure PV-DBOW doesn't use/train word-vectors at all.) And word-vectors from your domain's data might be better than generic vectors – more representative of local word-senses & frequencies. 

That said, there's an experimental method in class Word2Vec (inherited by Doc2Vec) called `intersect_word2vec_format()`. It will scan a word-vector file in the format as output by the Google word2vec.c tool, and for any word that is *already* in the model's known vocabulary, replace the model's word-vector weights with those from the file, *and* lock those weights against further changes. The idea is that after establishing your model's vocabulary (by `build_vocab()`), you might do this to bring in known frozen vectors – then proceed with training that only adjusts the non-imported words. There's no real evidence about whether or when it might help. You can search the forum archives for the method name for more discussion. 

Especially in the Doc2Vec case, using a 'space' only initially created on word-vecs might overly restrain the expressiveness of doc-vecs, whereas a joint-training would have earned ranges-of-values that are more helpful for the doc-vectors more 'room' in the space.

Also, the word2vec.c-format doesn't include the 'output' weights, so (whether doing a fill `load_word2vec_format()` or the intersect mentioned above), the resulting model isn't fully conditioned for more compatible training/inference. (The `syn1neg` or `syn1` layer is still all zeros.) Only after more bulk training (on relevant text examples) would the model re-learn the predictiveness that gave rise to the imported vectors – and thus perhaps become useful for training new compatible doc-vecs. 

As should be clear from the above, you'd be in experimental territory with such techniques. You'd want to exmine the source-code and internal model-state closely, and probably directly adjust it at times, to understand which improvised mash-up steps are helping your end goals and which aren't. 

- Gordon
Reply all
Reply to author
Forward
0 new messages