Word2Vec and Glove missing words issue

121 views
Skip to first unread message

ahmedt...@gmail.com

unread,
Oct 6, 2016, 9:05:34 PM10/6/16
to GloVe: Global Vectors for Word Representation

I'm working on a research project and I needed word similarity features in my project. Initially I used Gensim Python for Word2Vec and I got pretrained vectors from this page https://github.com/3Top/word2vec-api 

I got the Google News one and when I started working I noticed a lot of missing words from the model. I mean for example the word 'Abortion' was missing. I'm pretty sure that somewhere in a Google news article there are many instances of 'abortion'.I just don't know how they are not there in the model dictionary.

So, I thought maybe I should use Wikipedia vector. Still I get a lot of very common but missing words.

So, I decided to try Glove and I downloaded the pretrained Wikipedia 300d vector from the stanford webpage: http://nlp.stanford.edu/projects/glove/

Still, getting the same issue. For example, the word 'school' was missing. Can you imagine that? The word 'school' is missing from a vector trained on a Wikipedia 2014 dump. How can that be possible?

Am I doing something wrong? Before I search for the word in the model's dictionary (in either Word2Vec or Glove) I remove stop words, punctuation and I lower case all the words.

Any advice?

gnni...@gmail.com

unread,
Oct 15, 2016, 4:19:42 PM10/15/16
to GloVe: Global Vectors for Word Representation, ahmedt...@gmail.com
The word 'school' exists in the Glove Word Vectors dump you describe. Sounds like there is something wrong with your code. 

PR_IYYER

unread,
Nov 9, 2016, 12:10:15 PM11/9/16
to GloVe: Global Vectors for Word Representation
hi..

I was trying to obtain word vectors with glove code..but unfortunately i also found there are a high amount of words missing from the vocabulary. If someone could help me please..??

Praveena Ramanan

unread,
Nov 10, 2016, 12:57:41 AM11/10/16
to GloVe: Global Vectors for Word Representation
Hi...

sorry and thanks... :) I got it done myself..that was an error with the corpus i read.

Regards
        

--
You received this message because you are subscribed to a topic in the Google Groups "GloVe: Global Vectors for Word Representation" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/GlobalVectors/PlA90qX4t5I/unsubscribe.
To unsubscribe from this group and all its topics, send an email to GlobalVectors+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/GlobalVectors/3cf76fe5-30b3-4272-8329-e9f8257502ae%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Regards,
Praveena.R
Reply all
Reply to author
Forward
0 new messages