I'm working on a research project and I needed word similarity features in my project. Initially I used Gensim Python for Word2Vec and I got pretrained vectors from this page https://github.com/3Top/word2vec-api
I got the Google News one and when I started working I noticed a lot of missing words from the model. I mean for example the word 'Abortion' was missing. I'm pretty sure that somewhere in a Google news article there are many instances of 'abortion'.I just don't know how they are not there in the model dictionary.
So, I thought maybe I should use Wikipedia vector. Still I get a lot of very common but missing words.
So, I decided to try Glove and I downloaded the pretrained Wikipedia 300d vector from the stanford webpage: http://nlp.stanford.edu/projects/glove/
Still, getting the same issue. For example, the word 'school' was missing. Can you imagine that? The word 'school' is missing from a vector trained on a Wikipedia 2014 dump. How can that be possible?
Am I doing something wrong? Before I search for the word in the model's dictionary (in either Word2Vec or Glove) I remove stop words, punctuation and I lower case all the words.
Any advice?
--
You received this message because you are subscribed to a topic in the Google Groups "GloVe: Global Vectors for Word Representation" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/GlobalVectors/PlA90qX4t5I/unsubscribe.
To unsubscribe from this group and all its topics, send an email to GlobalVectors+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/GlobalVectors/3cf76fe5-30b3-4272-8329-e9f8257502ae%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.