reproducing nearest neighbour sample

privatl...@gmail.com

unread,

Jan 28, 2019, 4:37:36 PM1/28/19

to GloVe: Global Vectors for Word Representation

I've been playing around with glove and decided to look at some nearest neighbour relationships.

For the glove.6B model, trained on Wikipedia and Gigacrawl (this one), for some sample words, here are the 10 most similar ones:

horse - 'horses', 'thoroughbred', 'dog', 'riding', 'breeders', 'racing', 'rode', 'derby', 'jockey', 'ride'
frog - 'toad', 'frogs', 'snake', 'monkey', 'toads', 'squirrel', 'species', 'rodent', 'parrot', 'spider'
lion - 'elephant', 'dragon', 'leopard', 'bear', 'lions', 'beast',' golden', 'wolf', 'tiger', 'monkey'

These embeddings are like a topic and not really similar words.
I have to admit that I picked some bad examples, for some words it does much better:

house - 'houses', 'senate', 'congressional', 'congress', 'republicans', 'building', 'white', 'mansion', 'capitol', 'office'

On the glove website, I found the amazing frog example. Where the closest neighbours to a frog are 'frogs', 'toad', 'litoria', 'leptodactylidae', 'rana', 'lizard' and 'eleutherodactylus'.

Incredible that the model managed to pick up these similarities. But how can I actually reproduce this? Is this available as one of the pretrained models ?

I tried the Wikipedia + Gigacrawl model (the first download) and the Common Crawl with 2.2M vocab size. But they don't have the nice frog embedding.

For reference, here is the code that I'm using to look at nearest neighbours:

from gensim.scripts.glove2word2vec import glove2word2vec
glove2word2vec(glove_input_file=f'{glove_path}/glove.{identifier}.txt', word2vec_output_file="gensim_glove_vectors.txt")

from gensim.models.keyedvectors import KeyedVectors
glove_model = KeyedVectors.load_word2vec_format("gensim_glove_vectors.txt", binary=False)

glove_model.most_similar(positive=['frog'])

Dean Sturgess

unread,

Jan 31, 2019, 10:47:18 AM1/31/19

to GloVe: Global Vectors for Word Representation

I'm using the same dataset, the 400k words 50 dimensions and I get the following results:

Horse : horses , dog , bull , riding , cat , pack , rides , camel , rode , breeders

frog: snake , ape , toad , monkey , spider , lizard , tarantula , cat , spiny , fern

lion: dragon , beast , unicorn , elephant , cat , bear , golden , peacock , rabbit , mermaid

I suspect the example on the site was made with one of the larger datasets, more words mean more chance for an obscure froggy reference and more vectors would mean a greater "resolution" of a search

Pablo Pinedo López

unread,

Feb 7, 2024, 1:20:52 PM2/7/24

to GloVe: Global Vectors for Word Representation

Hi everybody,

Im also interested in obtain the frog example (in Stanford's website), but I never obtain their(Stanford) results :(

@Dean, Why my results for glove.6B.50d (same dataset you use in your post) are diffrents from yours:

It not make sense: orchid similar to frog??? ape similar to frog??? I don't understand..............

I've also tryed dataset glove.840B.300d but the results are completely diffrent from Stanford's frog sample:

I do not know how they(stanford's team) obtain frog's neighbours I've tryed every dataset in Stanford's web page and always the results are diffrent from theirs¡¡