So when I trained word2vec model, with default parameters (namely the
skip-gram model), the results where coherent with what is reported (in the blog and in papers..).
When I used the pre-trained “vectors.bin” model from C version of
Word2Vec from Tomas, loaded in gensim, everything seems fine as well
(notice that the default model of C version is CBOW).
Then I tried to train the Gensim Word2Vec with default parameters used
in C version (which are: size=200, workers=8, window=8, hs=0,
sampling=1e-4, sg=0 (using CBOW), negative=25 and iter=15) and I got a
strange “squeezed” or shrank vector representation where most of
computed “most_similar” words shared a value of roughly 0.97!! (And from
the classical “king”, “man”, “woman” the most similar will be “and”
with 0.98, and in the top 10 I don’t even have the “queen”…). Everything
was train on the SAME text8 dataset.
So I wondered if you saw such “wrong” training before, with those
atypical characteristics (all words in roughly one direction in vector
space) and if you know where might be the issue.
I am trying different parameters setting to hopefully figure out what is wrong (workers>1? iter?).
model = word2vec.Word2Vec(sentences, size=200, workers=8,
iter=15,sample=1e-4,hs=0,negative=25,
min_count=1,sg=0, window=8)
./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 4 -binary 1 -iter 15
training on 255078105 raw words took 260.5s, 558320 trained words/s
model.most_similar(positive=['woman', 'king'], negative=['man'], topn=10)
INFO:gensim.models.word2vec:precomputing L2-norms of word weight vectors
precomputing L2-norms of word weight vectors
Out[153]:
[(u'nine', 0.9694292545318604),
(u'v', 0.9688974022865295),
(u'it', 0.9687643051147461),
(u'zero', 0.9683082699775696),
(u'five', 0.9682567119598389),
(u'and', 0.9681676030158997),
(u'p', 0.9680780172348022),
(u'm', 0.9679656028747559),
(u'eight', 0.9679427146911621),
(u'them', 0.9679186344146729)]
model.most_similar(positive=['woman', 'king'], negative=['man'], topn=10)
Out[166]:
[(u'queen', 0.5550357103347778),
(u'betrothed', 0.4963855743408203),
(u'urraca', 0.4869607090950012),
(u'marries', 0.48425954580307007),
(u'vii', 0.4788791239261627),
(u'isabella', 0.4788578748703003),
(u'throne', 0.4734063744544983),
(u'daughter', 0.4699792265892029),
(u'abdicates', 0.46685048937797546),
(u'infanta', 0.46183738112449646)]
> an email to gensim+unsubscribe@googlegroups.com
> <mailto:gensim+unsubscribe@googlegroups.com>.