Accuracy test of Word2Vec trained by gensim

Jianbin Lin

unread,

Dec 13, 2015, 9:22:57 PM12/13/15

to gensim

I'm currently use gensim to reproduce the result of example of Google provide. here

The problem is the accuracy test of gensim doesn't match with Google's result.

For example, the accuracy of capital-common-countries of Google is 82.02%, the best result of gensim of different parameter sets is 64.4%. There is a big gap here.

Here is the code snippet of train word2vec and accuracy by using gensim

sentences = word2vec.Text8Corpus('./text8')
model = word2vec.Word2Vec(sentences, size=200, workers=12, min_count=5, sg=0, window=8, iter=15, sample=1e-4, negative=25)
model.accuracy("./questions-words.txt")[enter image description here][2]

Code snippet of Google's demo without changes any parameters

 ./demo-word-accuracy.sh

Accuracy comparison detail

Does anyone could help on this?

(I'm also post this question on stackoverflow)

Gordon Mohr

unread,

Dec 14, 2015, 2:09:15 AM12/14/15

to gensim

Try `alpha=0.05`. The word2vec.c code automatically shifts to that starting default when in CBOW mode, but gensim doesn't. To reduce sources of confusion gensim may more closely match the word2vec.c defaults in the future: https://github.com/piskvorky/gensim/issues/534

- Gordon

Jianbin Lin

unread,

Dec 14, 2015, 4:48:18 AM12/14/15

to gensim

Thanks very much, Gordon. I look through your post and solve this problem by setting cbow_mean=1 & alpha=0.05.

details:

sentences = Text8Corpus(".\text8")
model = Word2Vec(sentences, size=200, sg=0, window=8, alpha=0.05, min_count=5, workers=12, iter=15, cbow_mean=1, hs=0, negative=25)
model.accuracy(".\questions-words.txt")

One another question is that, it tooks me 590 seconds to train this model, while the google's version takes only 385 seconds.

I check the value of gensim.models.word2vec.FAST_VERSION is 1.

A confusion is that why the gensim is slower than google's version? do I need to set some additional parameters to use optimizing code?

在 2015年12月14日星期一 UTC+8下午3:09:15，Gordon Mohr写道：

Reply all

Reply to author

Forward