Accuracy test of Word2Vec trained by gensim

1,950 views
Skip to first unread message

Jianbin Lin

unread,
Dec 13, 2015, 9:22:57 PM12/13/15
to gensim

I'm currently use gensim to reproduce the result of example of Google provide. here

The problem is the accuracy test of gensim doesn't match with Google's result.

For example, the accuracy of capital-common-countries of Google is 82.02%, the best result of gensim of different parameter sets is 64.4%. There is a big gap here.

Here is the code snippet of train word2vec and accuracy by using gensim

sentences = word2vec.Text8Corpus('./text8')
model = word2vec.Word2Vec(sentences, size=200, workers=12, min_count=5, sg=0, window=8, iter=15, sample=1e-4, negative=25)
model.accuracy("./questions-words.txt")[enter image description here][2]

Code snippet of Google's demo without changes any parameters

 ./demo-word-accuracy.sh

Accuracy comparison detail


Does anyone could help on this?

(I'm also post this question on stackoverflow)

Gordon Mohr

unread,
Dec 14, 2015, 2:09:15 AM12/14/15
to gensim
Try `alpha=0.05`. The word2vec.c code automatically shifts to that starting default when in CBOW mode, but gensim doesn't. To reduce sources of confusion gensim may more closely match the word2vec.c defaults in the future:   https://github.com/piskvorky/gensim/issues/534

- Gordon

Jianbin Lin

unread,
Dec 14, 2015, 4:48:18 AM12/14/15
to gensim
Thanks very much, Gordon. I look through your post and solve this problem by setting cbow_mean=1 & alpha=0.05.
details:
sentences = Text8Corpus(".\text8")
model = Word2Vec(sentences, size=200, sg=0, window=8, alpha=0.05, min_count=5, workers=12, iter=15, cbow_mean=1, hs=0, negative=25)
model.accuracy(".\questions-words.txt")


One another question is that, it tooks me 590 seconds to train this model, while the google's version takes only 385 seconds.
I check the value of gensim.models.word2vec.FAST_VERSION is 1. 

A confusion is that why the gensim is slower than google's version? do I need to set some additional parameters to use optimizing code?


在 2015年12月14日星期一 UTC+8下午3:09:15,Gordon Mohr写道:
Reply all
Reply to author
Forward
0 new messages