Hi all,
I was playing with both code on a small collection of Twitter data (~5K tweets). I was able to get rather meaningful results when searching for similar tweets (or documents) with Mikolov's code, but gensim is giving me pretty much garbage. Here are the commands and options I used for running both programs:
Mikolov's doc2vec:
./word2vec -train <input_data> -output vectors.txt -cbow 0 -size 100 -window 10 -negative 5 -hs 0 -sample 1e-4 -threads 4 -binary 0 -iter 20 -min-count 1 -sentence-vectors 1
gensim's doc2vec:
documents = g.doc2vec.TaggedLineDocument(<input_data>)
m = g.Doc2Vec( size=100, window=10, min_count=1, sample=1e-4, workers=4, hs=0, dm=1, negative=5, dm_concat=1)
m.build_vocab(documents)
for epoch in range(20):
print "ITERATION =", epoch
m.train(documents)
m.alpha -= 0.002 # decrease the learning rate
m.min_alpha = m.alpha # fix the learning rate, no decay
The only difference I can really see is that the alpha decay is a little different between the two (Mikolov's alpha decay is linear over words, gensim's decay in this case is over epoch). I've tried doing cuter things like getting gensim's alpha to work the same (by giving the total_words argument in train()), but the results were not any different. Gensim does start to work a little better when I increased the epoch to 100. But that doesn't make any sense - why should gensim's doc2vec take more iterations to get a similar result to Mikolov's?
Any ideas what could be wrong?
Cheers,
JH