Comparing Mikolov's doc2vec to gensim's doc2vec

JH Lau

unread,

Dec 14, 2015, 7:42:54 PM12/14/15

to gensim

Hi all,

Has anyone done any formal comparison between Mikolov's unofficial doc2vec (https://groups.google.com/forum/#!searchin/word2vec-toolkit/doc2vec/word2vec-toolkit/Q49FIrNOQRo/-E2339F_GRoJ) and gensim's doc2vec?

I was playing with both code on a small collection of Twitter data (~5K tweets). I was able to get rather meaningful results when searching for similar tweets (or documents) with Mikolov's code, but gensim is giving me pretty much garbage. Here are the commands and options I used for running both programs:

Mikolov's doc2vec:

./word2vec -train <input_data> -output vectors.txt -cbow 0 -size 100 -window 10 -negative 5 -hs 0 -sample 1e-4 -threads 4 -binary 0 -iter 20 -min-count 1 -sentence-vectors 1

gensim's doc2vec:

documents = g.doc2vec.TaggedLineDocument(<input_data>)

m = g.Doc2Vec( size=100, window=10, min_count=1, sample=1e-4, workers=4, hs=0, dm=1, negative=5, dm_concat=1)

m.build_vocab(documents)

for epoch in range(20):

print "ITERATION =", epoch

m.train(documents)

m.alpha -= 0.002 # decrease the learning rate

m.min_alpha = m.alpha # fix the learning rate, no decay

The only difference I can really see is that the alpha decay is a little different between the two (Mikolov's alpha decay is linear over words, gensim's decay in this case is over epoch). I've tried doing cuter things like getting gensim's alpha to work the same (by giving the total_words argument in train()), but the results were not any different. Gensim does start to work a little better when I increased the epoch to 100. But that doesn't make any sense - why should gensim's doc2vec take more iterations to get a similar result to Mikolov's?

Any ideas what could be wrong?

Cheers,

JH

Gordon Mohr

unread,

Dec 14, 2015, 8:06:03 PM12/14/15

to gensim

The Mikolov example code doesn't implement `dm_concat=1` mode – that's a very different model with a lot more parameters... and so might require much more data/iterations or different interpretation of results. Also `cbow=0` to Mikolov's code essentially means skip-gram-like, aka 'DBOW', which is `dm=0` in gensim. And if I recall correctly, that code works by treating a single (first) token as the doc-token, mixed in with all other word vectors... so word-vectors are always co-trained (which isn't a requirement of the paper or in the gensim implementation).

So some rough equivalencies from the sentence-vectors-patched word2vec, to gensim Doc2Vec, for reference:

word2vec … -cbow 0 … -sentence-vectors 1 <=> Doc2Vec(…, dm=0, dbow_words=1, …)

word2vec … -cbow 1 … -sentence-vectors 1 <=> Doc2Vec(…, dm=1, dm_mean=1, alpha=0.05, …)

Another common gotcha that can trigger random-seeming results: make sure the `words` of each of your TaggedDocument examples are an already-tokenized list-of-strings. If they're strings themselves, the individual characters get interpreted as the words.

- Gordon

JH Lau

unread,

Dec 15, 2015, 12:48:22 AM12/15/15

to gensim

Hi Gordon,

> The Mikolov example code doesn't implement `dm_concat=1` mode – that's a very different model with a lot more parameters

I see. But the PV-DM described in Le and Mikolov's paper does use concatenation (section 2.2, second paragraph). That is, it doesn't just sum or average the word and sentence vectors to predict the target word. Clarification on this would be good.

But many thanks for pointing out the rough equivalences between gensim and Mikolov's. I am running dm=0 now for 20 iterations and results are looking much more sensible.

Cheers,

JH

Gordon Mohr

unread,

Dec 15, 2015, 3:29:41 AM12/15/15

to gensim

Yes, the 'Paragraph Vectors' paper mentions "averaged or concatenated" as options, but the Mikolov example patch to word2vec.c only implements averaging.

In my attempts to reproduce the error rates claimed by the paper, I implemented the concatenation mode in gensim – I believe gensim is the only publicly-available implementation. But, it wasn't enough: as far as I know, no one who has tried has yet matched the paper's error rate, and you can find a lot of frustrated posts in different forums from people who've tried. Something is missing or mistaken in the paper's reporting.

The concatenation mode results in a much larger (more memory-consumptive) model. And, it's slower to train for each example and slower to improve overall on downstream evaluation. It's possible that on some larger dataset, with sufficient training patience, it will achieve better results... but in my limited tests I haven't yet found such a situation yet.

- Gordon

JH Lau

unread,

Dec 15, 2015, 5:23:51 AM12/15/15

to gensim

Right I see. Indeed I was reading through that thread and no one seemed to be able to reproduce the paper's results. And I think the 9% error rate that gensim is getting (https://github.com/piskvorky/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb) is already very respectable.

Many thanks for the doc2vec implementation in gensim, it was easy to use and of course being able to call it from python makes everything so easy :D One thing though, it wasn't immediately obvious to lots of people that the current implementation is only training the model for 1 epoch, and that you have to manually write some code to train it for multiple iterations. Is there a reason why the "iter" option is disabled for doc2vec? I read about the blog post on this (http://rare-technologies.com/doc2vec-tutorial/), but I don't think there's any problem duplicating the data n times (n=iter) and linearly decaying alpha over the full data.

-JH

Gordon Mohr

unread,

Dec 15, 2015, 2:01:13 PM12/15/15

to gensim

You're welcome! The original Doc2Vec implementation was contributed by Tim Emerick, then I and others made additions and optimizations.

The `iter` argument should work fine with Doc2Vec. (Maybe it didn't originally, and the doc comment may not be clear – but it's currently used by several of the test_doc2vec.py test cases.)

Some of the examples don't use it just to have more control – the option of shuffling documents or re-assessing vector quality mid-training.

- Gordon

JH Lau

unread,

Dec 15, 2015, 7:41:16 PM12/15/15

to gensim

Ah okay. So the "iter" option works. Apologies I thought it was disabled.

-JH

Reply all

Reply to author

Forward