doc2vec: Distributed Memory algorithm is forced to use cbow?

513 views
Skip to first unread message

Gregory Larchev

unread,
Jan 18, 2016, 2:53:44 PM1/18/16
to gensim
I noticed that when we run doc2vec with Distributed Memory training algorithm (dm=1), the underlying word2vec model is forced to use cbow. Any particular reason for that? What if we want to run distributed memory with skip-gram instead (and would we ever want to)?

Thanks

Gregory Larchev

unread,
Mar 1, 2016, 6:26:43 PM3/1/16
to gensim
Bump... Anyone has any experience with this?

Gordon Mohr

unread,
Mar 1, 2016, 9:54:38 PM3/1/16
to gensim
The Paragraph Vectors paper's  'Distributed Memory' (DM) mode is defined in a way that's analogous to Word2Vec 'Continuous Bag Of Words' (CBOW). And indeed, the word-vectors that result are trained as if by CBOW, as a necessary side-effect of the doc-vector training. (The doc-vecs and word-vecs share the same error-correction, after each NN example.) Definitionally, DM mode causes CBOW-like word-vec training to occur. 

If you train doc-vecs in a skip-gram fashion, that's the Paragraph Vectors paper's 'Distributed Bag Of Words' (DBOW) mode, `dm=0` in gensim Doc2Vec. It would no longer be "DM" mode. 

(While it's not supported in the code, you could conceivably interleave both kinds of training – similar to how the original word2vec.c and gensim allow enabling *both* hierarchical-softmax and negative-sampling. But I don't know any experiments or reasoning to suggest that'd be worth the extra complication/time.)

- Gordon

bope...@gmail.com

unread,
Jul 26, 2016, 11:24:07 AM7/26/16
to gensim


Hi guys,

Is it possible to use Doc2vec with glove pre-trained word vectors? I'm trying to build a semantic search and I would love to have the semantic relationships of the glove word vectors as my foundation, and then use Doc2vec to map all the document vectors into vector space. This way, when I do a query search, it will give back similar vector documents with a strong semantic foundation for their word vectors. Will this work, or am I really off?

Thanks much for the help!

Alberto Blanco Garcés

unread,
Jan 4, 2018, 7:40:13 AM1/4/18
to gensim
I'm interested in the same thing. Bump. I've seen topics that talk about "intersect_word2vec_format" function, but nothing clear.

Gordon Mohr

unread,
Jan 5, 2018, 2:21:45 PM1/5/18
to gensim
This prior message has an overview of `intersect_word2vec_format()`'s potential use: https://groups.google.com/d/msg/gensim/u7rIdZUNFY0/NnuB6ickDQAJ


But, as noted elsewhere, it's best considered an experimental feature - so you'd need to work out further details of its applicability to your needs yourself, perhaps by also consulting the gensim source code. 

- Gordon

Denis Candido

unread,
Feb 26, 2018, 11:31:17 AM2/26/18
to gensim
Hello Gordon,

I don't see anything on the documentation talking about intersect_word2vec_format. Was this method excluded deprecated at all?

//Denis

Gordon Mohr

unread,
Feb 26, 2018, 1:10:32 PM2/26/18
to gensim
It's only ever existed on `Word2Vec` - where it remains, and is still inherited and usable by `Doc2Vec` models. (It probably should have been refactored to moved to `KeyedVectors`, the newer class(es) collecting vector-set operations and implementing the `wv` proprerty of `Word2Vec`/`Doc2Vec`/etc  models, but hasn't.)

So `intersect_word2vec_format()` should still work, albeit with the same limitations as described in prior messages, including that it only replaces words that already exist in the vocabulary when it is called, and may not offer much or any advantage over just using the usual `Doc2Vec` modes (which either ignore word-vectors, or co-create them as needed simultaneous with doc-vectors). 

- Gordon
Reply all
Reply to author
Forward
0 new messages