doc2vec: Distributed Memory algorithm is forced to use cbow?

Gregory Larchev

unread,

Jan 18, 2016, 2:53:44 PM1/18/16

to gensim

I noticed that when we run doc2vec with Distributed Memory training algorithm (dm=1), the underlying word2vec model is forced to use cbow. Any particular reason for that? What if we want to run distributed memory with skip-gram instead (and would we ever want to)?

Thanks

Gregory Larchev

unread,

Mar 1, 2016, 6:26:43 PM3/1/16

to gensim

Bump... Anyone has any experience with this?

Gordon Mohr

unread,

Mar 1, 2016, 9:54:38 PM3/1/16

to gensim

The Paragraph Vectors paper's 'Distributed Memory' (DM) mode is defined in a way that's analogous to Word2Vec 'Continuous Bag Of Words' (CBOW). And indeed, the word-vectors that result are trained as if by CBOW, as a necessary side-effect of the doc-vector training. (The doc-vecs and word-vecs share the same error-correction, after each NN example.) Definitionally, DM mode causes CBOW-like word-vec training to occur.

If you train doc-vecs in a skip-gram fashion, that's the Paragraph Vectors paper's 'Distributed Bag Of Words' (DBOW) mode, `dm=0` in gensim Doc2Vec. It would no longer be "DM" mode.

(While it's not supported in the code, you could conceivably interleave both kinds of training – similar to how the original word2vec.c and gensim allow enabling *both* hierarchical-softmax and negative-sampling. But I don't know any experiments or reasoning to suggest that'd be worth the extra complication/time.)

- Gordon

bope...@gmail.com

unread,

Jul 26, 2016, 11:24:07 AM7/26/16

to gensim

Hi guys,

Is it possible to use Doc2vec with glove pre-trained word vectors? I'm trying to build a semantic search and I would love to have the semantic relationships of the glove word vectors as my foundation, and then use Doc2vec to map all the document vectors into vector space. This way, when I do a query search, it will give back similar vector documents with a strong semantic foundation for their word vectors. Will this work, or am I really off?

Thanks much for the help!

Alberto Blanco Garcés

unread,

Jan 4, 2018, 7:40:13 AM1/4/18

to gensim

I'm interested in the same thing. Bump. I've seen topics that talk about "intersect_word2vec_format" function, but nothing clear.

Gordon Mohr

unread,

Jan 5, 2018, 2:21:45 PM1/5/18

to gensim

This prior message has an overview of `intersect_word2vec_format()`'s potential use: https://groups.google.com/d/msg/gensim/u7rIdZUNFY0/NnuB6ickDQAJ

It also has a descriptive doc-comment: https://radimrehurek.com/gensim/models/doc2vec.html#gensim.models.doc2vec.Doc2Vec.intersect_word2vec_format

But, as noted elsewhere, it's best considered an experimental feature - so you'd need to work out further details of its applicability to your needs yourself, perhaps by also consulting the gensim source code.

- Gordon

Denis Candido

unread,

Feb 26, 2018, 11:31:17 AM2/26/18

to gensim

Hello Gordon,

I don't see anything on the documentation talking about intersect_word2vec_format. Was this method excluded deprecated at all?

//Denis

Gordon Mohr

unread,

Feb 26, 2018, 1:10:32 PM2/26/18

to gensim

It's only ever existed on `Word2Vec` - where it remains, and is still inherited and usable by `Doc2Vec` models. (It probably should have been refactored to moved to `KeyedVectors`, the newer class(es) collecting vector-set operations and implementing the `wv` proprerty of `Word2Vec`/`Doc2Vec`/etc models, but hasn't.)

So `intersect_word2vec_format()` should still work, albeit with the same limitations as described in prior messages, including that it only replaces words that already exist in the vocabulary when it is called, and may not offer much or any advantage over just using the usual `Doc2Vec` modes (which either ignore word-vectors, or co-create them as needed simultaneous with doc-vectors).

- Gordon

Reply all

Reply to author

Forward