Doc2Vec and pre-trained vectors

3,761 views
Skip to first unread message

Stella Tryfona

unread,
Sep 5, 2016, 12:09:32 PM9/5/16
to gensim
Hello, I am currently trying to build a DBOW model with Doc2Vec and use pre-trained word embeddings for my master thesis. If I understand correctly, in order to use the pre-trained word embeddings I must use the method "intersect_word2vec_format". My question is whether dbow_words parameter needs to be set to 1, because if I set it to 0 there is no change in the paragraph vectors whether I use pre-trained word embeddings or not.

Thank you very much in advance,

Stella

bkj...@gmail.com

unread,
Sep 6, 2016, 12:50:17 PM9/6/16
to gensim
Stella --

Could you post a copy of the `Doc2Vec` parameters that you're using?  

I have roughly the same question as well -- I want to use word embeddings trained on a large number of sentences (100M) to train a Doc2Vec model on a smaller dataset.  I was wondering both a) whether this makes any sense and b) how exactly to go about doing it.  

~ Ben

Stella Tryfona

unread,
Sep 7, 2016, 7:09:43 AM9/7/16
to gensim
Hi Ben, 

This is part of my code:

# PV-DBOW (with negative sampling)
model_DBOW = Doc2Vec(dm=0,size=600,window=15,hs=0,negative=5,min_count=5,dbow_words=1,sample=1e-5)

# build the vocabulary 
model_DBOW.build_vocab(biomedical_docs)

# use pre-trained word vectors
model_DBOW.intersect_word2vec_format(biomedical-vectors-600.bin', binary=True)

Some papers that I read argue that when they used pre-trained word embeddings, they had better results. 

Ben Johnson

unread,
Sep 7, 2016, 10:31:06 AM9/7/16
to gen...@googlegroups.com
I'm going to try something similar on my data today, I'll report back if it seems to work. 

I will say that I haven't had much luck with hs=0 -- I did an experiment where I fed each document through twice, expecting the duplicates to have identical or very similar embeddings, but that wasn't the case, which suggests the embeddings weren't converging. So that's something to consider/ take a look at. (Btw, my docs are tweets, so they're very short, which may have something to do with non convergence.)
--
You received this message because you are subscribed to a topic in the Google Groups "gensim" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/kRLZgbaZzJ8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gensim+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gordon Mohr

unread,
Sep 7, 2016, 3:06:56 PM9/7/16
to gensim
Note that with pure DBOW, as described in the original "Paragraph Vectors" paper, (traditional input) word-vectors aren't learned or used at all. So you could import vectors from elsewhere, or just set every dimension of every word vector to `sys.float_info.max` or `0` or `float('inf')` or `float('nan')` and it won't affect your DBOW vectors at all. (Also, the `window` parameter has no effect in pure DBOW - each doc-vector is trained with respect to every target word in the corresponding doc.)

DBOW is so much like skip-gram word2vec that it's natural to interleave the two in the same training-pass, with shared hidden-to-output NN weights, which has among other effects:

* longer training times, proportionate to the skip-gram `window` size

* creation of word-vectors inside the same 'space' (comparable to) the doc-vectors

* doc-vectors which are (within the narrow context of the NN word-prediction training) not as good at predicting document words. After all, the training is now spending more effort on making the word-vectors predictive within sliding windows, which is not likely to have the same optimal solution as making doc-vectors predictive over whole documents. But...

* doc-vectors which *might* be more useful for some downstream tasks, as the extra competition within the word-vector training may have resulted in more generalizability (or interpretability in terms of similarity-to-words). In a way, the word-vectors *are* doc-vectors for synthetic documents made up of surrounding context-window words, and to the extent this may be an effective corpus-expansion technique (especially for small corpuses) it might help the doc vectors. (However, if word-vectors aren't needed, it might be fairest to compare the effect of adding word-training with a window of N against pure DBOW with N times as many iterations. That'd take roughly the same amount of training time, but all effort would be spent on the doc-vectors.)

Note that even with `dbow_words=1`, each individual doc-vector training example fed through the NN does not involve word-vectors. The effect of word-vectors are indirect, because of the interleaved joint training. (After each doc-vector updating, some word-vector updating will occur, and vice-versa, and they're sharing hidden-to-output weights. So only adjustments with some joint utility will tend to persist/continue.)

Finally, note that `intersect_word2vec_format()` by default locks all loaded word-vectors against further changes, by setting `model.syn0_lockf` to 0.0 for each loaded-word slot. If you absolutely love your pretrained vectors, they're perfect for your domain and trained on more data than you'll be using, and you are certain you want them to be the "fixed map" against which your doc-vectors should slot themselves, this may make sense. But in other cases pre-trained vectors may be no better than what you could create yourself, from domain-specific data, during local training. In that case, you may not want to use them, or you may want to use them simply to give the model a biased "head-start" (better than the usual random-initialized values), while still letting word-vectors drift to new positions best fit to your current training data. In that case, after doing the `intersect_word2vec_format()` load, you'd want to make sure all `syn0_lockf` values are again 1.0, allowing continued training. 

- Gordon 

Kiva S

unread,
Oct 27, 2017, 10:49:58 AM10/27/17
to gensim
@Stella 

Were you able to solve this? I'm a bit late to the party, but I seem to be following the path that you took. .

Marc

Sheer

unread,
Oct 30, 2017, 4:18:44 PM10/30/17
to gensim
Hi All,

This issue is discussed at length in this paper: https://arxiv.org/abs/1607.05368.

The authors forked gensim to implement a variant of Doc2Vec with pre-trained embeddings (https://github.com/jhlau/gensim).

But a much simpler and less invasive solution is to just subclass Doc2Vec.  I'm attaching my code for doing so.  I tested running with DBOW and setting dbow_word=0 and it seemed to work.  I used the pretrained google vectors (and confirmed they did not change) and tested on some wikipedia pages and the results seemed quite good.

Sheer 
doc2vec.py

Gordon Mohr

unread,
Oct 30, 2017, 6:20:31 PM10/30/17
to gensim
If you're running DBOW without interleaved skip-gram word-training (`dm=0, dbow_words=0`), any word-vectors in the model will be neither trained nor consulted in any way during the training. So in that mode, loading pre-trained vectors is superfluous. 

- Gordon

Sheer El Showk

unread,
Oct 30, 2017, 7:10:47 PM10/30/17
to gen...@googlegroups.com
Hi Gordon,

Thanks for your response.  I do see that in the original paper they claim they just use a softmax over the vocab (and they give very few details).  But looking at the gensim implementation (the python rather than cython code, for readability) I see that train_document_dbow calls train_sg_pair for each word in the doc_words (but uses the doc vector rather than syn0 for context).  I'm using hs=0 and the code for that uses the word vector in syn1 (line 309 of word2vec.py).  So I see what you mean that setting syn0 is not the correct thing to do but presumably setting syn1 would work but I would also have to set learn_hidden=False in train_document_dbow?

Thanks for your help!


--

Gordon Mohr

unread,
Oct 31, 2017, 12:24:55 AM10/31/17
to gensim
When speaking of pre-trained word-vectors, such as those in the `model.wv` vectors of Word2Vec or Doc2Vec, or found in some export file like the `GoogleNews` vectors, they're usually equivalent to what's in the `syn0` array. It wouldn't make sense to load those into `syn1` (for HS mode) or `syn1neg` (for negative-sampling mode). 

Whether and when it might make any sense to load pre-trained word-vectors before Doc2Vec training at all remains, to me, a murky question. Doc2Vec does not require word-vectors as an input, or a 1st stage - and the modes that use word-vectors at all will make them simultaneously with doc-vectors from the training corpus. 

In pure-DBOW mode (`dm=0, dbow_words=0`), loading word-vectors into `syn0` can't have any effect either way – the `syn0` values aren't used at all.

If you added skip-gram training to DBOW (`dm=0, dbow_words=1`), having prior word-vectors in `syn0`, either locked against changes or free to change with training, would have an indirect effect on the doc-vectors, because of all the skip-gram training affecting the hidden->output weights, with respect to which the doc-vectors would also then have their own adjustments calculated.

In DM mode, any pre-loaded word-vectors would be averaged with doc-vectors for every training-example, and so have the most-direct influence. 

In cases where such word-vectors might have influence, my hunch would be they might speed or improve results marginally, if the new training data is thin and the prior word-vectors come from a compatible domain. But for larger datasets or datasets from a different language domain than the pre-trained vectors, my hunch would be the impact could be negligible or negative. 

Separately, it might make sense to try to retain the `syn1` (in HS mode) or `syn1neg` (in negative-sampling mode) weights from a prior training session, either Word2Vec or Doc2Vec, to see if it speeds/improves a later session. But such weights are not typically saved as 'word-vectors'. (They are interpretable as one-vector-per-word in negative-sampling mode, but not cleanly in HS mode.) And they'd only be meaningful in a followup session with careful attention to ensuring vocabulary-correspondence. (In HS mode, perhaps retaining an identical vocabulary & encoding-tree. In negative-sampling mode, you could possibly synthesize a new `syn1neg` with a mix of imported vectors, for shared words, and new vectors, for novel words.) Whether you'd want this layer frozen (as in Doc2Vec inference, and possible by calling `train_document_MODE(..., learn_hidden=false, ...)` or updated with new training examples, would be an open question. 

There's no existing facility for saving/loading/correlating-vocabularies with respect to the `syn1` or `syn1neg` weights – you'd have to code that up, and ensure the model remains in a self-consistent, usable state. 

My hunches for that kind of re-use would be similar to those for re-use of traditional 'input' (syn0) vectors: possibly helpful for small datasets in well-matched domains, possibly wasteful or harmful if the dataset is larger or from a contrasting domain. 

- Gordon
To unsubscribe from this group and all its topics, send an email to gensim+un...@googlegroups.com.

Idriss Brahimi

unread,
Jul 2, 2018, 5:00:59 AM7/2/18
to gensim
Reply all
Reply to author
Forward
0 new messages