In DBOW, how does simultaneously training word vectors influence the resulting document vectors?

194 views
Skip to first unread message

Runze Wang

unread,
Nov 15, 2017, 8:00:31 PM11/15/17
to gensim
Hi y'all,

Big greetings :-)

I used to think that, when training a DBOW model, it doesn't matter if we are also simultaneously training the word vectors or not (i.e., dm=0 and dbow_words=1) because this structure (from Le and Mikolov's original paper) doesn't utilize word vectors at all:

However, I came across this paper (Lau and Baldwin) that tried to enable and disable training word embeddings and found that enabling it greatly improves the performance:


Even though dbow can in theory work with randomised word embeddings, we found that performance degrades severely under this setting. An intuitive explanation can be traced back to its objective function, which is to maximise the dot product between the document embedding and its constituent word embeddings: if word embeddings are randomly distributed, it becomes more difficult to optimise the document embedding to be close to its more critical content words.


It seems to me that it is suggesting that the loss function in the final layer is computed using a dot product between the document embedding and the context word's embeddings - is that the case? Looking at the structure above, I always thought that the final layer is simply a softmax of the predictions and the loss is simply a log-loss with the one hot-encoded context words.

Please let me know what you think.

Thanks,
Runze

Gordon Mohr

unread,
Nov 30, 2017, 12:08:59 PM11/30/17
to gensim
If you're mixing PV-DBOW with simultaneous skip-gram word-training (as with gensim `Doc2Vec` mode `dm=0, dbow_words=1`, it's true that the (traditional input) word-vectors don't *directly* affect the doc-vectors. Each document example presented to the model during training triggers a forward-propagation through the NN that doesn't use any word-vectors. Then, the back-propagation only updates the hidden-to-output weights and then the PV-DBOW doc-vector itself. 

But, there's indirect influence, because between each document-example will be many word-to-word examples, which share the hidden-to-output weights. The doc-to-word predictions are tried & improved against a model that's also doing word-to-word predictions, and vice-versa. 

The Lau & Baldwin paper is a valuable attempt to elucidate & benchmark a bunch of options, but I find its analysis confused on a few issues, such as the implication that the "randomized embeddings" have any influence on pure PV-DBOW training (with default `dbow_words=0`). The word embeddings are ignored entirely in that case – the gensim code which allocates & randomly-initializes the word-vectors only runs because of shared initialization code-paths with other modes that haven't been conditionally skipped (as they theoretically could be). 

Is it definitely the case that adding interleaved skip-gram training works as a sort of corpus-expansion trick: every sliding-window of word-to-word predictions is like another mini-document, carved out of the existing documents. So adding simultaneous word-training winds up presenting the NN model with many more individual training-examples – by a factor of the `window` parameter, which also multiplies training-time – and perhaps more varied shades-of-usage via the word-to-word frequencies. 

So for example, comparing...

    dm=0, iter=10

...vs...

    dm=0, dbow_words=1, window=10, iter=10

...the second set of parameters is going to result in 11x more micro-examples (individual input-to-desired-output patterns) to the NN over the 10 corpus iterations, and thus about 11x more training time. It might thus result in a stronger model, but if your main interest was the quality of the doc-vectors, and you have the same 11x more time to spend on training that `dbow_words, window=10` would cost you, you might want to also test pure-DBOW as...

    dm=0, iter=110

...so that the model innards are getting the same number of micro-examples as in the simultaneous word-training case, but now fully focused on the doc-vectors. 
    
- Gordon

Runze Wang

unread,
Nov 30, 2017, 2:00:33 PM11/30/17
to gensim
Thanks for the detailed explanation! Very helpful!
Reply all
Reply to author
Forward
0 new messages