objective function in doc2vec

Sascha

unread,

Sep 13, 2016, 1:14:38 PM9/13/16

to gensim

Hi! In the Mikolov's doc2vec paper the objective function(wordvec) for one text is written as the averaged log-probabilty, the log-probabiltdivided by the number of words (https://cs.stanford.edu/~quocle/paragraph_vector.pdf).

In the python code of word2vec or doc2vec I can't see any averaging. Is there a reason why it was left out. When taking the derivates in doc2vec and performing the gradient
descent update for the document vectors there should be a division by the number of words in the document be executed. Is there a reason why at was left out?

I haven't found an objective function for doc2vec but when taking averaging in account it should look like equation 2. J is the number of documents and N_j the size of document j.
Thanks for any hints or corrections!

jayant jain

unread,

Sep 13, 2016, 5:08:50 PM9/13/16

to gensim

This is because of the nature of the training process in gensim - instead of a batched update, the weight matrices are updated after evaluating every individual skipgram pair. If you wish to understand the process better, I'd recommend looking at the pure python implementation first instead of the cython implementation. The relevant functions for skipgram are train_batch_sg and train_sg_pair, and train_batch_cbow and train_cbow_pair for cbow.

Also, a minor correction - for computational efficiency, since the aim is only to obtain high quality word vectors, and not to actually minimize the log-probability, the negative sampling objective function is usually used in practice. You could have a look at this paper for more details - https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf

Thanks

Jayant

Sascha

unread,

Sep 14, 2016, 6:25:37 AM9/14/16

to gensim

Thanks, I had already taken a look at the python code and I'm aware of the paper.
Even if you optimize the objective function with single updates the code should look different when the objective function is the average log-probabilty.
The gradient descent update for the average objective I've written down in my first post(which could be wrong?) would look like the equation above (for hierarchical softmax).
When there is no batch update one sum is missing but there would still be the division by J (number of documents) and N_j (the number of words in the document) be performed.
So are my calculations wrong or is in gensim the implicit objective function the log-probabilty but not the average log-probabilty like in the paper (https://cs.stanford.edu/~quocle/paragraph_vector.pdf) ?
I wasn't aware that negative sampling outperforms hierarchical softmax significantly, I thought they produce comparable results.

Lev Konstantinovskiy

unread,

Sep 16, 2016, 12:25:26 PM9/16/16

to gensim

Hi Sacha,

Are you suggesting that the averaging should change in doc2vec compared to word2vec? Unfortunately the very terse paper says "More formally, the only change in this model compared to the word vector framework is in equation 1, where h is constructed from W and D."

It would be great to discuss your derivations if you could share them, prefereably using notation from Rong

Also feel free to compare to Deeplearning4j doc2vec implementation.

Regards
Lev

Sascha

unread,

Sep 17, 2016, 2:39:33 PM9/17/16

to gensim

Hi Lev!

Actually I'm wondering why there is no averaging by the number of words in the gensim-word2vec implementation while averaging is assumed in the paper you cited.
But there is no average likelihood calculated in Mikolov's word2vec implementation either as far as I can see, so the gensim implementations seems consistent with respect to this code.
In the mathematical explanation of word2vec by Rong the averaging by words is ignored too.

I agree that your citation seems to imply that there is no additional averaging necessary for the doc2vec. But under the assumption that we want to calculate an average probability it makes
sense to divide by the number of documents too. But since the factor by which we divide (the number of all documents) is the same for all document vector updates, it should make no difference for practical purposes because we can always use another learning rate. The equation would just imply that different number of documents would need a different learning rate.
For me personally it's important because doc2vec is a part of my master's thesis and I wanted to use the gensim implementation for the programming part. That's why I wanted to double check that the code is in accordance with the theory from the papers.

I've added my version of the averaged doc2vec, the introduction and most of the notation is taken from Rong.
Any feedback is welcome. Especially when someone knows why averaging can be ignored in the word2vec code. Is there a theoretical founded reason or
has it just worked as well as averaging in practice?
I will take a look at the Deeplearning4j doc2vec implementation too.

Best,
Sascha

averaged_hierarchical_softmax_doc2vec.pdf

Lev Konstantinovskiy

unread,

Sep 18, 2016, 12:19:07 PM9/18/16

to gensim

Hi Sascha,

Thanks for the reply. A masters thesis on doc2vec sounds very interesting. We provide academic support to such projects via regular Skype meetings. Would appreciate an email to l...@rare-technologies.com to arrange it.

Regards

Lev

Gordon Mohr

unread,

Sep 18, 2016, 5:35:06 PM9/18/16

to gensim

Can you highlight what passage of the word2vec paper makes you think an "averaging by the number of words" is a necessary step during training?

As you note, gensim matches the behavior of the original word2vec.c code, and the behavior as implemented performs well. (If you have a proposed code change where an additional averaging would occur, that could be tested for practical benefits.)

I suspect the source of the discrepancy-in-understandings may be the distinction between the overall goal ("minimize average error") and the concrete, incremental steps used to efficiently approximate that goal. Those steps only ever operate on subsets of the training examples (and for the most part in current code, one text example at a time), and so the total number of words is not a direct part of any of the individual-example training-loop.

- Gordon

Sascha

unread,

Sep 19, 2016, 6:08:06 AM9/19/16

to gensim

Thanks Lev, for the offer. I will send you an email later.

Sascha

unread,

Sep 19, 2016, 6:55:56 AM9/19/16

to gensim

The formulation about the objective function implied for me that the log probability has to be divided by the number of words in each document/sentence.
"the objective of the word vector model is to maximize the average log probability" in "Distributed Representations of Sentences and Documents".
So the actual objective that is maximized is just the log-probability and this for efficiency reasons and because of good performance.

But wouldn't it be a better approximation to the original objective function when the length of each sentence that is used for the updates is used for averaging?
Or is this omitted because of efficiency considerations too?

Best,
Sascha

Gordon Mohr

unread,

Sep 19, 2016, 2:29:41 PM9/19/16

to gensim

In practice, individual sentences are never presented to the neural network as training-examples. Rather, only (context -> target_word) training-examples are presented, as extracted from the sentences. These have no necessary connection to their source sentences – they could hypothetically be further shuffled before presentation to the neural network, to interleave examples from different sentences. (The code doesn't do this, but my guess would be it might offer some slight quality advantage but also a slight slowdown from the extra shuffling and lesser cache-locality.)

Of course, the contexts differ in each mode. In Word2Vec skip-gram, 'context' is a single nearby word. In Word2Vec CBOW, 'context' is an average of nearby words. In Doc2Vec DBOW, 'context' is a single vector for a full paragraph. In Doc2Vec DBOW, 'context' is an average of nearby words and the full-paragraph vector. But in none of these cases does the value of 'context' vary based on either the count of words of the originating text-example ('sentence') or count of words in the full corpus.

I can think of a vaguely-related issue in Doc2Vec where the length of each text-example could be relevant (but isn't addressed by the 'Paragraph Vectors' paper or current implementing code).

In Doc2Vec, since the usual practice is to give each text-example a unique doc-vector, there is the issue that text-examples with wildly different lengths create different amounts of training for their corresponding doc-vectors. For example, if you have text-example A of 10 words, and text-example B of 1000 words, then the doc-A vector will get 10 training-cycles for every 1000 training-cycles that the doc-B vector gets.

In the logic of the unsupervised neural-network training, this makes sense: it's trying to predict words from contexts, there are 100x more words to predict in B, so that model is far more tuned for the 1000 words than the 10. But since downstream tasks may consider the 'A' and 'B' documents of equal importance, this might not be ideal for those other tasks.

There's no current code to tune for such imbalances, but it might plausibly make sense to either over-sample the small documents (artificially repeat them), or perhaps scale the learning-rate for individual training-examples based on the word-length of the text-example from which they originated. In those cases, a scaling would be happening based on word-lengths. But not in 'Paragraph Vectors' as described in the original paper or currently implemented.

- Gordon

Sascha

unread,

Sep 20, 2016, 7:13:23 AM9/20/16

to gensim

You have accidentally written twice DBOW instead of DM: "In Doc2Vec DM, 'context' is an average of nearby words and the full-paragraph vector."
But thanks, your explanations helped me! When I've time I will try to scale the doc2vec learning rates for different word-lengths and check if the results differ from the current implementation.

Reply all

Reply to author

Forward