In practice, individual sentences are never presented to the neural network as training-examples. Rather, only (context -> target_word) training-examples are presented, as extracted from the sentences. These have no necessary connection to their source sentences – they could hypothetically be further shuffled before presentation to the neural network, to interleave examples from different sentences. (The code doesn't do this, but my guess would be it might offer some slight quality advantage but also a slight slowdown from the extra shuffling and lesser cache-locality.)
Of course, the contexts differ in each mode. In Word2Vec skip-gram, 'context' is a single nearby word. In Word2Vec CBOW, 'context' is an average of nearby words. In Doc2Vec DBOW, 'context' is a single vector for a full paragraph. In Doc2Vec DBOW, 'context' is an average of nearby words and the full-paragraph vector. But in none of these cases does the value of 'context' vary based on either the count of words of the originating text-example ('sentence') or count of words in the full corpus.
I can think of a vaguely-related issue in Doc2Vec where the length of each text-example could be relevant (but isn't addressed by the 'Paragraph Vectors' paper or current implementing code).
In Doc2Vec, since the usual practice is to give each text-example a unique doc-vector, there is the issue that text-examples with wildly different lengths create different amounts of training for their corresponding doc-vectors. For example, if you have text-example A of 10 words, and text-example B of 1000 words, then the doc-A vector will get 10 training-cycles for every 1000 training-cycles that the doc-B vector gets.
In the logic of the unsupervised neural-network training, this makes sense: it's trying to predict words from contexts, there are 100x more words to predict in B, so that model is far more tuned for the 1000 words than the 10. But since downstream tasks may consider the 'A' and 'B' documents of equal importance, this might not be ideal for those other tasks.
There's no current code to tune for such imbalances, but it might plausibly make sense to either over-sample the small documents (artificially repeat them), or perhaps scale the learning-rate for individual training-examples based on the word-length of the text-example from which they originated. In those cases, a scaling would be happening based on word-lengths. But not in 'Paragraph Vectors' as described in the original paper or currently implemented.
- Gordon