Since the gensim Doc2Vec class implements the 'Paragraph Vector' algorithm, the most formal/authoritative explanations will be via the original 'Paragraph Vectors' paper (
http://arxiv.org/abs/1405.4053) and the Word2Vec papers on which that is based (
http://arxiv.org/abs/1301.3781 and
http://arxiv.org/abs/1310.4546).
Some key aspects to note:
* the NN involved is 'shallow' with one hidden layer
* for each training example, the NN 'inputs' are determined by the choice of training contexts – in Doc2Vec 'DM' or 'DBOW' (after that paper's definitions), in Word2Vec 'CBOW' or 'Skip-Gram' (after that paper's definitions)
* the target NN 'outputs' (and thus errors-to-be-backpropagated) are determined by the choice of sparse training strategy – hierarchical softmax or negative-sampling
- Gordon