Ideally with these kinds of stochastic-gradient-descent optimized models, we'd want to train the model until it is 'converged' - as good as it can get, given the limitations of its state/parameters, at its internal training goal (predicting words from surrounding word-vectors/doc-vectors). Formally, that would be when a full epoch's training loss – tallied error magnitudes over all examples – can't be improved by further training. (Any nudging to internal weights that further improves it on some examples worsens it on others.) After that point, this model has done its job – and we could score its usefulness on downstream tasks, perhaps compared against other models of different parameterization, but ideally, always models on which training has 'completed' via reaching that no-further-improvement state.
However, so far the gensim `Doc2Vec` model doesn't have loss-reporting (and the loss-reporting that does exist on Word2Vec is a bit buggy & incomplete). And in practice, many implementations (including the original word2vec.c code) don't implement any dynamic stopping-choices (based on any formal test of loss plateauing), but just let users pick a fixed number of epochs, assuming they'll be wise enough to choose "enough" (perhaps by separately watching what loss might be reported, or using other rules-of-thumb to ensure severely-undertrained choices aren't used).
Still, more epochs (with the standard decaying-to-miniscule learning-rate) should never *hurt* convergence. (They'll just be a little wasteful making no-net-improvement tweaks.) And if more epochs ever actually harm downstream performance, as might be seen the case of severe overfitting, it's *not* the epochs that should be reduced, in some hope that early-stopping the optimization will land in a better place. Rather, if the full optimization zooms past some interim state that momentarily seemed better for another downstream use, that's evidence some other aspect of the model is over-provisioned, creating nooks-and-crannies in which idiosyncracies of the training data are being memorized instead of the more-generalizable patterns we'd prefer.
So if you see that pattern, *don't* choose the `epochs` to be a target parameter to be meta-optimized equally up or down, or as a remedy for overfitting. Slim the model in other ways - usually by shrinking dimensionality or effective vocabulary – so that again any number of epochs only ever reaches a usefulness plateau. Without a loss number to watch, this becomes roughly: increase epochs until it stops helping downstream evaluations. If more epochs ever starts actually hurting, that's an indication of *other* problems, best addressed by tightening other parameters. But once sure more epochs can't hurt – that the convergence target for the model is in fact a useful endpoint for downstream utility – dialing back the epochs to get essentially the same result in an efficient amount of time makes sense.
- Gordon