Avoiding over-fitting in Doc2Vec

688 views
Skip to first unread message

Alafut

unread,
Feb 18, 2019, 4:01:50 PM2/18/19
to Gensim
I have been training a Doc2Vec Model with around 1 million text document and I have a hard time figuring out the optimal number of epochs I should train the model to avoid over fitting.

The task I want to do is Semantic Textual Similarity (STS)
 
The length of the documents varies from a 100 to 1500 words and they can vary greatly in subject matter.

Base on an article called "An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation" by Jey Han Lau and Timothy Baldwin, they argue that the Empirical optimal 
hyper parameters for this task using the dmpv method are:

vector size: 300
Window size: 5
min count: 1
sub-sampling: 10**-6
Negative Sample: 5
epochs: 1000

My main issue with this is the  number of epochs. It seems extremely high. So in short, I would like to know if those hyper-parameters make sense for the task that I want to do and also, if there's a way using a callback function to test if the model is starting to over-train itself.

Any guidance on this would be greatly appreciated.

Gordon Mohr

unread,
Feb 18, 2019, 11:39:04 PM2/18/19
to Gensim
The optimal parameters will vary based on your data and goals: there's no best settings. 

The most common epochs in published works are 10-20, though especially with very small documents (which you don't seem to have)or very small corpuses (a million docs is pretty generous), then perhaps more epochs may help. 

Have you seen results that make you think you're over-fitting? (If so, what are those results?) Are you sure you're not making other mistakes in training? (In particular, if you're calling `train()` more than once, there's a good chance you're doing things wrong.)

A `min_count=1` usually means much larger models, and notably worse word-vector quality in Word2Vec models. (Rare words can't obtain good vectors for that word, but still serve as somewhat random 'noise' interfering with the improvement of vectors for neighboring words, so the defaults of throwing out rare words usually help.) It may also hurt Doc2Vec quality, especially in PV-DM mode - because a word that only appears in a single context is then competing with the (also unique) doc-vector for explanatory power. (The two together for describing nearby words: perhaps OK. But the doc-vector alone is then worse, because it's been "pulling again" another unique randomly-initialized token's influence.)

- Gordon

Zeya LT

unread,
Aug 21, 2020, 12:59:51 AM8/21/20
to Gensim
Hi Gordon, 

I'd like to follow up on this thread. Is there a systematic way of determining the optimal amount of epochs for doc2vec? In supervised algorithms, we can split the data into training and validation sets and monitor the validation loss vs training loss as the model trains. But given that doc2vec is unsupervised, do we do the training-validation data splits too? How can we determine the optimal amount of epochs in this case?

Based on your reply above, I get the impression that it requires our subjective judgement to determine the optimal amount of training. Is that correct?

I look forward to your advice. Thanks.

Regards,
Zeya 

Gordon Mohr

unread,
Aug 21, 2020, 3:24:27 AM8/21/20
to Gensim
Ideally with these kinds of stochastic-gradient-descent optimized models, we'd want to train the model until it is 'converged' - as good as it can get, given the limitations of its state/parameters, at its internal training goal (predicting words from surrounding word-vectors/doc-vectors). Formally, that would be when a full epoch's training loss – tallied error magnitudes over all examples – can't be improved by further training. (Any nudging to internal weights that further improves it on some examples worsens it on others.) After that point, this model has done its job – and we could score its usefulness on downstream tasks, perhaps compared against other models of different parameterization, but ideally, always models on which training has 'completed' via reaching that no-further-improvement state. 

However, so far the gensim `Doc2Vec` model doesn't have loss-reporting (and the loss-reporting that does exist on Word2Vec is a bit buggy & incomplete). And in practice, many implementations (including the original word2vec.c code) don't implement any dynamic stopping-choices (based on any formal test of loss plateauing), but just let users pick a fixed number of epochs, assuming they'll be wise enough to choose "enough" (perhaps by separately watching what loss might be reported, or using other rules-of-thumb to ensure severely-undertrained choices aren't used).

Still, more epochs (with the standard decaying-to-miniscule learning-rate) should never *hurt* convergence. (They'll just be a little wasteful making no-net-improvement tweaks.) And if more epochs ever actually harm downstream performance, as might be seen the case of severe overfitting, it's *not* the epochs that should be reduced, in some hope that early-stopping the optimization will land in a better place. Rather, if the full optimization zooms past some interim state that momentarily seemed better for another downstream use, that's evidence some other aspect of the model is over-provisioned, creating nooks-and-crannies in which idiosyncracies of the training data are being memorized instead of the more-generalizable patterns we'd prefer. 

So if you see that pattern, *don't* choose the `epochs` to be a target parameter to be meta-optimized equally up or down, or as a remedy for overfitting. Slim the model in other ways - usually by shrinking dimensionality or effective vocabulary – so that again any number of epochs only ever reaches a usefulness plateau. Without a loss number to watch, this becomes roughly: increase epochs until it stops helping downstream evaluations. If more epochs ever starts actually hurting, that's an indication of *other* problems, best addressed by tightening other parameters.  But once sure more epochs can't hurt – that the convergence target for the model is in fact a useful endpoint for downstream utility – dialing back the epochs to get essentially the same result in an efficient amount of time makes sense. 

- Gordon

Tedo Vrbanec

unread,
Aug 22, 2020, 3:34:32 PM8/22/20
to Gensim
"Still, more epochs (with the standard decaying-to-miniscule learning-rate) should never *hurt* convergence. (They'll just be a little wasteful making no-net-improvement tweaks.) "

I see a different reality. Take some corpus of documents that you know about their similarities / differences. Change the number of epochs (say for word2vec) and you will see big changes in the results. Eg. in one of my simple cases, if I do not specify the number of epochs at 70 but do not define it, instead of normal results I will get that the cosine similarities of all to all are almost 1.

Gordon Mohr

unread,
Aug 23, 2020, 2:00:45 PM8/23/20
to Gensim
If you do not specify a count of epochs, 5 epochs will be used. If your results have changed in any salient way from a 5-epoch run, compared to a 70-epoch run, that's suggestive that the 5-epoch run had not reached model convergence. (Similarly, if the results stay of comparable overall usefulness whether you use a 70-epoch run, or a 140-epoch run, that's suggestive the model had already converged with 70 epochs, and the extra 70 epochs in the 140-epoch run were superfluous.)

(Without seeing more details about your code & data, I'm not sure what could be happening in your "simple case". As word-vectors in `Word2Vec`, and doc-vectors in `Doc2Vec`, start training at random positions, there's not really any point in an effective training, between those starting positions & converged final positions, that I'd expect 'all' word-vector-to-word-vector cosine-similarities, or 'all' doc-vector-to-doc-vector cosine-similarities, to be near 1.0. But if you're plugging those word-vectors into some other particular doc-similarity calculation you haven't yet mentioned, or if there are other atypical things about your data or setup, who knows? And, I'm not sure what "cosine similarities of all to all are almost 1" means.)

- Gordon

Zeya LT

unread,
Aug 29, 2020, 8:35:24 PM8/29/20
to Gensim
Hi Gordon,

Thank you for the detailed explanation. Your advice has been really helpful for my work. 

Cheers!
Reply all
Reply to author
Forward
0 new messages