Optimal epoch in Doc2Vec

Abu

unread,

Jun 28, 2017, 5:10:41 PM6/28/17

to gensim

Hi,
I am training Doc2Vec model on data sets with around 800~1200 documents. I am explicitly controlling the epochs with a loop to train the model. I get different results for different instances of the model trained by different number of iterations. It also seems that training time increases with the number of epochs. I want to choose an optimal value for the number of epochs. How can I do that?

Ivan Menshikh

unread,

Jun 29, 2017, 1:57:59 AM6/29/17

to gensim

Hi Abu,

Depends on why you want to use vectors. Usually, a larger number of epochs gives better results. If you want to choose 'optimal' number of an epoch, you can every 5/10/20/... epoch constructs model based on docvectors and evaluate it. By already obtained metrics you can already select the required number of epochs.

Gordon Mohr

unread,

Jun 29, 2017, 7:37:36 AM6/29/17

to gensim

That's a very small dataset for Doc2Vec; the examples in the original paper used tens-of-thousands, and followup papers millions of documents.

It is rare to need to explicitly loop and call `train()` multiple times; I recommend against it unless you're absolutely certain you need to and know exactly what you're doing with explicitly-managed `alpha`/`min_alpha` values.

Because of inherent randomness in the algorithm, which remains through multithreading ordering jitter even if trying to force deterministically-seeding pseudorandom numbers, results will vary from run-to-run, even with the exact same metaparameters.

Typical iterations in published work are 10 or 20.

Training time is a linear function of epochs: twice as many will take twice as long.

More iterations should generally help until a point of diminishing returns and then negligible added effect. (At least, they're helping the model on its internal word-prediction task.) If at some point more iterations seems to hurt on your own evaluation of doc-vec quality for a downstream task, you may be suffering from overfitting – a model large enough compared to your data that, as you get deeper into training, is essentially doing more memorization-of-idiosyncracies-of-the-training-set, than generalizable-to-new-examples modeling.

This is especially a risk with small datasets. Getting more data or shrinking the model (such as using a smaller vector `size`) may help restore the desired correlation between the model's internal goal and your external aims, and make it so more iterations never hurt (but maybe aren't helping enough to be worth the time).

- Gordon

Abu

unread,

Jun 29, 2017, 3:20:09 PM6/29/17

to gensim

Hi Ivan,

Thanks for your answer. I want to calculate cosine similarity between those documents using the vectors from Doc2Vec model and compare them. I have a way of evaluating the results. I see that after some specific number of epochs the results go down. But strange thing is if I keep increasing epochs at some point the results get better again. Is it because of over-fitting the model?

Gordon Mohr

unread,

Jun 29, 2017, 4:16:05 PM6/29/17

to gensim

I would not expect your evaluation score, plotted against iterations, to improve as iterations increase from 1 to N1, then get worse from N1 to N2, then get better again from N2 to N3. (Validation curves with respect to `iter` don't usually get that fancy – maybe in tiny run-to-run jitter but not in long trends across ranges of 'iter'.)

I would thus suspect some other bug in your setup – for example, if calling `train()` multiple times, it's really easy to self-manage the `alpha` wrong. (Some people mis-customizing code they've seen in online tutorials wind up running training passes with *negative* `alpha`, which means the model is nonsensically trying to increase its own word-prediction error after each training example.)

If you can share your (pseudo-)code and representative numbers it might shed more light on what's happening. What training mode and other parameters are you using?

- Gordon

Abu

unread,

Jun 29, 2017, 5:25:15 PM6/29/17

to gensim

Hi Gordon,

Thanks for your response. Here is the sample code I am using.

     model = gensim.models.Doc2Vec(dm = 1, window=5, alpha=0.025, size=300, min_alpha=0.001, iter=1, min_count=2, workers=1)

    total_epoch = 30
    model.build_vocab(documents)
    model.iter = total_epoch
    total_docs = len(documents)
    # start training
    for epoch in range(total_epoch):
        random.shuffle(documents)

        model.train(documents, total_examples=total_docs, epochs=model.iter)
        model.alpha -= 0.001
       model.min_alpha = model.alpha

Am I choosing the right parameters for the model?

-Abu

Gordon Mohr

unread,

Jun 30, 2017, 4:49:22 AM6/30/17

to gensim

Other than the handling of `iter` and multiple-`train()` calls, the parameters are reasonable starting points. (But see some extra comments at bottom.)

The main issue is that your manual looping results in nonsense `alpha` treatment. The 1st loop through allows the alpha learning rate to start at 0.025 and descend gradually over the course of one pass over the dataset to 0.001. Then the next loop does a full pass with alpha learning-rate starting at 0.024, and ending at 0.024. (We've added a logged warning to the code for when it sees alpha jump back up from a lower to a higher number like this – which shoulddid you see that in your logs?)

Then 0.023 to 0.23, etc, until the 26th of 30 loops, which manages alpha from 0.000 to 0.000 – a training pass that does no training updates to the model. And then alpha goes negative, so the last 4 loops will use negative alphas, finally -0.004. Those loops, the model will be checking its own word-predictiveness error, then updating itself to be *worse* at that predictiveness after each example. That's not at all what you want.

If you want 30 passes, just use `iter=30` and call `train()` once – it will descend the alpha learning rate from 0.025 to 0.001, once, smoothly, as is appropriate for stochastic gradient descent. (Any theoretical tiny benefits from re-shuffling aren't worth the complications. Just make sure the dataset isn't arranged with all similar examples clumped together. If there's a risk of that, a single shuffle at the start would be sufficient.)

Other thoughts:

* People often think that by using a smaller `min_count`, they're preserving more original information, and thus will get better results. But words with only a few examples can't be learned well, and whatever is learned may be idiosyncrasies of those occurrences (and thus less useful if/when those words are later interpreted in new examples). Also, retaining such less-learnable words winds up further spacing-out the other more-learnable words, and interleaving the 'hopeless cases' (with few examples) with the 'good cases', serving as a sort of interference. So quality of vectors often goes *up* when using a higher `min_count` – even as training takes less time. So be sure to try values higher than 2.

* With only ~1200 documents (of unstated average size), and trying to learn both doc- and word-vectors (as per `dm=1` mode), of a full 300-dimensions, you may be trying to train an overlarge model from underpowered data. It will be worth trying a smaller vector `size`. Also, if the doc-vector quality is the main goal, and you don't need word-vectors, you could try the PV-DBOW mode (`dm=0`). By not trying to train word-vectors, all of the model's state/update-effort will be directed into the doc-vectors.

* If by chance your documents are larger than 10,000 tokens each, that could be helpful for getting better vectors (especially word-vectors), as you then have more data/contexts. But because of an internal implementation limit in the optimized gensim code, you'd need to split the overlong documents into multiple docs that are each smaller than 10,000 tokens – otherwise the tokens beyond 10,000 are ignored.

- Gordon

Reply all

Reply to author

Forward