Overtraining the doc2vec model?

Gregory Larchev

unread,

Dec 17, 2015, 6:22:16 PM12/17/15

to gensim

I'm fairly new to gensim and doc2vec, so, as a first step, I wanted to do a very small toy problem, in order to convince myself that everything works as expected. In my toy problem, I have 10 input documents, each about a sentence long. Most of the terms in each sentence are fairly unique. I'd like to train the model on these inputs, then query against a term or a phrase (contained by one or more sentences) via the most_similar method, and have the correct sentence be the top result.

Here's my code for the model training portion:

#Create doc2vec model

model = Doc2Vec(size=1000, min_count=1, dbow_words=0, workers=cores*2, alpha=0.025, min_alpha=0.025, seed=2)

model.build_vocab(get_sentences())

for x in range(NUM_CYCLES):

model.train(get_sentences())

if ((x % NUM_CYCLES/10) == 0):

model.alpha -= 0.002

model.min_alpha = model.alpha

logger.warn("Trained " + str(x) + " times")

The key parameter here is NUM_CYCLES, which is a number of training passes that we do. Setting that parameter to different values produces different results:

- For NUM_CYCLES = 10, the results are pretty much random (will be different by selecting a different training seed). This makes sense, since, with 10 input documents, we haven't had a chance to train the neural network yet, so the weights are still random

- For NUM_CYCLES = 1000, the results are better, but not great

- For NUM_CYCLES = 10000, the results are pretty good -- more or less what I'd expect, as far as word/phrase matching goes

- However, for NUM_CYCLES = 100000, the results start degrading again.

Why do results start degrading for large number of passes? It seems like there's some sort of an overtraining situation happening, but I'm not sure how that's possible with this particular example.

Thanks.

Gordon Mohr

unread,

Dec 17, 2015, 8:21:24 PM12/17/15

to gensim

Thoughts:

(1) Overfitting is definitely a risk with your parameters.

10 sentences * ~12 words/sentence in english * ~12 bits of info per word (a common ballpark estimate)

= 1440 bits = ~180 bytes of training info

plus:

10 doc vectors * 1000 dimensions * 4 bytes/dimension-float = 40,000 bytes of doc vectors (!)

~120 words vocabulary, huffman-encoded to log2(120) = 7 hierarchical-softmax output nodes

7 output nodes * 1000 hidden weights/node & 4 bytes/dimension-float

= 28,000 bytes of hidden/output weights (!)

Can a model with 68KB of internal state overfit/memorize ~180 bytes of training data? I would definitely think so.

For your simple tests – probing for exact words/phrases – memorization might not be the worst thing! Traditional full-text-search reverse-indexes are practically a memorization of input, but reorganized for easy keyword-querying. But you wouldn't really be testing or enjoying the benefits of dense/continuous vector models, in such a case.

BUT...

(2) At NUM_CYCLES=130, you'll have decreased the alpha by 13*0.002=0.026, so it will be -0.001, that is, negative! It only goes more negative with more cycles. A negative learning-rate essentially means the model tries to *increase* its prediction-error after seeing each example.

So I'm surprised that anything much over that cycle-count is doing anything useful. (And, I don't know for sure what you mean by "query against a term or phrase" – are you inferring a new vector for the term/short-phrase token lists?) But maybe for a certain range, since the model is still a funhouse mirror product of the training data, and the training data is small with certain patterns, some 'queries' would still resemble certain (anti-)learned doc vectors.

Toy-sized exercises don't tend to illustrate word2vec/doc2vec algorithms very well; their strengths are the subtle/continuous relationships that emerge from larger sets – at least hundreds and even better thousands or millions of examples. (I'm not even sure the internal thread-batching and alpha-update code does reasonable things for tiny, 10-example training sets.)

But relevant to the key issues, you might start to get more sensible results if you...

* use models smaller than the training data, to achieve some test of the model's true ability to abstract/generalize/'compress' patterns in the data

* use `iter=NUM_CYCLES` in the Doc2Vec constructor, and a real/default `min_alpha` value, to let the code handle the alpha-decay for you over however many passes you want to test. Explicitly managing alpha/multiple-calls-to-`train()` is probably only justified if you're reshuffling the data, or otherwise want to take extra steps (like evaluating interim performance), between training passes.

Other observations regarding your chosen parameters:

* Use of `dbow_words=0` is moot because the code hasn't actually set DBOW mode (`dm=0`).

* I'm suspicious of using `min_count=1` for any real situations. Nothing general can be learned from unique occurrences – such tokens wind up essentially being another arbitrary identifier for the surrounding context/example, and thus 'noise' that takes time to process but can only dilute the predictiveness of other features. (Maybe if future to-be-inferred examples will repeat those tokens, they'd have some model value. Not sure.)

* Most results with significant datasets seem to prefer using negative-sampling (`hs=0, negative=N' with 2<=N<=10) rather than the default hierarchical-softmax (`hs=1, negative=0`).

* Python thread locking issues make it hard for even the optimized code to fully use all cores; I've usually seen optimal throughput at worker counts less-than-or-equal-to the core count. (More-workers-than-cores only tends to help workloads where there are few locking bottlenecks, or where some threads could be blocked on IO lags... and neither apply here.)

- Gordon

Gregory Larchev

unread,

Dec 18, 2015, 5:43:14 PM12/18/15

to gensim

Thanks, Gordon, for a very extensive answer! A few thoughts:

- I don't think "anti-learning" is the issue in this case, as I'm careful to not let alpha go negative (in my training loop, I decrease the alpha every 1/10th of NUM_CYCLES -- for example, if NUM_CYCLES is 1000, then I decrease alpha every 100 passes.) But, it's good to know that setting iter=NUM_CYCLES and min_alpha in the constructor will accomplish the same thing. Will it decrease alpha monotonically for each training pass? Does it mean that I only have to call model.train once?

- I see that with this toy setup (10 doc vectors, 1000 dimensions) the internal state is much larger than the size of training data. However, I'm still trying to understand how exactly the overtraining process happens. Ideally, I'd hope that in this scenario, the model will simply contain a large amount of redundancy (such that a lot of the weights for each node will essentially do the same thing)

I understand that Doc2Vec models are best suited for extracting more general relationships from large datasets, and that toy problems do not illustrate that well. The reason I'm doing a toy problem is to hopefully gain some insight into what the model is doing

Here's what I'm doing as illustrated by an example. Let's consider only 2 sentences instead of 10:

Sentence 1: The quick brown fox jumps over the lazy dog.

Sentence 2: More generally speaking, dual-use can also refer to any technology which can satisfy more than one goal at any given time.

Let's say my query is "brown fox" -- when I run most_similar method on the vector inferred from the tokenized phrase "brown fox", I expect that S1 will have a higher score than S2. This typically happens with a "medium-trained" model (NUM_CYCLES = 10000), but not so much with "overtrained" model (NUM_CYCLES = 100000).

My intuition is that the presence of certain words ("brown", "fox") contributes to the positive score, while the absence of other words ("lazy", "dog") contributes to the negative score. For an "overtrained" model, this effect is more pronounced -- that is, the model "punishes" the query more because the query doesn't contain words "lazy" and "dog". However, in this case, I would expect a query "brown fox jumps" to generate a higher positive score than just "brown fox" (even for an overtrained model). However, I've found this to not be the case. This is especially true when the word order is shuffled. I'd like to hear your thoughts on this.

I'll go ahead and train the model with a smaller number of dimensions (like maybe 10 or so), and also try negative-sampling, to see what happens.

Thanks again for your help!

Gregory

Gordon Mohr

unread,

Dec 18, 2015, 6:57:55 PM12/18/15

to gensim

On Friday, December 18, 2015 at 2:43:14 PM UTC-8, Gregory Larchev wrote:

Thanks, Gordon, for a very extensive answer! A few thoughts:

- I don't think "anti-learning" is the issue in this case, as I'm careful to not let alpha go negative (in my training loop, I decrease the alpha every 1/10th of NUM_CYCLES -- for example, if NUM_CYCLES is 1000, then I decrease alpha every 100 passes.) But, it's good to know that setting iter=NUM_CYCLES and min_alpha in the constructor will accomplish the same thing. Will it decrease alpha monotonically for each training pass? Does it mean that I only have to call model.train once?

Aha; I see I was misreading the code – and alpha won't go negative. But, I believe your interpretation is also wrong: the `%` modulus operator has the same or higher precedence than `/` division. So your test is essentially (regrouped):

if (((x % NUM_CYCLES) / 10) == 0):

And further, the interpretation of `/` will vary between python 2 (where it's integer/floor division) and python 3 (where it's 'true' division). So in python 2, this code decreases alpha the first 10 times through the loop... then no more. In python 3, only the very first time through – x=0 – will trigger the decrease.

Yes, using `iter` and a `min_alpha` intends a gradual, linear decrease of learning rate across the whole training corpus. BUT, the alpha is only updated after each batch-of-examples passed to the worker threads, and the default batch size is way larger than your tiny dataset... so in practice I don't think the smooth decrease will be achieved without more training data, or perhaps a massive number of iterations.

- I see that with this toy setup (10 doc vectors, 1000 dimensions) the internal state is much larger than the size of training data. However, I'm still trying to understand how exactly the overtraining process happens. Ideally, I'd hope that in this scenario, the model will simply contain a large amount of redundancy (such that a lot of the weights for each node will essentially do the same thing)

I understand that Doc2Vec models are best suited for extracting more general relationships from large datasets, and that toy problems do not illustrate that well. The reason I'm doing a toy problem is to hopefully gain some insight into what the model is doing

Here's what I'm doing as illustrated by an example. Let's consider only 2 sentences instead of 10:

Sentence 1: The quick brown fox jumps over the lazy dog.
Sentence 2: More generally speaking, dual-use can also refer to any technology which can satisfy more than one goal at any given time.

Let's say my query is "brown fox" -- when I run most_similar method on the vector inferred from the tokenized phrase "brown fox", I expect that S1 will have a higher score than S2. This typically happens with a "medium-trained" model (NUM_CYCLES = 10000), but not so much with "overtrained" model (NUM_CYCLES = 100000).

My intuition is that the presence of certain words ("brown", "fox") contributes to the positive score, while the absence of other words ("lazy", "dog") contributes to the negative score. For an "overtrained" model, this effect is more pronounced -- that is, the model "punishes" the query more because the query doesn't contain words "lazy" and "dog". However, in this case, I would expect a query "brown fox jumps" to generate a higher positive score than just "brown fox" (even for an overtrained model). However, I've found this to not be the case. This is especially true when the word order is shuffled. I'd like to hear your thoughts on this.

Until the alpha is well-scheduled and the training data expanded to be larger than the model, the situation may be so different from real scenarios that any lessons learned are of limited use (except in understanding the boundaries of when the algorithm can't useful be applied).

Because you're using DM mode, the window value (default 8) is relevant: the neural-network is learning to predict a target word from the (sum of the) doc-vector and the vectors of up to 16 surrounding words. So for your short sentence, it's learning to predict each word, given the presence of every other word. And for the larger sentence, each word from *almost* every other word.

Given how much larger the model is than the training data, even with the alpha issues, the NN probably becomes very good at predicting exactly the target words from exactly the contexts in which they appear. It only ever sees `fox` as the output of exactly `sum(vectors(['S1', 'The', 'quick', 'brown', 'jumps', 'over', 'the', 'lazy', 'dog']))`. When you later ask it to infer a vector for ['brown', 'fox'], it makes the best S3 vector it can for the new prediction-goals: (1) predict 'brown' from ['S3', 'fox']; and (2) predict 'fox' from ['S3', 'brown'].

But there are so many free parameters, and such minimal/discrete/ text data, that all the vector-weights that made the model so good at the full-sentence prediction may be very idiosyncratic, no longer having the smooth/intuitive 'nearnesses' we were hoping to achieve. This method essentially needs gradations of co-occurrences for its benefits to arise... but discrete toy examples against large models don't provide that.

(You mention an effect of word order... but the only effect of word-order in DM training is whether words appear in the same window, and with `window=8` the only reason an inferred vector for ['brown', 'dog', 'jumps'] would be different from one for ['jumps', 'brown', 'dog'] would be the various ways randomness is used to seed the algorithm.)

(Also upon further thought, my calculation of the hierarchical-softmax output layer size of '7' was an underestimate; the codes for predicted-words will be 7 bits long, but those bits select output nodes from a tree with about as many nodes as vocabulary words. So that part of the model is ~120 HS nodes * 1000 dimensions * 4 bytes/dimension = 480KB. And using DM means every input-context word gets a 1000-dimensional vector, too – another 480KB. So my estimate is now that your model was about 1MB in size, and being trained with ~180 bytes worth of training info. So just by general proportions, a definite recipe for severe overfitting.)

- Gordon

Gregory Larchev

unread,

Dec 21, 2015, 10:59:56 AM12/21/15

to gensim

Ok, thanks for the explanation!

Reply all

Reply to author

Forward