#Create doc2vec model
model = Doc2Vec(size=1000, min_count=1, dbow_words=0, workers=cores*2, alpha=0.025, min_alpha=0.025, seed=2)
model.build_vocab(get_sentences())
for x in range(NUM_CYCLES):
model.train(get_sentences())
if ((x % NUM_CYCLES/10) == 0):
model.alpha -= 0.002
model.min_alpha = model.alpha
logger.warn("Trained " + str(x) + " times")
The key parameter here is NUM_CYCLES, which is a number of training passes that we do. Setting that parameter to different values produces different results:
- For NUM_CYCLES = 10, the results are pretty much random (will be different by selecting a different training seed). This makes sense, since, with 10 input documents, we haven't had a chance to train the neural network yet, so the weights are still random
- For NUM_CYCLES = 1000, the results are better, but not great
- For NUM_CYCLES = 10000, the results are pretty good -- more or less what I'd expect, as far as word/phrase matching goes
- However, for NUM_CYCLES = 100000, the results start degrading again.
Why do results start degrading for large number of passes? It seems like there's some sort of an overtraining situation happening, but I'm not sure how that's possible with this particular example.
Thanks.
Thanks, Gordon, for a very extensive answer! A few thoughts:- I don't think "anti-learning" is the issue in this case, as I'm careful to not let alpha go negative (in my training loop, I decrease the alpha every 1/10th of NUM_CYCLES -- for example, if NUM_CYCLES is 1000, then I decrease alpha every 100 passes.) But, it's good to know that setting iter=NUM_CYCLES and min_alpha in the constructor will accomplish the same thing. Will it decrease alpha monotonically for each training pass? Does it mean that I only have to call model.train once?
- I see that with this toy setup (10 doc vectors, 1000 dimensions) the internal state is much larger than the size of training data. However, I'm still trying to understand how exactly the overtraining process happens. Ideally, I'd hope that in this scenario, the model will simply contain a large amount of redundancy (such that a lot of the weights for each node will essentially do the same thing)I understand that Doc2Vec models are best suited for extracting more general relationships from large datasets, and that toy problems do not illustrate that well. The reason I'm doing a toy problem is to hopefully gain some insight into what the model is doingHere's what I'm doing as illustrated by an example. Let's consider only 2 sentences instead of 10:Sentence 1: The quick brown fox jumps over the lazy dog.Sentence 2: More generally speaking, dual-use can also refer to any technology which can satisfy more than one goal at any given time.Let's say my query is "brown fox" -- when I run most_similar method on the vector inferred from the tokenized phrase "brown fox", I expect that S1 will have a higher score than S2. This typically happens with a "medium-trained" model (NUM_CYCLES = 10000), but not so much with "overtrained" model (NUM_CYCLES = 100000).My intuition is that the presence of certain words ("brown", "fox") contributes to the positive score, while the absence of other words ("lazy", "dog") contributes to the negative score. For an "overtrained" model, this effect is more pronounced -- that is, the model "punishes" the query more because the query doesn't contain words "lazy" and "dog". However, in this case, I would expect a query "brown fox jumps" to generate a higher positive score than just "brown fox" (even for an overtrained model). However, I've found this to not be the case. This is especially true when the word order is shuffled. I'd like to hear your thoughts on this.