Other than the handling of `iter` and multiple-`train()` calls, the parameters are reasonable starting points. (But see some extra comments at bottom.)
The main issue is that your manual looping results in nonsense `alpha` treatment. The 1st loop through allows the alpha learning rate to start at 0.025 and descend gradually over the course of one pass over the dataset to 0.001. Then the next loop does a full pass with alpha learning-rate starting at 0.024, and ending at 0.024. (We've added a logged warning to the code for when it sees alpha jump back up from a lower to a higher number like this – which shoulddid you see that in your logs?)
Then 0.023 to 0.23, etc, until the 26th of 30 loops, which manages alpha from 0.000 to 0.000 – a training pass that does no training updates to the model. And then alpha goes negative, so the last 4 loops will use negative alphas, finally -0.004. Those loops, the model will be checking its own word-predictiveness error, then updating itself to be *worse* at that predictiveness after each example. That's not at all what you want.
If you want 30 passes, just use `iter=30` and call `train()` once – it will descend the alpha learning rate from 0.025 to 0.001, once, smoothly, as is appropriate for stochastic gradient descent. (Any theoretical tiny benefits from re-shuffling aren't worth the complications. Just make sure the dataset isn't arranged with all similar examples clumped together. If there's a risk of that, a single shuffle at the start would be sufficient.)
Other thoughts:
* People often think that by using a smaller `min_count`, they're preserving more original information, and thus will get better results. But words with only a few examples can't be learned well, and whatever is learned may be idiosyncrasies of those occurrences (and thus less useful if/when those words are later interpreted in new examples). Also, retaining such less-learnable words winds up further spacing-out the other more-learnable words, and interleaving the 'hopeless cases' (with few examples) with the 'good cases', serving as a sort of interference. So quality of vectors often goes *up* when using a higher `min_count` – even as training takes less time. So be sure to try values higher than 2.
* With only ~1200 documents (of unstated average size), and trying to learn both doc- and word-vectors (as per `dm=1` mode), of a full 300-dimensions, you may be trying to train an overlarge model from underpowered data. It will be worth trying a smaller vector `size`. Also, if the doc-vector quality is the main goal, and you don't need word-vectors, you could try the PV-DBOW mode (`dm=0`). By not trying to train word-vectors, all of the model's state/update-effort will be directed into the doc-vectors.
* If by chance your documents are larger than 10,000 tokens each, that could be helpful for getting better vectors (especially word-vectors), as you then have more data/contexts. But because of an internal implementation limit in the optimized gensim code, you'd need to split the overlong documents into multiple docs that are each smaller than 10,000 tokens – otherwise the tokens beyond 10,000 are ignored.
- Gordon