Regarding reasonable performance expectations:
Essentially, no adequate training data for for `Doc2Vec` will result in training completing in 2 minutes. Only tiny toy-sized demo datasets will complete in that amount of time. Even partial replication of the original `Paragraph Vector` paper IMDB movie-reviews results (~100,000 short texts of no more than a few hundred words each) takes at least tens-of-minutes on common systems. It's been a while since I did a bulk training on a Wikipedia dump with millions of documents, some with thousands of words, but I recall that taking 14h+.
If there's some part of the official docs that sets an unrealistic expectation, please provide a link so it can be corrected.
Specific to what you've shown:
Your data setup & general steps seem proper; however, there is rarely any reason for users to call `.train()` multiple times inside their own loop, and manage the `alpha` value decay themselves. It's unnecessary, overcomplicated, & error-prone – but for some reason this anti-pattern has been widely copied by poor-quality online tutorials that barely seem to understand what they're doing (much less explain it to their learning readers).
Specifically in your case, 40 of your own loops outside a `.train()` that is itself doing 40 internal epochs means you're doing 1600 total passes over your data – an overkill surely not intended. That's likely the biggest reason for surprisingly lengthy training time.
Further, manually decrementing starting `alpha` from its normal default of 0.025 by 0.0002, 40 times, only brings it down to 0.017 – normal SGD would decay it fully to a value-near-0.0. But because this antipattern code only tampers with `min_alpha` *after* the 1st call, you'll actually have one proper full decay, using the proper defaults automatically (on the 1st loop), then 39 more that are nonsense. This won't lengthen run-time, but it will make results unrrepresentative (& probably bad).
Call `.train()` exactly once, unless you're an expert with a clear idea of why you're doing something very non-standard & error-prone.
If you want to see incremental progress, enabling logging at the INFO level is a good idea. (If you need to log some interim evaluation, it's also possible to set up end-of-epoch callbacks. But keep in mind that an evaluation after, say, 20 epochs of a planned 40 epoch run will be measuring something quite different from a evaluation of a true 20 epoch run at its end.)
- Gordon