You could follow the example of the doc2vec-wikipedia notebook (
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-wikipedia.ipynb) to the point of getting the Wikipedia data, but then write the items you get back from `get_texts()` to an interim file of title & tokens – discarding tokens in excess of some threshold before writing. (This one-time process could also discard too-small articles.)
Then, read that file back into a new corpus-iterator to do your training. On the downside, you'd still have to download and scan the full dump once. On the upside, the truncated file may be much faster to re-iterate over for multiple training passes – as it's now just the titles & plain text, rather than original XML dump.
Alternatively, look into the abstracts-download or per-article summary downloading I'd mentioned in the previous message.
I'd not recommend the use of `api.load()` for anything you could reasonably do yourself - it hides steps/details in unhelpful ways.
- Gordon