The Gensim project source code (
https://github.com/RaRe-Technologies/gensim/) contains in its `docs/notebooks` directory a bunch of example notebooks for common uses, incuding applying the most simple & fast `Doc2Vec` mode (`pv_dbow`) to a recent Wikipedia dump:
It streams the articles from the dump copy – avoiding loading the giant corpus into memory as a single `list` – & also starts by doing a 1-time conversion to a plain-text dump better for most Gensim uses. Its parameters are a reasonable starting point, but shouldn't be considered highly-optimized/best-practice – when you have evaluations that fit your particular use (or custom data), & if you have the time to explore other settings, you can probably get additional improvements via more parameter tweaking.
I personally recommend against using the `api.load()` auto-downloading facility. As here, you're getting a ~6y old dump – instead of the dump from *yesterday* you could download from official Wikipedia sources. You're also not really seeing where it lands as a file on your system, what its format/size is, nor what sort of processing happens to return that single resultant object from the `load()` call.
Further, the code which does the dataset-specific processing is neither source-controlled by a git project, nor version-released like normal project code – neither great from the perspectives of code maintainability/transparency or security hygeine.
If you instead download the file from its canonical source yourself, you decide where to put it & see its size/format. As per the example notebook, streaming that into other explicit formats is usually just a few lines of bilerplate code, which are still good to see to understand as prep for later customization with alternate datasources.
- Gordon