Noob question - how to train a doc2vec model using a built-in corpus?

34 views

Skip to first unread message

Felix Goldberg

unread,

Aug 22, 2023, 10:54:47 AM8/22/23

to Gensim

Hi,

I would like to train a Doc2vec model using the "wiki-english-20171001" corpus shipped with gensim.

After getting an iterable pointer to the corpus with

corpus = api.load('wiki-english-20171001')

I am a bit stuck because the Doc2vec model seems to not be able to take such an object as input. I suppose some sort of pre-processing or casting is required but I don't quite understand how. The tutorial contains a read_corpus function but it doesn't quite seem right to apply it to the huge corpus.

I am sure I am missing something basic here, please advise :)

Also, would greatly appreciate any suggestions of good hyperparameter values.

Thanks,

Gordon Mohr

unread,

Aug 22, 2023, 1:25:42 PM8/22/23

to Gensim

The Gensim project source code (https://github.com/RaRe-Technologies/gensim/) contains in its `docs/notebooks` directory a bunch of example notebooks for common uses, incuding applying the most simple & fast `Doc2Vec` mode (`pv_dbow`) to a recent Wikipedia dump:

https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-wikipedia.ipynb

It streams the articles from the dump copy – avoiding loading the giant corpus into memory as a single `list` – & also starts by doing a 1-time conversion to a plain-text dump better for most Gensim uses. Its parameters are a reasonable starting point, but shouldn't be considered highly-optimized/best-practice – when you have evaluations that fit your particular use (or custom data), & if you have the time to explore other settings, you can probably get additional improvements via more parameter tweaking.

I personally recommend against using the `api.load()` auto-downloading facility. As here, you're getting a ~6y old dump – instead of the dump from *yesterday* you could download from official Wikipedia sources. You're also not really seeing where it lands as a file on your system, what its format/size is, nor what sort of processing happens to return that single resultant object from the `load()` call.

Further, the code which does the dataset-specific processing is neither source-controlled by a git project, nor version-released like normal project code – neither great from the perspectives of code maintainability/transparency or security hygeine.

If you instead download the file from its canonical source yourself, you decide where to put it & see its size/format. As per the example notebook, streaming that into other explicit formats is usually just a few lines of bilerplate code, which are still good to see to understand as prep for later customization with alternate datasources.