Noob question - how to train a doc2vec model using a built-in corpus?

34 views
Skip to first unread message

Felix Goldberg

unread,
Aug 22, 2023, 10:54:47 AM8/22/23
to Gensim
Hi,

I would like to train a Doc2vec model using the "wiki-english-20171001" corpus shipped with gensim. 

After getting an iterable pointer to the corpus with 

corpus = api.load('wiki-english-20171001')
I am a bit stuck because the Doc2vec model seems to not be able to take such an object as input. I suppose some sort of pre-processing or casting is required but I don't quite understand how. The tutorial contains a read_corpus function but it doesn't quite seem right to apply it to the huge corpus. 

I am sure I am missing something basic here, please advise :)

Also, would greatly appreciate any suggestions of good hyperparameter values.

Thanks,
FG

Gordon Mohr

unread,
Aug 22, 2023, 1:25:42 PM8/22/23
to Gensim
The Gensim project source code (https://github.com/RaRe-Technologies/gensim/) contains in its `docs/notebooks` directory a bunch of example notebooks for common uses, incuding applying the most simple & fast `Doc2Vec` mode (`pv_dbow`) to a recent Wikipedia dump:


It streams the articles from the dump copy – avoiding loading the giant corpus into memory as a single `list` – & also starts by doing a 1-time conversion to a plain-text dump better for most Gensim uses. Its parameters are a reasonable starting point, but shouldn't be considered highly-optimized/best-practice – when you have evaluations that fit your particular use (or custom data), & if you have the time to explore other settings, you can probably get additional improvements via more parameter tweaking. 

I personally recommend against using the `api.load()` auto-downloading facility. As here, you're getting a ~6y old dump – instead of the dump from *yesterday* you could download from official Wikipedia sources. You're also not really seeing where it lands as a file on your system, what its format/size is, nor what sort of processing happens to return that single resultant object from the `load()` call. 

Further, the code which does the dataset-specific processing is neither source-controlled by a git project, nor version-released like normal project code – neither great from the perspectives of code maintainability/transparency or security hygeine.

If you instead download the file from its canonical source yourself, you decide where to put it & see its size/format. As per the example notebook, streaming that into other explicit formats is usually just a few lines of bilerplate code, which are still good to see to understand as prep for later customization with alternate datasources.

- Gordon 
Reply all
Reply to author
Forward
0 new messages