Training doc2vec on Wikipedia dump AND user-supplied list of TaggedDocuments

82 views

Skip to first unread message

Nicolaj Mühlbach

unread,

Apr 15, 2021, 10:49:13 AM4/15/21

to Gensim

Hi there!

I've followed the online tutorial to train a doc2vec model on the latest Wikipedia dump (https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-wikipedia.ipynb).

However, I have ~25k tagged documents that I would like to include additionally in the training process.

I have tried something like: model.build_vocab(documents_tagged+[wiki])

but this throws an error: AttributeError: 'TaggedWikiDocument' object has no attribute 'words'

Is there any way to combine the Wikipedia dump and other documents for training?

Best,

Nicolaj.

Gordon Mohr

unread,

Apr 16, 2021, 7:55:53 PM4/16/21

to Gensim

Let's assume your `documents_tagged` is a list of exactly 25,000 `TaggedDocument` instances, while `wiki` is a single instance of `TaggedWikiDocument`, which (despite its misleading singular name) is actually an iterable whcih can re-iterate over exactly 5,000,000 wiki articles.

You can't usefully concatenate (`+`) your list with another list whose one element is itself an iterable-sequence of other wiki articles. You'll have a list with 25,001 items: 25,000 instances of `TaggedDocument`, then one instance of iterable `TaggedWikiDocument` (which is not itself a document with the needed `tags` and `words` properties).

You want instead a re-iterable sequence of 5,025,000 items, all of which are `TaggedDocument` instances.

If your system has enough RAM to bring the whole wiki dataset into memory – not as rare as it used-to-be – you could simply create one big list:

docs_for_training = documents_tagged + list(wiki)) # using `list()` iterates all 5M articles into in-memory list

(By only doing the parsing of the `wiki.get_texts()` call once, this also avoids some time-consuming repeated effort across the multiple re-iterations.)

Under more common constraints, you'd simply want to create a wrapping iterable class that ensures that each time it is iterated over, both your docs, and the other articles, are included. Something like this should do the trick:

import itertools as it

class ChainIterable:

def __init__(self, *list_of_iterables):

self.list_of_iterables = list_of_iterables

def __iter__(self):

return it.chain(*self.list_of_iterables)

Then, use this to chain the two separate doc iterables together:

docs_for_training = ChainIterable(documents_tagged, wiki)

- Gordon

Reply all

Reply to author

Forward

0 new messages