Training doc2vec on Wikipedia dump AND user-supplied list of TaggedDocuments

82 views
Skip to first unread message

Nicolaj Mühlbach

unread,
Apr 15, 2021, 10:49:13 AM4/15/21
to Gensim
Hi there!

I've followed the online tutorial to train a doc2vec model on the latest Wikipedia dump (https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-wikipedia.ipynb).

However, I have ~25k tagged documents that I would like to include additionally in the training process.

I have tried something like: model.build_vocab(documents_tagged+[wiki])

but this throws an error: AttributeError: 'TaggedWikiDocument' object has no attribute 'words' 

Is there any way to combine the Wikipedia dump and other documents for training?

Best,
Nicolaj.

Gordon Mohr

unread,
Apr 16, 2021, 7:55:53 PM4/16/21
to Gensim
Let's assume your `documents_tagged` is a list of exactly 25,000 `TaggedDocument` instances, while `wiki` is a single instance of `TaggedWikiDocument`, which (despite its misleading singular name) is actually an iterable whcih can re-iterate over exactly 5,000,000 wiki articles.

You can't usefully concatenate (`+`) your list with another list whose one element is itself an iterable-sequence of other wiki articles. You'll have a list with 25,001 items: 25,000 instances of `TaggedDocument`, then one instance of iterable `TaggedWikiDocument` (which is not itself a document with the needed `tags` and `words` properties). 

You want instead a re-iterable sequence of 5,025,000 items, all of which are `TaggedDocument` instances. 

If your system has enough RAM to bring the whole wiki dataset into memory – not as rare as it used-to-be – you could simply create one big list:

    docs_for_training = documents_tagged + list(wiki))  # using `list()` iterates all 5M articles into in-memory list

(By only doing the parsing of the `wiki.get_texts()` call once, this also avoids some time-consuming repeated effort across the multiple re-iterations.)

Under more common constraints, you'd simply want to create a wrapping iterable class that ensures that each time it is iterated over, both your docs, and the other articles, are included. Something like this should do the trick:

    import itertools as it
    class ChainIterable:
        def __init__(self, *list_of_iterables):
            self.list_of_iterables = list_of_iterables
        def __iter__(self):
            return it.chain(*self.list_of_iterables)

Then, use this to chain the two separate doc iterables together:

    docs_for_training = ChainIterable(documents_tagged, wiki)

- Gordon
Reply all
Reply to author
Forward
0 new messages