Let's assume your `documents_tagged` is a list of exactly 25,000 `TaggedDocument` instances, while `wiki` is a single instance of `TaggedWikiDocument`, which (despite its misleading singular name) is actually an iterable whcih can re-iterate over exactly 5,000,000 wiki articles.
You can't usefully concatenate (`+`) your list with another list whose one element is itself an iterable-sequence of other wiki articles. You'll have a list with 25,001 items: 25,000 instances of `TaggedDocument`, then one instance of iterable `TaggedWikiDocument` (which is not itself a document with the needed `tags` and `words` properties).
You want instead a re-iterable sequence of 5,025,000 items, all of which are `TaggedDocument` instances.
If your system has enough RAM to bring the whole wiki dataset into memory – not as rare as it used-to-be – you could simply create one big list:
docs_for_training = documents_tagged + list(wiki)) # using `list()` iterates all 5M articles into in-memory list
(By only doing the parsing of the `wiki.get_texts()` call once, this also avoids some time-consuming repeated effort across the multiple re-iterations.)
Under more common constraints, you'd simply want to create a wrapping iterable class that ensures that each time it is iterated over, both your docs, and the other articles, are included. Something like this should do the trick:
import itertools as it
class ChainIterable:
def __init__(self, *list_of_iterables):
self.list_of_iterables = list_of_iterables
def __iter__(self):
return it.chain(*self.list_of_iterables)
Then, use this to chain the two separate doc iterables together:
docs_for_training = ChainIterable(documents_tagged, wiki)
- Gordon