Adding a little more context, as I don't have a way forward yet. The serialized model was instantiated with the Doc2Vec
(only the first few lines). So I ran:
from gensim.corpora.wikicorpus import WikiCorpus, MmCorpus
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from pprint import pprint
#Define TaggedWikiDocument class to convert WikiCorpus into suitable form for Doc2Vec.
class TaggedWikiDocument(object):
def __init__(self, wiki):
self.wiki.metadata = True
for content, (page_id, title) in self.wiki.get_texts():
yield TaggedDocument([c.decode("utf-8") for c in content], [title])
#Convert the articles to WikiCorpus. WikiCorpus construct a corpus from a Wikipedia (or other MediaWiki-based) database dump.
wiki = WikiCorpus("enwiki-latest-pages-articles.xml.bz2")
documents = TaggedWikiDocument(wiki)
Building the WikiCorpus took ~8 hours, so before proceeding with the rest of the tutorial I saved the model by calling MmCorpus.serialize(wiki). In an official tutorial somewhere I recall reading this would also take ~8 hours, so I let this run overnight. By morning the serialization file was complete, but my laptop was in truly bad shape* so I rebooted. After realizing my syntactic error earlier I successfully loaded the serialized object with
wiki = MmCorpus("/Users/kz/wikibrain.mm")
What remains unclear to me is how to reinitialize the model, since the MmCorpus object returned by MmCorpus(<path_to_serialized_model>) is not a WikiCorpus object, nor does it follow the same protocol as WikiCorpus, so the following doesn't work:
In [8]: documents = TaggedWikiDocument(wiki)
In [9]: pre = Doc2Vec(min_count=0)
In [10]: pre.scan_vocab(documents)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-10-4a5ff44deb2d> in <module>()
----> 1 pre.scan_vocab(documents)
/Users/kz/anaconda/lib/python3.6/site-packages/gensim/models/doc2vec.py in scan_vocab(self, documents, progress_per, trim_rule, update)
676 checked_string_types = 0
677 vocab = defaultdict(int)
--> 678 for document_no, document in enumerate(documents):
679 if not checked_string_types:
680 if isinstance(document.words, string_types):
<ipython-input-7-668238c08397> in __iter__(self)
4 self.wiki.metadata = True
----> 6 for content, (page_id, title) in self.wiki.get_texts():
7 yield TaggedDocument([c.decode("utf-8") for c in content], [title])
AttributeError: 'MmCorpus' object has no attribute 'get_texts'