How to load serialized WikiCorpus?

Kevin Zeidler

unread,

Jul 12, 2017, 5:50:29 PM7/12/17

to gensim

I serialized the model with the MmCorpus class, per the WikiCorpus object's warning that its "save" method only saves documents, and not models. (I'm glad it warned me about that.) I let the memory-mapped corpus finish saving overnight, restarted the computer this morning, and now when I attempt to load the serialized model I get

In [12]: corpora.mmcorpus.MmCorpus.load("/Users/kz/wikibrain.mm", mmap='r')
---------------------------------------------------------------------------
UnpicklingError Traceback (most recent call last)
<ipython-input-12-3eccfa1e0cbd> in <module>()
----> 1 corpora.mmcorpus.MmCorpus.load("/Users/kz/wikibrain.mm", mmap='r')
/Users/kz/anaconda/lib/python3.6/site-packages/gensim/utils.py in load(cls, fname, mmap)
269 compress, subname = SaveLoad._adapt_by_suffix(fname)
270
--> 271 obj = unpickle(fname)
272 obj._load_specials(fname, mmap, compress, subname)
273 logger.info("loaded %s", fname)
/Users/kz/anaconda/lib/python3.6/site-packages/gensim/utils.py in unpickle(fname)
931 # Because of loading from S3 load can't be used (missing readline in smart_open)
932 if sys.version_info > (3, 0):
--> 933 return _pickle.load(f, encoding='latin1')
934 else:
935 return _pickle.loads(f.read())
UnpicklingError: invalid load key, '%'.

WikiCorpus gives the same error:

In [10]: WikiCorpus.load("/Users/kz/wikibrain.mm", mmap='r')
---------------------------------------------------------------------------
UnpicklingError Traceback (most recent call last)
<ipython-input-10-3ce93d99690a> in <module>()
----> 1 WikiCorpus.load("/Users/kz/wikibrain.mm", mmap='r')
/Users/kz/anaconda/lib/python3.6/site-packages/gensim/utils.py in load(cls, fname, mmap)
269 compress, subname = SaveLoad._adapt_by_suffix(fname)
270
--> 271 obj = unpickle(fname)
272 obj._load_specials(fname, mmap, compress, subname)
273 logger.info("loaded %s", fname)
/Users/kz/anaconda/lib/python3.6/site-packages/gensim/utils.py in unpickle(fname)
931 # Because of loading from S3 load can't be used (missing readline in smart_open)
932 if sys.version_info > (3, 0):
--> 933 return _pickle.load(f, encoding='latin1')
934 else:
935 return _pickle.loads(f.read())
UnpicklingError: invalid load key, '%'.

What is the appropriate load method to use on this object?

Kevin Zeidler

unread,

Jul 12, 2017, 5:58:56 PM7/12/17

to gensim

It looks like the culprit was the mmap='r' flag.

In [14]: corpora.mmcorpus.MmCorpus("/Users/kz/wikibrain.mm")

Out[14]: <gensim.corpora.mmcorpus.MmCorpus at 0x121781da0>

¯\_(ツ)_/¯

It's certainly memory-mapped, so I'm not sure why that would cause an error. But I'm happy it works!

Kevin Zeidler

unread,

Jul 12, 2017, 7:37:04 PM7/12/17

to gensim

I'm still having trouble with this. How do you load the mm.index file created by MmCorpus.serialize? I've tried MmCorpus(<index>), MmCorpus.load(<index>), WikiCorpus.load(<index>), there's no indication I can find from reading the WikiCorpus and MmCorpus how to reinitialize the full bow model (including the id-> word mappings)

On Wednesday, July 12, 2017 at 2:50:29 PM UTC-7, Kevin Zeidler wrote:

Kevin Zeidler

unread,

Jul 12, 2017, 9:49:41 PM7/12/17

to gensim

Adding a little more context, as I don't have a way forward yet. The serialized model was instantiated with the Doc2Vec method described in this blog post (only the first few lines). So I ran:

from gensim.corpora.wikicorpus import WikiCorpus, MmCorpus

from gensim.models.doc2vec import Doc2Vec, TaggedDocument

from pprint import pprint

import multiprocessing

#Define TaggedWikiDocument class to convert WikiCorpus into suitable form for Doc2Vec.

class TaggedWikiDocument(object):

    def __init__(self, wiki):

        self.wiki = wiki

        self.wiki.metadata = True

    def __iter__(self):

        for content, (page_id, title) in self.wiki.get_texts():

            yield TaggedDocument([c.decode("utf-8") for c in content], [title])

#Convert the articles to WikiCorpus. WikiCorpus construct a corpus from a Wikipedia (or other MediaWiki-based) database dump.

wiki = WikiCorpus("enwiki-latest-pages-articles.xml.bz2")

documents = TaggedWikiDocument(wiki)


Building the WikiCorpus took ~8 hours, so before proceeding with the rest of the tutorial I saved the model by calling MmCorpus.serialize(wiki). In an official tutorial somewhere I recall reading this would also take ~8 hours, so I let this run overnight. By morning the serialization file was complete, but my laptop was in truly bad shape* so I rebooted. After realizing my syntactic error earlier I successfully loaded the serialized object with

  wiki = MmCorpus("/Users/kz/wikibrain.mm")

What remains unclear to me is how to reinitialize the model, since the MmCorpus object returned by  MmCorpus(<path_to_serialized_model>) is not a WikiCorpus object, nor does it follow the same protocol as WikiCorpus, so the following doesn't work:

In [8]: documents = TaggedWikiDocument(wiki)

In [9]: pre = Doc2Vec(min_count=0)

In [10]: pre.scan_vocab(documents)

---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-10-4a5ff44deb2d> in <module>()

----> 1 pre.scan_vocab(documents)

/Users/kz/anaconda/lib/python3.6/site-packages/gensim/models/doc2vec.py in scan_vocab(self, documents, progress_per, trim_rule, update)

    676         checked_string_types = 0

    677         vocab = defaultdict(int)

--> 678         for document_no, document in enumerate(documents):

    679             if not checked_string_types:

    680                 if isinstance(document.words, string_types):

<ipython-input-7-668238c08397> in __iter__(self)

      4         self.wiki.metadata = True

      5     def __iter__(self):

----> 6         for content, (page_id, title) in self.wiki.get_texts():

      7             yield TaggedDocument([c.decode("utf-8") for c in content], [title])

AttributeError: 'MmCorpus' object has no attribute 'get_texts'

As WikiCorpus and MmCorpus do not share a common protocol, I would expect there to be a way to initialize a WikiCorpus from a MmCorpus object (otherwise the docs wouldn't say to use MmCorpus's serialize method). But the documentation provides few clues here, WikiCorpus only addresses how to serialize the model. it would be a great help if you would include on that page a note explaining how to deserialize the model also (particularly since serializing a previously-trained model is a step that is likely to be omitted from typical gensim tutorials, and the output of MmCorpus.serialize() is a peculiarly formatted (Objective-C++?) document that is not interoperable with other gensim corpus-based IO classes.)

On Wednesday, July 12, 2017 at 2:50:29 PM UTC-7, Kevin Zeidler wrote:

Reply all

Reply to author

Forward