How to load serialized WikiCorpus?

556 views
Skip to first unread message

Kevin Zeidler

unread,
Jul 12, 2017, 5:50:29 PM7/12/17
to gensim
I serialized the model with the MmCorpus class, per the WikiCorpus object's warning that its "save" method only saves documents, and not models. (I'm glad it warned me about that.) I let the memory-mapped corpus finish saving overnight, restarted the computer this morning, and now when I attempt to load the serialized model I get

In [12]: corpora.mmcorpus.MmCorpus.load("/Users/kz/wikibrain.mm", mmap='r')
---------------------------------------------------------------------------
UnpicklingError                           Traceback (most recent call last)
<ipython-input-12-3eccfa1e0cbd> in <module>()
----> 1 corpora.mmcorpus.MmCorpus.load("/Users/kz/wikibrain.mm", mmap='r')
/Users/kz/anaconda/lib/python3.6/site-packages/gensim/utils.py in load(cls, fname, mmap)
    269         compress, subname = SaveLoad._adapt_by_suffix(fname)
    270
--> 271         obj = unpickle(fname)
    272         obj._load_specials(fname, mmap, compress, subname)
    273         logger.info("loaded %s", fname)
/Users/kz/anaconda/lib/python3.6/site-packages/gensim/utils.py in unpickle(fname)
    931         # Because of loading from S3 load can't be used (missing readline in smart_open)
    932         if sys.version_info > (3, 0):
--> 933             return _pickle.load(f, encoding='latin1')
    934         else:
    935             return _pickle.loads(f.read())
UnpicklingError: invalid load key, '%'.
 

WikiCorpus gives the same error: 

In [10]: WikiCorpus.load("/Users/kz/wikibrain.mm", mmap='r')
---------------------------------------------------------------------------
UnpicklingError                           Traceback (most recent call last)
<ipython-input-10-3ce93d99690a> in <module>()
----> 1 WikiCorpus.load("/Users/kz/wikibrain.mm", mmap='r')
/Users/kz/anaconda/lib/python3.6/site-packages/gensim/utils.py in load(cls, fname, mmap)
    269         compress, subname = SaveLoad._adapt_by_suffix(fname)
    270
--> 271         obj = unpickle(fname)
    272         obj._load_specials(fname, mmap, compress, subname)
    273         logger.info("loaded %s", fname)
/Users/kz/anaconda/lib/python3.6/site-packages/gensim/utils.py in unpickle(fname)
    931         # Because of loading from S3 load can't be used (missing readline in smart_open)
    932         if sys.version_info > (3, 0):
--> 933             return _pickle.load(f, encoding='latin1')
    934         else:
    935             return _pickle.loads(f.read())
UnpicklingError: invalid load key, '%'.

What is the appropriate load method to use on this object?

Kevin Zeidler

unread,
Jul 12, 2017, 5:58:56 PM7/12/17
to gensim
It looks like the culprit was the mmap='r' flag. 

In [14]: corpora.mmcorpus.MmCorpus("/Users/kz/wikibrain.mm")
Out[14]: <gensim.corpora.mmcorpus.MmCorpus at 0x121781da0>

 ¯\_(ツ)_/¯  
It's certainly memory-mapped, so I'm not sure why that would cause an error. But I'm happy it works!

Kevin Zeidler

unread,
Jul 12, 2017, 7:37:04 PM7/12/17
to gensim
I'm still having trouble with this. How do you load the mm.index file created by MmCorpus.serialize? I've tried MmCorpus(<index>), MmCorpus.load(<index>), WikiCorpus.load(<index>), there's no indication I can find from reading the WikiCorpus and MmCorpus how to reinitialize the full bow model (including the id-> word mappings)


On Wednesday, July 12, 2017 at 2:50:29 PM UTC-7, Kevin Zeidler wrote:

Kevin Zeidler

unread,
Jul 12, 2017, 9:49:41 PM7/12/17
to gensim
Adding a little more context, as I don't have a way forward yet. The serialized model was instantiated with the Doc2Vec method described in this blog post (only the first few lines). So I ran:

from gensim.corpora.wikicorpus import WikiCorpus, MmCorpus
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from pprint import pprint
import multiprocessing

#Define TaggedWikiDocument class to convert WikiCorpus into suitable form for Doc2Vec.
class TaggedWikiDocument(object):
    def __init__(self, wiki):
        self.wiki = wiki
        self.wiki.metadata = True
    def __iter__(self):
        for content, (page_id, title) in self.wiki.get_texts():
            yield TaggedDocument([c.decode("utf-8") for c in content], [title])

#Convert the articles to WikiCorpus. WikiCorpus construct a corpus from a Wikipedia (or other MediaWiki-based) database dump.
wiki = WikiCorpus("enwiki-latest-pages-articles.xml.bz2")
documents = TaggedWikiDocument(wiki)

Building the WikiCorpus took ~8 hours, so before proceeding with the rest of the tutorial I saved the model by calling MmCorpus.serialize(wiki). In an official tutorial somewhere I recall reading this would also take ~8 hours, so I let this run overnight. By morning the serialization file was complete, but my laptop was in truly bad shape* so I rebooted. After realizing my syntactic error earlier I successfully loaded the serialized object with

  wiki = MmCorpus("/Users/kz/wikibrain.mm")

What remains unclear to me is how to reinitialize the model, since the MmCorpus object returned by  MmCorpus(<path_to_serialized_model>) is not a WikiCorpus object, nor does it follow the same protocol as WikiCorpus, so the following doesn't work:

In [8]: documents = TaggedWikiDocument(wiki)

In [9]: pre = Doc2Vec(min_count=0)

In [10]: pre.scan_vocab(documents)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-10-4a5ff44deb2d> in <module>()
----> 1 pre.scan_vocab(documents)

/Users/kz/anaconda/lib/python3.6/site-packages/gensim/models/doc2vec.py in scan_vocab(self, documents, progress_per, trim_rule, update)
    676         checked_string_types = 0
    677         vocab = defaultdict(int)
--> 678         for document_no, document in enumerate(documents):
    679             if not checked_string_types:
    680                 if isinstance(document.words, string_types):

<ipython-input-7-668238c08397> in __iter__(self)
      4         self.wiki.metadata = True
      5     def __iter__(self):
----> 6         for content, (page_id, title) in self.wiki.get_texts():
      7             yield TaggedDocument([c.decode("utf-8") for c in content], [title])
      8

AttributeError: 'MmCorpus' object has no attribute 'get_texts'



As WikiCorpus and MmCorpus do not share a common protocol, I would expect there to be a way to initialize a WikiCorpus from a MmCorpus object (otherwise the docs wouldn't say to use MmCorpus's serialize method). But the documentation provides few clues here, WikiCorpus only addresses how to serialize the model. it would be a great help if you would include on that page a note explaining how to deserialize the model also (particularly since serializing a previously-trained model is a step that is likely to be omitted from typical gensim tutorials, and the output of MmCorpus.serialize() is a peculiarly formatted (Objective-C++?) document that is not interoperable with other gensim corpus-based IO classes.)

On Wednesday, July 12, 2017 at 2:50:29 PM UTC-7, Kevin Zeidler wrote:
Reply all
Reply to author
Forward
0 new messages