Corpus Setup and Storing for Doc2Vec

James

unread,

Aug 5, 2016, 5:37:48 PM8/5/16

to gensim

Hello everyone. I have gone through some tutorials and keep getting stuck after importing my wiki corpus. I will outline the steps and information I have here in the hope that someone has the chance to help and that in the future this might help others:

Importing the corpus, per Corpus from a Wiki Dump:

wiki_corpus = gensim.corpora.WikiCorpus("enwiki-latest-pages-articles.xml.bz2") # create corpus
MmCorpus.serialize('wiki_en_vocab400k.mm') # edited filename for personal clarity
wiki_mm = MmCorpus("wiki_en_vocab400k.mm") # loading back in a memory friendly format

My problem is that by loading the corpus with MmCorpus I later cannot access WikiCorpus.get_texts:

classTaggedWikiDocument(object)
    def __init__(self, wiki_corpus):
         self.wiki_corpus = wiki_corpus
         self.wiki_corpus.metadata = True
    def __iter__(self):
         for content, (page_id, title) in self.wiki_corpus.get_texts():
              yield models.doc2vec.TaggedDocument([c.decode("utf-8") for c in content], [title])


documents = TaggedWikiDocument(wiki_corpus)

dv2 = models.doc2vec.Doc2Vec(dm=0, size=512, window=5, min_count=50, iter=10, workers=8)

dv2.build_vocab(documents)

>>>
AttributeError:'MmCorpus' object has no attribute 'get_texts'

I do understand that MmCorpus does not have 'get_texts' but I am not sure how to load the Wiki Corpus in a memory friendly manner. Attempting to only use

wiki_corpus = gensim.corpora.WikiCorpus("enwiki-latest-pages-articles.xml.bz2")

results in Wikipedia eating my laptop.

I found a Work in Progress Doc2Vec Wikipedia Tutorial on Gensim's GitHub, I am unsure if it is correct, but I am reimporting Wikipedia now to find out. It recommends:

#wiki = WikiCorpus("enwiki-latest-pages-articles.xml.bz2")
#wiki = WikiCorpus("enwiki-YYYYMMDD-pages-articles.xml.bz2")
#wiki.save("enwikicorpus")
wiki = WikiCorpus.load("enwikicorpus")

If this is the correct way to save and load WikiCorpus for Doc2Vec, it might be useful to add it to the "Corpus From a Wiki Dump" tutorial. If this is incorrect, I look forward to hearing the correct way to import this Corpus. I'm very excited to try out Doc2Vec!

Gordon Mohr

unread,

Aug 5, 2016, 7:59:54 PM8/5/16

to gensim

There's no need or benefit to introducing the MmCorpus format – Doc2Vec works on text-as-tokens, doing its own vocabulary-to-int-mapping, and can't directly use the compact representation created by MmCorpus. So better to follow the demo-notebook-in-progress.

And, there's no real need to muck with `WikiCorpus.save()`/`load()`. WikiCorpus can stream from the compressed dump. A save/load cycle has negligible (or negative) benefit compared to just instantiating a new WikiCorpus with the path to the underlying compressed dump.

(If you were concerned about the wikitext-decoding overhead being repeated on multiple passes, you might consider, as an optimization, iterating over the WikiCorpus once, and writing the resulting tokenized text into a 1-line-per-article plain-text format. Then, run Doc2Vec's passes on that file, as if it were any other one-example-per-line plain-text corpus. But that isn't strictly necessary.)

- Gordon

James

unread,

Aug 6, 2016, 9:07:05 AM8/6/16

to gensim

Thanks Gordon. For my project this worked perfect:

wiki_corpus=WikiCorpus("/home/user/python/wiki/enwiki-latest-pages-articles.xml.bz2")

If I may add a question:

In regards to your comment: "save/load cycle has negligible (or negative) benefit compared to just instantiating a new WikiCorpus with the path to the underlying compressed dump."

I may have misunderstood, isn't the save/load cycle extremely useful as a means of saving progress? I am interested in saving and loading because the above command, WikiCorpus("enwiki-latest-pages-articles.xml.bz2"), takes five to six hours to process. During the learning process it's likely I will not be able to complete my project on the first try. I would like to "save" and "load" my corpus to avoid the six hours for processing WikiCorpus("enwiki-latest-pages-articles.xml.bz2" when I start my python session for subsequent attempts.

In fact, it is something I had assumed, by doing this once, I could simply save this file, and all near future Doc2Vec projects may use WikiCorpus.load("enwikicorpus").

Your comment gave me cause to think that perhaps there was a faster way to "just instantiating a new WikiCorpus with the path to the underlying" wikipedia dump.

Thanks Gordon for your help. I'm really enjoying working with gensim!

Gordon Mohr

unread,

Aug 6, 2016, 12:19:47 PM8/6/16

to gensim

I've taken a look at the WikiCorpus source, and realize what's happening now. WikiCorpus is itself building an internal vocabulary-dictionary via a full-scan of the texts – specifically at:

https://github.com/RaRe-Technologies/gensim/blob/6613f05af54f6b8e7c6c37c783ae926db0174b49/gensim/corpora/wikicorpus.py#L278

That's what's taking hours. But note: Word2Vec/Doc2Vec never use that dictionary: they just work on the tokenized text (which isn't saved by WikiCorpus) – and create their own separate vocabulary-survey, subject to other needs.

A save/load will allow the re-use of that dictionary, so it will be faster than creating a new WikiCorpus – but only because of work that it'd be better to skip entirely. So unless you need that dictionary for some other processing, try instead:

wiki_corpus = WikiCorpus("/home/user/python/wiki/enwiki-latest-pages-articles.xml.bz2", dictionary={})

The resulting WikiCorpus will still be able to iterate over the compressed dump as needed for Word2Vec/Doc2Vec.

- Gordon

James

unread,

Aug 6, 2016, 9:19:05 PM8/6/16

to gensim

Wow, that is pretty interesting Gordon, as well as a great work around. I am rather novice, and a little unsure if I will or wont need the dictionary in later steps, but it's great that we're able to share that here.

I think perhaps there are both cases where people do and don't need to use the dictionary, so knowing that option is great.

Thanks again for your explanation.

Reply all

Reply to author

Forward