What is Doc2Vec build_vocab speed?

421 views
Skip to first unread message

James

unread,
Aug 6, 2016, 11:17:43 PM8/6/16
to gensim
Running Doc2Vec my build_vocab is going to take 30 hours and a third of the way through has maxed out my 16GB RAM and has eaten 10GB of swap. Some of this memory was already taken up by the first few parts of my code, though I still had around 8GB of free RAM when I started build_vocab last night. This is based on the following Doc2Vec code:

wiki_corpus = gensim.corpora.WikiCorpus("enwiki-latest-pages-articles.xml.bz2")


class TaggedWikiDocument(object):
     
def __init__(self, wiki_corpus):
           
self.wiki_corpus=wiki_corpus
           
self.wiki_corpus.metadata = True
     
def __iter__(self):
           
for content, (page_id, title) in self.wiki_corpus.get_texts():
                 
yield TaggedDocument([c.decode("utf-8") for c in content], [title])


documents
= TaggedWikiDocument(wiki_corpus)


d2v
=Doc2Vec(dm=0, size=512, window=5, min_count=50, iter=10, workers=8)


d2v
.build_vocab(documents)


and some current output of the terminal for d2v.build_vocab(documents):
INFO: PROGRESS: at 33.35% examples, 185329 words/s, in_qsize 0, out_qsize 0
DEBUG
: queueing job #854154 (8147 words,12 sentences) at alpha 0.01670
DEBUG
: prepared another chunk of 70 documents (qsize=0)


I was surprised by how slow and large this step is. Is this progress/speed normal?

Gordon Mohr

unread,
Aug 7, 2016, 2:36:20 AM8/7/16
to gensim
The vocabulary-building scan can be time-consuming. And, the decoding of the XML/Wikitext by WikiCorpus will be more expensive than handling plain-text. But, your main problem is likely the swapping. With Python datastructures/objects, you essentially never want to see any swapping, or else otherwise quick operations will take forever. (When you see it, you'll usually want to adjust the code to use less memory, or get more RAM, rather than wait it out.)

So you'd want to do anything possible to reduce memory use to avoid swapping, perhaps including shrinking your model (smaller `size`, larger `min_count`). 

The logging output you show actually comes from `train()`, not `build_vocab()`. That suggests that at least the `build_vocab()` scan finished, and execution continued to an `train()` invocation (not shown) which would iterate over the corpus another 10 times. 

Looking more at `WikiCorpus.get_texts()`, I'm a bit suspicious of its use of multiprocessing & thus multiple forked processes. It might be OK in a Python process that was just doing one pass over the WikiCorpus, but blow up addressable memory during multiple passes in a Python process that was already using a lot of memory (the large Doc2Vec model in training).  

The optimization I'd mentioned in the other thread might help avoid issues. That is: use WikiCorpus only once, to extract the titles/text and write those to a plainer-text format locally. Then, read and feed that text to Doc2Vec, without using WikiCorpus. 

- Gordon

James

unread,
Aug 7, 2016, 8:42:35 AM8/7/16
to gensim
Correct, the above logging was from train(). I apparently woke up last night and seeing build_vocab() was completed, looked up the next step, ran it and fell back asleep. In the morning I totally forgot though, haha. I was sleep gensiming.

I saw another post of yours, Gordon, with nice formula for estimating the potential ram size for Doc2Vec. I took the liberty to reproduce it's general content below: 

[document vectors] * [dimensions] * 4 bytes-per-float = Size in Bytes

Using your estimate on my setup: 4.3million document vectors * 512 dimensions * 4 bytes = 8.8GB which is fairly close to the 8.4GB my saved model with vocab, "d2v_model.doctag__syn0.npy" takes up on disc, along with three other of my "d2v_model" files for a total of ~10.5GB.

Per your suggestion, I am now reproducing my corpus as simipletext.txt to build my model & vocab. If you think it will have a substantial affect on it's size, I can try build_vocab() with my original choices of size(512), mincount(50), and iter(10) to see if it reduces the memory costs enough to continue on to training.

Finally, if you have time, could we take your estimation one step farther to training? If I am iterating 10 times, do I take the "Size in Bytes" (eg 10.5GB) and double it to estimate the RAM needed to complete the training? Or is it a multiple of the iterations? 

This will help me in deciding best possible Doc2Vec( size, min_count, iter).

Thank you for your help.

Gordon Mohr

unread,
Aug 7, 2016, 2:35:04 PM8/7/16
to gensim
In fact training just updates the arrays that are already allocated: so the Doc2Vec model won't use any more memory, no matter the number of chosen iterations. Essentially, Word2Vec/Doc2Vec memory usage peaks during `build_vocab()`, when the model is fully initialized. 

The forking of multiple wikitext-interpreting processes inside WikiCorpus for each new pass could change that, via additional processes, which is why it'd be good to do that once up front and thus remove it from a consideration during Doc2Vec training iterations.

In addition to the [#docs * dimensions * 4 bytes] size of just the doc-vectors, there's also the model's word representations and internal neural-network weights, whose memory use is a function of the surviving vocabulary size (and thus reducible by increasing `min_count`). That's the extra few GB beyond what you see related to `doctag_syn0`. 

There should be a relatively-good estimate of the model's memory needs printed as log lines, during `build_vocab()`, as well. If saved uncompressed, the total size of the files created by `save()` (after `build_vocab()` has completed) is also roughly the same as loaded RAM usage.

- Gordon
Reply all
Reply to author
Forward
0 new messages