Best strategy to train doc2vec on a huge corpus

8,551 views
Skip to first unread message

Dasha

unread,
Dec 7, 2015, 6:35:57 AM12/7/15
to gensim
Could you please share your opinion on what is the general strategy to train doc2vec on a corpus that does not fit into RAM?

I am currently training a dbow model on the Gigaword corpus merged with another relatively small corpus. This is in total about 30Gb and has about 4 billion tokens and about 118 million paragraphs, for which I need to learn representations. 
In future I might need to use even bigger data sets.

I have a max of 256GB RAM and 12 CPUs available, but I would prefer to be able to use less.

I was following this tutorial, that loads everything into memory, and after several hours of training I realized it was trying to use far more than 100% of available memory.

Now I shuffled the training data and implemented the iterator that does not load everything into memory. I also have the data set randomly splitted in about one thousand files. It is already running for about 12 hours and consuming about 70GB of RAM.

Could you please give me general advice on how to make the training more efficient and less memory greedy?

Thank you.




Gordon Mohr

unread,
Dec 8, 2015, 10:34:52 PM12/8/15
to gensim
The largest consumer of addressable memory will be the vectors-in-training for the individual documents, which need to be held and improved on each of multiple passes. You don't say your chosen dimensionality, but for the moment let's assume 100. (That's the default, but probably too small for a fine-grained model of large domains – published results often use 300-1000 dimensions.)

Assuming you're training a vector per paragraph, the vectors-in-training will require:

118 million vectors * 100 dimensions * 4 bytes-per-float = 47.2 GB

(Assuming a culled word-vocabulary of only 100K-1M items, other parts of the model will need just a tiny increment more than that.)

But, you don't have to have all that in RAM at once. If you supply a file path to the Doc2Vec init parameter `docvecs_mapfile`, the model will use a memory-mapped file for those vectors. If your training iterations just go through all examples in order each time (which can be good enough on a large dataset), only the 'hot' area of that file will be mapped into memory during each pass, so the docvecs array can be larger than RAM. 

Additionally it saves memory to use plain python ints (the indexes into that big array) as your document keys (aka 'doctags'), to avoid having to maintain a big dictionary of string-keys to int-indexes. (But as of the latest gensim, you can mix in some string-keyed doctags as well; they'll get indexes beyond the end of the last plain-int-indexed doctag.)

Other than that, you may already be doing things alright; it just takes a while. The achieved parallelism across all cores still isn't what would be ideal, due to python's GIL lock bottlenecks, but it gets incrementally better when using (1) longer examples; (2) more dimensions; (3) more-negative-examples, in that mode; (4) larger windows (in the modes that use a window – not DBOW unless also training words). So in some cases, where core-utilization wasn't maxed during training, upping these parameters doesn't cost as much in runtime as you would otherwise expect, because it's mainly able to make more use of otherwise-idle time. 

Other thoughts for speed:

* always be sure there's no swapping
* try to make sure neither IO (from some slow volume) nor redundant text preprocessing (such as regex-replacements or other character transformations) is a bottleneck in the corpus iterator

- Gordon

Daniele Evangelista

unread,
Jan 30, 2016, 11:32:59 AM1/30/16
to gensim
Hello Gordon,
I am facing the same problem, training a doc2vec model on a very huge corpus, the wikipedia one, all included in one huge text file (almost 12Gb).. Could you pleas share the part of code related to the setting of the parameter "docvecs_mapfile" in the configuration of the model?? I cannot manage to make it working good.. :(.. I am almost new to python and to Gensim too..
I hope you help me, have a nice day, D.

Gordon Mohr

unread,
Feb 1, 2016, 7:55:23 PM2/1/16
to gensim
What have you tried, and has the problem been errors or just poor final results?

Are you splitting Wikipedia into 'documents' by sentences, paragraphs, sections, or full-articles? 

- Gordon

Daniele Evangelista

unread,
Feb 2, 2016, 4:19:17 AM2/2/16
to gen...@googlegroups.com
Hi Gordon,
thanks for the reply. What I am actually doing is the following: take the wikipedia corpus, split it into full-articles, disambiguate all the text in each article, reassemble it in one big document (one article per line) and train a Doc2Vec model on this new corpus.
The first part is done, now I am doing the training as I said, I tried to set the model like the following:

model = Doc2Vec(min_count=1, window=10, size=100, sample=1e-4, negative=5, workers=7, docvecs_mapfile='../datasets/mapfile.txt')

But my system kill the process because (I suppose) it requires too much RAM, and no mapfile.txt is being created.. But one other important thing is that the vocabulary of this new corpus is extremely huge, even bigger than the one that you could build with the entire Wikipedia corpus (because mine is disambiguated so "bank" the "financial institution" is a different word with respect "bank" the "river bank"). What do you suggest? I cannot split the corpus since the model must be trained on all of it.

thanks again, 
D.

--
You received this message because you are subscribed to a topic in the Google Groups "gensim" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/6JmSsx4iIv0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gensim+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gordon Mohr

unread,
Feb 2, 2016, 10:50:34 PM2/2/16
to gensim
That's interesting! If you enable logging, what are the last messages before the process dies? How much RAM do you have?

Given your unique vocabulary-expansion, it might be the words rather than the doc-vectors that are the issue. How many disambiguated-words are we talking about?

The 5 million Wikipedia articles * 100 dimensions * 4 bytes/float = 2GB of doc-vectors that the memory-mapping can help with. But memory-mapping isn't implemented for the word vocabulary – and even if available it's far less likely to help if your word-vector array is larger than RAM. (As the words will be occurring randomly throughout the corpus, there'd constant paging in-and-out that would likely kill performance.

The only obvious suggestion would be to use a larger min_count – words with fewer than a few dozen occurrence may not be of much importance, and likely won't be very well-defined anyway. 

- Gordon

Daniele Evangelista

unread,
Feb 3, 2016, 6:29:38 AM2/3/16
to gensim
I was thinking to the same suggestion you gave. And I agree with the fact that the vocabulary size is the issue, I actually don't know how many single words we are talking about, but the process actually stops before starting the training, so in the vocabulary building step.
Do you think it could be usefull (and possible) to build the vocabulary externally and load it into the gensim doc2vec model? But I think that the memory issue will remain, even when trying to load a pre-built vocabulary.
Anyway, I want to ask you other things: is there the possibility to build the representation of new documents? let's say, if u are trying to do document similarity, it could be interesting to understand whether or not one document is similar to one inside my trained model. 

Daniele Evangelista

unread,
Feb 3, 2016, 6:30:54 AM2/3/16
to gen...@googlegroups.com
I was thinking to the same suggestion you gave. And I agree with the fact that the vocabulary size is the issue, I actually don't know how many single words we are talking about, but the process actually stops before starting the training, so in the vocabulary building step.
Do you think it could be usefull (and possible) to build the vocabulary externally and load it into the gensim doc2vec model? But I think that the memory issue will remain, even when trying to load a pre-built vocabulary.
Anyway, I want to ask you other things: is there the possibility to build the representation of new documents? let's say, if u are trying to do document similarity, it could be interesting to understand whether or not one document is similar to one inside my trained model. 

Gordon Mohr

unread,
Feb 3, 2016, 2:05:28 PM2/3/16
to gensim
To decide on options that are likely to work, it'd be best to figure out exactly how many unique words are being dealt with, and how much RAM you have. 

If you've done the expansion of original-words to disambiguated-words as a prior step, that resulted in a "one article per line" text file with your changed-words as space-separated tokens, this is an easy one-liner:

    cat corpus.txt | tr " " "\n" | sort | uniq | wc -l

The log output that shows exactly the progress before it stops would indicate how many words were discovered before the error, and also whether a full scan had completed (and then the failure happened as the words were arranged and vector arrays pre-allocated before training) or not. Each of those needs to be known to determine what sorts of optimizations (like doing the vocab-building as a separate step) might help the process complete. 

Yes, after a model is trained, the `infer_document()` method can be used to infer a vector-representation of new documents. (It takes a pre-tokenized list of words, and you might need to tinker with its default parameters, especially 'steps', to get good results.)

- Gordon

Daniele Evangelista

unread,
Feb 9, 2016, 5:51:50 AM2/9/16
to gensim
Hello Gordon,
thanks again for the reply. I will give you further information on the dimension of the corpus and the vocabulary as soon as possible. 
Anyway, I also would like to know what exactly happen in the infer_vector() function. I mean, does gensim perform what is written in Mikolov's paper? So it should thake all the words of the sentence (already tokenized and passed to the function as a list of tokens), find the related word-vectors and perform an average of them, is it right? 
If I am wrong could you please suggest me the maths behind this procedure?

Thanks again,
D.

Gordon Mohr

unread,
Feb 9, 2016, 4:39:58 PM2/9/16
to gensim
The most precise reference for what it's doing is the source code itself, starting at:


By reusing the same training routines, but with some additional non-default parameters to indicate parts of the model which should remain unchanged, it's trying to exactly match the paper's description of the process. 

Note that this is *not* just an averaging of word vectors. (In DM mode, 'dm=1, dm_mean=1', an average of the word vectors in a context-window is combined with a doc-vector-in-training for the prediction task – but the resulting doc vector is *not* an average of all the word vectors. In DBOW mode, 'dm=0', no averaging of word vectors is any part of the process.)

- Gordon

Sumeet Sandhu

unread,
Jan 8, 2017, 2:53:31 PM1/8/17
to gensim
We are facing the same problem - we have 114K documents in biology topics (peptides etc) that have a lot of esoteric formulas, sequences, and symbols. By default, these technical constructs parse into many more words than documents in other topics like computer science. Our vocabulary size and model size blows up. For 114K documents we get 9 million unique 'words', that require 20GB model memory with size 200 - see logfile excerpt below. This size doesn't work for our 50GB RAM cloud instance. 

What puzzles me is that we chose not to model words (dbow_words = 0, to save memory) - but even just document vectors (125K) seem to require this huge model size in RAM. Is there any way to get around this, besides using more aggressive min_count and downsampling parameters?


2017-01-07 23:45:07,297 : INFO : collected 13787264 word types and 125399 unique tags from a corpus of 370193 examples and 3143033547 words

2017-01-07 23:45:07,313 : INFO : Loading a fresh vocabulary

2017-01-07 23:47:17,549 : INFO : min_count=2 retains 9027742 unique words (65% of original 13787264, drops 4759522)

2017-01-07 23:47:17,549 : INFO : min_count=2 leaves 3138274025 word corpus (99% of original 3143033547, drops 4759522)

2017-01-07 23:47:44,615 : INFO : deleting the raw counts dictionary of 13787264 items

2017-01-07 23:48:24,811 : INFO : sample=5e-05 downsamples 774 most-common words

2017-01-07 23:48:24,814 : INFO : downsampling leaves estimated 1453784781 word corpus (46.3% of prior 3138274025)

2017-01-07 23:48:24,816 : INFO : estimated required memory for 9027742 words and 200 dimensions: 20889205600 bytes

2017-01-07 23:49:03,750 : INFO : constructing a huffman tree from 9027742 words

2017-01-07 23:58:13,338 : INFO : built huffman tree with maximum node depth 31

2017-01-07 23:59:02,340 : INFO : resetting layer weights

doc2vec : build vocab = 5979.63341904


best regards,

Sumeet

Gordon Mohr

unread,
Jan 8, 2017, 6:29:37 PM1/8/17
to gensim
`dbow_words=0` is the default, but the words 'input' array `syn0` is initialized in any case (even if not used as in `dm=0, dbow_words=0` pure DBOW). So `dm=0, dbow_words=1` doesn't actually wind up using any more memory. (It should be possible to skip that allocation/initialization in pure DBOW, but that efficiency just hasn't been implemented.)

With the numbers you've given, skipping that allocation could save around 7GB. But even without it, a 20GB model on a machine with 50GB RAM should be no problem – certainly no failure/crashing. 

Since training involves predicting words, the 'output' layer in any mode will be a function of the vocabulary size. 

You're probably not getting much generalizable training value from words where there are only 2 examples – so you could try a much larger `min_count`. You might also be able to canonicalize/collapse many of the idiosyncratic tokens into more general tokens which add value rather than noise. 

Frequent-word downsampling won't affect RAM use - it just speeds (and sometimes improves) training, by skipping/shrinking redundant contexts. 

Also while it won't save RAM, it seems projects with larger vocabularies tend to lean more towards negative-sampling than hierarchical-softmax – its speed is less sensitive to vocabulary size. 

- Gordon

Sumeet Sandhu

unread,
Jan 11, 2017, 6:39:56 PM1/11/17
to gen...@googlegroups.com
Thanks Gordon.

That's a good tip - document modeling is a function of vocabulary size whether words are modeled or not.

Is there a way to specify lists of words to cull at the top and bottom frequencies, i.e. a list for 'frequent' words to downsample, and a list of 'infrequent' words to remove.

trim_rule seems to allow some of this, but only at bottom end, and only with term frequency (unless i use a global variable to insert document frequency etc).

I tried frequent word downsampling with sample = 5e-5 (to avoid downsampling technical/useful words as much as stopwords). Most_similar word quality was ok on 50K documents but on 100K documents with same parameters, it made stop-word modeling 'worse' - stop words started showing up as most_similar to frequent technical words. Stop words have almost never shown up as nearest neighbors in my many past experiments with no downsampling on different document set sizes.

regards,
Sumeet


--
You received this message because you are subscribed to a topic in the Google Groups "gensim" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/6JmSsx4iIv0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gensim+unsubscribe@googlegroups.com.

Gordon Mohr

unread,
Jan 11, 2017, 8:41:20 PM1/11/17
to gensim
I believe `trim_rule` can implement any arbitrary policy for discarding some words that you'd like. (You can also unpack the stepss of `build_vocab()`, and before the `finalize_vocab()` step is called, do anything you want to the raw dictionary before then to slim the vocabulary.)

But I doubt any specific 'lists' or other fancy policies are worth the trouble compared to just using a larger `min_count`, which can shrink the vocabulary significantly (given usual Zipf's-Law-like token distributions). It also throws out exactly those least-frequent tokens that, due to individual rareness, were unlikely to have good learned representations anyway. 

(I don't know of any work strongly suggesting that completely discarding more frequent words helps, compared to just downsampling. It won't shrink the model size much – a word takes the same model space whether it appears once or a million times – and many word2vec/doc2vec projects don't even bother removing 'stop' words.)

Note that a smaller `sample` parameter – a larger negative exponent – results in more-aggressive downsampling. So `sample=5e-05` makes a larger set of words eligible for down-sampling, and samples fewer of the most-frequent words, than the default value of `sample=1e-03`. I have no idea what value might work best for your corpus/token-distribution/end-goals. If quality rather than short-training-time is important, you may want to balance more aggressive downsampling (which saves training time) with more training iterations (which spends more time) – as a way of making the training spend more effort on the most interesting middle-frequency words. 

- Gordon
To unsubscribe from this group and all its topics, send an email to gensim+un...@googlegroups.com.

w007...@umail.usq.edu.au

unread,
Apr 9, 2017, 8:12:49 PM4/9/17
to gensim
May I know have you solved the problem. I am also facing system hang problem with a corpus of only 72M using word2Vec. Log is:
o.n.l.f.Nd4jBackend - Loaded [CpuBackend] backend
o.n.n.NativeOpsHolder - Number of threads used for NativeOps: 4
o.n.n.Nd4jBlas - Number of threads used for BLAS: 4
o.d.m.s.SequenceVectors - Starting vocabulary building...
o.d.m.w.w.VocabConstructor - Sequences checked: [19137], Current vocabulary size: [17160]; Sequences/sec: [1439.52];
o.d.m.e.i.InMemoryLookupTable - Initializing syn1...
o.d.m.s.SequenceVectors - Building learning algorithms:
o.d.m.s.SequenceVectors -           building ElementsLearningAlgorithm: [SkipGram]
o.d.m.s.SequenceVectors - Starting learning process...
o.d.m.s.SequenceVectors - Epoch: [1]; Words vectorized so far: [7070575];  Lines vectorized so far: [100000]; learningRate: [0.02423167274288105]

The system just stay frozen at this point.

Any idea?

Gordon Mohr

unread,
Apr 10, 2017, 1:36:50 AM4/10/17
to gensim
That looks like Java, not Python gensim, logging output. So this question appears misdirected.

- Gordon

Ritu Sharma

unread,
Nov 27, 2017, 8:47:41 AM11/27/17
to gensim
Hello, 

I am facing the same issue of Memory Error.

I have dataset of 322 GB. I was reading the JSON files to prepare training data. Two lists were created, in one file name is appended (DocLabels) and in other file text is appended (Data) to supply at the time of training. When processing 322 GB data at once, got memory error. Then I divided the Dataset into 2 sets. While processing the first set of 110 GB, first part of preparing the dataset was completed successfully but got memory error during doc2vec training. Here, is the attached output.

2017-11-27 10:54:18,459 : INFO : collecting all words and their counts
2017-11-27 10:54:18,474 : INFO : PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
2017-11-27 10:55:03,105 : INFO : PROGRESS: at example #10000, processed 152663650 words (3419935/s), 5569472 word types, 10000 tags
2017-11-27 10:55:57,802 : INFO : PROGRESS: at example #20000, processed 331818820 words (3275295/s), 10035481 word types, 20000 tags
2017-11-27 10:56:52,864 : INFO : PROGRESS: at example #30000, processed 504357226 words (3133729/s), 14415155 word types, 30000 tags
2017-11-27 10:58:57,177 : INFO : PROGRESS: at example #40000, processed 889851113 words (3101062/s), 22031283 word types, 40000 tags
2017-11-27 11:00:04,242 : INFO : PROGRESS: at example #50000, processed 1072569770 words (2724355/s), 25509947 word types, 50000 tags
2017-11-27 11:01:02,655 : INFO : PROGRESS: at example #60000, processed 1241225166 words (2887482/s), 28479318 word types, 60000 tags
2017-11-27 11:02:02,529 : INFO : PROGRESS: at example #70000, processed 1409582269 words (2811964/s), 31192772 word types, 70000 tags
2017-11-27 11:02:42,173 : INFO : PROGRESS: at example #80000, processed 1508806632 words (2502602/s), 33227334 word types, 80000 tags
2017-11-27 11:04:32,243 : INFO : PROGRESS: at example #90000, processed 1809654284 words (2733423/s), 39412168 word types, 90000 tags
2017-11-27 11:07:20,119 : INFO : PROGRESS: at example #100000, processed 2234156699 words (2528533/s), 47147221 word types, 100000 tags
2017-11-27 11:09:52,196 : INFO : PROGRESS: at example #110000, processed 2607896686 words (2457627/s), 53897383 word types, 110000 tags
2017-11-27 11:10:44,446 : INFO : PROGRESS: at example #120000, processed 2745264867 words (2629450/s), 56001822 word types, 120000 tags
2017-11-27 11:11:16,137 : INFO : PROGRESS: at example #130000, processed 2827117610 words (2582841/s), 58366521 word types, 130000 tags
2017-11-27 11:11:50,825 : INFO : PROGRESS: at example #140000, processed 2912509508 words (2461196/s), 60757543 word types, 140000 tags
2017-11-27 11:12:21,437 : INFO : PROGRESS: at example #150000, processed 2995331490 words (2705918/s), 62875375 word types, 150000 tags
2017-11-27 11:12:52,543 : INFO : PROGRESS: at example #160000, processed 3078531060 words (2674653/s), 64960484 word types, 160000 tags
2017-11-27 11:13:37,401 : INFO : PROGRESS: at example #170000, processed 3160857061 words (1835253/s), 67044892 word types, 170000 tags
2017-11-27 11:14:11,191 : INFO : PROGRESS: at example #180000, processed 3245049235 words (2491577/s), 69110558 word types, 180000 tags
2017-11-27 11:14:43,832 : INFO : PROGRESS: at example #190000, processed 3330286435 words (2611326/s), 71095891 word types, 190000 tags
2017-11-27 11:15:15,394 : INFO : PROGRESS: at example #200000, processed 3412482675 words (2604299/s), 73032591 word types, 200000 tags
2017-11-27 11:15:41,117 : INFO : PROGRESS: at example #210000, processed 3484747568 words (2809144/s), 73978394 word types, 210000 tags
2017-11-27 11:16:03,545 : INFO : PROGRESS: at example #220000, processed 3539712618 words (2450890/s), 74930744 word types, 220000 tags
2017-11-27 11:16:24,595 : INFO : PROGRESS: at example #230000, processed 3594204897 words (2588179/s), 75823469 word types, 230000 tags
2017-11-27 11:16:47,161 : INFO : PROGRESS: at example #240000, processed 3651254919 words (2529385/s), 76733627 word types, 240000 tags
2017-11-27 11:17:09,582 : INFO : PROGRESS: at example #250000, processed 3708906038 words (2571235/s), 77635553 word types, 250000 tags
2017-11-27 11:17:46,730 : INFO : PROGRESS: at example #260000, processed 3766507618 words (1550176/s), 78535135 word types, 260000 tags
2017-11-27 11:18:09,684 : INFO : PROGRESS: at example #270000, processed 3826063201 words (2594230/s), 79430195 word types, 270000 tags
2017-11-27 11:18:34,592 : INFO : PROGRESS: at example #280000, processed 3885800529 words (2398620/s), 80409526 word types, 280000 tags
2017-11-27 11:19:08,168 : INFO : PROGRESS: at example #290000, processed 3968037506 words (2449717/s), 81997016 word types, 290000 tags
2017-11-27 11:19:35,923 : INFO : PROGRESS: at example #300000, processed 4036921076 words (2482351/s), 83125495 word types, 300000 tags
2017-11-27 11:19:58,625 : INFO : PROGRESS: at example #310000, processed 4097091596 words (2648966/s), 84031816 word types, 310000 tags
2017-11-27 11:20:23,447 : INFO : PROGRESS: at example #320000, processed 4159877051 words (2529729/s), 85001748 word types, 320000 tags
2017-11-27 11:20:50,767 : INFO : PROGRESS: at example #330000, processed 4227457905 words (2474381/s), 86012206 word types, 330000 tags
2017-11-27 11:21:19,819 : INFO : PROGRESS: at example #340000, processed 4295950182 words (2357029/s), 87081250 word types, 340000 tags
2017-11-27 11:21:52,331 : INFO : PROGRESS: at example #350000, processed 4371362975 words (2319315/s), 88480879 word types, 350000 tags
2017-11-27 11:23:00,957 : INFO : PROGRESS: at example #360000, processed 4461660292 words (1315870/s), 90085061 word types, 360000 tags
2017-11-27 11:23:31,667 : INFO : PROGRESS: at example #370000, processed 4532568324 words (2308694/s), 91532531 word types, 370000 tags
2017-11-27 11:24:04,960 : INFO : PROGRESS: at example #380000, processed 4608419619 words (2278579/s), 92988541 word types, 380000 tags
2017-11-27 11:24:45,655 : INFO : PROGRESS: at example #390000, processed 4701794826 words (2293378/s), 95603736 word types, 390000 tags
2017-11-27 11:25:27,592 : INFO : PROGRESS: at example #400000, processed 4798265510 words (2301310/s), 98167804 word types, 400000 tags
2017-11-27 11:26:11,569 : INFO : PROGRESS: at example #410000, processed 4895266540 words (2205785/s), 100786108 word types, 410000 tags
2017-11-27 11:26:55,226 : INFO : PROGRESS: at example #420000, processed 4993322826 words (2246206/s), 103370226 word types, 420000 tags
2017-11-27 11:27:37,148 : INFO : PROGRESS: at example #430000, processed 5088389983 words (2267482/s), 105677374 word types, 430000 tags
2017-11-27 11:28:03,908 : INFO : PROGRESS: at example #440000, processed 5147896915 words (2224477/s), 106601742 word types, 440000 tags
2017-11-27 11:28:29,671 : INFO : PROGRESS: at example #450000, processed 5208146581 words (2338848/s), 107402834 word types, 450000 tags
2017-11-27 11:29:19,598 : INFO : PROGRESS: at example #460000, processed 5265476538 words (1148079/s), 108331428 word types, 460000 tags
2017-11-27 11:29:45,993 : INFO : PROGRESS: at example #470000, processed 5325687199 words (2280751/s), 109300165 word types, 470000 tags
2017-11-27 11:30:12,549 : INFO : PROGRESS: at example #480000, processed 5385844852 words (2265788/s), 110205462 word types, 480000 tags
2017-11-27 11:30:51,648 : INFO : PROGRESS: at example #490000, processed 5466415924 words (2060612/s), 111889347 word types, 490000 tags
2017-11-27 11:31:34,576 : INFO : PROGRESS: at example #500000, processed 5555596021 words (2077843/s), 113791549 word types, 500000 tags
2017-11-27 11:32:16,072 : INFO : PROGRESS: at example #510000, processed 5644147903 words (2133446/s), 115671742 word types, 510000 tags
2017-11-27 11:32:57,910 : INFO : PROGRESS: at example #520000, processed 5728904460 words (2026359/s), 117484160 word types, 520000 tags
2017-11-27 11:33:41,608 : INFO : PROGRESS: at example #530000, processed 5818595331 words (2051948/s), 119334571 word types, 530000 tags
2017-11-27 11:34:25,165 : INFO : PROGRESS: at example #540000, processed 5908503009 words (2064164/s), 121185709 word types, 540000 tags
2017-11-27 11:35:09,950 : INFO : PROGRESS: at example #550000, processed 5998600120 words (2011960/s), 123056834 word types, 550000 tags
2017-11-27 11:35:54,576 : INFO : PROGRESS: at example #560000, processed 6085824310 words (1954382/s), 124871802 word types, 560000 tags
2017-11-27 11:36:38,542 : INFO : PROGRESS: at example #570000, processed 6174440471 words (2016116/s), 126700160 word types, 570000 tags
2017-11-27 11:37:51,989 : INFO : PROGRESS: at example #580000, processed 6259424856 words (1157025/s), 128397926 word types, 580000 tags
2017-11-27 11:38:32,204 : INFO : PROGRESS: at example #590000, processed 6342094693 words (2055857/s), 129951745 word types, 590000 tags
2017-11-27 11:39:12,381 : INFO : PROGRESS: at example #600000, processed 6424854207 words (2059776/s), 131514222 word types, 600000 tags
2017-11-27 11:39:56,421 : INFO : PROGRESS: at example #610000, processed 6508111128 words (1890202/s), 133098502 word types, 610000 tags
2017-11-27 11:40:40,098 : INFO : PROGRESS: at example #620000, processed 6591677968 words (1913836/s), 134647880 word types, 620000 tags
2017-11-27 11:41:21,648 : INFO : PROGRESS: at example #630000, processed 6672233996 words (1938550/s), 136188929 word types, 630000 tags
2017-11-27 11:42:08,092 : INFO : PROGRESS: at example #640000, processed 6762071533 words (1934173/s), 137963800 word types, 640000 tags
2017-11-27 11:42:56,043 : INFO : PROGRESS: at example #650000, processed 6852436879 words (1884525/s), 139744193 word types, 650000 tags
2017-11-27 11:43:40,808 : INFO : PROGRESS: at example #660000, processed 6938789040 words (1928968/s), 141359878 word types, 660000 tags
2017-11-27 11:44:22,221 : INFO : PROGRESS: at example #670000, processed 7019344331 words (1945305/s), 142766760 word types, 670000 tags
2017-11-27 11:45:04,053 : INFO : PROGRESS: at example #680000, processed 7098128924 words (1883938/s), 144166899 word types, 680000 tags
2017-11-27 11:46:00,631 : INFO : PROGRESS: at example #690000, processed 7205858371 words (1904226/s), 145908318 word types, 690000 tags
2017-11-27 11:47:30,378 : INFO : PROGRESS: at example #700000, processed 7374395882 words (1877839/s), 148396626 word types, 700000 tags
2017-11-27 11:48:20,276 : INFO : PROGRESS: at example #710000, processed 7465636870 words (1828456/s), 149581811 word types, 710000 tags
2017-11-27 11:49:24,095 : INFO : PROGRESS: at example #720000, processed 7578445224 words (1767523/s), 151437817 word types, 720000 tags
2017-11-27 11:50:28,154 : INFO : PROGRESS: at example #730000, processed 7689860414 words (1739420/s), 153993381 word types, 730000 tags
2017-11-27 11:52:13,573 : INFO : PROGRESS: at example #740000, processed 7804638995 words (1088868/s), 156147343 word types, 740000 tags
2017-11-27 11:53:14,137 : INFO : PROGRESS: at example #750000, processed 7914193226 words (1808665/s), 157850041 word types, 750000 tags
2017-11-27 11:54:09,155 : INFO : PROGRESS: at example #760000, processed 8014233229 words (1818562/s), 159461712 word types, 760000 tags
2017-11-27 11:55:11,368 : INFO : PROGRESS: at example #770000, processed 8123415169 words (1754935/s), 160976003 word types, 770000 tags
2017-11-27 11:56:17,418 : INFO : PROGRESS: at example #780000, processed 8239969464 words (1764421/s), 162868161 word types, 780000 tags
2017-11-27 11:57:28,443 : INFO : PROGRESS: at example #790000, processed 8357823905 words (1659331/s), 164571859 word types, 790000 tags
2017-11-27 11:58:31,115 : INFO : PROGRESS: at example #800000, processed 8465856989 words (1723943/s), 165855863 word types, 800000 tags
2017-11-27 11:59:38,081 : INFO : PROGRESS: at example #810000, processed 8575907418 words (1643363/s), 167675625 word types, 810000 tags
2017-11-27 12:00:45,845 : INFO : PROGRESS: at example #820000, processed 8689399163 words (1674742/s), 169385297 word types, 820000 tags
2017-11-27 12:01:57,246 : INFO : PROGRESS: at example #830000, processed 8804099667 words (1606546/s), 170984319 word types, 830000 tags
2017-11-27 12:02:59,476 : INFO : PROGRESS: at example #840000, processed 8907297453 words (1658177/s), 172494784 word types, 840000 tags
2017-11-27 12:04:03,980 : INFO : PROGRESS: at example #850000, processed 9012025720 words (1623877/s), 174406058 word types, 850000 tags
2017-11-27 12:05:12,709 : INFO : PROGRESS: at example #860000, processed 9125655859 words (1653227/s), 176375161 word types, 860000 tags
2017-11-27 12:06:23,759 : INFO : PROGRESS: at example #870000, processed 9238572341 words (1589284/s), 178354187 word types, 870000 tags
2017-11-27 12:07:56,710 : INFO : PROGRESS: at example #880000, processed 9346242729 words (1158270/s), 180768166 word types, 880000 tags
2017-11-27 12:09:09,290 : INFO : PROGRESS: at example #890000, processed 9462554828 words (1602516/s), 182803395 word types, 890000 tags
2017-11-27 12:10:15,506 : INFO : PROGRESS: at example #900000, processed 9572241994 words (1656738/s), 184671040 word types, 900000 tags
2017-11-27 12:11:25,417 : INFO : PROGRESS: at example #910000, processed 9685914858 words (1626053/s), 186261093 word types, 910000 tags
2017-11-27 12:12:23,052 : INFO : PROGRESS: at example #920000, processed 9783252247 words (1688488/s), 187368580 word types, 920000 tags
2017-11-27 12:13:37,819 : INFO : PROGRESS: at example #930000, processed 9908719074 words (1678122/s), 189112546 word types, 930000 tags
2017-11-27 12:14:43,667 : INFO : PROGRESS: at example #940000, processed 10014129862 words (1600898/s), 190420710 word types, 940000 tags
2017-11-27 12:16:47,460 : INFO : PROGRESS: at example #950000, processed 10125139689 words (896738/s), 191705137 word types, 950000 tags
2017-11-27 12:18:02,244 : INFO : PROGRESS: at example #960000, processed 10247521765 words (1636399/s), 192857787 word types, 960000 tags
2017-11-27 12:19:19,960 : INFO : PROGRESS: at example #970000, processed 10377655281 words (1674546/s), 194106565 word types, 970000 tags
2017-11-27 12:20:33,180 : INFO : PROGRESS: at example #980000, processed 10496949512 words (1629473/s), 195385814 word types, 980000 tags
2017-11-27 12:21:48,766 : INFO : PROGRESS: at example #990000, processed 10612291662 words (1525873/s), 196747237 word types, 990000 tags
2017-11-27 12:23:07,130 : INFO : PROGRESS: at example #1000000, processed 10732101821 words (1528907/s), 197874949 word types, 1000000 tags
2017-11-27 12:24:45,052 : INFO : PROGRESS: at example #1010000, processed 10890197142 words (1614528/s), 198769761 word types, 1010000 tags
2017-11-27 12:26:21,289 : INFO : PROGRESS: at example #1020000, processed 11047017190 words (1629567/s), 200228568 word types, 1020000 tags
2017-11-27 12:28:16,026 : INFO : PROGRESS: at example #1030000, processed 11229081122 words (1586725/s), 202459220 word types, 1030000 tags
2017-11-27 12:30:29,013 : INFO : PROGRESS: at example #1040000, processed 11424319613 words (1468068/s), 208260001 word types, 1040000 tags
2017-11-27 12:31:37,200 : INFO : PROGRESS: at example #1050000, processed 11532405079 words (1585337/s), 209843180 word types, 1050000 tags
2017-11-27 12:34:04,003 : INFO : PROGRESS: at example #1060000, processed 11755580678 words (1520111/s), 213984687 word types, 1060000 tags
2017-11-27 12:36:00,303 : INFO : PROGRESS: at example #1070000, processed 11934190372 words (1535802/s), 216992114 word types, 1070000 tags
2017-11-27 12:37:16,950 : INFO : PROGRESS: at example #1080000, processed 12050889393 words (1522698/s), 219795582 word types, 1080000 tags
2017-11-27 12:39:37,346 : INFO : PROGRESS: at example #1090000, processed 12303608363 words (1799926/s), 225353138 word types, 1090000 tags
2017-11-27 12:42:13,420 : INFO : PROGRESS: at example #1100000, processed 12542671389 words (1531697/s), 229901537 word types, 1100000 tags
2017-11-27 12:44:27,612 : INFO : PROGRESS: at example #1110000, processed 12744399342 words (1503340/s), 232543628 word types, 1110000 tags
2017-11-27 12:45:36,236 : INFO : PROGRESS: at example #1120000, processed 12845213636 words (1469749/s), 233897416 word types, 1120000 tags
2017-11-27 12:46:57,911 : INFO : PROGRESS: at example #1130000, processed 12958009078 words (1380796/s), 236809426 word types, 1130000 tags
2017-11-27 12:51:02,836 : INFO : PROGRESS: at example #1140000, processed 13292693659 words (1366412/s), 242258680 word types, 1140000 tags
2017-11-27 12:52:40,566 : INFO : PROGRESS: at example #1150000, processed 13427173349 words (1376080/s), 244299228 word types, 1150000 tags
2017-11-27 12:55:38,203 : INFO : PROGRESS: at example #1160000, processed 13683147891 words (1440981/s), 248662959 word types, 1160000 tags
2017-11-27 13:01:31,595 : INFO : PROGRESS: at example #1170000, processed 14248810893 words (1600715/s), 257698537 word types, 1170000 tags
2017-11-27 13:06:43,517 : INFO : PROGRESS: at example #1180000, processed 14730700025 words (1544859/s), 265550753 word types, 1180000 tags
2017-11-27 13:08:36,461 : INFO : PROGRESS: at example #1190000, processed 14882238992 words (1341595/s), 267362897 word types, 1190000 tags
2017-11-27 13:11:09,104 : INFO : PROGRESS: at example #1200000, processed 14994557079 words (735842/s), 268526014 word types, 1200000 tags
2017-11-27 13:13:35,226 : INFO : PROGRESS: at example #1210000, processed 15207374894 words (1456539/s), 271348701 word types, 1210000 tags
2017-11-27 13:17:06,999 : INFO : PROGRESS: at example #1220000, processed 15495935602 words (1362558/s), 275731241 word types, 1220000 tags
2017-11-27 13:21:30,391 : INFO : PROGRESS: at example #1230000, processed 15860509580 words (1384158/s), 280004206 word types, 1230000 tags
2017-11-27 13:22:54,453 : INFO : PROGRESS: at example #1240000, processed 15972466459 words (1331792/s), 281398502 word types, 1240000 tags
2017-11-27 13:24:50,621 : INFO : PROGRESS: at example #1250000, processed 16125513992 words (1317438/s), 283066579 word types, 1250000 tags
2017-11-27 13:30:00,828 : INFO : PROGRESS: at example #1260000, processed 16568002145 words (1426481/s), 288482853 word types, 1260000 tags
2017-11-27 13:31:07,852 : INFO : PROGRESS: at example #1270000, processed 16650673396 words (1233381/s), 289273909 word types, 1270000 tags
2017-11-27 13:32:14,180 : INFO : PROGRESS: at example #1280000, processed 16736681722 words (1296570/s), 290444309 word types, 1280000 tags
2017-11-27 13:34:00,434 : INFO : PROGRESS: at example #1290000, processed 16883054135 words (1377598/s), 292448959 word types, 1290000 tags
2017-11-27 13:35:24,700 : INFO : PROGRESS: at example #1300000, processed 16988089315 words (1246432/s), 294098789 word types, 1300000 tags
2017-11-27 13:38:51,888 : INFO : PROGRESS: at example #1310000, processed 17257650497 words (1301093/s), 298892240 word types, 1310000 tags
2017-11-27 13:43:37,499 : INFO : PROGRESS: at example #1320000, processed 17619644259 words (1267448/s), 303380082 word types, 1320000 tags
2017-11-27 13:46:17,831 : INFO : PROGRESS: at example #1330000, processed 17826979573 words (1293163/s), 305716169 word types, 1330000 tags
2017-11-27 13:49:40,045 : INFO : PROGRESS: at example #1340000, processed 18093105136 words (1316016/s), 308753076 word types, 1340000 tags
2017-11-27 13:51:11,434 : INFO : PROGRESS: at example #1350000, processed 18208800931 words (1266091/s), 309884698 word types, 1350000 tags
2017-11-27 13:53:05,289 : INFO : PROGRESS: at example #1360000, processed 18355434145 words (1287764/s), 311215118 word types, 1360000 tags
2017-11-27 13:53:08,221 : INFO : collected 311250382 word types and 1360633 unique tags from a corpus of 1360633 examples and 18358933358 words
2017-11-27 13:53:08,221 : INFO : Loading a fresh vocabulary
2017-11-27 14:20:14,513 : INFO : min_count=5 retains 31820897 unique words (10% of original 311250382, drops 279429485)
2017-11-27 14:20:14,513 : INFO : min_count=5 leaves 17982644368 word corpus (97% of original 18358933358, drops 376288990)
2017-11-27 14:22:06,467 : INFO : deleting the raw counts dictionary of 311250382 items
2017-11-27 14:23:23,805 : INFO : sample=0.001 downsamples 30 most-common words
2017-11-27 14:23:23,805 : INFO : downsampling leaves estimated 15316549016 word corpus (85.2% of prior 17982644368)
2017-11-27 14:23:23,805 : INFO : estimated required memory for 31820897 words and 300 dimensions: 94185487500 bytes
2017-11-27 14:26:48,005 : INFO : resetting layer weights
Traceback (most recent call last):
  File "C:/Users/Administrator/PycharmProjects/Concept_Hierarchical_Model/d2v_core.py", line 112, in <module>
    model.build_vocab(it)
  File "C:\Users\Administrator\Anaconda2\lib\site-packages\gensim\models\word2vec.py", line 546, in build_vocab
    self.finalize_vocab(update=update)  # build tables & arrays
  File "C:\Users\Administrator\Anaconda2\lib\site-packages\gensim\models\word2vec.py", line 717, in finalize_vocab
    self.reset_weights()
  File "C:\Users\Administrator\Anaconda2\lib\site-packages\gensim\models\doc2vec.py", line 655, in reset_weights
    super(Doc2Vec, self).reset_weights()
  File "C:\Users\Administrator\Anaconda2\lib\site-packages\gensim\models\word2vec.py", line 1109, in reset_weights
    self.syn1neg = zeros((len(self.wv.vocab), self.layer1_size), dtype=REAL)
MemoryError

I have 256 GB RAM and at the time of error, information regarding resource usage is attached as the screenshot.
Can u please help?
ResourceUsage.png

Gordon Mohr

unread,
Nov 27, 2017, 11:38:28 AM11/27/17
to gensim
Memory usage of the gensim model is *not* a function of the size of the text-corpus, but the number of unique word-tokens and tags. So splitting the corpus won't necessarily help from the gensim side, unless it significantly reduces the number of word-tokens or tags in the model. 

If you're holding the whole corpus in-memory, splitting it might help, by not using as much memory outside of the gensim model. But ideally you'd be streaming the corpus from somewhere else, so its size would be irrelevant. So if you are in fact holding all source text, and then the full corpus as fed to Doc2Vec, in main memory (Python objects), changing that to stream-from-disk will free up a lot of memory for the model. 

Within gensim, the number of tags (1360633) is relatively small compared to unique word-tokens (31820897), so the main way you could make the model use less memory is to reduce the vocabulary, for example by using a larger `min_count`. With such a large corpus, it's unlikely words with only 5 or even 50 occurrences are going to get strong representations, or contribute much to the model, so you could be more aggressive here (and the quality of the remaining vectors may also improve). 

Other than that: a certain configured model takes the memory it takes. Only by  – (a) giving it more memory (with less memory use in other steps or a bigger machine) or (b) changing the parameters like retained-vocabulary – can a process that'd otherwise hit memory limits be changed to succeed. 

- Gordon
Reply all
Reply to author
Forward
0 new messages