Worker threads don't use full capacity of CPU

Tobias Schulte to Brinke

unread,

Jul 15, 2017, 1:25:24 PM7/15/17

to gensim

Hello,

I want to use gensims (Version 2.2.0) Doc2Vec for my master thesis. I started with the following simple script:

import gensim
import logging
import io


assert gensim.models.doc2vec.FAST_VERSION > -1, "fast mode should be enabled"

logging.basicConfig(level=logging.INFO)

docs = []
with io.open('paragraphs.txt', 'r', encoding='utf8') as f:
    for i, line in enumerate(f):
        docs.append(gensim.models.doc2vec.TaggedDocument(line.split(), [i]))
        
model = gensim.models.doc2vec.Doc2Vec(size=100, min_count=1, iter=10, seed=14, workers=16, alpha=0.1, negative=5, hs=0,
                                      dm=1, dm_concat=0, dm_mean=0, window=8)
model.build_vocab(docs)
model.train(docs, total_examples=model.corpus_count, epochs=model.iter)

I'm running this script on an AMD Ryzen 7 1700 with 8 cores/16 threads and 16GB memory and a file with 15342 lines/paragraphs.

My operating System is Ubuntu 16.04.2 LTS and I use Openblas.

I recognized that the script doesn't use the full power of the CPU, if I use multiple workers for the training.

I tested different values for the workers param for vector sizes of 100 and 400.

The results are listed below (workers, average words/s, average CPU usage per thread):

size=100:

16, 610k, 25%

15, 610k, 26%

14, 622k, 29%

13, 636k, 30%

12, 652k, 33%

11, 685k, 36%

10, 723k, 40%

9, 765k, 44%

8, 797k, 50%

7, 813k, 55%

6, 811k, 62%

5, 799k, 71%

4, 767k, 80%

3, 671k, 88%

2, 526k, 95%

1, 326k, 100%

size=400:

16, 608k, 44%

13, 626k, 55%

11, 628k, 63%

10, 629k, 68%

9, 619k, 73%

7, 566k, 83%

4, 497k, 93%

1, 178k, 100%

Is there a way to improve the CPU usage of each thread to accelerate the multithreaded training or is the best method to use the full

capacity of the CPU to run multiple single threaded trainings with different parameters at the same time?

Best regards,

Tobias

Gordon Mohr

unread,

Jul 15, 2017, 1:48:41 PM7/15/17

to gensim

The parallelism of parts of the implementation is still limited by:

(1) the Python "global interpreter lock" (GIL)

(2) a single corpus-iteration thread, which batches text examples to the worker threads

You've already got a pretty simple corpus-iterator, so that may not be a bottleneck. (Still, if data is coming from a spinning disk, using a faster volume, or bringing it all into RAM, might help a bit.)

The contention from single-threading is likely the larger issue, and often means peak throughput is achieved with 3-8 worker threads, even if more physical cores are available.

The exact optimal number of threads can vary based on things like the system/libraries, meta-parameters and vocabulary-size. For example, increasing the `window`, `negative` count, or vector `size` each cause the highly-optimized multi-threadable code blocks to take longer – thus *lessening* contention and potentially getting more work done 'for free' (from cores that would otherwise be idle). Of course, increasing these values may not in fact obtain optimal results on other project goals.

A few other tangential comments on your set-up:

* using Intel's MKL (as for example installed by the `conda` environment-manager) may outperform OpenBLAS

* it's rare to use PV-DM with summing rather than averaging (your `dm=1, dm_mean=0`)

* keeping words that only appear once (`min_count=1`) or even a few times usually serves more to interfere-with, or dilute, other vectors than to help.

Hope this helps,

- Gordon

Tobias Schulte to Brinke

unread,

Jul 15, 2017, 2:49:14 PM7/15/17

to gensim

Thank you for the fast response.

(Still, if data is coming from a spinning disk, using a faster volume, or bringing it all into RAM, might help a bit.)

In my case the data is comming from a SSD and maybe I'm getting something wrong, because I'm quite new to python, but I thought I brought all data into RAM in my script.

The exact optimal number of threads can vary based on things like the system/libraries, meta-parameters and vocabulary-size. For example, increasing the `window`, `negative` count, or vector `size` each cause the highly-optimized multi-threadable code blocks to take longer – thus *lessening* contention and potentially getting more work done 'for free' (from cores that would otherwise be idle). Of course, increasing these values may not in fact obtain optimal results on other project goals.

So it is normal that the threads don't use the full CPU capacity and I have to try different worker counts for each set-up, but I don't have to search for mistakes in my code (at least not in the script I posted here).

* using Intel's MKL (as for example installed by the `conda` environment-manager) may outperform OpenBLAS
* it's rare to use PV-DM with summing rather than averaging (your `dm=1, dm_mean=0`)

Thank you for your comments on my set-up. I will try Intels MKL and I will have a look at the performance. I will also use dm_mean=1 next time.

* keeping words that only appear once (`min_count=1`) or even a few times usually serves more to interfere-with, or dilute, other vectors than to help

Does a good value for the min_count depend on size of the corpus? Is there a rule of thumb or something for the min_count?

Gordon Mohr

unread,

Jul 15, 2017, 3:58:15 PM7/15/17

to gensim

Yes, I now notice you've already brought everything into the `docs` list, so IO wouldn't be a factor. I see nothing amiss in your code that could interfere with core utilization. Just keep in mind some meta-parameter choices, that in terms of the algorithm could require nearly-linearly-more calculation, might in fact just serve to (helpfully) increase core-utilization, through higher concurrency, rather than fully increase running times.

Regarding `min_count` - I don't know any hard-and-fast rules-of-thumb, just that one should remember that words with only a few occurrences are unlikely to be well-modeled (individually), but that in aggregate they may soak up a lot of training effort, and inject further randomness/context-distance between other words. So, despite one's intuition that "more info must help", dropping more words often improves the power of the remaining words. Some projects seem to target a retained-vocabulary size – since some aspects of memory-use or calculation are sensitive to vocabulary size. So they pick whatever `min_count` floor achieves the desired vocabulary size. IIRC, the "Document Embeddings with Paragraph Vectors" paper training on Wikipedia articles used a just-under-1-million-word vocabulary, which in my Wikipedia experiments required a `min_count` somewhere near 40.

You can split the `build_vocab()` step into its three constituent steps (see the source code) of `scan_vocab()`, `scale_vocab()`, `finalize_vocab()`, and preview the effects of changed `min_count` and `sample` parameters by calling `scale_vocab(..., dry_run=True)` multiple times before freezing those parameters in place.

- Gordon

Tobias Schulte to Brinke

unread,

Jul 15, 2017, 4:32:57 PM7/15/17

to gensim

Ok, I will experiment with the parameters to get good core-utilization, good running times and good results.

It is also a good hint to split the build_vocab step into its parts to find a good min_count value.
Thank you for your help.

Reply all

Reply to author

Forward