ThreadPoolExecutor is not consistently reducing execution time

0 views
Skip to first unread message

B.J.D. Jacobs

unread,
Apr 18, 2017, 6:31:52 PM4/18/17
to Numba Public Discussion - Public
I am in the process of parallelizing a large for-loop in my optimization code by jitting the inner-function of this loop with Numba in nopython mode with the GIL released.
The jitted function relies on a few basic Numpy operations on a handful of relatively small Numpy arrays. Everything happens in memory.
In addition, I have restricted the number of threads available to Numpy to 1 using the environment variables MKL_NUM_THREADS, NUMEXP_NUM_THREADS, OMP_NUM_THREADS.

The parallelization is done using a ThreadPoolExecutor from the concurrent.futures module.
I use the map function of this ThreadPoolExecutor to call my jitted function several thousand times in parallel.
Each call to the jitted function only takes a short amount of time to evaluate, i.e. just a couple of milliseconds.
However, as I need to evaluate this function thousands of times per iteration, and I have thousands of iterations, the total time spent in the for-loop adds up.


Unfortunately, the parallelization results are not as expected: Increasing the number of workers in the thread pool from 1 to 2 or 4 does not decrease the execution time.
If I monitor my CPU usage using the top command on the terminal I don't see my CPU under full load either.
It seems that not more than one thread is activated, even though more threads are available.
I have not yet been able to reproduce an isolated minimum working example of this behavior outside of my optimization code.

To further test this I artificially increased the execution time of my jitted function by taking the inverse of a large dummy matrix inside the function.
In that case I do see a decrease in execution time if I increase the number of workers. Now the CPU is under full load as well.
The larger this dummy matrix, the larger the (relative) performance gap between a single and multiple threads.


I am not able to explain this behavior, which almost seems to indicate that the extra available threads in the ThreadPoolExecutor object are not used to execute tasks, if those tasks are very light. I am not very familiar with parallel processing so I am not sure if that explanation actually makes sense and how to assert that all threads are used.

To summarize, I have two questions:
1. Is there an intuitive explanation why I only observe a speed-up in the execution time for my function when I increase the number of threads, if those threads evaluate heavy function calls? Why is there no speed-up for light function calls? Can I somehow work around this restriction?

Currently I am considering pushing the outer for-loop into Numba as well, and (manually) dividing the thousands of function calls across the available threads. That way each thread is assured of 'sufficient work' and will hopefully get activated. However, this requires some serious rewriting from my part because the outer-loop now relies on lists of Numpy arrays that cannot directly be used in nopython mode.

2. Is it possible to print the current thread ID within nopython mode? This would help me rule out that my code is actually using all workers available to the ThreadPoolExecutor object.
Reply all
Reply to author
Forward
0 new messages