How does prange() work?

0 views
Skip to first unread message

Peter Kucirek

unread,
Oct 2, 2017, 6:05:59 PM10/2/17
to Numba Public Discussion - Public
Hello, 

I'm writing a very involved piece of parallel code where the time it takes to process each element varies wildly. Usually, I parallelize code by splitting Numpy arrays into equal-sized chunks using np.array_split(). However, given the asymmetrical duration of my code, I assume that uniform is not an optimal distribution of load across multiple cores.

All that said, I was hoping that the prange() function might be smarter than I at balancing this load, but it appears not to. Here's my test:

import numba as nb
import numpy as np
from threading import Thread

n = 500_000
draws = np.random.randint(low=1, high=10, size=n)
seeds = np.arange(n)

@nb.jit([nb.float64(nb.int32, nb.int32)], nogil=True, nopython=True)
def kernel(seed, n_draws):
    np.random.seed(seed)
    r = np.random.uniform(0, 1, 1)[0]
    for _ in range(n_draws**2):
        r = np.random.uniform(0, 1, 1)[0]
    return r

The 'kernel' function is setup to (hopefully) take longer the more draws are provided. Both my options for balancing the load will work from the same function.

@nb.jit(nb.void(nb.int32[:], nb.int32[:], nb.float64[:]), nopython=True, nogil=True)
def worker1(seeds, draws, out):
    assert len(seeds) == len(draws)
    assert len(seeds) == len(out)
    
    for i in range(len(out)):
        out[i] = kernel(seeds[i], draws[i])

def manager1(seeds, draws, n_threads=8):
    assert len(seeds) == len(draws)
    n = len(seeds)
    out = np.zeros(n, dtype='f8')
    seed_chunks = np.array_split(seeds, n_threads, axis=0)
    draws_chunks = np.array_split(draws, n_threads, axis=0)
    out_chunks = np.array_split(out, n_threads, axis=0)
    
    threads = [Thread(target=worker1, args=[seed_chunks[i], draws_chunks[i], out_chunks[i]]) for i in range(n_threads)]
    for t in threads: t.start()
    for t in threads: t.join()
    
    return out

This first manager used uniform chunks.

@nb.jit(nb.float64[:](nb.int32[:], nb.int32[:]), nopython=True, nogil=True, parallel=True)
def manager2(seeds, draws):
    assert len(seeds) == len(draws)
    n = len(seeds)
    out = np.zeros(n, dtype=np.float64)
    
    for i in nb.prange(n):
        out[i] = kernel(seeds[i], draws[i])
    
    return out

This second manager uses prange().

%%timeit
test1 = manager1(seeds, draws)
>> 1.77 s ± 14.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
test2 = manager2(seeds, draws)
>> 1.77 s ± 13.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The two options appear to have virtually identical performance, when I was expecting the second manager to better balance the load. So how does prange() actually work? Does it create worker threads that eagerly grab the next available task? Or does it uniformly split the data the way I was doing manually?

Stanley Seibert

unread,
Oct 3, 2017, 10:17:35 AM10/3/17
to Numba Public Discussion - Public
You are correct.  The current implementation chunks the work up front, and assigns a contiguous iteration range to each thread, with no option for work stealing.  This is something we could improve.  I've opened a Github issue to remind us:


--
You received this message because you are subscribed to the Google Groups "Numba Public Discussion - Public" group.
To unsubscribe from this group and stop receiving emails from it, send an email to numba-users+unsubscribe@continuum.io.
To post to this group, send email to numba...@continuum.io.
To view this discussion on the web visit https://groups.google.com/a/continuum.io/d/msgid/numba-users/cb9ac326-6e00-4ec2-9a7b-102129d0c1cc%40continuum.io.
For more options, visit https://groups.google.com/a/continuum.io/d/optout.

Reply all
Reply to author
Forward
0 new messages