How to access BLAS without GIL?

Jacob Schreiber

unread,

Nov 27, 2017, 9:05:37 AM11/27/17

to Numba Public Discussion - Public

Howdy

I'd like to access BLAS, specifically dgemm, in numba wrapped functions while also releasing the GIL. When I try a simple function like...

def dot(X):
    return numpy.dot(X.T, X)


@jit('double[:,:](double[:,:])', nopython=True, nogil=True)
def jit_dot(X):
    return numpy.dot(X.T, X)

I get that the `dot` function is significantly faster than `jit_dot`, and that `jit_dot` is slowed down by multithreading, not sped up.

Is there a way to use numba to release the GIL while also accessing BLAS?

Thanks

Stanley Seibert

unread,

Nov 27, 2017, 1:22:19 PM11/27/17

to Numba Public Discussion - Public

This difference doesn't make sense to me, unless the version of NumPy and SciPy you are using were compiled with different BLAS implementations. For some obscure technical reasons, Numba uses the SciPy BLAS when JIT compiling functions that call numpy.dot. In all the cases we've encountered, both are compiled with the same implementation, but in principle they could be different.

Where did your numpy and scipy packages come from?

--
You received this message because you are subscribed to the Google Groups "Numba Public Discussion - Public" group.
To unsubscribe from this group and stop receiving emails from it, send an email to numba-users+unsubscribe@continuum.io.
To post to this group, send email to numba...@continuum.io.
To view this discussion on the web visit https://groups.google.com/a/continuum.io/d/msgid/numba-users/48e8a1d2-e2b3-40e9-a61f-16ccf2572f40%40continuum.io.
For more options, visit https://groups.google.com/a/continuum.io/d/optout.

Jacob Schreiber

unread,

Nov 28, 2017, 11:29:11 PM11/28/17

to Numba Public Discussion - Public

I started off with the default ones from Anaconda for a 64 bit Ubuntu computer. Here is a screenshot of my notebook.

If I can get multi-threading to work using numba, then I'd go ahead and convert pomegranate entirely over to numba. This is currently the blocking issue.

Thanks for any help you can provide!

Jacob

On Monday, November 27, 2017 at 10:22:19 AM UTC-8, Stanley Seibert wrote:

This difference doesn't make sense to me, unless the version of NumPy and SciPy you are using were compiled with different BLAS implementations. For some obscure technical reasons, Numba uses the SciPy BLAS when JIT compiling functions that call numpy.dot. In all the cases we've encountered, both are compiled with the same implementation, but in principle they could be different.

Where did your numpy and scipy packages come from?

On Sat, Nov 25, 2017 at 2:42 AM, Jacob Schreiber <jmschr...@gmail.com> wrote:

Howdy

I'd like to access BLAS, specifically dgemm, in numba wrapped functions while also releasing the GIL. When I try a simple function like...

def dot(X): return numpy.dot(X.T, X) @jit('double[:,:](double[:,:])', nopython=True, nogil=True) def jit_dot(X): return numpy.dot(X.T, X)

I get that the `dot` function is significantly faster than `jit_dot`, and that `jit_dot` is slowed down by multithreading, not sped up.

Is there a way to use numba to release the GIL while also accessing BLAS?

Thanks

--
You received this message because you are subscribed to the Google Groups "Numba Public Discussion - Public" group.

To unsubscribe from this group and stop receiving emails from it, send an email to numba-users...@continuum.io.

Kevin Sheppard

unread,

Nov 29, 2017, 12:27:36 PM11/29/17

to Numba Public Discussion - Public

I believe you need to use the out argument to substantially benefit form GIL releasing when using dot. Without using out NumPy is forced to construct a Python object and allocate memory for each block to contain the result, which must hold the GIL.

import numpy as np 
from joblib import Parallel, delayed
import datetime as dt


N = 4

X = np.random.randn(50000,1000)
Y = np.random.randn(1000,1000)
Z = np.empty((50000, 1000))
Zs = [np.empty((50000, 1000)) for _ in range(N)]


tic = dt.datetime.now()
for i in range(4):
    np.dot(X, Y, out=Z)
toc = dt.datetime.now() - tic
print(toc.total_seconds())


tic = dt.datetime.now()
with Parallel(n_jobs=4, backend='threading') as P:
    P(delayed(np.dot, check_pickle=False)(X, Y, Zs[i]) for i in range(4))
toc = dt.datetime.now() - tic
print(toc.total_seconds())

I use set MKL_NUM_THREADS=1 to ensure multithreading doesn't clobber the performance gains.

When I run this I get times of ~9s for the explicitly serial version and ~3s for the multithreaded version which shows that NumPy releases the GIL when calling dot.

It is possible I'm not fully understanding your issue.

Kevin

Siu Kwan Lam

unread,

Nov 29, 2017, 12:59:20 PM11/29/17

to numba...@continuum.io

Can you try removing the type signature or changing it to double[:,::1](double[:,::1], double[:,::1]) instead?

The type signature you provided is restricting the function to only take arrays of unspecified layout. If you skip the type signature, numba will infer from the actual args that the arrays are of C layout; thus the implementation can optimize for such layout. Alternatively, spelling the 2d array as double[:, ::1] will force it to use C-layout as well.

I tried your notebook and numba is slightly faster. But beware that there maybe oversubscription of threads if the underlying gemm implementation is MKL and you are also launching threads to run multithreaded MKL gemm in parallel.

To view this discussion on the web visit https://groups.google.com/a/continuum.io/d/msgid/numba-users/544b18a8-43da-484a-931c-b6a1258c4540%40continuum.io.

For more options, visit https://groups.google.com/a/continuum.io/d/optout.

--

Siu Kwan Lam

Software Engineer

Continuum Analytics

Jacob Schreiber

unread,

Nov 30, 2017, 6:25:07 PM11/30/17

to Numba Public Discussion - Public

The oversubscription of threads was the issue. Thanks for the help!

Reply all

Reply to author

Forward