calling numpy.dot reacquires the GIL?

Stephen Tu

unread,

Mar 31, 2016, 6:32:15 PM3/31/16

to numba...@continuum.io

I have the following piece of code that I am running on multiple threads (over independent inputs):

@jit('void(complex128[:,:], complex128[:,:,:], int32[:], int32[:])', nopython=True, nogil=True)

def _evaluate1(res, cubeFactor, m_rows, m_cols):

for idx0 in range(len(m_rows)):

i0, j0 = m_rows[idx0], m_cols[idx0]

for idx1 in range(len(m_rows)):

i1, j1 = m_rows[idx1], m_cols[idx1]

i_diff, j_diff = i1 - i0, j1 - j0

# WARNING: calling np.dot() seems to reacquire the GIL!

#res[i_diff, j_diff] += np.dot(cubeFactor[i0, j0, :], np.conj(cubeFactor[i1, j1, :]))

acc = 0.0

for k in range(cubeFactor.shape[2]):

acc += cubeFactor[i0, j0, k] * np.conj(cubeFactor[i1, j1, k])

res[i_diff, j_diff] += acc

I noticed that when I use np.dot() instead of the handrolled dot-product, the multi-threaded perf tanks, and it appears that all the time is spent fighting locks. However, when I use the handrolled dot-product, this issue disappears. My guess is that calling np.dot() re-acquires the GIL? This is surprising to me, and it would be great if nogil=True would either (a) throw an exception or (b) issue a warning at the very least.

Thanks!

Stanley Seibert

unread,

Mar 31, 2016, 6:43:45 PM3/31/16

to Numba Public Discussion - Public

Our np.dot implementation delegates out to the SciPy BLAS interface exported for Cython, which in turn calls the BLAS library that SciPy was linked to. There shouldn't be any GIL involved.

However, you might be seeing some kind of contention coming from a multithreaded BLAS implementation (like OpenBLAS or MKL) attempting to use N threads from each of your Python threads. Can you try setting this enviroment variable before launching your program?:

OMP_NUM_THREADS=1

I believe both OpenBLAS and MKL will respect the OpenMP environment variables.

--
You received this message because you are subscribed to the Google Groups "Numba Public Discussion - Public" group.
To unsubscribe from this group and stop receiving emails from it, send an email to numba-users...@continuum.io.
To post to this group, send email to numba...@continuum.io.
To view this discussion on the web visit https://groups.google.com/a/continuum.io/d/msgid/numba-users/CAF%2BdM3iHA-62jRiq8q0QFiUHuB%3DCCZSYVh%2BW-J2CQEX-oUJJkQ%40mail.gmail.com.
For more options, visit https://groups.google.com/a/continuum.io/d/optout.

Stephen Tu

unread,

Apr 1, 2016, 12:20:43 AM4/1/16

to numba...@continuum.io

Hi Stanley,

I tried your suggestion, and it did not improve the situation. I'm OK with my workaround for now, but happy to debug this further also. Any tips for figuring out what lock is being contended?

To view this discussion on the web visit https://groups.google.com/a/continuum.io/d/msgid/numba-users/CADv3RKSQmTwV40fSBNfwiszgVoseZhsdSKs%3DTXfsQB0xvrq9EQ%40mail.gmail.com.

Stephen Tu

unread,

Apr 2, 2016, 1:52:26 PM4/2/16

to numba...@continuum.io

Stan,

This (may) be related to this bug: https://github.com/numba/numba/issues/1796 actually (which was incidentally reported by my collaborator). The slowness I was seeing may be related to contention w/ lots of unnecessary memory allocation.

ajasja....@gmail.com

unread,

Jun 3, 2016, 6:08:23 PM6/3/16

to Numba Public Discussion - Public, tu.st...@gmail.com

Hi!

I also noticed that hand-rolling `np.dot` increased performance by 20x.

Example is here.

The arrays being dotted did only have 3 elements however.

Best,

Ajasja

Antoine Pitrou

unread,

Jun 6, 2016, 9:34:37 AM6/6/16

to numba...@continuum.io

On Fri, 3 Jun 2016 13:40:38 -0700 (PDT)
ajasja....@gmail.com wrote:
> Hi!
>
> I also noticed that hand-rolling `np.dot` increased performance by 20x.

> Example is here <http://stackoverflow.com/a/37612696/952600>.

>
> The arrays being dotted did only have 3 elements however.

Indeed, np.dot() calls optimized implementations from the underlying
BLAS library (e.g. MKL) but those optimized implementations will have a
significantly larger setup cost than a trivial product-and-summation
loop. On tiny arrays the setup cost becomes more noticeable than the
potential performance gain due to the optimized implementations.

Regards

Antoine.

Reply all

Reply to author

Forward