I have the following piece of code that I am running on multiple threads (over independent inputs):
@jit('void(complex128[:,:], complex128[:,:,:], int32[:], int32[:])', nopython=True, nogil=True)
def _evaluate1(res, cubeFactor, m_rows, m_cols):
for idx0 in range(len(m_rows)):
i0, j0 = m_rows[idx0], m_cols[idx0]
for idx1 in range(len(m_rows)):
i1, j1 = m_rows[idx1], m_cols[idx1]
i_diff, j_diff = i1 - i0, j1 - j0
# WARNING: calling np.dot() seems to reacquire the GIL!
#res[i_diff, j_diff] += np.dot(cubeFactor[i0, j0, :], np.conj(cubeFactor[i1, j1, :]))
acc = 0.0
for k in range(cubeFactor.shape[2]):
acc += cubeFactor[i0, j0, k] * np.conj(cubeFactor[i1, j1, k])
res[i_diff, j_diff] += acc
I noticed that when I use np.dot() instead of the handrolled dot-product, the multi-threaded perf tanks, and it appears that all the time is spent fighting locks. However, when I use the handrolled dot-product, this issue disappears. My guess is that calling np.dot() re-acquires the GIL? This is surprising to me, and it would be great if nogil=True would either (a) throw an exception or (b) issue a warning at the very least.