Debugging vectorize parallel

0 views
Skip to first unread message

Matthieu Dartiailh

unread,
Jul 21, 2016, 5:12:16 AM7/21/16
to Numba Public Discussion - Public
Hi,

I am facing some troubles with vectorize when trying to use the parallel target.

When compiling my function for the cpu target it works without any trouble but when compiling for the parallel target the call never completes and even more surprising the cpu activity is very low (lower than the one for the cpu target). In both cases the signature for the function is provided explicitely.

The code is quite complex and I have not been able to reproduce my issue on smaller tests. The rough outline is as follow (all function are compiled in no-python mode):
- the vectorized function call a jitted function taking the same arguments
- this function creates a jitclass instance and pass to a third function (with compute an integral). The evaluation of the integrand is quite involved and requires to invert a matrix.

So far I tested that :
- I can vectorize a function creating and inverting a matrix (MKL is set to use a single thread as the matrices are small)
- I can vectorize a function performing a trivial integration (using the same technique as above)
- I can vectorize a function integrating a function doing a matrix inversion

I am running latest master of numba on windows 10.

I know that this is likely to insufficient to really provide any insight into what the problem is but I am wondering if there are any other tool or options in Numba I could use to go deeper.

Thanks

Best regards

Matthieu Dartiailh
 

stuart

unread,
Jul 22, 2016, 10:37:56 AM7/22/16
to Numba Public Discussion - Public
Hi,

Sorry to hear you are having problems with this target.

A few Qs.
  1. Does this replicate on linux or is it just on windows that this problem occurs?
  2. Any chance you could please run the program and attach gdb/a debugger to it, poking around in the threads might yield something about what is running (could be an accidental spin lock or something), posting a backtrace for anything about where threads may be stuck, or if they aren't stuck, what they are actually doing, would be helpful.
  3. Could you please dump your environment variables used. We'd be especially interested in anything involving threads (MKL_NUM_THREADS, OMP_NUM_THREADS, NUMBA_NUM_THREADS, etc).
  4. How is the integrand being computed? Is it via a callback @cfunc ? Is it just `np.linalg.inv()` and perhaps some `np.dot()` which would hit MKL?
Many thanks,

--
stuart

Matthieu Dartiailh

unread,
Jul 22, 2016, 11:19:38 AM7/22/16
to numba...@continuum.io

Thanks for your quick answer. I will try the steps you suggest.

However I am going on holidays for two weeks so this will probably wait at least that long.

I can just answer to point 3 and 4 :

- MKL_NUM_THREADS is set to 1 at the beginning of my script (before any other import) and OMP and NUMBA num threads are not set explicitly.
- for the integration I basically re-wrote the fortran code of of quad (scipy) using numba and the function to integrate is wrapped with a jitclass. To evaluating the integrand is just a call to a jitted function calling np.linalg
.inv and also np.dot. So they may be some weird issue with MKL.

Thanks

Matthieu
--
You received this message because you are subscribed to the Google Groups "Numba Public Discussion - Public" group.
To unsubscribe from this group and stop receiving emails from it, send an email to numba-users...@continuum.io.
To post to this group, send email to numba...@continuum.io.
To view this discussion on the web visit https://groups.google.com/a/continuum.io/d/msgid/numba-users/da3cd73b-dfe9-4f64-a548-76703c78ffa8%40continuum.io.
For more options, visit https://groups.google.com/a/continuum.io/d/optout.

Siu Kwan Lam

unread,
Jul 22, 2016, 12:15:01 PM7/22/16
to numba...@continuum.io
Matthieu,

Does your code has any `print()` or functions that reacquire the GIL (that callbacks into cpython)?  I just discovered that having `print()` in the parallel-vectorize is causing a deadlock.

--
Siu Kwan Lam
Software Engineer
Continuum Analytics

Siu Kwan Lam

unread,
Jul 22, 2016, 12:40:38 PM7/22/16
to numba...@continuum.io
I think I have found the problem.  The parallel ufunc launcher is not releasing the GIL.  My current minimal patch have fixed the issue with using `print()` and ctypes callback.  I will clean it up and open a PR soon.

Siu Kwan Lam

unread,
Jul 22, 2016, 12:43:45 PM7/22/16
to numba...@continuum.io
Here's the issue tracking this problem: https://github.com/numba/numba/issues/1998

Matthieu Dartiailh

unread,
Jul 22, 2016, 12:52:46 PM7/22/16
to numba...@continuum.io
That's interesting. I don't think there is any print but I will check. What other function is likely to re-acquire the GIL ?

Thanks again

Matthieu



Envoyé depuis mon appareil Samsung


-------- Message d'origine --------
De : Siu Kwan Lam <s...@continuum.io>
Date : 22/07/2016 18:14 (GMT+01:00)
À : numba...@continuum.io
Objet : Re: [Numba] Re: Debugging vectorize parallel

Matthieu Dartiailh

unread,
Jul 22, 2016, 3:13:39 PM7/22/16
to numba...@continuum.io
Is there an easy way to check if a function reacquires the GIL ? I am asking because looking at my code I do not see anything obvious.

Thanks

Matthieu



Envoyé depuis mon appareil Samsung


-------- Message d'origine --------
De : Siu Kwan Lam <s...@continuum.io>
Date : 22/07/2016 18:43 (GMT+01:00)
À : numba...@continuum.io
Objet : Re: [Numba] Re: Debugging vectorize parallel

Here's the issue tracking this problem: https://github.com/numba/numba/issues/1998

On Fri, Jul 22, 2016 at 11:40 AM Siu Kwan Lam <s...@continuum.io> wrote:
I think I have found the problem.  The parallel ufunc launcher is not releasing the GIL.  My current minimal patch have fixed the issue with using `print()` and ctypes callback.  I will clean it up and open a PR soon.

On Fri, Jul 22, 2016 at 11:14 AM Siu Kwan Lam <s...@continuum.io> wrote:

Siu Kwan Lam

unread,
Jul 22, 2016, 3:55:03 PM7/22/16
to numba...@continuum.io

When I debug this, I set environment variable NUMBA_DEBUG_JIT=1.  This tells numba to put a C-print call between every instruction that numba emit.  The print will display the Numba IR instruction corresponding being executed.  This may help a bit as you will probably see the instruction that is triggering the deadlock.

Another option is to try my patch https://github.com/numba/numba/pull/1999

I can't think of anything GIL specific since it can be released outside of Numba control.  

Matthieu Dartiailh

unread,
Aug 9, 2016, 9:22:33 AM8/9/16
to Numba Public Discussion - Public
Hi,

I tested your patch and it does solve the deadlock. However I am still concerned as I have not managed to pinpoint the origin of the access to the GIL. And I suspect this is why even if I am working with 8 threads (matching the processor cores) I get only a four times speed up instead of the 8 I expected, even though the vectorized function is pretty slow (2ms per evaluation).

Thanks a lot for your help and as I don't have much to spend on this at the moment I will consider myself happy with the four times improvement.

Matthieu

Edison Gustavo Muenz

unread,
Aug 9, 2016, 10:02:52 AM8/9/16
to numba...@continuum.io

8 threads (matching the processor cores)

Are those real cores or virtual (hyperthreaded) cores? Notice that if they’re virtual cores, it is better to use the number of threads the same number of real cores.

Hyperthreading is helpful for operations that are not CPU-intensive (like IO), but when running a cpu-intensive algorithm, you should aim for the real cores.

Stanley Seibert

unread,
Aug 9, 2016, 10:10:09 AM8/9/16
to Numba Public Discussion - Public
My experience with hyperthreading has been that you will see somewhere between no improvement to 50% improvement when using the hyperthreaded virtual CPU cores.  (But never 100% improvement.)  A lot depends on how much cache and memory bandwidth each thread needs, since hyperthreading effectively cuts your bandwidth and cache per core in half.

We have seen compute-intensive jobs reach that 50% improvement level, but those calculations didn't put much pressure on memory I/O.


--
You received this message because you are subscribed to the Google Groups "Numba Public Discussion - Public" group.
To unsubscribe from this group and stop receiving emails from it, send an email to numba-users+unsubscribe@continuum.io.

To post to this group, send email to numba...@continuum.io.

Matthieu Dartiailh

unread,
Aug 9, 2016, 10:11:26 AM8/9/16
to numba...@continuum.io
My processor is a i7-4720HQ so 4 physical cores (and hence 8 logical
processors). So I guess given your remark that a four times speed-up is
the best I can hope for.

Thanks for the tip.

Matthieu

Edison Gustavo Muenz

unread,
Aug 9, 2016, 10:13:35 AM8/9/16
to numba...@continuum.io
I want to mention this podcast: http://www.rce-cast.com/Podcast/rce-52-atlas-automatically-tuned-linear-algebra-software.html

The invited speaker is the author of ATLAS (a BLAS library) in which they discuss many subjects, one of them is hyperthreading.

Best

--
You received this message because you are subscribed to the Google Groups "Numba Public Discussion - Public" group.
To unsubscribe from this group and stop receiving emails from it, send an email to numba-users+unsubscribe@continuum.io.

To post to this group, send email to numba...@continuum.io.
Reply all
Reply to author
Forward
0 new messages