Python overheads when launching cuda kernels

0 views

Skip to first unread message

Rémi Lehe

unread,

Apr 13, 2018, 10:40:53 AM4/13/18

to Numba Public Discussion - Public

In a simulation code based on Numba on GPU (https://github.com/fbpic/fbpic.git), I recently encountered a case where the run time seems to be dominated by the Python overheads when launching cuda kernels. I was wondering if there are any guidelines in order to reduce these overheads.

More precisely, this happens in a case where the problem size is very small (and hence the cuda compute kernels run very fast). The typical timeline with nvvp looks like this:

i.e. the compute kernels are fast, but separated by a lot of "idle" time.

Analyzing the CPU side (e.g. with cProfile) reveals that 80% of the run time is spent in the kernel-launching function `numba/cuda/compiler.py:line 697:__call__` (includes time spent in other functions called within the above function), with an average of 0.25 ms per call. Looking down the call stack, it seems that most of this time is spent in Python code (e.g. checking types of arguments, etc.).

I was wondering if there are any guidelines to reduce the time spent there. For instance, I guess that lumping several cuda kernels into a single one (where possible) would help (since it reduces the number of kernel launches).
Also, we are typically using `cuda.jit` without explicitly specifying the type of the input arguments. Would it reduce the Python overhead if we explicitly specify these types?

Thanks in advance for your help!