C callbacks for gpu functions

0 views
Skip to first unread message

Antonio Augusto Alves Junior

unread,
Jun 9, 2017, 6:30:51 AM6/9/17
to Numba Public Discussion - Public
Hi all,

Congratulations, this is a very interesting and useful project.

I am absolute beginner on numba... so I apologise for asking  this very silly question.

I am designing  python bindings for Hydra, a library for data analysis in high energy physics using multi-core CPU and GPU.
I would like to be able to define host/device functions on the  python interpreter, compile them using @jit decorators , then get their address and
pass it to the foreign c++ interface of Hydra.

Currently, it is possible to do  this task for CPU target  it using @cfunc, as documented here http://numba.pydata.org/numba-doc/latest/user/cfunc.html. I would like to have similar functionally for gpu.

Does someone have a suggestion?


Cheers
A.A.

Antonio Augusto Alves Junior

unread,
Jun 10, 2017, 3:47:18 AM6/10/17
to Numba Public Discussion - Public
Hi folks, some suggestion?

Cheers
A.A.

Jim Pivarski

unread,
Jun 10, 2017, 9:29:59 AM6/10/17
to numba...@continuum.io
If I understand your question correctly, you want to use Numba to compile a Python function to PTX and get a pointer to it so that a CUDA-C++ function can use it as a callback. That way, you can have a CUDA-C++ function minimizer (for example) and pass it user-defined functions to minimize, which are written in Python for user convenience.

Looking at the documentation (http://numba.pydata.org/numba-doc/0.13/CUDAJit.html), numba.cuda.jit has a "device" argument. When "True," the Python function is compiled as a __device__ function, rather than __global__. So that's how you can get your callback.

As for its pointer, there must be a way using ctypes. It will probably be in some llvmlite attribute, rather than a Numba attribute. I might have an example for a CPU function, and it would be similar for GPU functions.

But... wait a minute... I thought that functions running in GPUs couldn't run functions from pointers. That would make recursion possible, something else I thought was impossible on GPUs because the stack depth needs to be known at compile time.

Jim





--
You received this message because you are subscribed to the Google Groups "Numba Public Discussion - Public" group.
To unsubscribe from this group and stop receiving emails from it, send an email to numba-users+unsubscribe@continuum.io.
To post to this group, send email to numba...@continuum.io.
To view this discussion on the web visit https://groups.google.com/a/continuum.io/d/msgid/numba-users/d70976d6-f215-4ab8-8e6e-098135dc3db2%40continuum.io.
For more options, visit https://groups.google.com/a/continuum.io/d/optout.
Message has been deleted

Antonio Augusto Alves Junior

unread,
Jun 11, 2017, 1:57:24 AM6/11/17
to Numba Public Discussion - Public
Hi Jim, sorry... I could not reply before.


On Saturday, June 10, 2017 at 3:29:59 PM UTC+2, Jim Pivarski wrote:
If I understand your question correctly, you want to use Numba to compile a Python function to PTX and get a pointer to it so that a CUDA-C++ function can use it as a callback. That way, you can have a CUDA-C++ function minimizer (for example) and pass it user-defined functions to minimize, which are written in Python for user convenience.


 Perfect! Yes you did understand correctly.

Looking at the documentation (http://numba.pydata.org/numba-doc/0.13/CUDAJit.html), numba.cuda.jit has a "device" argument. When "True," the Python function is compiled as a __device__ function, rather than __global__. So that's how you can get your callback.

 
I think yes, this can be an way. Other possibility is to extend the functionality of numba.cfunc... something like this "pseudo-code" below:

>>> from numba import cfunc

>>>
def add(a,b):
        return return
   ...:
>>> c_callback = cfunc("float64(float64, float64)", gpu=True, cpu=True)(add)

>>> from hydra import generic_algo

>>> gpu_result = generic_algo.gpu_calculation( c_callback.gpu_address)

>>> cpu_result = generic_algo.cpu_calculation( c_callback.cpu_address)



where generic_algo is the interface to c++ boilerplate and can be anyone of the Hydra algorithms (ex. minimization, MC generation, integration etc...)
 
As for its pointer, there must be a way using ctypes. It will probably be in some llvmlite attribute, rather than a Numba attribute. I might have an example for a CPU function, and it would be similar for GPU functions.

But... wait a minute... I thought that functions running in GPUs couldn't run functions from pointers. That would make recursion possible, something else I thought was impossible on GPUs because the stack depth needs to be known at compile time.


The CUDA Programing Guide states:
 

E.3.9.5. Function Pointers

The address of a __global__ function taken in host code cannot be used in device code (e.g. to launch the kernel). Similarly, the address of a __global__ function taken in device code 10 cannot be used in host code.

It is not allowed to take the address of a __device__ function in host code.

E.3.9.6. Function Recursion

__global__ functions do not support recursion.


So I think using even recursion in device functions is not a very smart thing to do, there is no formal constraint to forbid that. 


Do you have some idea on how I could implement that ideas ?


A.A.


Jim





On Jun 10, 2017 2:47 AM, "Antonio Augusto Alves Junior" <aalv...@gmail.com> wrote:
Hi folks, some suggestion?

Cheers
A.A.


On Friday, June 9, 2017 at 12:30:51 PM UTC+2, Antonio Augusto Alves Junior wrote:
Hi all,

Congratulations, this is a very interesting and useful project.

I am absolute beginner on numba... so I apologise for asking  this very silly question.

I am designing  python bindings for Hydra, a library for data analysis in high energy physics using multi-core CPU and GPU.
I would like to be able to define host/device functions on the  python interpreter, compile them using @jit decorators , then get their address and
pass it to the foreign c++ interface of Hydra.

Currently, it is possible to do  this task for CPU target  it using @cfunc, as documented here http://numba.pydata.org/numba-doc/latest/user/cfunc.html. I would like to have similar functionally for gpu.

Does someone have a suggestion?


Cheers
A.A.

--
You received this message because you are subscribed to the Google Groups "Numba Public Discussion - Public" group.
To unsubscribe from this group and stop receiving emails from it, send an email to numba-users...@continuum.io.

Jim Pivarski

unread,
Jun 11, 2017, 12:52:21 PM6/11/17
to numba...@continuum.io
The part of the documentation you quoted didn't say that you can't execute a __device__ function pointer on the device, but it didn't say you can, either.

I was under the impression that PTX bytecode can't have a function call stack that grows without bound at runtime, which is fundamentally unlike x86 bytecode. This is what I have heard— I'm not an expert. Do you have evidence, independent of Numba, that you can pass and execute __device__ function pointers on the device? Have you seen a working example in pure CUDA code— no Numba? Because of you haven't, then what you're trying to do in Numba might be impossible for the hardware, and it's not really a Numba problem.

Now, assuming that the above is true, that you can't pass __device__ function pointers on the device, JIT-compilation provides another way. You can take users' Python code and compile an entry point around it, rather than passing it as a function. The effect would be the same: users get to pass user-defined functions to a functor. The difference is that the user's function actually gets inlined into the framework, compiled fresh every time. I'll give an example of it's not clear what I mean.

Jim





To unsubscribe from this group and stop receiving emails from it, send an email to numba-users+unsubscribe@continuum.io.

To post to this group, send email to numba...@continuum.io.

Antonio Augusto Alves Junior

unread,
Jun 12, 2017, 2:29:04 AM6/12/17
to Numba Public Discussion - Public
Hi Jim,

On Sunday, June 11, 2017 at 6:52:21 PM UTC+2, Jim Pivarski wrote:
The part of the documentation you quoted didn't say that you can't execute a __device__ function pointer on the device, but it didn't say you can, either.

I can confirm we can deploy recursion on __device__ functions. Usually one will get a warning like this :

nvlink warning : Stack size for entry function 'blablabla...' cannot be statically determined.

and if some one is not careful deploying recursion  the stack can explode. The stack size per thread can be resized at run time manualy.
I Hydra, I avoided completely use recursive functions because the penalty in the performance and the unpredictability they introduce is a sign
poor desing in the context of gpu programming, in my opinion. 
 



I was under the impression that PTX bytecode can't have a function call stack that grows without bound at runtime, which is fundamentally unlike x86 bytecode. This is what I have heard— I'm not an expert. Do you have evidence, independent of Numba, that you can pass and execute __device__ function pointers on the device? Have you seen a working example in pure CUDA code— no Numba? Because of you haven't, then what you're trying to do in Numba might be impossible for the hardware, and it's not really a Numba problem.

Actualy studyind the NVidia programing guide/model I got the impression we can not dynamically link a __device__ function, in the same way we do with ordinary functions implemented in shared libraries.
We can load at run time a __global__ function lunching a kernel, though.
I still need to understand if it is possible to somehow merge/link ptx files. So I could compile Hydra to ptx and edit or merge the ptx files provided by numba/llvmlite to get the user python typed functions.
From observing the way numba works, I have the impression somewhere this is already being done, because we can jit compile device functions and call then from kernels compiled later
It is also possible to inline ptx in a cuda program at run time, as explained here  http://docs.nvidia.com/cuda/inline-ptx-assembly/index.html


Now, assuming that the above is true, that you can't pass __device__ function pointers on the device, JIT-compilation provides another way. You can take users' Python code and compile an entry point around it, rather than passing it as a function. The effect would be the same: users get to pass user-defined functions to a functor. The difference is that the user's function actually gets inlined into the framework, compiled fresh every time. I'll give an example of it's not clear what I mean.

I think i understand what are you suggesting, but having an example would be great. Please, could you include in the  example  some interation  with   a c/c++ coded routine ?

Thank you very much Jim!
A.A.


Jim





Jim Pivarski

unread,
Jun 12, 2017, 3:13:06 AM6/12/17
to numba...@continuum.io
On Mon, Jun 12, 2017 at 1:29 AM, Antonio Augusto Alves Junior <aalv...@gmail.com> wrote:
I think i understand what are you suggesting, but having an example would be great. Please, could you include in the  example  some interation  with   a c/c++ coded routine ?

Here's psuedocode of what I mean. Let's say you're implementing a GPU-based function minimizer that takes a user-defined function called "objective" and finds a local (hopefully global!) minimum of it. It does this by running "objective" at several points in parameter space, computing a Hessian, predicting a new minimum, and iterating until converged.

In CUDA, we define all of the utility functions, such as "computeHessian", "predictMinimum," and "isConverged."

We let the user write "objective," and then generate the following code around it:

@numba.jit
def doEverything(startingParams):
    iteration = 0
    params = startingParams
    while not isConverged():
        results = {}
        for p in params:
            results[p] = objective(p)     # since we're auto-generating this, might as well include objective inline
        h = computeHessian(results)
        params = predictMinimum(h)
    return params

When the user writes another objective function, you wrap that in another doEverything loop, making no attempt to reuse the old one. The most expensive parts to compile are the "computeHessian," "predictMinimum," and "isConverged," which are compiled once in your library— you just link them here. But you see that we've inverted the linking relationship: instead of the library code having to link against user code that will exist in the future, the user code (including the auto-generated wrapper "doEverything" you build around it) has to link against your library, which is the normal direction for dependencies.

(I didn't say how Numba finds the C++ functions. For the CPU, it automatically recognizes ctypes functions and does the right thing; for the GPU, I don't know, but there must be a way. PyCUDA?)

As far as the user is concerned, the interface is exactly the same: they provide a function and get its minimum. The implementation is completely different, however: auto-generated code (metaprogramming) is an alternative to function pointers. In the Numba world, JIT/generated code/metaprogramming is the normal way to do things. Looking for function pointers is going against the grain.

-- Jim

Reply all
Reply to author
Forward
0 new messages