ANN: interpolation.py (numba-accelerated interpolation library)

3 views
Skip to first unread message

Pablo Winant

unread,
Jan 18, 2016, 6:05:23 PM1/18/16
to Numba Public Discussion - Public

Hi everybody,

I’m happy to announce the first release of the interpolation.py library under the BSD license. It aims at becoming a fully JIT compiled library for several interpolation methods which can be used in numba accelerated loops. I reckon some people on this mailing list may find it useful. It is also somehow a tribute to the numba developers, and the amazing work they have done so far.

Interpolation.py currently supports smolyak products of polynomial as well as more traditional splines on regularly spaced cartesian grids (multilinear and cubic). Among the features are linear extrapolation, computation of derivatives, ability to deal with vector valued function. Everything works in theory with any number of dimensions.

The code for the splines is evaluated by automatically generated code (tempita templates), which is then JItted by numba. The resulting performances when evaluating splines at a very large number of points are very close to a former Cython implementation and not ridiculous with respect to the competition. I had been keeping the score for a long time with the successive numba versions and dropped the cython implementation when I saw that.

Several new developments of the library are closely related to recent features of numba, and may actually be good showcases for them:

  • guvectorize: when evaluating apliens at many points, it produces very fast evaluation, and the parallel option reduces execution time quite a bit. Judging from a simple proof of concept, this seems to work quite well.
  • jitclasses: they would allow us to expose simple interpolator objects with no (or little) performance tradeoff. In principle it works, but I haven’t found the way to use them properly (so as to accomodate, for instance, for attributes that can take several dimensions)
  • cuda.jit: splines interpolation sounds like a perfect example for a gpu application. Currently, the library generates numba-cuda code, which works in principle, but is painfully slow. The reasons for this are still under investigation (to be frank I have no clue yet and currently don't have a gpu other than ec2/g2).

There is one missing feature which would be extremely useful: some robust way to generate function code and JIT it just-in-time (ala julia @generate). The optimized interpolation routines have many possible options (extrapolation type, interp type, …) and producing in advance all combinations for all dimensions is costly and inelegant.

There is still a lot of progress to be made, especially when it comes to documentation and testing. Any feedback is more than welcome.

Best regards,

Pablo

Stanley Seibert

unread,
Jan 18, 2016, 6:12:18 PM1/18/16
to Numba Public Discussion - Public
Hi Pablo,

I'm very excited to see this!  We're definitely interested in helping to figure out the issues you are seeing.

Regarding code generation: We've been trying to figure out good ways to support this kind of metaprogramming.  Fortunately, the problem is orthogonal to Numba in some respects.  However the function is generated, once it is compiled to Python bytecode, you can feed it to Numba.  Internally, we use string substitution to create some functions on the fly, but this has always felt kind of crude.  If people have suggestion for ways to do this sort of thing in Python, please let us know.
 


--
You received this message because you are subscribed to the Google Groups "Numba Public Discussion - Public" group.
To unsubscribe from this group and stop receiving emails from it, send an email to numba-users...@continuum.io.
To post to this group, send email to numba...@continuum.io.
To view this discussion on the web visit https://groups.google.com/a/continuum.io/d/msgid/numba-users/b5312c4e-5cc4-4df2-b7fc-26032fc2be50%40continuum.io.
For more options, visit https://groups.google.com/a/continuum.io/d/optout.

Pablo Winant

unread,
Jan 18, 2016, 6:44:18 PM1/18/16
to Numba Public Discussion - Public
Cool !

Luckily, there are that many issues. I'll try to describe them on github and come back to you when done.

As for code generation, the thing I had in mind was some kind of CPUOverload object. At the time input types are known, it would generate a function a jit it, instead of just jit it. The generation could be made by a user-specified function which would take as argument the primary function, the types, and would return an AST (or a string). Unfortunately, I don't know the current implementation well enough to suggest anything more concrete. But I'm willing to test whatever you come up with.

Pablo

Stanley Seibert

unread,
Jan 18, 2016, 6:47:43 PM1/18/16
to Numba Public Discussion - Public
Yeah, I think the Julia approach makes a lot of sense.  I think we will try making something like @generated (probably call it @generate_jit to avoid confusion with Python generators), which would solve this problem for you, and several other users we've heard from recently.


Pablo Winant

unread,
Jan 18, 2016, 6:51:44 PM1/18/16
to numba...@continuum.io
I'm glad to hear that. And to correct my former email, I meant  "there aren't that many issues", not the opposite !

jcr...@continuum.io

unread,
Jan 18, 2016, 9:23:33 PM1/18/16
to Numba Public Discussion - Public
Here's a hacky implementation of `@generated` that I wrote up: https://gist.github.com/jcrist/fddcbb12f3c74748aea5. I'm sure there could be a cleaner way to do this, but it works.

- Jim
To unsubscribe from this group and stop receiving emails from it, send an email to numba-users+unsubscribe@continuum.io.

To post to this group, send email to numba...@continuum.io.

--
You received this message because you are subscribed to the Google Groups "Numba Public Discussion - Public" group.
To unsubscribe from this group and stop receiving emails from it, send an email to numba-users+unsubscribe@continuum.io.

To post to this group, send email to numba...@continuum.io.

Denis Akhiyarov

unread,
Jan 19, 2016, 1:25:40 AM1/19/16
to Numba Public Discussion - Public
This is super cool, are you planning something like LinearNDInterpolator or Rbf available in scipy library?

My impression was that smolyak grid requires pre-generated points on a special grid, which is sometimes not feasible.

Pablo Winant

unread,
Jan 19, 2016, 5:27:51 AM1/19/16
to numba...@continuum.io
Thanks Denis. We haven't made precise plans for future developments, but we all have some personal interest in all kinds of algorithm (adaptive sparse grids and shape-preserving splines are two examples). 
From what I remember, the LinearNDInterpolator in scipy is a cython wrapper around the qhull library. So, while, it is certainly possible to do the same in numba, there isn't much added value (other than being able to jit into numba libraries). At the same time, it should not be very hard, so that it may still be a worthwhile effort.
As for RBF, I have no idea. Pairwise distance is a good example to test the performance of the compilation loops.
Do you want to open github issues to discuss both ?

Best,

Pablo

--
You received this message because you are subscribed to the Google Groups "Numba Public Discussion - Public" group.
To unsubscribe from this group and stop receiving emails from it, send an email to numba-users...@continuum.io.
To post to this group, send email to numba...@continuum.io.

Pablo Winant

unread,
Jan 19, 2016, 7:25:17 AM1/19/16
to numba...@continuum.io
Jim, I just tried your piece of code. This is brilliant ! It does precisely what I tried to do. Even the cache mechanism still works. I can't wait to have this as an official feature.

Best,

Pablo

--
You received this message because you are subscribed to the Google Groups "Numba Public Discussion - Public" group.
To unsubscribe from this group and stop receiving emails from it, send an email to numba-users...@continuum.io.
To post to this group, send email to numba...@continuum.io.

--
You received this message because you are subscribed to the Google Groups "Numba Public Discussion - Public" group.
To unsubscribe from this group and stop receiving emails from it, send an email to numba-users...@continuum.io.
To post to this group, send email to numba...@continuum.io.

Stanley Seibert

unread,
Jan 19, 2016, 9:33:35 AM1/19/16
to Numba Public Discussion - Public
To follow up another of your bullet points:  Can you point me toward the Numba CUDA code (and a suitable example of calling it) that is running too slow?

On Mon, Jan 18, 2016 at 5:05 PM, Pablo Winant <pablo....@gmail.com> wrote:

--

Pablo Winant

unread,
Jan 19, 2016, 10:54:14 AM1/19/16
to Numba Public Discussion - Public
Here is a gist: https://gist.github.com/albop/eaf622ca88847af46d8a . The kernel evaluating the spline is in `eval_cubic_cuda` (the one that is generated). It is imported and compiled in `speed_comparison_cuda.py` . The gpu part can be run independently of the library.
A few remarks:
- when I do the comparison on EC2/g2, (nvidia grid) the gpu version is approximately two times slower than the cpu
- there are conditional branches, which can be removed assuming the points stays inside the grid. This didn't help. Note that Phi_x is set to 0 before being assigned a conditional value. This is so that the type of Phi_x is known to the compiler.
- I did run a similar code successfully with the conditional branches on a gtx titan (native float64), but that was two years ago. It may all be a float32 vs float64 problem, but I find it hard to admit.
- the evaluation method takes two constant arrays as an argument. In regular numba it is not necessary: one can leave the constants in the parent module  when the function is compiled. This also work with the simulator, but I got an error when trying a real graphic card. That's small bug I meant to report.

Pablo

Denis Akhiyarov

unread,
Jan 19, 2016, 8:09:30 PM1/19/16
to Numba Public Discussion - Public
actually rbf in scipy is not that memory efficient, because it does not use KDTree to lookup nearest neighbors. This is the area that could be improved significantly.

https://github.com/EconForge/interpolation.py/issues/7

kirchen...@googlemail.com

unread,
Jan 21, 2016, 7:37:43 AM1/21/16
to Numba Public Discussion - Public
Try this. (see modifications in the files)

CUDA: 0.0146556854248
Cubic (CPU): 0.102976298332

(K20Xm)

I also attached the nvprof results.

Main problem was that the Block size was set to 1. jitted[(N,1)]()

Cheers,
Manuel
test.py
eval_cubic_cuda.py
nvprof.txt

Pablo Winant

unread,
Jan 21, 2016, 10:26:11 AM1/21/16
to numba...@continuum.io
Thank you very much Manuel. I tried your suggestions on ec2/g2 and got similar results (gpu 4 times faster than cpu with single or double precision).
Out of curiosity: in this case which information do you extract from nvprof.txt ? I read that 95% of time is spent in the kernel which is called 11 times, but it doesn't tell you whether it is memory bound (which I suspect) or anything else. How would you get that kind of informations ?

Pablo




Stanley Seibert

unread,
Jan 21, 2016, 2:55:44 PM1/21/16
to Numba Public Discussion - Public
nvprof can collect a large number of interesting counters during the execution of GPU code, but it is easiest to use the GUI profiler interface to figure out CPU vs memory bound.  The Visual Profiler should work with Numba applications just like regular CUDA applications (although you'll notice the function names are very mangled as Numba needs to generate a unique name for each combination of argument data types).


kirchen...@googlemail.com

unread,
Jan 21, 2016, 3:09:13 PM1/21/16
to Numba Public Discussion - Public
Hey Pablo,

an easy way to profile everything (this can take some time) would be:

"nvprof --analysis-metrics --metrics all --output-profile output_file python python_file.py"

This will create an output file that you can open with the Nvidia Visual Profiler ("nvvp").

From my experience, using CUDA with numba, the reasons for having a low performance can be grouped like this:

kernel performance < device/host data transfer < API call latencies

i.e. API related latencies (maybe even cause by numba sometimes) are the main reasons for poor performance, if you have a rather good memory management (device/host data transfer).

Poor kernel performance is mainly caused by the kernel being memory bound and (from my experience) if you use too many registers.

Cheers,
Manuel
Reply all
Reply to author
Forward
0 new messages