Hey everyone,
thought I'd share some things I've been experimenting with. I wanted to try numba (one of these magic python packages that turns your slow python code into something C-fast) so as a test case I rewrote Andrew's pynbody sph renderer in pure python. You can see what it looks like in this gist along with some sample timings:
https://gist.github.com/rokroskar/bdcf6c6b210ff0efc738
Numba is an open source project but I've got it installed as a part of the continuum analytics python toolset Anaconda (
https://store.continuum.io/cshop/anaconda/). Numbapro, the "accelerated" version of numba (promises to allow you to use gpu cuda-like code) and support for OpenMP-like loop parallelism via a prange for loop). It's free for academics but I could never get these bits to work and the documentation for them is even worse than numba.
You'll see that it uses mostly pythonic syntax but there are some cases where numba blows up if you try to do too much with list comprehensions etc. It also doesn't know how to do abstract loops (i.e. for bla in some_list_of_stuff). It can be a pain to get it working occasionaly because sometimes the errors it gives during the compilation stage are less than informative and the documentation basically boils down to a set of (short) examples. But, when it works it's pretty cool. As you can see from the timings, it's sometimes faster than the pynbody renderer. In case it isn't clear -- the numba part of the code are just the @autojit, @jit, and @vectorize decorators... yes, it does seem too good to be true.
The second goal was to try and put this onto a gpu. After talking with Joachim Stadel about it, he pointed out that I should rewrite the algorithm to expose as much paralellism as possible by forcing the threads to do as much of identical work as possible. So we came up with a scheme where the particles are sorted by how many pixels they need to draw on the image (smoothing length basically).
The resulting code is the second file under the same gist link. You can see some sample timings in the doc string at the top of the file -- for the eris simulation (~40M dm particles), the template render code does ~ 6 times better than the pynbody renderer, even including the very slow np.where and sort. I tried to put this on the gpu using the anaconda accelerate cuda jit but it was very painful -- no debugger, few useful error messages. I gave up on the numbapro cuda and implemented it in CUDA C but getting it to work correctly and efficiently still needs some more work... but it's close. :)
Take away points: numba is great when it works -- you can write fairly naive python code and turn it into something relatively fast fairly easily. It can take a little bit of work reorganizing the code, but come on, with a single decorator you get fabulous speed ups! Numbapro, on the other hand, might seriously test your nerves, I don't think it's really ready for anything but experimentation though maybe things will get better. They appear to have Nvidia behind the development now so perhaps that will help things along.
The new renderer is pretty fast and I'll try to find the time to write it in C and wrap it in cython to make it more portable than being simply a numba-fied code. It needs some quality control as the results don't really agree with pynbody at the moment, I think it's just some normalization somewhere that I missed. It should also be made multi-threaded which will speed things up even more obviously. Would be cool if people try it and help find bugs...
Let me know if you end up trying any of this out!