[Numpy-discussion] Looking for people interested in helping with Python compiler to LLVM

152 views
Skip to first unread message

Travis Oliphant

unread,
Mar 11, 2012, 1:35:36 AM3/11/12
to Discussion of Numerical Python
Hey all, 

I gave a lightning talk this morning on numba which is the start of a Python compiler to machine code through the LLVM tool-chain.   It is proof of concept stage only at this point (use it only if you are interested in helping develop the code at this point).   The only thing that works is a fast-vectorize capability on a few functions (without for-loops).   But, it shows how creating functions in Python that can be used by the NumPy runtime in various ways.   Several NEPS that will be discussed in the coming months will use this concept.  

Right now there is very little design documentation, but I will be adding some in the days ahead, especially if I get people who are interested in collaborating on the project.   I did talk to Fijal and Alex of the PyPy project at PyCon and they both graciously suggested that I look at some of the PyPy code which walks the byte-code and does translation to their intermediate representation for inspiration. 

Again, the code is not ready for use, it is only proof of concept, but I would like to get feedback and help especially from people who might have written compilers before.  The code lives at:   https://github.com/ContinuumIO/numba

-Travis



Mic

unread,
Mar 11, 2012, 3:59:00 AM3/11/12
to Discussion of Numerical Python
what is the difference to http://www.python.org/dev/peps/pep-3146/ ?

_______________________________________________
NumPy-Discussion mailing list
NumPy-Di...@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


xavier...@gmail.com

unread,
Mar 11, 2012, 10:12:22 AM3/11/12
to Discussion of Numerical Python
It says : " Numba is a NumPy aware optimizing compiler for Python. It
uses the remarkable LLVM compiler infrastructure to compile Python
byte-code to machine code especially for use in the NumPy run-time and
SciPy modules."*
*
If this description is correct, Numba is an additional pass once the
cpython bytecode has be produced by cpython.
Is it correct??
Is python bytecote a good intermediate representation to perform numpy
related optimization?

One current big issue with numpy is that C=A+B+D produces temporaries.
numexpr addresses this issue and it would be great to get the same
result by default in numpy.
numexpr also optimizes polynomials using Horner's method. It is hard to
do that at bytecode level, isn't it?

Unladen swallow wanted to replace the full cpython implementation by a
jit compiler built using LLVM...but unladen swallow is dead.

Xavier

> NumPy-Di...@scipy.org <mailto:NumPy-Di...@scipy.org>
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

Pauli Virtanen

unread,
Mar 11, 2012, 10:52:40 AM3/11/12
to numpy-di...@scipy.org
11.03.2012 15:12, xavier...@gmail.com kirjoitti:
[clip]

> If this description is correct, Numba is an additional pass once the
> cpython bytecode has be produced by cpython.
> Is it correct??
> Is python bytecote a good intermediate representation to perform numpy
> related optimization?
>
> One current big issue with numpy is that C=A+B+D produces temporaries.
> numexpr addresses this issue and it would be great to get the same
> result by default in numpy.
> numexpr also optimizes polynomials using Horner's method. It is hard to
> do that at bytecode level, isn't it?

My impression is that dealing with Python's bytecode is not necessarily
significantly harder than dealing with the AST.

Your example reads

1 0 LOAD_NAME 0 (A)
3 LOAD_NAME 1 (B)
6 BINARY_ADD
7 LOAD_NAME 2 (D)
10 BINARY_ADD
11 STORE_NAME 3 (C)

For instance, interpreting the bytecode (e.g. loop body) once with dummy
objects lets you know what the final compound expression is.

> Unladen swallow wanted to replace the full cpython implementation by a
> jit compiler built using LLVM... but unladen swallow is dead.

To get speed gains, you need to optimize not only the bytecode
interpreter side, but also the object space --- Python classes, strings
and all that. Keeping in mind Python's dynamism, there are potential
side effects everywhere. I guess this is what sunk the swallow.

Just speeding up effectively statically typed code dealing with arrays
and basic types, on the other hand, sounds much easier.

The PyPy guys have a much more ambitious approach, and are getting nice
results. Also with arrays --- as I understand, the fact that they want
to be able to do this sort of optimization is the main reason why they
want to reimplement the core parts of Numpy in RPython.

The second issue is that unfortunately their emulation of CPython's
C-API is at the moment seems to have quite large overheads. Porting
Numpy on top of that is possible --- I managed to get basic things
(apart from string/unicode arrays) to work, but things took very large
speed hits (of the order ~ 100x for things like `arange(10000).sum()`).
This pushes the speed advantage of Numpy to a bit too large array sizes.
The reason is probably that Numpy uses PyObjects internally heavily,
which accumulates the cost of passing objects through the emulation layer.

--
Pauli Virtanen

xavier...@gmail.com

unread,
Mar 11, 2012, 12:22:17 PM3/11/12
to Discussion of Numerical Python, Pauli Virtanen
On 03/11/2012 03:52 PM, Pauli Virtanen wrote:
> 11.03.2012 15:12, xavier...@gmail.com kirjoitti:
> [clip]
>> If this description is correct, Numba is an additional pass once the
>> cpython bytecode has be produced by cpython.
>> Is it correct??
>> Is python bytecote a good intermediate representation to perform numpy
>> related optimization?
>>
>> One current big issue with numpy is that C=A+B+D produces temporaries.
>> numexpr addresses this issue and it would be great to get the same
>> result by default in numpy.
>> numexpr also optimizes polynomials using Horner's method. It is hard to
>> do that at bytecode level, isn't it?
> My impression is that dealing with Python's bytecode is not necessarily
> significantly harder than dealing with the AST.
>
> Your example reads
>
> 1 0 LOAD_NAME 0 (A)
> 3 LOAD_NAME 1 (B)
> 6 BINARY_ADD
> 7 LOAD_NAME 2 (D)
> 10 BINARY_ADD
> 11 STORE_NAME 3 (C)
>
> For instance, interpreting the bytecode (e.g. loop body) once with dummy
> objects lets you know what the final compound expression is.

ok.


>
>> Unladen swallow wanted to replace the full cpython implementation by a
>> jit compiler built using LLVM... but unladen swallow is dead.
> To get speed gains, you need to optimize not only the bytecode
> interpreter side, but also the object space --- Python classes, strings
> and all that. Keeping in mind Python's dynamism, there are potential
> side effects everywhere. I guess this is what sunk the swallow.

Correct : impact of dynamism underestimated and unexpected bugs in LLVM
--> dead swallow.

> Just speeding up effectively statically typed code dealing with arrays
> and basic types, on the other hand, sounds much easier.

it is :)

> The PyPy guys have a much more ambitious approach, and are getting nice
> results. Also with arrays --- as I understand, the fact that they want
> to be able to do this sort of optimization is the main reason why they
> want to reimplement the core parts of Numpy in RPython.

It sonds ok for numpy but it is no sense because numpy is only the
beginning of the story.
Numpy is not *that* useful without scipy, matplotlib and so on.

> The second issue is that unfortunately their emulation of CPython's
> C-API is at the moment seems to have quite large overheads. Porting
> Numpy on top of that is possible --- I managed to get basic things
> (apart from string/unicode arrays) to work, but things took very large
> speed hits (of the order ~ 100x for things like `arange(10000).sum()`).
> This pushes the speed advantage of Numpy to a bit too large array sizes.
> The reason is probably that Numpy uses PyObjects internally heavily,
> which accumulates the cost of passing objects through the emulation layer.

Yes. We have two types of users :
1) Guys using python without numpy. These guys want the python core
language to go faster (with the full complexity of a dynamic language).
2) science users. These guys are fine with python speed because their
codes spend 99% the time in
numpy/scipy/any_other_science_module_using_c-api. Who cares if it takes
1.1s (in python) instead of 1.0 (in C) to read the data input file if it
takes hours (or even minutes) to process them (using the C-api and doing
the basics with temporaries and not optimized loops)?

Sturla Molden

unread,
Mar 11, 2012, 1:46:28 PM3/11/12
to Discussion of Numerical Python, numpy-di...@scipy.org

Den 11. mars 2012 kl. 15:52 skrev Pauli Virtanen <p...@iki.fi>:

> To get speed gains, you need to optimize not only the bytecode
> interpreter side, but also the object space --- Python classes, strings
> and all that. Keeping in mind Python's dynamism, there are potential
> side effects everywhere. I guess this is what sunk the swallow.
>
> Just speeding up effectively statically typed code dealing with arrays
> and basic types, on the other hand, sounds much easier.

Yes, it would be like psyco, except it has LLVM as compiler and knows about NumPy.

Sturla

Travis Oliphant

unread,
Mar 11, 2012, 6:11:31 PM3/11/12
to Discussion of Numerical Python
On Mar 11, 2012, at 1:59 AM, Mic wrote:

what is the difference to http://www.python.org/dev/peps/pep-3146/ ?


To fully expound the difference would take a lot of discussion.   But, summarizing: 

* Numba is not nearly as ambitious as US (Numba will be fine with some user-directed information and with a subset of the language that gets compiled)
* Numba will focus on compilation rather than JITing --- in other words it won't be trying to detect hot-spots and compile segments (actually LLVM is not a good candidate for that sort of thing as it is not a very fast or memory-efficient compiler for a lot of reasons,  I believe this is why PyPy does not use it anymore, for example). 
* Numba will be much closer to Cython in spirit than Unladen Swallow (or PyPy) --- people who just use Cython for a loop or two will be able to use Numba instead
* Numba will borrow the idea of call-sites from IronPython wherein a function is replaced by a generic function that dispatches based on input types to either cached compiled code for the types specified or the generic function.
* Numba will be mainly about trying to leverage the code-generation of LLVM which multiple hardware manufacturers are using (Intel's OpenCL support, Nvidia's PTX backend, Apple's CLang, etc.) for NumPy arrays


-Travis





Sturla Molden

unread,
Mar 11, 2012, 7:01:26 PM3/11/12
to numpy-di...@scipy.org
Den 11.03.2012 15:52, skrev Pauli Virtanen:
> To get speed gains, you need to optimize not only the bytecode
> interpreter side, but also the object space --- Python classes,
> strings and all that. Keeping in mind Python's dynamism, there are
> potential side effects everywhere. I guess this is what sunk the swallow.

I'm not sure what scared off or killed the bird. Maybe they just
approached it from the wrong side.

Psyco has proven that some "algorithmic" Python code can be accelerated
by an order of magnitude or two. But Psyco did not use an optimizing
compiler like LLVM, it just generated unoptimized x86 binary code from
Python.

Also, a JIT for a dynamic language is possible. One example is the
Strongtalk JIT compiler for Smalltalk. It was purchased by Sun and
renamed "Java VM Hotspot", because Sun could not produce a JIT for its
static language Java. The static nature of Java does not have anything
to do with the performance of its JIT compiler.

Maybe there are differences between Python and Smalltalk that makes the
latter more easy to JIT compile? Or perhaps there are differences
between Hotspot/Strongtalk and LLVM that makes the latter less efficient
for dynamic languages? I don't know. But making a performant JIT for all
of Python is obviously not easy.

A third example of a fast JIT for a dynamic language is LuaJIT. It can
often make "interpreted" and duck-typed Lua run faster than statically
compiled C. Like psyco, LuaJIT just focuses on making algorithmic code
with a few elementary types run fast.

Another approach to speeding up dynamic languages is optional static
typing and static compilation. Cython, Bigloo (et al.), and CMUCL/SBCL
are exellent examples of that. Maybe one could use type annotations like
Cython in pure Python mode? Python 3 has even got type annotations in
the syntax.

I think (but I am not 100% sure) that the main problem with JIT'ing
Python is its dynamic attributes. So maybe some magic with __slots__ or
__metaclass__ could be used to turn that dynamicity off? So a JIT like
Numba would only work with builtin types (int, float, list, dict, set,
tuple), NumPy arrays, and some less-dynamic classes. It would not speed
up all of Python, but a sufficient subset to make scientists happy.

Sturla

Sturla Molden

unread,
Mar 11, 2012, 7:21:27 PM3/11/12
to Discussion of Numerical Python
Den 11.03.2012 23:11, skrev Travis Oliphant:

* Numba will be much closer to Cython in spirit than Unladen Swallow (or PyPy) --- people who just use Cython for a loop or two will be able to use Numba instead

This is perhaps the most important issue for scientific and algorithmic codes.

Not having to resort to Cython, C, C++ or Fortran to get decent performance for algorithmic code would be great.

It could also put Python/Numba high up on the Debian shootout ;-)

Or one could benchmark a pure Python version of timsort against the standard library.

Sturla

Till Stensitzki

unread,
Mar 12, 2012, 12:57:33 PM3/12/12
to numpy-di...@scipy.org
Doesent Theano does the same, only via GCC compilation?

Olivier Delalleau

unread,
Mar 12, 2012, 1:29:10 PM3/12/12
to Discussion of Numerical Python
One major difference is that Theano doesn't attempt to parse existing Python (byte)code: you need to explicitly code with the Theano syntax (which tries to be close to Numpy, but can end up looking quite different, especially if you want to control the program flow with loops and ifs for instance).

A potentially interesting avenue would be to parse Python (byte)code to generate a Theano graph. It'd be nice if numba could output some intermediate information that would represent the computational graph being compiled, so that Theano could re-use it directly :) (probably much easier said than done though)

-=- Olivier

Pierre Haessig

unread,
Mar 12, 2012, 4:25:04 PM3/12/12
to numpy-di...@scipy.org
Hi,

Le 12/03/2012 00:21, Sturla Molden a écrit :
>
> It could also put Python/Numba high up on the Debian shootout ;-)
Can you tell a bit more about it ? (I just didn't understand the whole
sentence in fact ;-) )

Thanks !
--
Pierre

signature.asc

Dag Sverre Seljebotn

unread,
Mar 13, 2012, 1:58:49 AM3/13/12
to numpy-di...@scipy.org

Hi Travis,

me and Mark F. has been talking today about whether some of numba and
Cython development could overlap -- not right away, but in the sense
that if Cython gets some features for optimization of numerical code,
then make it easy for numba to reuse that functionality.

This may be sort of off-topic re: the above-- but part of the goal of
this post is to figure out numba's intended scope. If there isn't an
overlap, that's good to know in itself.

Question 1: Did you look at Clyther and/or Copperhead? Though similar,
they target GPUs...but at first glance they look as though they may be
parsing Python bytecode to get their ASTs... (didn't check though)

Question 2: What kind of performance are you targeting -- in the short
term, and in the long term? Is competing with "Fortran-level"
performance a goal at all?

E.g., for ufunc computations with different iteration orders such
as "a + b.T" (a and b in C-order), one must do blocking to get good
performance. And when dealing with strided arrays, copying small chunks
at the time will sometimes help performance (and sometimes not).

This is optimization strategies which (as I understand it) is quite
beyond what NumPy iterators etc. can provide. And the LLVM level could
be too low -- one has quite a lot of information when generating the
ufunc/reduction/etc. that would be thrown away when generating LLVM
code. Vectorizing compilers do their best to reconstruct this
information; I know nothing about what actually exists here for
LLVM. They are certainly a lot more complicated to implement and work
with than making use of on higher-level information available before
code generation.

The idea we've been playing with is for Cython to define a limited
subset of its syntax tree (essentially the "GIL-less" subset) seperate
from the rest of Cython, with a more well-defined API for optimization
passes etc., and targeted for a numerical optimization pipeline.

This subset would actually be pretty close to what numba needs to
compile, even if the overlap isn't perfect. So such a pipeline could
possibly be shared between Cython and numba, even if Cython would use
it at compile-time and numba at runtime, and even if the code
generation backend is different (the code generation backend is
probably not the hard part...). To be concrete, the idea is:

(Cython|numba) -> high-level numerical compiler and
loop-structure/blocking optimizer (by us on a shared parse tree
representation) -> (LLVM/C/OpenCL) -> low-level optimization (by the
respective compilers)

Some algorithms that could be shareable are iteration strategies
(already in NumPy though), blocking strategies, etc.

Even if this may be beyond numba's (and perhaps Cython's) current
ambition, it may be worth thinking about, if nothing else then just
for how Cython's code should be structured.

(Mark F., how does the above match how you feel about this?)

Dag

Travis Oliphant

unread,
Mar 13, 2012, 4:19:47 AM3/13/12
to Discussion of Numerical Python
That would be very, very interesting.  


This may be sort of off-topic re: the above-- but part of the goal of
this post is to figure out numba's intended scope. If there isn't an
overlap, that's good to know in itself.

Question 1: Did you look at Clyther and/or Copperhead? Though similar,
they target GPUs...but at first glance they look as though they may be
parsing Python bytecode to get their ASTs... (didn't check though)

I have looked at both projects although Clyther more in depth.    Clyther is parsing bytecode to get the AST (through a sub-project by the same author called Meta:  http://srossross.github.com/Meta/html/index.html). 


Question 2: What kind of performance are you targeting -- in the short
term, and in the long term? Is competing with "Fortran-level"
performance a goal at all?

In the short-term, I'm targeting C-equivalent performance (like weave).   In the long-term, I'm targeting optimized high-level expressions (i.e. Fortran-level) with GPU and mulit-core. 


E.g., for ufunc computations with different iteration orders such
as "a + b.T" (a and b in C-order), one must do blocking to get good
performance. And when dealing with strided arrays, copying small chunks
at the time will sometimes help performance (and sometimes not).

This is optimization strategies which (as I understand it) is quite
beyond what NumPy iterators etc. can provide.



And the LLVM level could
be too low -- one has quite a lot of information when generating the
ufunc/reduction/etc. that would be thrown away when generating LLVM
code.

It doesn't need to be thrown away at all.   It could be used to generate appropriate code for the arrays being used.   The long-term idea is to actually be aware of NumPy arrays and encourage expression of high-level constructs which generate optimized code using chunking, blocking, AVX instructions, multiple threads, etc.  

To do this, it may make more sense to actually emit OpenMP (unless LLVM grows standard threading intrinsics).   This is not out of the question. 

Vectorizing compilers do their best to reconstruct this
information; I know nothing about what actually exists here for
LLVM. They are certainly a lot more complicated to implement and work
with than making use of on higher-level information available before
code generation.

The idea we've been playing with is for Cython to define a limited
subset of its syntax tree (essentially the "GIL-less" subset) seperate
from the rest of Cython, with a more well-defined API for optimization
passes etc., and targeted for a numerical optimization pipeline.

This subset would actually be pretty close to what numba needs to
compile, even if the overlap isn't perfect. So such a pipeline could
possibly be shared between Cython and numba, even if Cython would use
it at compile-time and numba at runtime, and even if the code
generation backend is different (the code generation backend is
probably not the hard part...). To be concrete, the idea is:


(Cython|numba) -> high-level numerical compiler and
loop-structure/blocking optimizer (by us on a shared parse tree
representation) -> (LLVM/C/OpenCL) -> low-level optimization (by the
respective compilers)

Some algorithms that could be shareable are iteration strategies
(already in NumPy though), blocking strategies, etc.

Even if this may be beyond numba's (and perhaps Cython's) current
ambition, it may be worth thinking about, if nothing else then just
for how Cython's code should be structured.

This kind of collaboration would be very nice.  I agree, there might be some kind of intermediate representation that would be good for both projects.  

-Travis

mark florisson

unread,
Mar 13, 2012, 12:28:26 PM3/13/12
to Discussion of Numerical Python

As for blocking, this could be done by the numpy iterators themselves,
by simply introducing more dimensions with appropriate shape and
strides (not saying that's a solution :).

>
>
> And the LLVM level could
> be too low -- one has quite a lot of information when generating the
> ufunc/reduction/etc. that would be thrown away when generating LLVM
> code.
>
>
> It doesn't need to be thrown away at all.   It could be used to generate
> appropriate code for the arrays being used.   The long-term idea is to
> actually be aware of NumPy arrays and encourage expression of high-level
> constructs which generate optimized code using chunking, blocking, AVX
> instructions, multiple threads, etc.
>
> To do this, it may make more sense to actually emit OpenMP (unless LLVM
> grows standard threading intrinsics).   This is not out of the question.

That would be interesting, my experience with OpenMP is that the
standard doesn't define (ironically enough) the use of OpenMP in the
context of threading, and indeed, trying to use OpenMP outside of the
main thread simply segfaults your program. If llvm would get such
features, one must be prepared to make the OpenMP runtime thread-safe
as well (hopefully it will be in the first place, like I believe
Intel's implementation).

I would like collaboration, but from a technical perspective I think
this would be much more involved than just dumping the AST to an IR
and generating some code from there. For vector expressions I think
sharing code would be more feasible than arbitrary (parallel) loops,
etc. Cython as a compiler can make many decisions that a Python
(bytecode) compiler can't make (at least without annotations and a
well-defined subset of the language (not so much the syntax as the
semantics)). I think in numba, if parallelism is to be supported, you
will want a prange-like construct, as proving independence between
iterations can be very hard to near impossible for a compiler.

As for code generation, I'm not sure how llvm would do things like
slicing arrays, reshaping, resizing etc (for vector expressions you
can first evaluate all slicing and indexing operations and then
compile the remaining vector expression), but for loops and array
reassignment within loops this would have to invoke the actual slicing
code from the llvm code (I presume). There are many other things, like
bounds checking, wraparound, etc, that are all supported in both numpy
and Cython, but going through an llvm layer would as far as I can see,
require re-implementing those, at least if you want top-notch
performance. Personally, I think for non-trivial performance-critical
code (for loops with indexing, slicing, function calls, etc) Cython is
a better target.

So for vector expressions I think Cython and Numba could work together
by specifying AST transformations that operate on vector expressions.
For the purposes of Cython it would go from the Cython AST to the IR
and after transformations either back to the Cython AST, or directly
to llvm. For Cython, going from that code to llvm is not necessarily
more useful than C and OpenCL, as you will know the types anyway at
compile time and you can immediately exploit multicore as well as SIMD
parallelism. In the face of blocking and chunking etc, certain
specializations may be created in advance for Cython, or it could even
generate a C version (+ openmp + auto-vectorization appeasing
pragmas), an OpenCL version for the CPU and possibly a different one
for the GPU, and a numba + numba IR version, i.e. feed the IR at
runtime to numba and have it compile to llvm. If the compiler
additionally fuses vector expressions together, this will be even more
powerful.

Finally, as for non-vector-expression code, I really believe Cython is
a better target. cython.inline can have high overhead (at least the
first time it has to compile), but with better (numpy-aware) type
inference or profile guided optimizations (see recent threads on the
cython-dev mailing list), in addition to things like prange, I
personally believe Cython targets most of the use cases where numba
would be able to generate performing code.

Travis Oliphant

unread,
Mar 13, 2012, 1:18:19 PM3/13/12
to Discussion of Numerical Python
>>
>> (Mark F., how does the above match how you feel about this?)
>
> I would like collaboration, but from a technical perspective I think
> this would be much more involved than just dumping the AST to an IR
> and generating some code from there. For vector expressions I think
> sharing code would be more feasible than arbitrary (parallel) loops,
> etc. Cython as a compiler can make many decisions that a Python
> (bytecode) compiler can't make (at least without annotations and a
> well-defined subset of the language (not so much the syntax as the
> semantics)). I think in numba, if parallelism is to be supported, you
> will want a prange-like construct, as proving independence between
> iterations can be very hard to near impossible for a compiler.

I completely agree that you have to define some kind of syntax to get parallelism. But, a prange construct would not be out of the question, of course.

>
> As for code generation, I'm not sure how llvm would do things like
> slicing arrays, reshaping, resizing etc (for vector expressions you
> can first evaluate all slicing and indexing operations and then
> compile the remaining vector expression), but for loops and array
> reassignment within loops this would have to invoke the actual slicing
> code from the llvm code (I presume).

There could be some analysis on the byte-code, prior to emitting the llvm code in order to handle lots of things. Basically, you have to "play" the byte-code on a simple machine anyway in order to emit the correct code. The big thing about Cython is you have to typedef too many things that are really quite knowable from the code. If Cython could improve it's type inference, then it would be a more suitable target.

> There are many other things, like
> bounds checking, wraparound, etc, that are all supported in both numpy
> and Cython, but going through an llvm layer would as far as I can see,
> require re-implementing those, at least if you want top-notch
> performance. Personally, I think for non-trivial performance-critical
> code (for loops with indexing, slicing, function calls, etc) Cython is
> a better target.

With libclang it is really quite possible to imagine a cython -> C target that itself compiles to llvm so that you can do everything at that intermediate layer. However, LLVM is a much better layer for optimization than C now that there are a lot of people collaborating on that layer. I think it would be great if Cython targeted LLVM actually instead of C.

>
> Finally, as for non-vector-expression code, I really believe Cython is
> a better target. cython.inline can have high overhead (at least the
> first time it has to compile), but with better (numpy-aware) type
> inference or profile guided optimizations (see recent threads on the
> cython-dev mailing list), in addition to things like prange, I
> personally believe Cython targets most of the use cases where numba
> would be able to generate performing code.

Cython and Numba certainly overlap. However, Cython requires:

1) learning another language
2) creating an extension module --- loading bit-code files and dynamically executing (even on a different machine from the one that initially created them) can be a powerful alternative for run-time compilation and distribution of code.

These aren't show-stoppers obviously. But, I think some users would prefer an even simpler approach to getting fast-code than Cython (which currently doesn't do enought type-inference and requires building a dlopen extension module).

-Travis

Nathaniel Smith

unread,
Mar 13, 2012, 2:39:01 PM3/13/12
to Discussion of Numerical Python
On Tue, Mar 13, 2012 at 5:18 PM, Travis Oliphant <tra...@continuum.io> wrote:
> Cython and Numba certainly overlap.  However, Cython requires:
>
>        1) learning another language

So is the goal for numba to actually handle arbitrary Python code with
correct semantics, i.e., it's actually a compiled implementation of
Python-the-language? (I feel like most of where Cython-the-language
differs from Python-the-language is that Cython adds extensions for
stuff where getting speed out of Python-the-language would just be too
hard. Dynamic type inference for numpy arrays is definitely a good
start, but you can't even, say, promote a Python integer to a C
integer without changing semantics...)

>        2) creating an extension module --- loading bit-code files and dynamically executing (even on a different machine from the one that initially created them) can be a powerful alternative for run-time compilation and distribution of code.

Totally agreed on this point, the workflow for Cython could definitely
be smoother.

-- Nathaniel

Ilan Schnell

unread,
Mar 13, 2012, 4:47:10 PM3/13/12
to Discussion of Numerical Python
As far as I understand, the goal is not to handle arbitrary
Python code, because this would become too difficult as is
not necessary when you have a simple math oriented function
which you want to speed up. The idea is to create something
similar to http://www.enthought.com/~ischnell/paper.html which
only handles some restricted Python code, but using LLVM
(instead of C) as the target language.

- Ilan

mark florisson

unread,
Mar 20, 2012, 1:49:09 PM3/20/12
to Discussion of Numerical Python, Dag Sverre Seljebotn, Travis Oliphant, Francesc Alted

Dag and I have been discussing this at PyCon, and here is my take on
it (at this moment :).

Definitely, if you can avoid Cython then that is easier and more
desirable in many ways. So perhaps we can create a third project
called X (I'm not very creative, maybe ArrayExprOpt), that takes an
abstract syntax tree in a rather simple form, performs code
optimizations such as rewriting loops with array accesses to vector
expressions, fusing vector expressions and loops, etc, and spits out a
transformed AST containing these optimizations. If runtime information
is given such as actual shape and stride information the
transformations could figure out there and then whether to do things
like collapsing, axes swapping, blocking (as in, introducing more axes
or loops to retain discontiguous blocks in the cache), blocked memory
copies to contiguous chunks, etc. The AST could then also say whether
the final expressions are vectorizable. Part of this functionality is
already in numpy's nditer, except that this would be implicit and do
more (and hopefully with minimal overhead).

So numba, Cython and maybe numexpr could use the functionality, simply
by building the AST from Python and converting back (if necessary) to
its own AST. As such, the AST optimizer would be only part of any
(runtime) compiler's pipeline, and it should be very flexible to
retain any information (metadata regarding actual types, control flow
information, etc) provided by the original AST. It would not do
control flow analysis, type inference or promotion, etc, but only deal
with abstract types like integers, reals and arrays (C, Fortran or
partly contiguous or strided). It would not deal with objects, but
would allow to insert nodes like UnreorderableNode and SideEffectNode
wrapping parts of the original AST. In short, it should be as easy as
possible to convert from an original AST to this project's AST and
back again afterwards.

As the project matures many optimizations may be added that deal with
all sorts of loop restructuring and ways to efficiently utilize the
cache as well as enable vectorization and possibly parallelism.
Perhaps it could even generate a different AST depending on whether
execution target the CPU or the GPU (with optionally available
information such as cache sizes, GPU shared/local memory sizes, etc).

Seeing that this would be a part of my master dissertation, my
supervisor would require me to write the code, so at least until
August I think I would have to write (at least the bulk of) this.
Otherwise I can also make other parts of my dissertation's project
more prominent to make up for it. Anyway, my question is, is there
interest from at least the numba and numexpr projects (if code can be
transformed into vector operations, it makes sense to use numexpr for
that, I'm not sure what numba's interest is in that).

Olivier Delalleau

unread,
Mar 20, 2012, 1:58:44 PM3/20/12
to Discussion of Numerical Python
This sounds a lot like Theano, did you look into it?

-=- Olivier

Francesc Alted

unread,
Mar 20, 2012, 2:24:26 PM3/20/12
to mark florisson, Discussion of Numerical Python

I think this is a very interesting project, and certainly projects like numba can benefit of it. So, in order to us have an idea on what you are after, can we assume that your project (call it X) would be kind of an compiler optimizer, and then the produced, optimized code could be feed into numba for optimized LLVM code generation (that on its turn, can be run on top of CPUs or GPUs or a combination)? Is that correct?

Giving that my interpretation above is correct, it is bit more difficult to me to see how your X project could be of benefit for numexpr. In fact, I actually see this the other way round: once the optimizer has discovered the vectorization parts, then go one step further and generate code that uses numexpr automatically (call this, vectorization through numexpr). This is what you mean, or I'm missing something?

> As the project matures many optimizations may be added that deal with
> all sorts of loop restructuring and ways to efficiently utilize the
> cache as well as enable vectorization and possibly parallelism.
> Perhaps it could even generate a different AST depending on whether
> execution target the CPU or the GPU (with optionally available
> information such as cache sizes, GPU shared/local memory sizes, etc).
>
> Seeing that this would be a part of my master dissertation, my
> supervisor would require me to write the code, so at least until
> August I think I would have to write (at least the bulk of) this.
> Otherwise I can also make other parts of my dissertation's project
> more prominent to make up for it. Anyway, my question is, is there
> interest from at least the numba and numexpr projects (if code can be
> transformed into vector operations, it makes sense to use numexpr for
> that, I'm not sure what numba's interest is in that).

I'm definitely interested for the numexpr part. It is just that I'm still struggling to see the big picture on this. But the general idea is really appealing.

Thanks,

-- Francesc Alted

Dag Sverre Seljebotn

unread,
Mar 20, 2012, 3:40:58 PM3/20/12
to Olivier Delalleau, Discussion of Numerical Python
We talked some about Theano. There are some differences in project goals which means that it makes sense to make this a seperate project: Cython wants to use this to generate C code up front from the Cython AST at compilation time; numba also has a different frontend (parsing of python bytecode) and a different backend (LLVM).

However, it may very well be possible that Theano could be refactored so that the more essential algorithms working on the syntax tree could be pulled out and shared with cython and numba. Then the question is whether the core of Theano is smart enough to compete with Fortran compilers and support arbitraily strided inputs optimally. Otherwise one might as well start from scratch. I'll leave that for Mark to figure out...

Dag
--
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

Dag Sverre Seljebotn

unread,
Mar 20, 2012, 3:44:17 PM3/20/12
to numpy-di...@scipy.org
Sorry, forgot to CC list on this. Lines staring with single greater-than are mine.

--
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

Dag Sverre Seljebotn <d.s.se...@astro.uio.no> wrote:


Francesc Alted <fran...@continuum.io> wrote:

>On Mar 20, 2012, at 12:49 PM, mark florisson wrote:
>I think this is a very interesting project, and certainly projects like
>numba can benefit of it. So, in order to us have an idea on what you
>are after, can we assume that your project (call it X) would be kind of
>an compiler optimizer, and then the produced, optimized code could be
>feed into numba for optimized LLVM code generation (that on its turn,
>can be run on top of CPUs or GPUs or a combination)? Is that correct?

I think so. Another way of thinking about it is that it is a reimplementation of the logic 
 in the
(good and closed source) Fortran 90 compilers, in a reusable component for inclusion in various compilers.

Various c++ metaprogramming libraries (like Blitz++) are similar too.


>
>Giving that my interpretation above is correct, it is bit more
>difficult to me to see how your X project could be of benefit for
>numexpr. In fact, I actually see this the other way round: once the
>optimizer has discovered the vectorization parts, then go one step
>further and generate code that uses numexpr automatically (call this,
>vectorization through numexpr). This is what you mean, or I'm missing
>something?

No. I think in some ways this is a competitor to numexpr -- you would gut out the middle of numexpr and keep the frontend and backend, but use this to optimize iteration order and blocking strategies.

I think the goal is for higher performance than what I understand numexpr can provi de (in some cases, not all!). For instance, can numexpr deal well with

a + a.T

where a is a c-contiguous array? Any numpy-like iteration order will not work well, one needs to use higher-dimensional (eg 2D) blocking, not 1D blocking.

(if numexpr can do this then great, the task might then reduce to refactoring numexpr so that cython and numba can use the same logic)

Dag


>
>> As the project matures many optimizations may be added that deal with
>> all sorts of loop restructuring and ways to efficiently utilize the
>> cache as well as enable vectorization and possibly parallelism.
>> Perhaps it could even generate a different AST depending on whether
>> execution target the CPU or the GPU (with optionally available
>> information such as cache sizes, GPU shared/local memory sizes, etc).
>>
>> Seeing that this would be a part of my master dissertation, my
>> supervisor would require me to write the code, so at least until
>> August I think I would have to write (at least the bulk of) this.
>> Otherwise I can also make other parts of my dissertation's project
>> more prominent to make up for it. Anyway, my question is, is there
>> interest from at least the numba and numexpr projects (if code can be
>> transformed into vector operations, it makes sense to use numexpr for
>> that, I'm not sure what numba's interest is in that).
>
>I'm definitely interested for the numexpr part.  It is just that I'm
>still struggling to see the big picture on this. But the general idea
>is really appealing.
>
>Thanks,
>
>-- Francesc Alted

Olivier Delalleau

unread,
Mar 20, 2012, 3:49:18 PM3/20/12
to Dag Sverre Seljebotn, Discussion of Numerical Python
I doubt Theano is already as smart as you'd want it to be right now, however the core mechanisms are there to perform graph optimizations and move computations to GPU. It may save time to start from there instead of starting all over from scratch. I'm not sure though, but it looks like it would be worth considering it at least.

-=- Olivier

Francesc Alted

unread,
Mar 20, 2012, 3:56:04 PM3/20/12
to Dag Sverre Seljebotn, Discussion of Numerical Python
On Mar 20, 2012, at 2:29 PM, Dag Sverre Seljebotn wrote:
> Francesc Alted <fran...@continuum.io> wrote:
>
> I think so. Another way of thinking about it is that it is a reimplementation of the logic in the (good and closed source) Fortran 90 compilers, in a reusable component for inclusion in various compilers.
>
> Various c++ metaprogramming libraries (like Blitz++) are similar too.

Aha, thanks.

>> Giving that my interpretation above is correct, it is bit more
>> difficult to me to see how your X project could be of benefit for
>> numexpr. In fact, I actually see this the other way round: once the
>> optimizer has discovered the vectorization parts, then go one step
>> further and generate code that uses numexpr automatically (call this,
>> vectorization through numexpr). This is what you mean, or I'm missing
>> something?
>

> No. I think in some ways this is a competitor to numexpr -- you would gut out the middle of numexpr and keep the frontend and backend, but use this to optimize iteration order and blocking strategies.

I see. Yes, I can easily see Mark's project X + numba more as a competitor (killer?) to numexpr too.

>
> I think the goal is for higher performance than what I understand numexpr can provide (in some cases, not all!). For instance, can numexpr deal well with


>
> a + a.T
>
> where a is a c-contiguous array? Any numpy-like iteration order will not work well, one needs to use higher-dimensional (eg 2D) blocking, not 1D blocking.

No. numexpr cannot deal with the above problem efficiently. numexpr is about 1d blocking, so its approach is pretty naive (but effective for these 1d blocking tasks).

> (if numexpr can do this then great, the task might then reduce to refactoring numexpr so that cython and numba can use the same logic)

Well, now that I think about it, the virtual machine in latest numexpr (2.0) is based on the new nditer iterator included in NumPy 1.6, so I'm wondering if with a little bit of more effort, the 2d blocking could be implemented. While I'm pretty sure this is the case, I don't know if it would be really worth the effort. Perhaps concentrating on things like numba + projectX, or just Theano would be better targets. Perhaps Mark Wiebe (who contributed the new nditer-aware VM) could say a bit more about this.

Hmm, some good food for brains :)

Dag Sverre Seljebotn

unread,
Mar 21, 2012, 12:20:52 AM3/21/12
to Francesc Alted, Discussion of Numerical Python

My guess is that the answer is that nditer works well in many
situations, but can be sub-optimal for some array shapes, particularly
if the arrays fit in cache.

Here's a contrived bad case (not sure if it is the worst): Consider

arr[::2, :]

where arr is C-contiguous with shape (n, 2), and let n be such that the
array fits in L2 cache :-)

The ::2 ensures that the array can't be flattened. Any nditer approach,
as I understand it, would need to execute the inner subroutine for each
row of 4 elements, and then invoke the iteration machinery, which I
imagine is quite slow compared to LLVM code generated specifically for
this array, which could even unroll the inner 4-element loop.

If one really wants to compete with Fortran, one must take into account
that the scientific end-user may already be structuring the computation
in a cache-efficient way. It doesn't always help with good performance
"as long as arrays are >10 MB".

If the array is large, the memory bottleneck should makes a lot of the
CPU overhead of nditer irrelevant even for worst-case arrays; though I'd
be curious to see benchmarks of how much overhead is left, and my
(rather unfounded) suspicion is that hard-wired compiled LLVM code for a
specific array should play better with the CPUs cache predictor to
better mask memory access latencies (?)

(Mark F., apropos benchmarks, let me know what kind of different
hardware you have access to; I could see if I can give you access to a
couple of boxes with different memory bus characteristics, i.e., Intel
vs. AMD and so on.)

Dag

Dag Sverre Seljebotn

unread,
Mar 21, 2012, 12:22:07 AM3/21/12
to numpy-di...@scipy.org

Sorry: "2 elements", "2-element loop".

Dag

mark florisson

unread,
Mar 21, 2012, 9:12:04 AM3/21/12
to Discussion of Numerical Python

Definitely. An other issue is that nditer, although it could perform
blocking if implemented, cannot look at the larger structure of the
memory access pattern. Often the elementwise match-up of elements is
not at an outermost level, and you may want to do the copying to
contiguous memory both to avoid cache conflicts within your blocks,
but you may also want multiple blocks to match up (consider for
instance LU-factorization or matrix multiplication, although no one in
their right minds would write those by hand :). Unless you reuse the
same data enough in your expressions to warrant copying to gain
cache-reuse and possibly vectorization, you're probably not going to
perform copies.

mark florisson

unread,
Mar 21, 2012, 9:14:39 AM3/21/12
to Discussion of Numerical Python
On 20 March 2012 20:49, Olivier Delalleau <sh...@keba.be> wrote:
> I doubt Theano is already as smart as you'd want it to be right now, however
> the core mechanisms are there to perform graph optimizations and move
> computations to GPU. It may save time to start from there instead of
> starting all over from scratch. I'm not sure though, but it looks like it
> would be worth considering it at least.

Thanks for the suggestion Olivier, as Dag said we discusses it, and
indeed we (or I) should look a lot deeper into it and see what
components are reusable there and discuss with the Theano community if
and how we can collaborate.

Mic

unread,
Mar 25, 2012, 11:44:41 PM3/25/12
to Discussion of Numerical Python

Dag Sverre Seljebotn

unread,
Mar 25, 2012, 11:53:31 PM3/25/12
to numpy-di...@scipy.org
On 03/25/2012 08:44 PM, Mic wrote:
> How about:
> * http://www.hotpy.org/

The front page says a 10x speedup. That's a bit short of the almost
1000x speedup required for numerical code (that is, for some examples
Python is thousands of times slower than C or Fortran).

Well -- I'm sure hotpy could get beyond 10x as well -- it's just to say
that they have probably not looked much on the case of numerical
computation.

> * http://pypy.org/numpydonate.html

Well, you can start with these...:

http://technicaldiscovery.blogspot.com/2011/10/thoughts-on-porting-numpy-to-pypy.html

http://blog.streamitive.com/2011/10/19/more-thoughts-on-arrays-in-pypy/

Reply all
Reply to author
Forward
0 new messages