[Common IR] Introductions

264 views
Skip to first unread message

Andy Ray Terrel

unread,
Jul 2, 2013, 8:20:13 AM7/2/13
to numf...@googlegroups.com, SciPy Users List, cython...@googlegroups.com, Andreas Kloeckner, Alex Rubinsteyn, James Bergstra, Siu Kwan Lam, mark florisson, Frédéric Bastien, Jon Riehl, Stephen Diehl, Travis Oliphant, Matthew Rocklin, Kurt Smith, numba...@continuum.io, Florian Rathgeber, Aaron Meurer, Ondřej Čertík, Jason Moore, Anthony Scopatz, Mark Wiebe, John Wiggins, Bryan Catanzaro
Hello all,

I'm emailing everyone who may be interested in this topic. I would
love to keep everything on one archival list, since NumFOCUS is a
common place for many projects, I'm sending things there. Further
emails should drop all folks except numf...@googlegroups.com , to
join the list email numfocus+...@googlegroups.com with the word
Subscribe as the subject.

Subject
-----------

At SciPy2013, we had a discussion about creating a common intermediate
representation to support the wide array of code generation activities
going on in the Python ecosystem. My summary can be found at:

https://github.com/IgnitionProject/ignition/wiki/CodeGenComposability_Scipy2013

It is a wiki, feel free to edit.

Further
----------

I will be sending a number of emails to the NumFOCUS list that are
different responses from emails around the event. Please forward this
email to folks you think would be interested.

-- Andy

Andy Ray Terrel

unread,
Jul 2, 2013, 8:22:50 AM7/2/13
to numf...@googlegroups.com
Mail on topic from Mark Florisson:

On Jun 30, 2013, at 2:18 PM, mark florisson <markflo...@gmail.com>
> wrote:
>
> Hi,
>
> In this mail I hope to get a discussion started for a common compiler
> component for the many different projects that we have that all need
> to perform some form of compilation. I've CCed a number of people I
> think will be interested, feel free to CC other interested parties.
>
> The project is called pykit (http://markflorisson88.github.io/pykit/)
> which hopes to address an ever growing range of compilation engines,
> all of which have oddities in what they optimize and support. Pykit
> similar in some ways to VMKit, except it provides a higher level layer
> than LLVM IR for compilation. It hopes to be reusable for:
>
> - theano: http://deeplearning.net/software/theano/
> - parakeet:
> https://www.usenix.org/system/files/conference/hotpar12/hotpar12-final37.pdf
> - pythran: http://pythonhosted.org/pythran/
> - blaze: http://blaze.pydata.org/
> - numbapro: http://docs.continuum.io/numbapro/
> - numba: numba.pydata.org
> - (and probably others)
>
> The idea with pykit is that it deals only with typed code in a
> representation amenable to optimization. It does not try to solve the
> typing problem for python code or enforce any programming model. All
> passes, analyses, optimizations and provided runtime functions are
> optional and can be overridden or ignored. By default it should
> support:
>
> - high-level data parallel operations like in parakeet or copperhead
> - arrays and loops
> - containers and memory allocation
> - threads
> - strings and unicode
> - partial functions, Python objects and sum types
> - dynamic variable binding (through cells to support closures as
> first-class values)
> - tasks/green threads
> - exceptions
>
> There's a lot to convey, so instead I've been working on
> documentation, which will hopefully answer most of the initial
> questions:
>
> Website: http://markflorisson88.github.io/pykit/ (this may move)
> Documentation: http://markflorisson88.github.io/pykit-doc/dev/
> Example optimizations:
> http://markflorisson88.github.io/pykit-doc/dev/opt_examples.html
>
> I recommend everyone who's interested to have a look, and see what you
> like or dislike, whether it addresses the needs of your respective
> project and if not, how it should be different. Feedback, discussion,
> criticism or outright contempt are all very welcome.
>
> I'm at Europython this week, so I apologize if my responses should be
> delayed.
>
> Cheers,
>
> Mark
>

Andy Ray Terrel

unread,
Jul 2, 2013, 8:23:47 AM7/2/13
to numf...@googlegroups.com
Respons to Mark from Alex Rubinsteyn and James Bergstra:


(Adding Andy and Andreas)

I like the idea of using translation from Theano as a way to ground
the design decisions for both PyKit and AIR (the common array
intermediate representation we discussed at SciPy).

Questions for Mark: how would I encode array expressions such as
"A+B+C"? At what level is the data representation fixed -- does the
frontend have to manually distinguish between objects in "nopython"
mode (unboxed efficient representations) and PyObjects?

--Alex

On Mon, Jul 1, 2013 at 12:19 PM, James Bergstra
<james.b...@gmail.com> wrote:
(Adding Matthew Rocklin)

Thanks for the heads up Mark. I've read through the documentation, and
I think PyKit is indeed providing the kind of program representation
that I was hoping at least a few of our numerous projects could
standardize on.

My experience with Theano left me with the sense that maintaining a
front-end (e.g. Theano) is plenty of work to keep a typical
open-source team busy. The burden of developing and maintaining a
suite of optimizations and code generators for a variety of platforms
and architectures is quite a bit larger. Although standards sometimes
get in the way of creativity and innovation, I think in this case a
standard optimization/transformation path is necessary to get enough
hands on deck, to actually build something good enough to use.

I'm willing to put in a little time to experiment with PyKit as a
means of running e.g. Theano graphs, how would you recommend
approaching that?

One idea is that I coud write some simple tests that do not
technically depend on Theano, but which replicate the core theano data
structures in question, build expression graphs such as Theano would
build, and assert that a corresponding PyKit-compiled function does
the correct thing. With such tests written, we could have a more
grounded conversation about what it would take to make said tests
pass, and how a new compilation process might work.

- James

Andy Ray Terrel

unread,
Jul 2, 2013, 8:24:33 AM7/2/13
to numf...@googlegroups.com
Response from Andreas Kloeckner:

Hi all,

Alex Rubinsteyn <alex.ru...@gmail.com> writes:
> (Adding Andy and Andreas)
>
> I like the idea of using translation from Theano as a way to ground the
> design decisions for both PyKit and AIR (the common array intermediate
> representation we discussed at SciPy).
>
> Questions for Mark: how would I encode array expressions such as "A+B+C"?
> At what level is the data representation fixed -- does the frontend have
> to manually distinguish between objects in "nopython" mode (unboxed
> efficient representations) and PyObjects?

Forgive me for barging in late, and thanks for bringing me into the
discussion.

I've got a few questions I hope that will clarify things that are
unclear to me, and, perhaps, even help sharpen the focus of the
discussion:

- In my opinion, perhaps the most pressing problem is data format across
multiple devices. On the host, we have such a format in numpy
arrays. On GPUs, things are less clear. As long as everybody agrees on
how data is stored, there is no problem having a large diversity of
execution engines, and we can then let 'market forces' decide on a set
of winners. If there is no agreement, users are left in a world of
pain, at worst copying data around.

The good news is that this will get easier as the world moves towards
'unified' memory models where a host pointer is all you need--but
we're not quite there, and it'll take a few years to get there. Still,
this is worth keeping in mind as the goal scenario.

Has there been a discussion along these lines?

- If you want to store data, you need a type system. Numpy defines one,
PyKit defines one, Blaze defines another one...

- The word IR gets thrown around a lot, but that could mean a lot of
things:

- scalar/array-based?
- typed/untyped?
- how strong a program order?
- what model of parallelism?

Options here include

- PyKit input (array-based, typed?, strict program order, par?)
- C (scalar, typed, strict program order, par?)
- LLVM (scalar, typed, strict program order, par?)
- OpenCL SPIR (scalar, typed, strict program order, SMT parallel)
- Loopy (*) (scalar, loosely typed, loose program order, SMT parallel)

Maybe I'm missing additional characteristics, and I'm sure I'm missing

- Another aspect is the runtime, aka the bit that takes some low-level
IR and actually executes it. Breadth here is good.

Options include:

- Ye olde weave-like compile-and-link-into-interpreter method
- LLVM JIT infrastructure
- OpenCL
- CUDA driver interface

Again, I've probably missed some here--feel free to fill in missing bits.

- In addition to the runtime, there's the decision of what Python
interface to use for the runtime. Disagreement here can once again
create friction for users.

I've also got some personal opinions, which I'll put up for comment here:

- Numpy's type system is fairly limited, but I feel like that might be a
good thing. By supporting full-blown pointers, for example, PyKit
already locks itself out of working with many aspects of OpenCL.
(buffer+offset is roughly equivalent to a pointer, but this quirk
needs to be reflected.)

- I tend to be suspicious of IRs that want to make decisions/optimize on
my behalf--such as by not explicitly representing parallelism. I
don't believe automatic parallelization is solved in a satisfactory
manner, or will be in the near future.

- I like OpenCL as a runtime standard. Why?

- Broad device support *now*, including CPUs, GPUs, MICs, and your
dog.

- JIT built into the standard.

- Free implementations available that are rapidly getting close to
being "good enough".

Sorry about the wall of text. :) I'd be happy to hear your comments.

Andreas

(*) I've been working on "loopy", which I'm hoping to
release pretty damn soon. Current (still crappy) docs:
http://iliketoast:withma...@documen.tician.de/loopy

PS: I'd be happy to host a mailing list for this discussion if you guys
think that's a good idea. (for archival, inclusiveness and so that we
don't have to drag around 15-long recipient lists)

Andy Ray Terrel

unread,
Jul 2, 2013, 8:25:17 AM7/2/13
to numf...@googlegroups.com
Response from Serge Guelton:

Hi all and thanks Andreas for this Wall of text!

Before we dive into all the details, it seems to me we have to agree on
the goals. Why do we need a common representation, what would be the
benefits for all of us? If we don't have an answer to these questions,
we will likely fail to cooperate.

I've been working on the PIPS compiler infrastructure for 3+ years and
keep working on the compiler + parallelism since then, and from what I
have learnt, choosing a generic IR, if not a common IR, is a difficult
problem, as each IR tends to reflect our specific needs...

So what are we trying to share? I am interested in sharing high-level
transformations, including (but not limited to) loop unrolling, list
comprehension -> generator conversion, interprocedural constant unfolding
and various analysis such has points-to, argument effects and the like.
Polyhedral transformations would be nice too (e.g tiling, skewing,
interchange and the likes) If someone can come up with an analysis that
would type the IR, that would be nice too.

What am I not interested in? Accelerator-specific IR does not appeal to
me at all, even if I understand the need of other projects.

A possibility, which is found in many compilers, is to have several IRs,
e.g
> C / C++ / whatever
/
AST -> [stuff_0] -> PyKit -> [stuff_1] -> LLVM
\
> OpenCL / CUDA / whatever


It seems to me some project are likely to benefit from sharing stuff_0
and some are likely to share stuff_1, and some both.

I expect most project to take the python's AST as input (2.7 version ?
3.x version ?) so transformations that would work directly on the AST
are beneficial for all of us. That's where I'd like to contribute, and
Pythran already has a pass management system to do this. Having more
passes available at this level would be nice.

That's for the high-level stuff, any thought about this ?

Andy Ray Terrel

unread,
Jul 2, 2013, 8:48:04 AM7/2/13
to numf...@googlegroups.com
Hello Everyone,

Thanks for sending input on this problem. I'm still looking at the
internals of PyKit and other projects, but I thought I would take a
stab at a few of the questions asked.

Why share?
------------------

To me it is important to be able to compose different language
constructs. I build out domain specific languages and there are a lot
of things that are very common between them. For example the idea of
linear algebra which I have an implementation in Ignition and Matt
Rocklin has in his different matrix algebra libraries. Sharing is a
key step to being able to compose different tools.

Another reason for this is to give folks a place to play with ideas
and take the best ones. For example the Theano folks might want to
just output a graph that Numba can use rather than being stuck in only
one project.


What to share?
-----------------------

I agree with having several IRs. My expression trees are always above
the Python AST. I have used SymPy and might switch to Pymbolic to
represent these trees since I can define domain specific optimizations
at that level and then I'm left with a fairly generic symbolic math
representation when I'm done. Having a nice symbolic mathematical
representation is essentially what my project Ignition was originally
designed to do, but hasn't had enough time to mature.

Another aspect of my generators is generating not to general programs
rather to generate to specific libraries. For example I have code
that hooks into FEniCS, Proteus, and CLAWPack (all large validated CFD
codes). Code validation is huge in our field and being able to just
generate to their interface has been a must in my work.


-- Andy

mark florisson

unread,
Jul 2, 2013, 1:08:30 PM7/2/13
to numf...@googlegroups.com
On 2 July 2013 13:23, Andy Ray Terrel <andy....@gmail.com> wrote:
> Respons to Mark from Alex Rubinsteyn and James Bergstra:
>
>
> (Adding Andy and Andreas)
>
> I like the idea of using translation from Theano as a way to ground
> the design decisions for both PyKit and AIR (the common array
> intermediate representation we discussed at SciPy).
>
> Questions for Mark: how would I encode array expressions such as
> "A+B+C"? At what level is the data representation fixed -- does the
> frontend have to manually distinguish between objects in "nopython"
> mode (unboxed efficient representations) and PyObjects?

I was thinking that since this is so common, is to use the polymorphic
add/mul etc instructions, with a pass that rewrites them into a data
parallel map operation.

The representation is defined however you want it to be. There is no
"nopython" mode, such semantics are left for your front-end of choice.
The idea is to provide some support for objects, but it's minimal, and
objects are only used when you type something as an object. An array
is not considered an object, nor a list or dict. A provided runtime or
compiler pass could however represent an array as an object (e.g. an
ndarray) if so desired.

So let's way what comes in is A+B+C, which is expanded into two map
operators with two scalar functions. After fusion we get a map with
one scalar function (a+b+c) and three array operands (A, B, C). Now we
could have a scalarization pass that kicks in which expands the map()
to a loop nest, with instructions like array_data, array_strides and
such (specialized to the data order if desired).
Now if we wanted, we could map our array data structures to "opaque"
pointers and operations like array_data to runtime calls.
Alternatively and more likely we would write a simple pass that
expands arrays into (pointers to) structures and loads the data
pointer and strides from there. Or maybe we directly we avoid
scalarization entirely because some awesome library can do this much
better already, so we map to a simple runtime call.

There's a lot of specialization front-ends may want to do, to keep
things simple it may be easy to write them as separate modules (in
Python or Pykit IR, which is basically like specifying pykit assembly
in C with some syntax sugar on top) taking opaque pointers and marking
the functions inline.
Great, I think that's very useful. Right now pykit doesn't really do
anything, it's mostly specced out to some extent. The first pass will
probably not have any optimizations, just a working end-to-end
infrastructure. But this is very useful in determining the validity of
the approach.

> --
> You received this message because you are subscribed to the Google Groups "NumFOCUS" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to numfocus+u...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

mark florisson

unread,
Jul 2, 2013, 1:11:41 PM7/2/13
to numf...@googlegroups.com
On 2 July 2013 13:24, Andy Ray Terrel <andy....@gmail.com> wrote:
> Response from Andreas Kloeckner:
>
> Hi all,
>
> Alex Rubinsteyn <alex.ru...@gmail.com> writes:
>> (Adding Andy and Andreas)
>>
>> I like the idea of using translation from Theano as a way to ground the
>> design decisions for both PyKit and AIR (the common array intermediate
>> representation we discussed at SciPy).
>>
>> Questions for Mark: how would I encode array expressions such as "A+B+C"?
>> At what level is the data representation fixed -- does the frontend have
>> to manually distinguish between objects in "nopython" mode (unboxed
>> efficient representations) and PyObjects?
>
> Forgive me for barging in late, and thanks for bringing me into the
> discussion.
>
> I've got a few questions I hope that will clarify things that are
> unclear to me, and, perhaps, even help sharpen the focus of the
> discussion:
>
> - In my opinion, perhaps the most pressing problem is data format across
> multiple devices. On the host, we have such a format in numpy
> arrays. On GPUs, things are less clear. As long as everybody agrees on
> how data is stored, there is no problem having a large diversity of
> execution engines, and we can then let 'market forces' decide on a set
> of winners. If there is no agreement, users are left in a world of
> pain, at worst copying data around.

I think there was one effort in that direction in compyte:
https://github.com/inducer/compyte/wiki . I don't know where that's
at.

> The good news is that this will get easier as the world moves towards
> 'unified' memory models where a host pointer is all you need--but
> we're not quite there, and it'll take a few years to get there. Still,
> this is worth keeping in mind as the goal scenario.
>
> Has there been a discussion along these lines?
>
> - If you want to store data, you need a type system. Numpy defines one,
> PyKit defines one, Blaze defines another one...
>
> - The word IR gets thrown around a lot, but that could mean a lot of
> things:
>
> - scalar/array-based?
> - typed/untyped?
> - how strong a program order?
> - what model of parallelism?
>
> Options here include
>
> - PyKit input (array-based, typed?, strict program order, par?)
> - C (scalar, typed, strict program order, par?)
> - LLVM (scalar, typed, strict program order, par?)
> - OpenCL SPIR (scalar, typed, strict program order, SMT parallel)
> - Loopy (*) (scalar, loosely typed, loose program order, SMT parallel)
>
> Maybe I'm missing additional characteristics, and I'm sure I'm missing

The way I see it, there's several things that need to be solved. One
is the typing problem for python code which applies to only a subset
of the projects. Another is to compile array operations and scalar
code. For the latter I think you will want no ambiguity, i.e. you want
arrays, scalars, parallelism and explicit types (and optionally
inferred parallelism). Coercions are left for front-end semantics.

> - Another aspect is the runtime, aka the bit that takes some low-level
> IR and actually executes it. Breadth here is good.
>
> Options include:
>
> - Ye olde weave-like compile-and-link-into-interpreter method
> - LLVM JIT infrastructure
> - OpenCL
> - CUDA driver interface
>
> Again, I've probably missed some here--feel free to fill in missing bits.
>
> - In addition to the runtime, there's the decision of what Python
> interface to use for the runtime. Disagreement here can once again
> create friction for users.
>
> I've also got some personal opinions, which I'll put up for comment here:
>
> - Numpy's type system is fairly limited, but I feel like that might be a
> good thing. By supporting full-blown pointers, for example, PyKit
> already locks itself out of working with many aspects of OpenCL.
> (buffer+offset is roughly equivalent to a pointer, but this quirk
> needs to be reflected.)

I'm not sure that's true, the pointer support is there for front-ends
that want to use pointers. If your front-end doesn't, then you won't
use them. Lowering passes could introduce pointers (e.g. to pass
things around by reference), but such things can hopefully be
controlled in a clear way. I imagine it will have a "CPU" and "GPU"
set of pass configurations, with the GPU set having restrictions on
what you can do.

> - I tend to be suspicious of IRs that want to make decisions/optimize on
> my behalf--such as by not explicitly representing parallelism. I
> don't believe automatic parallelization is solved in a satisfactory
> manner, or will be in the near future.

I agree, parallelism should be explicit. The way I this is
conceptualized in pykit is through metadata, which may be introduced
explicitly by a front-end, or through an compiler analysis pass.

> - I like OpenCL as a runtime standard. Why?
>
> - Broad device support *now*, including CPUs, GPUs, MICs, and your
> dog.
>
> - JIT built into the standard.
>
> - Free implementations available that are rapidly getting close to
> being "good enough".
>
> Sorry about the wall of text. :) I'd be happy to hear your comments.
>
> Andreas
>
> (*) I've been working on "loopy", which I'm hoping to
> release pretty damn soon. Current (still crappy) docs:
> http://iliketoast:withma...@documen.tician.de/loopy
>
> PS: I'd be happy to host a mailing list for this discussion if you guys
> think that's a good idea. (for archival, inclusiveness and so that we
> don't have to drag around 15-long recipient lists)
>

mark florisson

unread,
Jul 2, 2013, 1:19:15 PM7/2/13
to numf...@googlegroups.com
On 2 July 2013 13:25, Andy Ray Terrel <andy....@gmail.com> wrote:
> Response from Serge Guelton:
>
> Hi all and thanks Andreas for this Wall of text!
>
> Before we dive into all the details, it seems to me we have to agree on
> the goals. Why do we need a common representation, what would be the
> benefits for all of us? If we don't have an answer to these questions,
> we will likely fail to cooperate.
>
> I've been working on the PIPS compiler infrastructure for 3+ years and
> keep working on the compiler + parallelism since then, and from what I
> have learnt, choosing a generic IR, if not a common IR, is a difficult
> problem, as each IR tends to reflect our specific needs...
>
> So what are we trying to share? I am interested in sharing high-level
> transformations, including (but not limited to) loop unrolling, list
> comprehension -> generator conversion, interprocedural constant unfolding
> and various analysis such has points-to, argument effects and the like.
> Polyhedral transformations would be nice too (e.g tiling, skewing,
> interchange and the likes) If someone can come up with an analysis that
> would type the IR, that would be nice too.

Yes, I agree. I think most if not all optimizations and analyses work
on typed code. So typing Python code only needs some control flow and
data flow analysis. I think that problem should very much be separated
from any optimization (and most definitely lowering) passes. This is a
problem that Python compilers need to solve, but not projects like
Theano, Sympy or Blaze.

> What am I not interested in? Accelerator-specific IR does not appeal to
> me at all, even if I understand the need of other projects.
>
> A possibility, which is found in many compilers, is to have several IRs,
> e.g
> > C / C++ / whatever
> /
> AST -> [stuff_0] -> PyKit -> [stuff_1] -> LLVM
> \
> > OpenCL / CUDA / whatever
>
>
> It seems to me some project are likely to benefit from sharing stuff_0
> and some are likely to share stuff_1, and some both.
>
> I expect most project to take the python's AST as input (2.7 version ?
> 3.x version ?) so transformations that would work directly on the AST
> are beneficial for all of us. That's where I'd like to contribute, and
> Pythran already has a pass management system to do this. Having more
> passes available at this level would be nice.
>
> That's for the high-level stuff, any thought about this ?

I think that's a valid approach. However, I have found Python ASTs
tedious to work with, and as my collegues have pointed out in the
past, the bytecode is a simpler format with control flow "expanded
out", making it easier to focus on the actual control flow rather than
the abstract syntax. But the code for both approaches exists and only
has to be written once :)

serge Guelton

unread,
Jul 3, 2013, 8:51:20 AM7/3/13
to numf...@googlegroups.com
> > - I tend to be suspicious of IRs that want to make decisions/optimize on
> > my behalf--such as by not explicitly representing parallelism. I
> > don't believe automatic parallelization is solved in a satisfactory
> > manner, or will be in the near future.
>
> I agree, parallelism should be explicit. The way I this is
> conceptualized in pykit is through metadata, which may be introduced
> explicitly by a front-end, or through an compiler analysis pass.

Beware that this kind of metadata actually change the semantic of the
node they are attached to, and each transformation has to be aware of
them.

serge Guelton

unread,
Jul 3, 2013, 8:52:17 AM7/3/13
to numf...@googlegroups.com
> What to share?
> -----------------------
>
> I agree with having several IRs. My expression trees are always above
> the Python AST. I have used SymPy and might switch to Pymbolic to
> represent these trees since I can define domain specific optimizations
> at that level and then I'm left with a fairly generic symbolic math
> representation when I'm done. Having a nice symbolic mathematical
> representation is essentially what my project Ignition was originally
> designed to do, but hasn't had enough time to mature.

Hi Andy,

Sympy is indeed a great example of the kind of tool collaboration we
could make. If I got it well, I can turn a Python expression from AST to
Sympy's IR, optimize it a the Sympy level, and turn it back to AST.

Just a note about this:

turning an expression to Sympy is only valid given a context:

cos(a+pi/2)

what do the identifier cos and pi refer to? Points-to analysis (or
language restriction) are needed there...

I am not familiar with SymPy, but from their codebase, it seems that
hooking into `sympy.sympify(...)' makes it possible to use an AST as
expression, and this[0] thread seems to provide an actual
sympy-expression-to-ast function. Anyone with experience on this?

[0] http://comments.gmane.org/gmane.comp.python.sympy/10601

serge Guelton

unread,
Jul 3, 2013, 8:53:13 AM7/3/13
to numf...@googlegroups.com
On Tue, Jul 02, 2013 at 06:19:15PM +0100, mark florisson wrote:
> On 2 July 2013 13:25, Andy Ray Terrel <andy....@gmail.com> wrote:
> > Response from Serge Guelton:
> >
> > Hi all and thanks Andreas for this Wall of text!
> >
> > Before we dive into all the details, it seems to me we have to agree on
> > the goals. Why do we need a common representation, what would be the
> > benefits for all of us? If we don't have an answer to these questions,
> > we will likely fail to cooperate.
> >
> > I've been working on the PIPS compiler infrastructure for 3+ years and
> > keep working on the compiler + parallelism since then, and from what I
> > have learnt, choosing a generic IR, if not a common IR, is a difficult
> > problem, as each IR tends to reflect our specific needs...
> >
> > So what are we trying to share? I am interested in sharing high-level
> > transformations, including (but not limited to) loop unrolling, list
> > comprehension -> generator conversion, interprocedural constant unfolding
> > and various analysis such has points-to, argument effects and the like.
> > Polyhedral transformations would be nice too (e.g tiling, skewing,
> > interchange and the likes) If someone can come up with an analysis that
> > would type the IR, that would be nice too.
>
> Yes, I agree. I think most if not all optimizations and analyses work
> on typed code. So typing Python code only needs some control flow and

No analyse or transformations in Pythran work on typed code. Many
transformations are type agnostic...

Andy Ray Terrel

unread,
Jul 3, 2013, 10:08:31 PM7/3/13
to numf...@googlegroups.com
On Wed, Jul 3, 2013 at 7:52 AM, serge Guelton
<serge....@quiet-oceans.com> wrote:
>> What to share?
>> -----------------------
>>
>> I agree with having several IRs. My expression trees are always above
>> the Python AST. I have used SymPy and might switch to Pymbolic to
>> represent these trees since I can define domain specific optimizations
>> at that level and then I'm left with a fairly generic symbolic math
>> representation when I'm done. Having a nice symbolic mathematical
>> representation is essentially what my project Ignition was originally
>> designed to do, but hasn't had enough time to mature.
>
> Hi Andy,
>
> Sympy is indeed a great example of the kind of tool collaboration we
> could make. If I got it well, I can turn a Python expression from AST to
> Sympy's IR, optimize it a the Sympy level, and turn it back to AST.

Well I wouldn't go from Python AST to SymPy, SymPy is really an
embedded DSL for symbolic math and does quite a bit of manipulation to
the expression tree as it is built.

>
> Just a note about this:
>
> turning an expression to Sympy is only valid given a context:
>
> cos(a+pi/2)
>
> what do the identifier cos and pi refer to? Points-to analysis (or
> language restriction) are needed there...

If they come from SymPy the refer to objects in there expression tree.
Really it is just a language restriction, if you tried to pass a for
loop it would just bottom out.

>
> I am not familiar with SymPy, but from their codebase, it seems that
> hooking into `sympy.sympify(...)' makes it possible to use an AST as
> expression, and this[0] thread seems to provide an actual
> sympy-expression-to-ast function. Anyone with experience on this?
>
> [0] http://comments.gmane.org/gmane.comp.python.sympy/10601

Looks like this never got merged in.

-- Andy

Frédéric Bastien

unread,
Jul 4, 2013, 11:24:47 AM7/4/13
to numf...@googlegroups.com
Hi,


Some news, we have something working for OpenCL and CUDA on Windows, Mac OSX and Linux! It have a c interface and a python interface. The c interface is independent from python.

There isn't many operation there, but we started the work to use it in Theano. At the same time, we will move there some operation that is currently in Theano. That way, it will work. I don't know where we should discuss this. I think I'll make a new email thread in numfocus about this to don't grow too much this thread.
I see it that way too.
 

> - Another aspect is the runtime, aka the bit that takes some low-level
>   IR and actually executes it. Breadth here is good.
>
>   Options include:
>
>   - Ye olde weave-like compile-and-link-into-interpreter method
>   - LLVM JIT infrastructure
>   - OpenCL
>   - CUDA driver interface
>
>   Again, I've probably missed some here--feel free to fill in missing bits.
>
> - In addition to the runtime, there's the decision of what Python
>   interface to use for the runtime. Disagreement here can once again
>   create friction for users.
>
> I've also got some personal opinions, which I'll put up for comment here:
>
> - Numpy's type system is fairly limited, but I feel like that might be a
>   good thing. By supporting full-blown pointers, for example, PyKit
>   already locks itself out of working with many aspects of OpenCL.
>   (buffer+offset is roughly equivalent to a pointer, but this quirk
>   needs to be reflected.)

I'm not sure that's true, the pointer support is there for front-ends
that want to use pointers. If your front-end doesn't, then you won't
use them. Lowering passes could introduce pointers (e.g. to pass
things around by reference), but such things can hopefully be
controlled in a clear way. I imagine it will have a "CPU" and "GPU"
set of pass configurations, with the GPU set having restrictions on
what you can do.


I also think there is a CPU and GPU/Accelerator pass, but also a common pass before.

On the questions someone else asked, what I would like this project to provide is:

1) an IR that take represent Theano graph. This include a way to represent ndarray with the number of dim they have, the dtype. Ideally, it would be possible to hardcode the shape of some of the dimensions. This is useful in the code generation, especially on the GPU. So the metadata need to be passed/kept between each pass of optimization/transformation.

2) optimization that is backend-end agnostif (cpu, gpu, phi, ...) and optimization that is back-end specific

3) Code generator. I would prefer to be able to get the c/c++ code generated and reuse it in our framework. That would make the integration easier I think. Otherwise, we will need to be able to use c code compiled with different backend.

thanks for pushing this.

Frédéric

Rahul Garg

unread,
Jul 5, 2013, 1:09:57 PM7/5/13
to numf...@googlegroups.com, SciPy Users List, cython...@googlegroups.com, Andreas Kloeckner, Alex Rubinsteyn, James Bergstra, Siu Kwan Lam, mark florisson, Frédéric Bastien, Jon Riehl, Stephen Diehl, Travis Oliphant, Matthew Rocklin, Kurt Smith, numba...@continuum.io, Florian Rathgeber, Aaron Meurer, Ondřej Čertík, Jason Moore, Anthony Scopatz, Mark Wiebe, John Wiggins, Bryan Catanzaro
Hi everyone.

Just wanted to say I am also watching the topic with interest.  I have been building a compiler toolkit (with emphasis on toolkit, it is completely reusable for anyone) for CPUs and GPUs myself and has many of the ideas discussed here. I am hoping to make the framework public soon.

rahul
PhD student
McGill University

Rahul Garg

unread,
Jul 5, 2013, 1:19:50 PM7/5/13
to numf...@googlegroups.com, SciPy Users List, cython...@googlegroups.com, Andreas Kloeckner, Alex Rubinsteyn, James Bergstra, Siu Kwan Lam, mark florisson, Frédéric Bastien, Jon Riehl, Stephen Diehl, Travis Oliphant, Matthew Rocklin, Kurt Smith, numba...@continuum.io, Florian Rathgeber, Aaron Meurer, Ondřej Čertík, Jason Moore, Anthony Scopatz, Mark Wiebe, John Wiggins, Bryan Catanzaro
Also, I wanted to suggest that we also keep in mind some use-cases (i.e. real applications) in mind. One of the issues I have is that while I may think up any number of fancy constructs for GPUs, I don't often find that many applications that can benefit from GPUs due to issues such as data transfer overhead or insufficient parallelism. Having some real applications, written in Python (not a simple wrapper around a C library), as expected use cases for the IR will be very helpful.

rahul

James Bergstra

unread,
Jul 5, 2013, 1:44:45 PM7/5/13
to numf...@googlegroups.com
On Tue, Jul 2, 2013 at 1:08 PM, mark florisson
<markflo...@gmail.com> wrote:
>> One idea is that I coud write some simple tests that do not
>> technically depend on Theano, but which replicate the core theano data
>> structures in question, build expression graphs such as Theano would
>> build, and assert that a corresponding PyKit-compiled function does
>> the correct thing. With such tests written, we could have a more
>> grounded conversation about what it would take to make said tests
>> pass, and how a new compilation process might work.
>>
>> - James
>
> Great, I think that's very useful. Right now pykit doesn't really do
> anything, it's mostly specced out to some extent. The first pass will
> probably not have any optimizations, just a working end-to-end
> infrastructure. But this is very useful in determining the validity of
> the approach.
>

If anyone wants to have a crack at using pykit to compile a toy
theano-like graph, they can start from these simple test cases:
https://github.com/ContinuumIO/pykit/pull/1

mark florisson

unread,
Jul 5, 2013, 2:23:36 PM7/5/13
to numf...@googlegroups.com
> --
> You received this message because you are subscribed to the Google Groups "NumFOCUS" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to numfocus+u...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

Thanks James, I merged your PR and gave it a crack. Here's what it
looks like: https://github.com/ContinuumIO/pykit/commit/d704487d657bea59023d0082559d09effd7ccd73

mark florisson

unread,
Jul 5, 2013, 2:27:29 PM7/5/13
to Rahul Garg, numf...@googlegroups.com
Hey Rahul,

That sounds great! Can you tell us a little more about it? What does
the IR look like, what constructs does it handle, what optimizations
does it do, how flexible are the analyses and lowering passes, what
backends does it currently have?

Anthony Scopatz

unread,
Jul 5, 2013, 2:34:50 PM7/5/13
to Rahul Garg, numf...@googlegroups.com, SciPy Users List, cython...@googlegroups.com, Andreas Kloeckner, Alex Rubinsteyn, James Bergstra, Siu Kwan Lam, mark florisson, Frédéric Bastien, Jon Riehl, Stephen Diehl, Travis Oliphant, Matthew Rocklin, Kurt Smith, numba...@continuum.io, Florian Rathgeber, Aaron Meurer, Ondřej Čertík, Jason Moore, Mark Wiebe, John Wiggins, Bryan Catanzaro
> not a simple wrapper around a C library

Damn those simple wrappers! :)

Be Well
Anthony

mark florisson

unread,
Jul 5, 2013, 2:36:38 PM7/5/13
to numf...@googlegroups.com
Understandable, it really depends on the transformation and IR in
question. I imagine if you're emitting C++ you don't need too many
lowering passes. I think if you want to do things like fusion or
detecting whether an operation can be parallel, you'll want to know
that for instance an operation is pure and does not raise an exception
(excluding e.g. objects). If we want to make this generate code - as
opposed to just simplify symbols - we'll definitely need types in the
IR.

Rahul Garg

unread,
Jul 5, 2013, 3:35:34 PM7/5/13
to numf...@googlegroups.com, Rahul Garg
IR is a typed AST with built-in support for high level array operators. It is much higher level than LLVM IR for example.
Analysis and lowering passes are very flexible, it is a proper multi-pass compiler. We also have a very robust runtime support infrastructure to handle GPU resource management.
The toolkit is not for Python specifically, but I am building a compiler for annotated subset of Python on top of my toolkit to demonstrate that it works. We also have a similar effort for other languages.

However, there are some limitations. The toolkit is not tied to any one language and it means that we cannot easily support some constructs like strings, which vary a lot across languages. I am thinking of exposing at least a buffer of chars or something to handle strings but it is not there yet and frankly not my top priority right now. Consequently you are essentially limited to numerics right now.

The toolkit is a dynamic compiler and does compile-to-Python-module-and-link at runtime.  Backends supported are LLVM and OpenCL. Both of these backends are mature, tested on multiple types of machines, can handle many samples, and we are working on more backends. Another student is working on a static backend to C/C++ but that is very preliminary.

The toolkit is not yet public however, but hopefully July or August. I am avoiding a specific commitment right now as I have overpromised and under-delivered in the past, so being a little hestitant to announce dates of when I expect to make it public.

rahul

James Bergstra

unread,
Jul 5, 2013, 3:40:47 PM7/5/13
to numf...@googlegroups.com
On Fri, Jul 5, 2013 at 2:23 PM, mark florisson
Nice!

Level 2: the dot -> gemm transformation pathway:
https://github.com/ContinuumIO/pykit/pull/2

James Bergstra

unread,
Jul 5, 2013, 3:45:19 PM7/5/13
to numf...@googlegroups.com, Rahul Garg
Very cool, that sounds exactly like what the folks at Continuum are up
to, and basically the sort of thing that this consolidation effort is
aimed at.

If you plan to make it public in a little while, why don't you make it
public now?

- James

Rahul Garg

unread,
Jul 5, 2013, 3:56:10 PM7/5/13
to numf...@googlegroups.com, Rahul Garg

Well I am holding back the release because I am making changes to the runtime currently to improve performance. Also, I want to write a few specification docs before release as well. There is almost no documentation currently so it will be very difficult to get started with it.

rahul

- James

Travis Oliphant

unread,
Jul 5, 2013, 3:58:54 PM7/5/13
to numf...@googlegroups.com, Rahul Garg
Hey Rahul, 

Your project does sound interesting and very connected to what we are doing at Continuum as well.   I would highly encourage you to make the code public as soon as possible, as it would be great to collaborate instead of do essentially the same thing.    

You can make the project public before an official release.   Documentation is of course essential for wide-spread use.   But, don't underestimate the ability of people to read code.    Many people are working on similar tool chains and being able to understand what you are doing and create common / shared tools would be very helpful.   

Best,

-Travis





--
You received this message because you are subscribed to the Google Groups "NumFOCUS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to numfocus+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.





--

Travis Oliphant
Continuum Analytics, Inc.

Rahul Garg

unread,
Jul 5, 2013, 4:08:46 PM7/5/13
to numf...@googlegroups.com, Rahul Garg
Ok sure. I will target end of July for release then. License will be Apache 2. I will keep everyone posted.

Btw, relatedly, I am building an auto-tuning library for some GPU operations like matrix multiplication. Think of it as a small subset of ATLAS for GPU in OpenCL. Initial results are good and I come very close to CUBLAS and AMD's OpenCL BLAS on respective hardware on GEMM. Some initial code is here: http://www.raijincl.org
Code is very close to (but not 100%) usable but at least people can look at the logic if they want.

rahul

Frédéric Bastien

unread,
Jul 5, 2013, 11:14:36 PM7/5/13
to numf...@googlegroups.com, Rahul Garg
Hi,

What about the license? I'm used to use BSD 3 clauses and the current NumPy stack use it heavily. I have hear good thing of Apache, but changing the licence of the NumPy stack isn't trivial. Personally, I won't butter discussing this if people stay with BSD 3 clauses.

But do we agree on this or should we discuss this? Rahul, would it be fine for you to change the license for BSD 3 clauses?

Fred

Aaron Meurer

unread,
Jul 5, 2013, 11:22:07 PM7/5/13
to numf...@googlegroups.com, Rahul Garg
At the risk of starting a huge bikeshed flamewar, what's so great about Apache? As far as I can tell, it's slightly less free than BSD, and much harder to read.

Just use BSD. It's well-established throughout this community as the *right* license to use. 

Aaron Meurer

Travis Oliphant

unread,
Jul 5, 2013, 11:42:55 PM7/5/13
to numf...@googlegroups.com

In the XDATA world,  the Apache license is preferred because of some promises it makes about not having patent time bombs in the software.

I could see either way.
Travis

Frédéric Bastien

unread,
Jul 5, 2013, 11:45:21 PM7/5/13
to numf...@googlegroups.com, Rahul Garg
From memory it include protection from patent law suit again the developer. From memory it is, if you use it, you can't make a law suite again the author for patent or copyright. i.e. the user grant the author a free right about all the user patent that the software could infringe upon... Asking this from the user could move many of them away due to company restriction... But could protect us from patent...

Disclaimer I'm not a laywer, and the only time I read it was more then 1 year ago, so I could be completely wrong.

Also, as told, I don't push for this, it is just that Rahul code was under Apache 2.0... So we can't reuse part of it in the new gpu ndarray without him changing the license.

Fred

Aaron Meurer

unread,
Jul 6, 2013, 12:36:51 AM7/6/13
to numf...@googlegroups.com
I see. I don't know much about these patent issues. I'm just not a fan of the bookkeeping clause in the Apache license.

Aaron Meurer

Rahul Garg

unread,
Jul 6, 2013, 2:00:14 PM7/6/13
to Frédéric Bastien, numf...@googlegroups.com
Ah well actually we had this discussion in our lab, and it went on for quite a while, as all license discussions do. Finally we just agreed that our lab's policy is to release everything under Apache 2. I can look into it but will be difficult to convince my supervisor and other potential contributors in my lab to change it to BSD3.

Do note that Apache 2 is not viral so you should still be able to use any license for all your other files. Apache 2 also permits commercial usage. Anyway, I think for now I will just focus on actually getting it released. License discussions are a lot less important currently.

rahul

mark florisson

unread,
Jul 6, 2013, 6:38:13 PM7/6/13
to numf...@googlegroups.com, Rahul Garg
On 5 July 2013 20:35, Rahul Garg <rahul...@gmail.com> wrote:
> IR is a typed AST with built-in support for high level array operators. It
> is much higher level than LLVM IR for example.
> Analysis and lowering passes are very flexible, it is a proper multi-pass
> compiler. We also have a very robust runtime support infrastructure to
> handle GPU resource management.
> The toolkit is not for Python specifically, but I am building a compiler for
> annotated subset of Python on top of my toolkit to demonstrate that it
> works. We also have a similar effort for other languages.

Sounds good. So does it use SSA? I think three-address code flow-graph
with SSA is more amenable to optimization than an AST.

Rahul Garg

unread,
Jul 8, 2013, 5:26:29 PM7/8/13
to numf...@googlegroups.com, Rahul Garg
It does not use SSA but the framework does provide some common analysis like live variable analysis. SSA would have been nice but I also wanted it to be easy to generate, which SSA is not. SSA is also not that advantageous when it comes to dealing with arrays. In any case, I think if a compiler wants to use SSA, it should be possible to convert from AST to SSA.

Some other details about the IR:

a) Parallelism is explicit. The compiler does not do any auto parallelization.

b) In general, you have to explicitly mark which code should be executed on the GPU

c) Generally you do not mark the data to be transferred to the GPU. The programmer essentially sees a shared memory paradigm. The compiler + runtime infer and transfer the relevant data. Is this the optimal choice? Not sure yet, but this is the choice we made for now.

d) The IR is actually split into multiple levels. The frontend accepts the high level IR, which is easy to generate, and the compiler lowers it during various transformations and the backends see the lowest level IR. The lowest level IR is still an AST but easier for code generators.

e) The compiler can deal with the strides in NumPy, though there is some limitation in terms of alignment (which i am trying to formalize) which may be an issue for some packed struct arrays but should be fine for most use cases.

These are all choices that we made. I think other choices can be made, and some other systems do make other choices. I am ok with providing more flexibility in the future, but this is the starting point I am currently working on. Anyway,  I will keep working on making the release.

thanks,
Rahul

mark florisson

unread,
Jul 9, 2013, 4:39:17 AM7/9/13
to numf...@googlegroups.com
On 8 July 2013 22:26, Rahul Garg <rahul...@gmail.com> wrote:
> It does not use SSA but the framework does provide some common analysis like
> live variable analysis. SSA would have been nice but I also wanted it to be
> easy to generate, which SSA is not.

Yes, but producers don't have to generate SSA. They can use
alloca/load/store, and use an pre-existing SSA pass to convert to SSA.

> SSA is also not that advantageous when
> it comes to dealing with arrays.

Yes. What is advantageous however is to assign each sub-expression to
a temporary, essentially converting to a three-address like
representation. I've found things are a lot easier this way, as well
as representing control flow uniformly through jumps and conditional
jumps, making all analyses easier (with a completely flat structure of
blocks). This is basically the representation LLVM uses, except you
should be able to extend it with your own instructions and types more
easily, while still being able to have those extensions participate in
existing optimizations.

Frédéric Bastien

unread,
Jul 11, 2013, 7:35:09 PM7/11/13
to numf...@googlegroups.com
Hi,



On Tue, Jul 9, 2013 at 4:39 AM, mark florisson <markflo...@gmail.com> wrote:
On 8 July 2013 22:26, Rahul Garg <rahul...@gmail.com> wrote:
> It does not use SSA but the framework does provide some common analysis like
> live variable analysis. SSA would have been nice but I also wanted it to be
> easy to generate, which SSA is not.

Yes, but producers don't have to generate SSA. They can use
alloca/load/store, and use an pre-existing SSA pass to convert to SSA.

> SSA is also not that advantageous when
> it comes to dealing with arrays.

Yes. What is advantageous however is to assign each sub-expression to
a temporary, essentially converting to a three-address like
representation. I've found things are a lot easier this way, as well
as representing control flow uniformly through jumps and conditional
jumps, making all analyses easier (with a completely flat structure of
blocks). This is basically the representation LLVM uses, except you
should be able to extend it with your own instructions and types more
easily, while still being able to have those extensions participate in
existing optimizations.

Why restrict yourself to 2 input per op? How do you implement fusion of elemwise in that case?

Also, how can represent inplace operation(operation where the result is written in the input memory) in SSA? Or we do this in a later phase that break SSA?

Fred

mark florisson

unread,
Jul 12, 2013, 4:35:01 AM7/12/13
to numf...@googlegroups.com
Restricting to the most elemental operations makes analyses and passes
simpler. A first pass transforms the add(array1, array2) to a
map(add_func, array1, array2), and a subsequent pass fuses map/reduce
etc operations (or uses blas where appropriate). There's no limit to
the number of arguments an operation can take, but the binary
operations are binary.

> Also, how can represent inplace operation(operation where the result is
> written in the input memory) in SSA? Or we do this in a later phase that
> break SSA?

You don't really "break SSA", but you don't have to use it directly
either. There are alloca/load/store instructions, which you can use to
manipulate stack variables. There is a pass to promote these to
virtual registers where possible, which can be useful to aid certain
optimizations. If you're mutating an array, that can only happen
through setitem/setslice, memory operations (pointers), or via a call
to some function.

> Fred

Rahul Garg

unread,
Aug 27, 2013, 5:06:11 PM8/27/13
to numf...@googlegroups.com
Hi Everyone.

A belated update from my side. Unfortunately for the past month I got tied with another project (unexpected paper deadline due to rejection of a previous paper) and thus things are running delayed at my end. I will give further updates when I have things to show.

rahul
Reply all
Reply to author
Forward
0 new messages