High performance computation: Python and Julia

Dahua

unread,

Sep 6, 2013, 7:59:13 AM9/6/13

to juli...@googlegroups.com

In recent traveling, I met a lot of people working on machine learning and computer vision. I found that Python has become increasingly popular lately -- people are talking about new packages in Python for developing high performance learning algorithms, notably Theano and NumbaPro.

Both Theano and NumbaPro provides "secondary" compilation facilities that can compile high-level python codes into highly optimized CUDA/LLVM codes for lage-scale computation. They did this in different ways. Particularly, Theano focuses on optimizing vectorized computation, while Numba/NumbaPro places more efforts on accelerating loops.

With Julia, one can write codes with performance comparable to C codes. However, with the rapid advancement in computational technologies, people seem to be more aggressive -- "comparable to C/Fortran" is no longer enough. People are more interested in languages/libraries that can unlock the full capabilities of their GPUs or multi-core CPUs (with AVX instructions) without writing architecture-dependent low-level codes.

Mathworks' recent acquisition of Jacket also seems to suggest that Mathworks is also taking this seriously.

It is time for us to start thinking about how Julia should respond to this trend.

One approach is to start exploring this through packages. For example, we can have a CUDA.jl package to provide lower-level common infrastructure for GPU computing, and a package similar to Theano for delayed expression construction and compilation.

John Myles White

unread,

Sep 6, 2013, 8:29:36 AM9/6/13

to juli...@googlegroups.com

I'd be really interested in seeing some of Theano's approach brought to Julia, but that's a huge body of work.

-- John

Tim Holy

unread,

Sep 6, 2013, 9:38:03 AM9/6/13

to juli...@googlegroups.com

Agreed. For me a fairly high priority is threads---I've got applications that
are dirt slow on one CPU (I'm talking days here...) and for which the DArray
approach is a nonstarter (due to the overhead of data transport, these are
"big data" as well as "big computation" problems). SMP is the obvious answer.
I don't think we're actually that far away from supporting low-level
multithreading, the main lack being a mutex around codegen. (Obviously all of
the threaded algorithms have to avoid touching gc, but that's getting easier
all the time.) I've spent a little time playing with this but nothing
sufficiently finished for general consumption (or even, my own consumption). But
my own need is getting sufficiently dire that I am contemplating picking these
efforts up again, although in the near term other more immediate commitments
will almost certainly win the competition for my time.

A "nice" threading interface (i.e., something fancier than a wrapper around
pthreads) would be more work, of course. And of course GPUs are very
attractive, too, but again more work.

Finally, don't forget Krys' very nice work on delayed execution, which
implements something along exactly the lines you propose. That work has never
gotten the attention it deserves (and here I view myself as the #1 guilty
party in not picking that up and running with it, since it would in principle
be quite useful to me).

--Tim

On Friday, September 06, 2013 04:59:13 AM Dahua wrote:
> In recent traveling, I met a lot of people working on machine learning and
> computer vision. I found that Python has become increasingly popular lately
> -- people are talking about new packages in Python for developing high
> performance learning algorithms, notably

> Theano<http://deeplearning.net/software/theano/>and NumbaPro
> <http://docs.continuum.io/numbapro/>.

Stefan Karpinski

unread,

Sep 6, 2013, 12:21:17 PM9/6/13

to juli...@googlegroups.com

Making it possible to use multiple cores without DArrays needs to be a top priority after 0.2 and I don't think it can be explored through packages. GPU work may be possible through packages. Krys' work definitely deserves more attention there.

Jeff Bezanson

unread,

Sep 6, 2013, 2:21:17 PM9/6/13

to juli...@googlegroups.com

The first thing to explore is using shared memory for communication.
We could have arrays stored in shared memory segments and eliminate
communication overhead without tangling with most of the problems of
threading. We can also use shmem for message passing, speeding up
distributed-memory code as well.

Setting up a shared array and having workers compute on it can be done
in a package, though we would want that to be in Base eventually of
course. Recall Amit's example:
https://gist.github.com/amitmurthy/5477462

Kevin Squire

unread,

Sep 6, 2013, 2:37:06 PM9/6/13

to juli...@googlegroups.com

Amit's Ptools.jl package implements some shared memory support, though not on Windows.

Kevin

Stefan Karpinski

unread,

Sep 6, 2013, 2:44:43 PM9/6/13

to Julia Dev

I really think that something very simple and limited like making comprehensions automatically parallel when the comprehension expression does no memory allocation would get us a rather long way.