High performance computation: Python and Julia

2,109 views
Skip to first unread message

Dahua

unread,
Sep 6, 2013, 7:59:13 AM9/6/13
to juli...@googlegroups.com
In recent traveling, I met a lot of people working on machine learning and computer vision. I found that Python has become increasingly popular lately -- people are talking about new packages in Python for developing high performance learning algorithms, notably Theano and NumbaPro.

Both Theano and NumbaPro provides "secondary" compilation facilities that can compile high-level python codes into highly optimized CUDA/LLVM codes for lage-scale computation. They did this in different ways. Particularly, Theano focuses on optimizing vectorized computation, while Numba/NumbaPro places more efforts on accelerating loops. 

With Julia, one can write codes with performance comparable to C codes. However, with the rapid advancement in computational technologies, people seem to be more aggressive -- "comparable to C/Fortran" is no longer enough. People are more interested in languages/libraries that can unlock the full capabilities of their GPUs or multi-core CPUs (with AVX instructions) without writing architecture-dependent low-level codes. 

Mathworks' recent acquisition of Jacket also seems to suggest that Mathworks is also taking this seriously. 

It is time for us to start thinking about how Julia should respond to this trend. 

One approach is to start exploring this through packages. For example, we can have a CUDA.jl package to provide lower-level common infrastructure for GPU computing, and a package similar to Theano for delayed expression construction and compilation.


John Myles White

unread,
Sep 6, 2013, 8:29:36 AM9/6/13
to juli...@googlegroups.com
I'd be really interested in seeing some of Theano's approach brought to Julia, but that's a huge body of work.

 -- John

Tim Holy

unread,
Sep 6, 2013, 9:38:03 AM9/6/13
to juli...@googlegroups.com
Agreed. For me a fairly high priority is threads---I've got applications that
are dirt slow on one CPU (I'm talking days here...) and for which the DArray
approach is a nonstarter (due to the overhead of data transport, these are
"big data" as well as "big computation" problems). SMP is the obvious answer.
I don't think we're actually that far away from supporting low-level
multithreading, the main lack being a mutex around codegen. (Obviously all of
the threaded algorithms have to avoid touching gc, but that's getting easier
all the time.) I've spent a little time playing with this but nothing
sufficiently finished for general consumption (or even, my own consumption). But
my own need is getting sufficiently dire that I am contemplating picking these
efforts up again, although in the near term other more immediate commitments
will almost certainly win the competition for my time.

A "nice" threading interface (i.e., something fancier than a wrapper around
pthreads) would be more work, of course. And of course GPUs are very
attractive, too, but again more work.

Finally, don't forget Krys' very nice work on delayed execution, which
implements something along exactly the lines you propose. That work has never
gotten the attention it deserves (and here I view myself as the #1 guilty
party in not picking that up and running with it, since it would in principle
be quite useful to me).

--Tim

On Friday, September 06, 2013 04:59:13 AM Dahua wrote:
> In recent traveling, I met a lot of people working on machine learning and
> computer vision. I found that Python has become increasingly popular lately
> -- people are talking about new packages in Python for developing high
> performance learning algorithms, notably
> Theano<http://deeplearning.net/software/theano/>and NumbaPro
> <http://docs.continuum.io/numbapro/>.

Stefan Karpinski

unread,
Sep 6, 2013, 12:21:17 PM9/6/13
to juli...@googlegroups.com
Making it possible to use multiple cores without DArrays needs to be a top priority after 0.2 and I don't think it can be explored through packages. GPU work may be possible through packages. Krys' work definitely deserves more attention there.

Jeff Bezanson

unread,
Sep 6, 2013, 2:21:17 PM9/6/13
to juli...@googlegroups.com
The first thing to explore is using shared memory for communication.
We could have arrays stored in shared memory segments and eliminate
communication overhead without tangling with most of the problems of
threading. We can also use shmem for message passing, speeding up
distributed-memory code as well.

Setting up a shared array and having workers compute on it can be done
in a package, though we would want that to be in Base eventually of
course. Recall Amit's example:
https://gist.github.com/amitmurthy/5477462

Kevin Squire

unread,
Sep 6, 2013, 2:37:06 PM9/6/13
to juli...@googlegroups.com
Amit's Ptools.jl package implements some shared memory support, though not on Windows.

Kevin

Stefan Karpinski

unread,
Sep 6, 2013, 2:44:43 PM9/6/13
to Julia Dev
I really think that something very simple and limited like making comprehensions automatically parallel when the comprehension expression does no memory allocation would get us a rather long way.

Tim Holy

unread,
Sep 6, 2013, 5:04:54 PM9/6/13
to juli...@googlegroups.com
On Friday, September 06, 2013 02:21:17 PM Jeff Bezanson wrote:
> Setting up a shared array and having workers compute on it can be done
> in a package, though we would want that to be in Base eventually of
> course. Recall Amit's example:
> https://gist.github.com/amitmurthy/5477462

Thanks for the reminder. That's indeed a good approach, particularly since the
data are already mmapped. Kevin, thanks for pointing out PTools, and
especially Amit for writing it! I'll definitely explore it.

And I agree with Stefan that it's desirable to consider yet more approaches to
introduce parallelism without exposing the guts of threading. Clearly the
bottom line here is that choosing as a first step "expose a raw threading
model" might not be the wisest course. That is, unless the goal is to find a
way to increase the number of issues reporting segfaults :-).

--Tim

Jeff Bezanson

unread,
Sep 6, 2013, 6:17:28 PM9/6/13
to juli...@googlegroups.com
We could write a parallel comprehension that uses shmem right away.
Just like there is an implementation of `@parallel [ ... for ... ]`
that returns a DArray, there could be one that allocates the result in
shmem, divides the iteration space, sends out pointers, and returns
the result to processor 1. That can be done now, without any limits on
allocation.

Of course, the real problem there is handling access to existing data.
In a simple implementation everything the comprehension accesses would
have to be broadcast.

Personally I'm against limited approaches. It's hard to tell whether
an expression allocates memory, and in any case I want to be able to
allocate memory. A comprehension where each element requires a big
computation featuring lots of allocation is a great candidate for
parallelization. And allocation is not the only issue; you also have
to consider I/O and heap writes.

Kevin Squire

unread,
Sep 6, 2013, 9:10:59 PM9/6/13
to juli...@googlegroups.com
Unix shared-memory approaches are a little dangerous.  One problem is that the shared memory segment will stick around if the program crashes (or even on a clean exit if care isn't taken to clean them up properly). 

I think pthreads would be a better approach (or any other implementation of lightweight threads in the same memory space), although windows support is unclear to me. 

Kevin

Stefan Karpinski

unread,
Sep 7, 2013, 12:10:09 AM9/7/13
to Julia Dev
I also suspect that the shared-memory approach is not the right way to go. I think we need to use threads and take a more "consenting adults" approach to avoiding race conditions. Instead of trying to make it impossible, we should provide the necessary mechanisms to avoid it easily.

Jeff Bezanson

unread,
Sep 7, 2013, 2:12:22 AM9/7/13
to juli...@googlegroups.com
Something that only happens when your program crashes doesn't strike
me as that dangerous --- your program has already crashed, after all.
Are there any other problems?

I really don't think the "consenting adults" attitude does anything
for race conditions and deadlocks. You can have a julia library that
is not thread safe, and use it (or maybe not even be aware you're
using it) with threads, and everything falls apart. The problem is
that you didn't violate any abstractions, so it is hard to be aware
that you're doing anything that requires "consent". The system is only
composable if everybody is paranoid about threading issues all the
time. And many native libraries aren't.

Outside the sweet spot of openmp-style loops over arrays, threading
doesn't look that good. You want your processes mostly isolated from
each other most of the time anyway. Threading allows lots of
questionable but reasonable-seeming approaches, e.g. lots of threads
updating the same dictionary protected by a lock. Everybody has to do
lots of work to make sure that's supported, and in the end it probably
doesn't perform very well. Distributed memory also has performance
pitfalls, probably even worse ones, but at least you can generally
avoid worrying about data corruption or deadlocks.

Kevin Squire

unread,
Sep 7, 2013, 3:54:11 AM9/7/13
to juli...@googlegroups.com
On Fri, Sep 6, 2013 at 11:12 PM, Jeff Bezanson <jeff.b...@gmail.com> wrote:
Something that only happens when your program crashes doesn't strike
me as that dangerous --- your program has already crashed, after all.
Are there any other problems?

Really?  Your program crashing and leaving a potentially large amount of shared memory allocated after the crash is not a big deal?  It can usually be cleaned up, but you usually have to know to look for it to do so, and most users just won't know.  If you go this route, it would be a good idea to have Julia look for orphaned shared-memory segments on startup, which will clean up things whenever a user restarts Julia right away (and not when she doesn't).  But what does that say about Julia?  ("We expect to randomly lose chunks of memory, but don't worry, we'll take care of them when we wake up... ")

If Julia were bullet proof, it would be less of an issue, but crashing Julia is still pretty easy to do--try hitting control-c during any long Pkg activity, or anything of reasonable length that calls an outside library.  For this reason alone, this isn't the path I would choose to go down... (Not that I have an opinion on this or anything!)

Other issues... well, it's somewhat cumbersome to use.  Each process has the memory mapped into its own address space, and usually each is mapped at a different memory location, so you can't pass pointers, just offsets.  Creating lots of small segments is very inefficient, so if you want to share a lot of variables, it's best to create one large segment and manage it yourself.  You can only delete a segment if no one is attached to it-it won't go away on its own unless you reboot the computer.  Etc.  And hey, the shared segments have unix file permissions, so you can share them with a friend if you set the permissions right (or wrong). ;-)

It's been while since I've worked with this in any detail, but in my experience, while it can be made to work, it tends to suck the life out of you.
 
doesn't perform very well. Distributed memory also has performance
pitfalls, probably even worse ones, but at least you can generally
avoid worrying about data corruption or deadlocks.

No argument--I'm definitely a fan of limiting or minimizing locking when possible, and distributed memory processing can be extremely efficient in some domains--Erlang demonstrates both of these nicely.

I think what would be most useful here would be a description of the kinds of problems that people are working on, and focus on providing tools to make solving those problems easier.  I don't think there will be a one-size fits all solution.  My own work often fits well in a map-reduce framework, whereas other problems may require much more communication among different threads (updating a shared data structure, etc.) throughout the processing.

Kevin

Tim Holy

unread,
Sep 7, 2013, 10:48:45 AM9/7/13
to juli...@googlegroups.com
On Saturday, September 07, 2013 12:54:11 AM Kevin Squire wrote:
> I think what would be most useful here would be a description of the kinds
> of problems that people are working on, and focus on providing tools to
> make solving those problems easier.

Here's my contribution to the "list of descriptions," which sounds a lot like
your use cases. For me the main category of problems tend to be
embarrassingly-parallel operations on arrays, either 1:1 (meaning one result
value per input array value) or a reduction along a single dimension. They're
usually local operations, but typically involve looking at neighbors. For such
problems, breaking inputs into chunks is bad (it introduces edges, and that's
bad for looking at neighbors) but outputs can be broken into chunks as in
DArray just fine. The inputs tend to be big (often the outputs too), so copying
is to be avoided.

However, I do occasionally have need to update shared data structures, so it's
not the case that everything I need fits into the above category. A rather
generic example would be processing a sequence of images, and when you find
something "interesting" you put it on the "interesting pile". That pile would
be shared among threads devoted to individual images, and used to make future
decisions about whether something is interesting.

--Tim

Jason Riedy

unread,
Sep 8, 2013, 12:31:34 PM9/8/13
to juli...@googlegroups.com
And Kevin Squire writes:
> Unix shared-memory approaches are a little dangerous. One
> problem is that the shared memory segment will stick around if
> the program crashes (or even on a clean exit if care isn't
> taken to clean them up properly).

So mmap a backing file (user can place it on a RAM FS), then
unlink the file. Send the fd to partners through a domain
socket. No mess on crashes. SysV IPC is horrible, but mmap's
quite useful.

There is another possibility for non-transparent parallelism:
Supporting a PGAS abstraction similar to Co-Array Fortran.
Either SHMEM (http://openshmem.org) or Global Arrays
(http://hpc.pnl.gov/globalarrays/) could work with much of Julia,
possibly just as a package.
--
Jason

Benjamin Silbaugh

unread,
Sep 8, 2013, 7:22:45 PM9/8/13
to juli...@googlegroups.com
I think coprocessor support (GPU's, Xeon Phi's, etc) would be great addition to Julia, as there are certain classes of problems that fit nicely into the GPGPU paradigm. I know that many are interested in using analysis methodologies like FEM/FVM earlier in the engineering design stage, and are looking to GPU's as a way to bring high-fidelity analysis to the workstation. Also, grid/cloud computing has a lot of potential for academia, but some in industry and government are avoiding these technologies for security reasons. However, if more work can be done locally on a workstation or small GPU cluster, then they can tackle the big problems while minimizing the security risk.

Some of my classmates (fellow graduate students) at the University of Maryland have been investigating the use of CUDA for Computational Fluid Dynamics (CFD) applications. The real challenge in getting good performance out of GPU code is managing the memory transfer between main memory and GPU memory. A person in our lab showed that if you can fit the entire problem on a single GPU, you can get 70X speed up for structured RANS CFD (with SA turbulence model). However, if you need to do frequent IO to disk (e.g. writing transient solution data), or your problem is simply too big to fit on the GPU, then your performance gain really drops off (I forget the exact values). Unfortunately I cannot find the reference, but I seem to recall that some folks out at NASA did some preliminary studies on using GPU's to accelerate their in-house overset CFD solver by simply replacing a few key fortran subroutine calls with calls to a CUDA version, and the performance gains were rather disappointing (~ 10X speedup). In this case, the hypothesis was that the lacklustre performance was due to the frequent data transfers between main memory and GPU memory. This makes me wonder how efficient Theano and NumbaPro really are for non-trivial problems; that is, problems that do not lend themselves to simple vectorized expressions (or at least not without resulting in significant data transfers between main memory and CPU)? I have a suspicion that these Python/CUDA solutions are somewhat limited to small problems with special/simple structure, but I could be wrong.

Perhaps, if Julia can combine her/it's capacity for efficient handling of loops with GPU support (or coprocessors in general), then Julia would be applicable to a much larger class of problems than these Python/CUDA solutions. That is, supporting simple vectorized expressions may not be enough for Julia to really distinguish herself from the rest of the pack. (Though, being able to offload vectorized expressions to GPU would be better than nothing.)

Kevin Squire

unread,
Sep 9, 2013, 3:09:19 AM9/9/13
to juli...@googlegroups.com
On Sun, Sep 8, 2013 at 9:31 AM, Jason Riedy <ja...@lovesgoodfood.com> wrote:
And Kevin Squire writes:
> Unix shared-memory approaches are a little dangerous. One
> problem is that the shared memory segment will stick around if
> the program crashes (or even on a clean exit if care isn't
> taken to clean them up properly).

So mmap a backing file (user can place it on a RAM FS), then
unlink the file.  Send the fd to partners through a domain
socket.  No mess on crashes.  SysV IPC is horrible, but mmap's
quite useful.

Hmm.  That might work.  I think that Windows shared memory is like that by default.
 

There is another possibility for non-transparent parallelism:
Supporting a PGAS abstraction similar to Co-Array Fortran.
Either SHMEM (http://openshmem.org) or Global Arrays
(http://hpc.pnl.gov/globalarrays/) could work with much of Julia,
possibly just as a package.

Those both look interesting.  Unfortunately, SHMEM seems to be Linux only, and Global Arrays only claims to support cygwin on Windows, which Julia doesn't support.  But perhaps it could be made to work.

Kevin

Kevin Squire

unread,
Sep 9, 2013, 3:34:40 AM9/9/13
to juli...@googlegroups.com
LLVM has more than one project focusing on producing code for GPUs.  I'm wondering if that is the right level to hook into these things, or if that is too low and focus should be on packages to provide the necessary functionality?

I think it would be nice if one could simply swap in a module have (certain) matrix multiplies start using a GPU or coprocessor transparently (or with minimal setup/code change), but I'm not sure how easy that would be (or even if it's a goal).

Kevin

Jason Riedy

unread,
Sep 9, 2013, 7:33:09 AM9/9/13
to juli...@googlegroups.com
And Kevin Squire writes:
> Hmm.  That might work.  I think that Windows shared memory is
> like that by default.

I quite happily know nothing about Windows.

One interesting case for mmap() is that the master process won't
know if the clients die without additional work. A purely
threaded approach will just die when one crashes.

> Those both look interesting.  Unfortunately, SHMEM seems to be
> Linux only, and Global Arrays only claims to support cygwin on
> Windows, which Julia doesn't support.  But perhaps it could be
> made to work.

SHMEM is a standard implemented by a few vendors; there may be
implementations for Windows.

But... This is non-transparent parallelization. I do think a
PGAS system could work via a package, but I don't have time or
manpower to send that way. Lower-level integration emulating the
old T3E instruction set via LLVM would be very interesting but
likely not as map well onto current hardware. A few platforms do
support sending registers around, but very few (Tilera, maybe?).
--
Jason

Viral Shah

unread,
Sep 10, 2013, 12:35:58 AM9/10/13
to juli...@googlegroups.com
While it should be easy enough to call CUDA BLAS routines and other such library routines through ccall, I doubt that it will give good enough performance, since data will have to be moved back and forth from GPU memory to main memory.

Once LLVM gets better at GPU codegen, we can try using that and at least allow for some subset of julia to work on GPUs. We could have a CuArray type, for example. I can imagine doing something like:

@gpu (vars...) begin


end

-viral

Viral Shah

unread,
Sep 10, 2013, 12:37:19 AM9/10/13
to juli...@googlegroups.com
That got posted abruptly. I meant to say that the @gpu macro can automatically move stuff from main memory to gpu memory and back, and execute a whole kernel on the gpu.

Until then, the best bet may be to write the GPU kernels in C and ccall them from julia.

-viral

Dahua

unread,
Sep 10, 2013, 9:47:33 AM9/10/13
to juli...@googlegroups.com
In GPU programming, it is important to plan ahead what resides where so as to reduce the round traffic between CPU and device.

In my mind, a GPU program in Julia might look like:

ga = GPUArray( ... )
gb = GPUArray( ... )

@gpu begin

  ... some complicated computation on ga and gb ...

end

... some other codes (not on GPU) ...

@gpu begin
 
   ... further computation on ga and gb ...

end

a = fetch(ga)  # copy from device to CPU memory
b = fetch(gb)

free(ga)
free(gb)

The section enclosed within @gpu will be re-compiled into CUDA-VM. 

I am currently working on a CUDA package (slowly), which is nothing more than a wrapper of CUDA driver -- with this, people may at least start using GPU in Julia.

- Dahua

Viral Shah

unread,
Sep 10, 2013, 3:06:34 PM9/10/13
to juli...@googlegroups.com
Thanks Dahua. This is how I imagined it too, but you gave a much better description. A CUDA package would be a great starting point to have, and I suspect it will find other contributors too.

-viral

Vamsi Parasa

unread,
Nov 25, 2013, 3:49:12 PM11/25/13
to juli...@googlegroups.com
Thanks to all for the great discussion!
It would be great to be able to write GPU kernel code in Julia w/o resorting to CUDA or OpenCL. So far, to the best of my knowledge there is only language like this : Harlan ( a language for General Purpose GPU computing https://github.com/eholk/harlan). It would be a great distinguishing feature if Julia can have this.

Stefan Karpinski

unread,
Nov 25, 2013, 4:03:11 PM11/25/13
to Julia Dev
The README for Harlan says "Harlan requires an OpenCL implementation" and then lists some implementations that it should work with. So that doesn't quite seem like they're not resorting to OpenCL.

Vamsi Parasa

unread,
Nov 25, 2013, 4:31:56 PM11/25/13
to juli...@googlegroups.com
Thanks Stefan for pointing it out. Didn't realize that it was using OpenCL underneath.

Al Rahimi

unread,
Nov 29, 2013, 7:37:04 PM11/29/13
to juli...@googlegroups.com
Is there any plan to support something like the actor model (like Erlang  or Scala) for concurrent programming?

Reid Atcheson

unread,
Dec 16, 2013, 12:35:48 AM12/16/13
to juli...@googlegroups.com


On Monday, September 9, 2013 2:34:40 AM UTC-5, Kevin Squire wrote:
LLVM has more than one project focusing on producing code for GPUs.  I'm wondering if that is the right level to hook into these things, or if that is too low and focus should be on packages to provide the necessary functionality?

I think it would be nice if one could simply swap in a module have (certain) matrix multiplies start using a GPU or coprocessor transparently (or with minimal setup/code change), but I'm not sure how easy that would be (or even if it's a goal).

Kevin



My experience with these attempts to get a BLAS-like coprocessor library functionality is that they fall short of potential gains based on peak throughput, it's the wrong level to be looking at GPU code output. For the most part algorithms need to be redesigned for good performance on GPUs, and so swapping out a CPU BLAS for a GPU BLAS doesn't quite do this. The level at which real gains are seen is when one backs further away from matrix or stencil operations and looks at the entire loop they are based in, and then make decisions about loop tiling and other things. 
Reply all
Reply to author
Forward
0 new messages