Matrix-free operations in Julia

458 views
Skip to first unread message

flex

unread,
May 23, 2012, 8:05:31 PM5/23/12
to juli...@googlegroups.com
Hi,

First of all I really like what I see with what Julia has to offer so far and I will keep a keen eye open regarding new developments. What I am particularly interested in is how Julia handles truly large-scale linear algebra type of problems. My research focusses on PDE-constrain optimization for seismic inversions problems and we use adjoint state methods that allow us to define the Jacobian and its adjoint matrix free. In this way, we can deal with the size of our problems by not forming the matrix but only its action on a vector. 

To support my research program, we developed an extensive environment based on parallel matlab in combination with SPOT-a linear operator toolbox, its parallel extension pSPOT, and a seismic data container. These tools allows us to develop a concrete implementation of physics-based optimization with concrete parallel implementations for a forward modeling and inversion framework, including gradients, Jacobians, and their adjoints. The operator overloading of SPOT gives us a key advantage and my question is whether there are plans to support matrix-free matrix-vector operations in Julia. 

More specifically, I am interested in what the thoughts are amongst the Julia development community regarding the Rice Vector Library as outlined in the paper "Software framework for abstract expression of coordinate-free linear algebra and optimization algorithms" by ANTHONY D. PADULA, SHANNON D. SCOTT, and WILLIAM W. SYMES. Incorporation of certain key aspects of the design of this library would give Julia an unique advantage over other languages. 

I am looking forward to hear your response and thanks again for all the efforts. Thanks.

Viral Shah

unread,
May 23, 2012, 10:40:05 PM5/23/12
to juli...@googlegroups.com
The type framework and multiple dispatch should make it straightforward for you to have matrix-free operations for PDEs. 

You can easily have a type for PDEs, Jacobians, and all sorts of things. Then you define the operators that you want on those types. You could also implement the type promotion rules and convert() so that an explicit matrix can be formed in cases where efficient operations are not provided.

Currently, we do not have support for large-scale linear algebra, except for experiments with an implementation of the LINPACK benchmark. It is one of the traditional interests of our research group, but we have been quite busy with just putting the basic julia infrastructure together for a while now. One could integrate ScaLAPACK at a pinch, but I prefer to avoid it, and MPI in general - focussing on more modern cloud stacks and some basic fault tolerance for parallelism.

Can you say a little more about the types and sizes of problems you want to solve, number of processors, etc.?

-viral

flex

unread,
May 23, 2012, 11:00:37 PM5/23/12
to juli...@googlegroups.com
Hi Viral,

Thanks for your input. My group develops PDE-constrained optimization algorithms for which "post-stamp" problems, which involve n=1000^3 variables and N=1000^5 data. The companies we develop algorithms for are interested to scale these algorithm to 64k-200k processors. With these numbers, resilience to connection time outs become a real issue and this explains why the industry is working with relatively simple client-server solutions, which severely limits the sophistication of the algorithms. I hope this answers your question.

Felix

Viral Shah

unread,
May 23, 2012, 11:39:22 PM5/23/12
to juli...@googlegroups.com
This is certainly the kind of stuff that we would like to address. Let's keep the dialogue going.

What kind of capabilities would you like to see in julia to solve such problems at this scale? We already have a whole bunch of low-level parallelization infrastructure that is general purpose, but has yet to be tuned to work at scale. Perhaps this can force us towards thinking that direction.

-viral

Jonathan Dursi

unread,
May 24, 2012, 12:07:53 AM5/24/12
to juli...@googlegroups.com
On 23 May 10:40PM, Viral Shah wrote:
> I prefer to
> avoid it, and MPI in general - focussing on more modern cloud stacks and
> some basic fault tolerance for parallelism.

Sorry, this is off topic for this thread, but this interested me; what
exactly is the alternative "modern cloud stack" for a high-speed
communications middleware for technical computing which handles the
various types of network fabrics one sees in the wild?

I'm slowly coming up to speed in Julia and the parallelism stuff is
something I'd like to get involved with in the coming months; is there a
roadmap for extending/tuning what currently exists?

- Jonathan

--
Jonathan Dursi <ljd...@gmail.com>

Viral Shah

unread,
May 24, 2012, 12:33:10 AM5/24/12
to juli...@googlegroups.com
On 24-May-2012, at 9:37 AM, Jonathan Dursi wrote:

> On 23 May 10:40PM, Viral Shah wrote:
>> I prefer to
>> avoid it, and MPI in general - focussing on more modern cloud stacks and
>> some basic fault tolerance for parallelism.
>
> Sorry, this is off topic for this thread, but this interested me; what exactly is the alternative "modern cloud stack" for a high-speed communications middleware for technical computing which handles the various types of network fabrics one sees in the wild?

My take-off point is that computation is getting cheaper and becoming a utility. The kind of thing I have in mind is:

1. Virtual Machines that can be booted up at will for computation (Eucalyptus, EC2, VMWare, etc.)
2. Regular good-old TCP/IP which is the only thing that I know to use in a reliable way in heterogeneous clusters of unreliable machines
3. Some basic fault-tolerance for simple parallelism, where work can simply be reallocated
4. A Hadoop FS for more fancy fault-tolerance (may not even be automatic, but user-driven through hints and such)

In my opinion, a lot of scientific workloads do not need high-speed communication, so long as you can write your code in a latency-tolerant manner, overlapping computation with communication. See for example:

Multithreading and One-Sided Communication in Parallel LU Factorization
http://cscads.rice.edu/publications/paper16

This is usually quite difficult to do, except for skilled programmers, and even then fraught with errors - but with good language level support, could be easier.

Of course, it should be straightforward to use a faster communication layer, if available.

> I'm slowly coming up to speed in Julia and the parallelism stuff is something I'd like to get involved with in the coming months; is there a roadmap for extending/tuning what currently exists?

There is no real roadmap yet. Would be a good idea to put a quick roadmap together on the wiki / or in an issue - after some discussion on the list.

-viral

Viral Shah

unread,
May 24, 2012, 12:49:58 AM5/24/12
to juli...@googlegroups.com
When we implement the iterative solvers - PCG, GMRES, BICGSTAB etc., we'll most likely have a similar interface for matrix-free operations. Just provide a type that has * defined for matrix vector product and transpose matrix vector product, and everything works.

-viral

Jonathan Dursi

unread,
May 24, 2012, 12:52:52 AM5/24/12
to juli...@googlegroups.com
On 24 May 12:33AM, Viral Shah wrote:
> My take-off point is that computation is getting cheaper and becoming a utility.

Absolutely, but communications between those cheap computation units
_isn't_ getting cheaper at even remotely the same rate.

Not that that matters for everything - you're completely right of course
that some tasks don't need to be tightly coupled. Ssh to a bunch of
nodes, launch some processes, etc. Anything at all will work for that.

But not everything can be done in that SETI-at-home type parallelism;
some stuff is just really tightly coupled. PDE solvers, big
eigenproblems, etc. (LU decomposition is fairly easy, which is why
people often treat HPL scores unfairly dismissively). And TCP/IP is a
huge waste of expensive network fabrics; people buy those fast
interconnects for a reason.

That's not to say that all, or even most, of parallelism in Julia or any
other language/environment should be geared expressly to the very
high-performance end of things, but one wouldn't want to be excluded
from that community, either.

In terms of high-performance middleware in the technical computing
space, there's basically MPI, GASNet, or Charm++. GASNet is nice (and
would plug very nicely into say the distributed array stuff) but if one
or two key people's funding got cut, development would stagnate quickly
- and it's no more fault tolerant than MPI-2. Charm++ is great for a
lot of things, but, well, no one has re-implemented scalapack in
Charm++, for instance.

We should know by next week if MPI-3 will implement fail-stop fault
tolerance, or if it'll be delayed to 3.1; but the real issue is writing
algorithms for these sorts of problems that are failure tolerant.

Tim Holy

unread,
May 24, 2012, 5:17:40 AM5/24/12
to juli...@googlegroups.com
Hi Felix,

Some of the ongoing work in extras/grid.jl may also be relevant to you. This
got started because my lab does large-scale imaging, 2500^2 x 200 x 3000 data.
(I'm impressed that you're already doing petabytes!) We treat images a bit
like solving PDEs, and are interested in developing restriction/prolongation
operators, interpolation of various orders, differencing schemes, etc.

For solving PDEs across thousands of processors, presumably you want to have
"localized chunks" on each machine. Julia's current DArray allows you to
distribute arrays along one dimension, but I'd guess you'll want localization
along all spatial/temporal directions simultaneously. I'm working towards
tiled, or "bricked," array representations. Right now my main efforts are
devoted to single-machine (but GPU-backed) computations, in which the disk
serves to cache the individual tiles which are accessed through memory-
mapping. I suspect that this may not suffice for your needs---you may really
need a generalized DArray---and so I'm very pleased you've joined the list; it
seems reasonable to develop this in a way that works for as wide an audience
as possible.

I should point out that the work in extras/grid.jl is currently stalled, due
mostly to non-Julia duties and the fact that I'm in the middle of a detour to
improve the performance of high-dimensional array operations in Julia. Big
parts of grid.jl may need rewriting, and the ability to transparently memory-
map operations means that we may not need all the types I "advertise" in the
comments in that file. Finally, the version in Julia's master is probably not
the latest; https://github.com/timholy/julia/blob/imagelib/extras/grid.jl is a
better place to look. I may have a few other changes I need to push, but this
should be relatively current.

Best,
--Tim
> >> SPOT<http://www.cs.ubc.ca/labs/scl/spot/>-a linear operator toolbox, its
> >> parallel extension pSPOT<https://github.com/slimgroup/pSPOT>, and a
> >> seismic data container<https://github.com/slimgroup/SeisDataContainer>.
> >> These tools allows us to develop a concrete implementation of
> >> physics-based optimization with concrete parallel implementations for a
> >> forward <https://www.slim.eos.ubc.ca/node/6675> modeling and
> >> inversion<https://www.slim.eos.ubc.ca/fwi>framework, including
> >> gradients, Jacobians, and their adjoints. The operator overloading of
> >> SPOT gives us a key advantage and my question is whether there are plans
> >> to support matrix-free matrix-vector operations in Julia.
> >>
> >> More specifically, I am interested in what the thoughts are amongst the
> >> Julia development community regarding the Rice Vector
> >> Library<http://www.trip.caam.rice.edu/software/rvl/rvl/doc/html/index.ht
> >> ml>as outlined in the paper "Software framework for abstract expression
> >> of coordinate-free linear algebra and optimization
> >> algorithms<http://portal.acm.org/citation.cfm?doid=1499096.1499097>" by

Viral Shah

unread,
May 24, 2012, 12:31:14 PM5/24/12
to juli...@googlegroups.com
On 24-May-2012, at 10:22 AM, Jonathan Dursi wrote:

> That's not to say that all, or even most, of parallelism in Julia or any other language/environment should be geared expressly to the very high-performance end of things, but one wouldn't want to be excluded from that community, either.

We will end up having some primitives for communication (we already have something in place), which can be implemented for multiple transports. The default could be TCP/IP. For all the other fancy stuff, I would prefer GASNet so long as it is maintained. Is Dan Bonachea still the primary author of GASNet?

What is more important is the programming model, that is exposed to the parallel algorithm writer to write the eigensolver, PDE solver, etc. Once, the right abstractions are in place, we can always plug in different communication libraries for performance.

Then, there will be I/O bound applications, which will pose a different set of challenges, relating to distributed filesystems and such.

-viral

Viral Shah

unread,
May 24, 2012, 1:38:35 PM5/24/12
to juli...@googlegroups.com, ljd...@gmail.com
On Thursday, May 24, 2012 10:22:52 AM UTC+5:30, Jonathan Dursi wrote:
But not everything can be done in that SETI-at-home type parallelism;
some stuff is just really tightly coupled.   PDE solvers, big
eigenproblems, etc.   (LU decomposition is fairly easy, which is why
people often treat HPL scores unfairly dismissively).   And TCP/IP is a
huge waste of expensive network fabrics; people buy those fast
interconnects for a reason.

BTW, everyone keeps saying how LU decomposition is so easy. However, all fast implementations I see run into thousands of lines of C/Fortran and MPI. Are there fast parallel implementations in higher level languages that people like to use for computation? The best I know of is the UPC implementation, but new stuff may have happened since. 

I actually think that writing HPC in julia, while not difficult, is not exactly easy, especially if it has to compete with the best implementations out there, without sacrificing julia's productivity features. The other stuff is a much more difficult proposition. I'd be happy if julia can make it possible to do master-slave and LU decomposition type problems without much effort on the user's part, to start with.

-viral

Jonathan Dursi

unread,
May 24, 2012, 1:46:56 PM5/24/12
to Viral Shah, juli...@googlegroups.com
Oh, I absolutely agree that MPI (say), or even GASnet, isn't the *programming model* one would want to to expose; it's far too low level, tedious, and error prone, Julia could do so much more. The distributed array functionality already shows that. My question was about what you'd want to use as an underlying communications substrate to build stuff on top of. And that does matter more than one might like; having, eg, collectives
available at that level will effect design.

- Jonathan
Jonathan Dursi <ljd...@gmail.com>

From: Viral Shah <vi...@mayin.org>
Date: Thu, 24 May 2012 10:38:35 -0700 (PDT)
Subject: Re: [julia-dev] Matrix-free operations in Julia

Stefan Karpinski

unread,
May 24, 2012, 1:58:53 PM5/24/12
to juli...@googlegroups.com, Viral Shah
On Thu, May 24, 2012 at 1:46 PM, Jonathan Dursi <ljd...@gmail.com> wrote:
Oh, I absolutely agree that MPI (say), or even GASnet, isn't the *programming model* one would want to to expose; it's far too low level, tedious, and error prone, Julia could do so much more. The distributed array functionality already shows that. My question was about what you'd want to use as an underlying communications substrate to build stuff on top of. And that does matter more than one might like; having, eg, collectives available at that level will effect design.

Can you clarify what you mean by "underlying communications substrate"? Do you mean alternate transport protocols to TCP?

Jonathan Dursi

unread,
May 24, 2012, 2:09:12 PM5/24/12
to juli...@googlegroups.com
Yes, exactly; ideally there'd be some middleware which can take advantage of multiple transports - shared memory without going through the tcp stack on-node, high-speed rdma-based transports on hpc cluster fabrics like infiniband, and whatever might lie inbetween - and implementing useful primitives efficiently on top of them.

There are libraries that do this in the scientific context, but none of them are perfect, and for a lot of things, TCP gives you what you want. But at scale or for really tightly coupled computations where you need that low latency / high bandwidth, it's nice to have alternatives.


- Jonathan
Jonathan Dursi <ljd...@gmail.com>

From: Stefan Karpinski <ste...@karpinski.org>
Date: Thu, 24 May 2012 13:58:53 -0400
Cc: Viral Shah<vi...@mayin.org>
Subject: Re: [julia-dev] Matrix-free operations in Julia

Stefan Karpinski

unread,
May 24, 2012, 2:17:03 PM5/24/12
to juli...@googlegroups.com
Yeah, sure that's entirely possible. The high-level stuff doesn't really depend on the transport mechanism so transport mechanism should be swappable given that you can implement certain high-level functionality.

flex

unread,
May 25, 2012, 12:58:18 AM5/25/12
to juli...@googlegroups.com
Hi Viral,

Thanks. We at SLIM are certainly interested in keeping a dialogue going. We will get back to you with a details on what we would like to see. 

On the short term, we would certainly be interested to have a look at some of the parallel capabilities of Julia with a particular eye on how things scale. The reason we are interested is because we are planning to move to 3D seismic, which will eventually lead to the size of problems industry is interested in. (I mentioned this above). While parallel matlab, in combination with our object-oriented environment has allowed us to successfully develop numerous algorithms for 2D seismic involving low hundreds of workers (parallel matlab sessions), we are really concerned to scale to 3D, which entails a two-orders of magnitude increase in the problem size. We raised some of our concerns to the matlab community and it is clear that they are not really ready to scale parallel matlab to large (>500) numbers of parallel matlab sessions.

Regarding my initial question, here is a though. As far as I understand, Julia supports explicit matrix-vector multiples and at the just-in-time compile time the compiler verifies the domain ranges and and vector sizes to check whether these are consistent and if not it throws a compile error. 

Now suppose, we were to allow for matrix-free multiplies. In that case,  it is one thing to check for size consistence but is quite another to ensure that the adjoint of the matrix is actually the adjoint. Remember,  we needed to code adjoints explicitly in matlab. While this seems a reasonable approach, there is no guarantee that the adjoint is actually correct. 

To make sure this does not happen I (often unsuccessfully) force my students to carry out the "dot test" numerically. However, as we know from automatic differentiation, adjoints can be computed automatically by the compiler using simple rules and this would allow the compiler to derive the adjoint at compile time. 

If this feature would be build into the language this would truly be transformative. I am not saying that we are too lazy or not willing to compute the adjoint but my experience working with student and industry has clearly shown that way too often people write highly optimized code for matrix-free vector multiplies to be at a complete loss how to get an adjoint that passes the dot test after the "forward" code has been written. 

This has been and continues to be a major problem closely related to the problem of computing matrix-free implementations of Jacobians, and their adjoints, to compute the gradient of some nonlinear functional. Apparently, this problem is not only apparent in our field but is also prominent in machine learning.

So a proposal could be to incorporate automatic computations of adjoints and gradients of algorithms via matrix-free Jacobians (and their adjoints for e.g. Gauss-Newton methods) into Julia. 

I am seismologist so I am sorry in case I said something stupid. Please let me know what you think.

Felix

flex

unread,
May 25, 2012, 1:07:09 AM5/25/12
to juli...@googlegroups.com
Yes I agree and this has been the main premise on which we built our matrix-free environment in matlab. This environment allows us to use large-scale solvers such you mention. Actually, the SPOT framework was originally developed to solve compressive sensing type of problems involving matrix-free transforms and one-norm solvers such as spgl1

Felix

david tweed

unread,
May 25, 2012, 3:29:36 AM5/25/12
to juli...@googlegroups.com
I've worked with automatic differentiation, albeit in a different context and for different reasons, so I've done some reading of the literature. In the context of that, is it the case that the technique that you'd use for _absolutely enormous_ problems is known: I seem to recall quite a few papers using sophisticated techniques to try and compute very accurate low-rank approximaions to the adjoint (because lots of the "components" of the adjoint are minuscule) in order to be adequately usable, rather than using standard auto-diff ideas. Is the state of the field such that the "best" thing to do is known sufficiently to embed within the language itself yet?

Stefan Karpinski

unread,
May 25, 2012, 11:50:09 AM5/25/12
to juli...@googlegroups.com
We're very interested in doing matrix computations at the scale you're talking about and would love to work with you on getting things working for that.

As to the adjoint business, there certainly isn't going to be anything in the core language as magical as automatically doing dot tests or computing adjoints that aren't requested by the programmer. However, one of the beauties of Julia is that you can very easily a) create your own types that interoperate naturally with existing ones and perform just as well, b) modify the behavior of existing types by altering the source, which is pretty easy to read and to extend, or c) change the language itself for your purposes — since most of Julia is implemented in Julia is this not do hard. I would say that those options are in descending order of how recommended they are, but they are all options.

Viral Shah

unread,
May 25, 2012, 12:35:13 PM5/25/12
to juli...@googlegroups.com
On the issue of parallel computing, going to these many processors is certainly part of our thinking. Of course, at that scale we need some fault tolerance (either in the system, or in the algorithm).

How complex is your codebase? How difficult would it be to port to julia? We already have some parallel capabilities (darray, spawn). Can we get your test codes to run on a few nodes with little effort? What primitives do you use for parallelism currently, and what would you like to see in julia?

As for the matrix-free methods and computations you describe, this is something that can be supported in a julia library, for people using such methods. Someone who uses this stuff on a regular basis needs to take a lead to develop such a package.

-viral

thomaslai

unread,
May 25, 2012, 5:43:39 PM5/25/12
to juli...@googlegroups.com
Hi, I'm Thomas, a developer from Felix's group and am actively developing in the SPOT and pSPOT Matlab libraries that he has mentioned above. You can find our codebase in its entirety at github:
SPOT (Matrix-free operators) - https://github.com/slimgroup/spot
pSPOT (Parallel extension to SPOT) - https://github.com/slimgroup/pSPOT
SeisDataContainer (Extension to Matlab data type with SPOT-specific utilities) - https://github.com/slimgroup/SeisDataContainer

We are excited about the possibilities offered by Julia and is looking into porting our functionality. I have been looking into Julia documentation and trying it out on the command line. Though it seems to be missing help files on some of the functions and i can't guess what they are doing. Also, there doesn't seem to be clear guidelines in how to implement object-oriented functionality in Julia, which our code is built on.

As for parallel computing our design choices are mostly limited by Matlab's implementation of parallelism which is based on MPI. Mostly it is in the form of applying serial matrix-free SPOT operators in spmd blocks and micromanaging the data transfer. We're hoping that Julia's integrated parallelism could mean that we could get away with just one library instead of separate serial and parallel libraries like what we have implemented in Matlab.

-Thomas

Jeff Bezanson

unread,
May 25, 2012, 6:15:58 PM5/25/12
to juli...@googlegroups.com
Welcome Thomas!

Yes, we'll need to do a documentation blitz pretty soon, as we get close to 1.0.

Julia is purely object-oriented, albeit unconventionally so. See
extras/sparse.jl or extras/bitarray.jl for examples of implementing
new array types. Or base/rational.jl for an example of a numeric type,
or base/set.jl for an example of a general container.

We're interested in feedback/ideas about the parallel stuff since we
have some low-level functionality but don't really have the
higher-level APIs figured out. The "low-level" functionality is still
pretty high-level compared to MPI; it's basically asynchronous remote
function calls, and shared value cells that you can read and write.
Reply all
Reply to author
Forward
0 new messages