Is there a use case for Plan based Workflows?

Timothy Baldridge

unread,

Jan 30, 2013, 2:18:15 PM1/30/13

to numerica...@googlegroups.com

I'm doing work recently with generating native code in Clojure using LLVM. This has got me thinking quite a bit recently how to leverage new SSE/AVX extensions in Clojure. And more on topic, how to generate highly optimized code on-the-fly from core.matrix calls.

One of the issues I have with the current implementation of core.matrix is that it implies immediate execution. Has anyone thought about perhaps making a version of this that relies on a plan based workflow? For instance, let's say I wanted to execute the following code:

vec1 = vec2 + (vec3 * vec4)

AVX2 allows this to happen as a single instruction, but with the current core.matrix api we would be required to do this in two steps: a) the multiplication and b) the addition.

I think this issue will be even more pronounced once we start talking about running this sort of code on a GPU.

So perhaps the protocol functions should setup a pipeline of operations and the result is only calculated when needed. This allows the underlying implementation to chain/combine/compile pipelines into single functions for better optimization.

Some examples of where this is already used, is .NET's IQueryable<T> interface: http://blogs.msdn.com/b/mattwar/archive/2008/11/18/linq-links.aspx
http://msdn.microsoft.com/en-us/library/bb351562.aspx

Clojure 1.5's Reducers library also follows this pattern. You construct a pipeline of functions that are then handed directly to the underlying collection.

Anyway, I'd love to hear some thoughts about this.

Timothy Baldridge

Joel Boehland

unread,

Jan 30, 2013, 3:46:50 PM1/30/13

to numerica...@googlegroups.com

Very interesting! I've been thinking about how to do something like this as well. Thinking out loud: I wonder if we could have a special def- form (defkernel, def-element-op...) that preserved the source of the body, but otherwise compiled the source as normal for immediate execution as a default fallback. Then, we could have optional specializers that could compile that source to the low-level kernels of whatever platform(GPU/SSE/AVX) they were run on, at runtime, replacing the default immediate execution function body??

You probably already know about the python/Blaze project, but in case you don't, it sounds like they are doing exactly what you are describing: Creating deferred execution kernels that are llvm jit compiled at runtime(with all the low-level type/size info of the arrays).

Do you have a public link to your clojure/llvm native codegen project? It sounds really interesting.

--Joel

kovas boguta

unread,

Jan 30, 2013, 7:17:58 PM1/30/13

to numerica...@googlegroups.com

Its a good question.

The analogy with reducers is interesting. The difference that with
reducers, there is no way to dig into the stack of operations and
optimize them, or get them to optimize against each other. Which could
be interesting. For the matrix optimization case, we need to retain
the symbolic meaning of the original code, either explicitly as source
code, as a purely declarative AST, or in a more complicated way via
the operational interface of the plan object tree.

So the options could be
1. Compile the instructions down to a map-and-vector based AST, and do
further analysis and compilation from there.
2. Do symbolic analysis and transformation on the code directly, which
would require keeping around the source code for functions
3. Do the analysis and transformation using protocols on the "plan" object

#3 is the least attractive IMHO, doing this kind of analysis on
object-like things is making work for yourself. For instance
http://msdn.microsoft.com/en-us/library/system.linq.expressions.expression.aspx

is not the way we want to be manipulating and rewriting our expressions.

#2 would be a great option if generic tools for such things already existed

#1 seems like the easiest starting point.

> --
> You received this message because you are subscribed to the Google Groups
> "Numerical Clojure" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to numerical-cloj...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

Ben Mabey

unread,

Jan 30, 2013, 8:00:03 PM1/30/13

to numerica...@googlegroups.com

On 1/30/13 12:18 PM, Timothy Baldridge wrote:
> One of the issues I have with the current implementation of
> core.matrix is that it implies immediate execution.

Yeah, this is an issue even on simple cases such as mapping and
reducing. With the current API you are forced to allocate memory for
the entire datastructure for mapping (assuming you don't want to map in
place) before reducing can even begin. I've solved this in the past
with a map-reduce function that combines the two operations but this is
obviously a one-off kludge. If the API moves to more of a computation
based one then we allow for greater modularity (yay laziness!). Another
nice side effect of this is that the API would seem more clojurey with
lazy semantics.

-Ben

Mike Anderson

unread,

Jan 30, 2013, 11:16:42 PM1/30/13

to numerica...@googlegroups.com

On Thursday, 31 January 2013 03:18:15 UTC+8, Timothy Baldridge wrote:

I'm doing work recently with generating native code in Clojure using LLVM. This has got me thinking quite a bit recently how to leverage new SSE/AVX extensions in Clojure. And more on topic, how to generate highly optimized code on-the-fly from core.matrix calls.

Awesome stuff - definitely an important topic. I'd love to be able to run Clojure-generated code on a LLVM based backend.

One of the issues I have with the current implementation of core.matrix is that it implies immediate execution. Has anyone thought about perhaps making a version of this that relies on a plan based workflow? For instance, let's say I wanted to execute the following code:

vec1 = vec2 + (vec3 * vec4)

I know all the current implementations are focused on immediate execution. But I think the API should allow for deferred execution as well. That has certainly been my intention at least.

Have you spotted anything in the core.matrix API that prevents an implementation from doing this?

AVX2 allows this to happen as a single instruction, but with the current core.matrix api we would be required to do this in two steps: a) the multiplication and b) the addition.

I think this issue will be even more pronounced once we start talking about running this sort of code on a GPU.

So perhaps the protocol functions should setup a pipeline of operations and the result is only calculated when needed. This allows the underlying implementation to chain/combine/compile pipelines into single functions for better optimization.

I can see few different approaches here that could be effective.

1) is to write a core.matrix implementation that defers execution and builds up a pipeline of operations that can then be realised. Advantage is that it is mostly transparent to an API user.

2) is to build a separate symbolic expression library that can optimise matrix expressions and then execute them via core.matrix calls on any implementation. This could do higher level optimisations such as simplification of expressions, constant folding etc.

3) is to extend the API with functions to handle and execute symbolic expressions. A LLVM based implementation could compile the code directly. A normal non-optimising implementation can just execute the symbolic expression in the regular way. It could work somewhat like SQL "prepared statements" i.e. you get a handle to a precompiled code unit that you can then execute with different parameters as needed.

If we want to be *really* state of the art then I think we need to do both 2 and 3.

Some examples of where this is already used, is .NET's IQueryable<T> interface: http://blogs.msdn.com/b/mattwar/archive/2008/11/18/linq-links.aspx
http://msdn.microsoft.com/en-us/library/bb351562.aspx

Clojure 1.5's Reducers library also follows this pattern. You construct a pipeline of functions that are then handed directly to the underlying collection.

I like the reducers concept and analogy. I think that we need to work with a higher level of abstraction (symbolic expressions) rather than raw functions however. Raw functions should be a compiled end product rather than an intermediate data format.

This becomes especially important when you want to send expressions over the wire for distributed execution, or if you want to do expression optimisation etc.

Mike Anderson

unread,

Jan 30, 2013, 11:43:21 PM1/30/13

to numerica...@googlegroups.com

On Thursday, 31 January 2013 08:17:58 UTC+8, kovasb wrote:

Its a good question.

The analogy with reducers is interesting. The difference that with
reducers, there is no way to dig into the stack of operations and
optimize them, or get them to optimize against each other. Which could
be interesting. For the matrix optimization case, we need to retain
the symbolic meaning of the original code, either explicitly as source
code, as a purely declarative AST, or in a more complicated way via
the operational interface of the plan object tree.

Exactly! it is really important to keep the symbolic representation as long as possible before compiling to optimised form.

So the options could be
1. Compile the instructions down to a map-and-vector based AST, and do
further analysis and compilation from there.

Clisk does something like this for image generation operations. Worth looking at. Actually I used a defrecord but the principle is the same.

2. Do symbolic analysis and transformation on the code directly, which
would require keeping around the source code for functions

I think this is effectively pretty similar to #1. In many ways it's more "simple" and lightweight, although the downside is that you lose some of the the ability to put metadata in the AST structure.

3. Do the analysis and transformation using protocols on the "plan" object

I think you could make #3 pretty transparent to the user. For example

(+ a (* b c))

could return a matrix that hasn't actually been evaluated yet, it just stores the expression and parameters for deferred execution. It would only need to compile and execute the expression when asked for a direct realisation of some part of the result (e.g. a subsequent mget)

#3 is the least attractive IMHO, doing this kind of analysis on
object-like things is making work for yourself. For instance
http://msdn.microsoft.com/en-us/library/system.linq.expressions.expression.aspx

is not the way we want to be manipulating and rewriting our expressions.

Yuck. But we have s-expressions, no need to use an OOP syntax to build them :-)

#2 would be a great option if generic tools for such things already existed

#1 seems like the easiest starting point.

My problem with #1 is that is risks baking a specific AST representation into the API. I'm not sure I like that idea.

I'd rather we tried to make something work with s-expressions at an API level, something like #2. What the API does under the hood can be much more flexible.

kovas boguta

unread,

Jan 31, 2013, 5:23:23 AM1/31/13

to numerica...@googlegroups.com

On Wed, Jan 30, 2013 at 11:43 PM, Mike Anderson
<mike.r.an...@gmail.com> wrote:
> I think you could make #3 pretty transparent to the user. For example
>
> (+ a (* b c))
>
> could return a matrix that hasn't actually been evaluated yet, it just

What would the mechanics be?

Growing an inert symbolic structure from the inside out is a challenge.

AFAICT it is necessary to set up ahead of time, for each symbol (such
as + and * in this case), the capture mechanism.

Limiting the question to the matrix protocols, this is possible. But
it seems possible we'd want a wider range of constructs to capture the
full intent.

Konrad Hinsen

unread,

Jan 31, 2013, 6:04:10 AM1/31/13

to numerica...@googlegroups.com

Timothy Baldridge writes:

> One of the issues I have with the current implementation of
> core.matrix is that it implies immediate execution. Has anyone
> thought about perhaps making a version of this that relies on a
> plan based workflow?

As the discussion has shown, yes :-)

I am also very interested in optimization options, but I'd also like
to have an API that guarantees immediate execution. Experience has
shown that sophisticated techniques tend to fail for some special
cases, and then it's always good to be able to go back to the basics.

I see the delayed-execution stuff as distinct from the core.matrix
project, which in my view is about a unifying API rather than about
any concrete implementation. In fact, I see the core.matrix API as
something in the same category as algo.generic:

http://github.com/clojure/algo.generic

To the point that perhaps algo.generic.array would be a good place
to have it in the Clojure ecosystem, but that's not the most important
point for now.

Racket's new array library has delayed execution and draws on the
experience the Haskell people had with Repa. That's probably the most
advanced project along these lines, so it's worth looking there for
inspiration. Python's Blaze is of interest as well, as has been
mentioned already, but it hasn't generated any practical experience
until now. Yes, I am back to my favorite topic: the dangers of
theoretical API design ;-)

Konrad.

Timothy Baldridge

unread,

Jan 31, 2013, 1:23:34 PM1/31/13

to Konrad Hinsen, numerica...@googlegroups.com

At this point I'm wondering if we should define what the end goal of this project is. Thus far we've basically said "An abstract API for performing matrix operations". But what does that actually mean?

When I see many of the operations currently in the API, I start thinking of my experience with working with video/audio processing. For a stereo audio stream is really a matrix of doubles [channels x number of samples]. A video frame is a matrix of [width x height x elements per pixel].

Many of the operations we're looking at in this library apply quite cleanly to these sort of operations. For instance, mixing two audio streams:

(add (scale audio-stream1 mix-amount-1)

(scale audio-stream2 mix-amount-2))

In this example, it would be a poor implementation to have each operation return a new matrix for each operation as that would kill the CPU cache not to mention the cost of memory allocation.

Perhaps the easy answer is to do something like we currently have for sub-views. Perhaps on some implementations add, scale, etc. Would return a lazy view who's contents are only calculated once a mget operations is called.

But it does sound like some people are talking of a more involved implementation that would allow for a more functional and less declarative approach. For instance, something as simple as map:

(defkernel scale [v amount]

(* v amount))

(matrix-map (partial scale 0.5) my-matrix)

An implementation that supports something like that will lock us out of LLVM, and the GPU, and keep us firmly in the GPU camp (for the most part), as dealing with closures opens up a huge can of worms when it comes to generating code for a GPU for instance.

Timothy

--
You received this message because you are subscribed to the Google Groups "Numerical Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email to numerical-cloj...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
“One of the main causes of the fall of the Roman Empire was that–lacking zero–they had no way to indicate successful termination of their C programs.”
(Robert Firth)

Mike Anderson

unread,

Jan 31, 2013, 8:21:50 PM1/31/13

to numerica...@googlegroups.com, Konrad Hinsen

On Friday, 1 February 2013 02:23:34 UTC+8, Timothy Baldridge wrote:

At this point I'm wondering if we should define what the end goal of this project is. Thus far we've basically said "An abstract API for performing matrix operations". But what does that actually mean?

I've tried my best to articulate it in the README / Wiki - improvement patches welcome!

core.matrix needs to be more than just an API. In addition it needs:

1. Default implementations that are good enough to do meaningful numerical computing in Clojure

2. A plug-in system that supports other matrix implementations under the same API (an SPI, if you like....)

1) makes core.matrix useful in its own right without extra dependencies and gives us something comparable to NumPy

2) makes it extensible, so we can support an ecosystem of implementations (including many already very good implementations available from the Java world) and do more advanced stuff (e.g. the plan based workflows)

A plan based workflow implementation doesn't necessarily need to be part of core.matrix itself, but it should absolutely be compatible with the same API / SPI if at all possible.

When I see many of the operations currently in the API, I start thinking of my experience with working with video/audio processing. For a stereo audio stream is really a matrix of doubles [channels x number of samples]. A video frame is a matrix of [width x height x elements per pixel].

Many of the operations we're looking at in this library apply quite cleanly to these sort of operations. For instance, mixing two audio streams:

(add (scale audio-stream1 mix-amount-1)
(scale audio-stream2 mix-amount-2))

In this example, it would be a poor implementation to have each operation return a new matrix for each operation as that would kill the CPU cache not to mention the cost of memory allocation.

You always have the in-place operations. scale!, add! and suchlike are designed for efficient in-place updates of this type.

I think we also need an "add-multiple!" function or something similar - multiply and add is a sufficiently common case that I think it deserves it's own API function, and it is important because it can often be optimised by the underlying implementation.

Perhaps the easy answer is to do something like we currently have for sub-views. Perhaps on some implementations add, scale, etc. Would return a lazy view who's contents are only calculated once a mget operations is called.

But it does sound like some people are talking of a more involved implementation that would allow for a more functional and less declarative approach. For instance, something as simple as map:

(defkernel scale [v amount]

(* v amount))

(matrix-map (partial scale 0.5) my-matrix)

An implementation that supports something like that will lock us out of LLVM, and the GPU, and keep us firmly in the GPU camp (for the most part), as dealing with closures opens up a huge can of worms when it comes to generating code for a GPU for instance.

I think we need to avoid converting to closures for exactly this reason. I think we actually want some form of expression format that can be maniplulated like an AST. My contrived version:

(defexpression scale [v amount]

(* v amount)) ;;; returns an expression AST object.

(def scale-half (partial-expression [v] (scale v 0.5))) ;;; returns a new expression

(apply-expression! scale-half my-matrix)

If we make the expression object as a defrecord/deftype that implements IFn, then we could also do:

(scale-half my-matrix)

How to execute the expression is then an implementation choice:

- A default implementation could turn the expression into a closure and use "emap" to map over all elements. Slower but general purpose.

- A GPU implementation could compile the expression into GPU code

- An LLVM implementation could compile down to native code and run this, exploiting SSE instructions etc.

Reply all

Reply to author

Forward