observations about array code

123 views
Skip to first unread message

david tweed

unread,
Apr 21, 2012, 8:23:57 PM4/21/12
to julia-dev
Hi, these observations may, or may not, be of interest.

In looking at whether temporary removal is worthwhile, I've been
starting seeing how much can be done at the pure Julia level. Ignoring
for the moment how you'd extract it automatically, I've been comparing
timings for a function that does mulitiple array operations directly
(ie, using +, .*, etc) vs using a specialised function built using
julia's metaprogramming. As a simple warm-up example, consider
f(a,b,c)=(a.*b)+b.*(c-c') for square 2-D matrices generated by randn()
(so ie Array{Float64,2}s). The "temporary avoiding" function is
constructed by inlining both an argument list and an inner loop body
into two nested loops over array dimensions (call the outer loop
variable out, the inner loop variable in). Things are obviously non-
trivial timing anything that involves lazy initialisation (LLVM, julia
runtime) and things are going to depend on relative CPU/memory speed,
so the timings used in the following are "best effort". Anyway, the
observations:

1. The initial inlining approach using "result[out,in] = <big
expression>" was about 8 times slower.

2. Manually cse-ing the repeated array indices in <big epxression>
helps a bit and brings it down to about 6 times slower.

3. Realising how the array is stored in memory and doing things in
"result[in,out]" order brings things down to about 2 times slower.
(This is reasonably intuitive as once a cache line is read in the
whole of it is probably getting used before eviction now.)

4. Manually converting the 2-D indexing into 1-D indexing with the
array's first dimension stored in a local variable (rather than the
size(A,1) call used to go from 2-D to 1-D indexes in array.jl) makes
it now about twice as fast as the matrix version. (You obviously
wouldn't want to do this when manually writing code, but I hope to
figure out some way to mechanically generate the expressions so it's
not really a problem in that case.) Maybe if the inliner improves in
future this won't make such a performance difference.

These are the things that you'd do if you were writing C code (if your
matrices were stored in Fortran order). So it looks like explicit-loop
array code at the julia level can actually be competitive with the
carefully tuned array assembly libraries if you do some manual
optimisation when doing the coding in julia. When I think about it,
that's an amazing testatment to the Julia runtime (and LLVM of
course).

Hope this was useful,
Dave Tweed

Krzysztof Kamieniecki

unread,
Apr 22, 2012, 3:34:33 AM4/22/12
to juli...@googlegroups.com
Hi David,

Do you have some benchmark/example code for this that you could share?

Thanks,
Krys

Elliot Saba

unread,
Apr 22, 2012, 3:34:12 AM4/22/12
to juli...@googlegroups.com
May I ask how big the arrays you were testing on were?  Because I wager that these speedup factors are definitely a function of the size of the arrays.  (Square array copying being an O(n^2) operation, as opposed to the speedup you get by storing an indexing variable being somewhere around O(n), I'd think).

-E

david tweed

unread,
Apr 22, 2012, 5:37:37 AM4/22/12
to julia-dev
I

On Apr 22, 8:34 am, Elliot Saba <staticfl...@gmail.com> wrote:
> May I ask how big the arrays you were testing on were?  Because I wager
> that these speedup factors are definitely a function of the size of the
> arrays.  (Square array copying being an O(n^2) operation, as opposed to the
> speedup you get by storing an indexing variable being somewhere around
> O(n), I'd think).

There's two things obviously: the various indexing overheads look to
be "constant but incurred on each element lookup" so it's actually
O(elements)=O(N^2), while the savings from not creating temporary
matrices are more complicated and depend on how things sit in the
cache hierarchy. (The interest for this is vectorised code (the Matlab
style of doing things) so big-ish arrays are what's expected.) From at
200x200 (smallest size I tried) avoiding copies is over twice as fast
it slowly drops (with some odd jaggedness) to 3000x3000 where it's
about the same as the vectorised computation _in this benchmark_. My
hypothesis is that avoiding temporaries helps most at sizes where,
without temporaries things stay in the L2 cache but with temporaries
they don't.

Here's the library function
-------------

## build a 2-D-array-output evaluation function from symbolic
expression
function makeFn2D(ivars,assignExpr)
eval(quote
local _F_
function _F_($(ivars...))
result = similar($ivars[1])
const D1=size($ivars[1],1)
const D2=size($ivars[1],2)
ctr=0
for j=1:D2
for i=1:D1
$assignExpr
end
end
return result
end
return _F_
end)
end

function evalAtSizes(f,lo,step,nStep)
for i=0:nStep
sz=lo+i*step
aa=randn(sz,sz)
bb=randn(sz,sz)
cc=randn(sz,sz)
f(aa,bb,cc)
startT=time()
f(aa,bb,cc)
endT=time()
f(aa,bb,cc)
endT2=time()
f(aa,bb,cc)
endT3=time()
println(sz," ",min(endT-startT,endT2-endT,endT3-endT2)/
(sz*sz))
end
end

-----------

and then test with

ss=makeFn2D({:(a::Array{Float64,2}),:(b::Array{Float64,2}),:
(c::Array{Float64,2})},:(DJ=(j-1)*D1;x=a[i+DJ];y=b[i+DJ];z=c[i
+DJ];w=c[j+(i-1)*D1];result[i+DJ]=(x * y)+y*(z - w)))

gg(a,b,c)=(a .* b)+b.*(c - c')

and then print out the time per output element with

evalAtSizes(ss,200,200,15)

Now I think about it, given the use case I'm interested in is when you
repeatedly rerun the same code after while tweaking the code that
produces the inputs, perhaps I ought to generate new random matrices
before each run rather than just for each size...

Any other comments or pointing out of issues in this code gratefully
accepted
Dave Tweed

Patrick O'Leary

unread,
Apr 22, 2012, 8:46:44 AM4/22/12
to juli...@googlegroups.com
On Sunday, April 22, 2012 4:37:37 AM UTC-5, david tweed wrote:
My
hypothesis is that avoiding temporaries helps most at sizes where,
without temporaries things stay in the L2 cache but with temporaries
they don't.

One of the tricks that numexpr does when rewriting is to chunk large matrices to maintain cache locality. There was a good talk at PyData 2012 which you can watch online that goes into some detail on the technique.

david tweed

unread,
Apr 22, 2012, 9:54:17 AM4/22/12
to julia-dev
On Apr 22, 1:46 pm, Patrick O'Leary <patrick.ole...@gmail.com> wrote:
> On Sunday, April 22, 2012 4:37:37 AM UTC-5, david tweed wrote:
>
> > My
> > hypothesis is that avoiding temporaries helps most at sizes where,
> > without temporaries things stay in the L2 cache but with temporaries
> > they don't.
>
> One of the tricks that numexpr <https://code.google.com/p/numexpr/> does
> when rewriting is to chunk large matrices to maintain cache locality. There
> was a good talk at PyData 2012 which you can watch online<http://pyvideo.org/video/962/boosting-numpy-with-numexpr-and-cython>that goes into some detail on the technique.

Indeed. For a long while languages like SAC and Chapel go to great
lengths behind the scenes to access in tiles, and even layout vector
and array data, to maximise this effect. Unfortunately, it's clear
that almost all users want Matlab style languages rather than those
sort, so (time permitting) the vague goal is to try and implement some
of their techniques which have the greatest speedup/implementation
effort ratio in Julia.

cheers,
Dave Tweed

Patrick O'Leary

unread,
Apr 22, 2012, 10:28:53 AM4/22/12
to juli...@googlegroups.com

Right, I didn't see that technique mentioned yet, and it seems like it might be a good optimization for the scaling limitation you indicated.

I'm curious what would be required to add optimization passes before the code hits LLVM? Assuming some of these experiments prove fruitful, it would be nice to build them in. It seems like automatically performing stream fusion and chunking array operations would require an IR at a higher level than LLVM assembly, but I'm not familiar with how Julia's code generation works.

Tim Holy

unread,
Apr 22, 2012, 10:31:55 AM4/22/12
to juli...@googlegroups.com
On Sunday, April 22, 2012 05:46:44 AM Patrick O'Leary wrote:
> One of the tricks that numexpr <https://code.google.com/p/numexpr/> does

> when rewriting is to chunk large matrices to maintain cache locality. There
> was a good talk at PyData 2012 which you can watch
> online<http://pyvideo.org/video/962/boosting-numpy-with-numexpr-and-cython>

> that goes into some detail on the technique.

Interestingly, I'm working on this right now, but for reasons of "how to do
local operations on arrays too large to live in memory at once" rather than
performance on in-memory arrays.

The whole file is not yet ready for public consumption, but you can see all
operations referencing "BrickArray" in this file:
https://github.com/timholy/julia/blob/imagelib/extras/grid.jl
(methods may be a bit scattered)

BrickArray will interface nicely with the almost-ready-to-be-pulled support
for memory-mapped arrays:
https://github.com/JuliaLang/julia/pull/743#issuecomment-5265229

Best,
--Tim

david tweed

unread,
Apr 22, 2012, 11:11:45 AM4/22/12
to julia-dev
The two really big issues that are "conceptual problems" for array
fusion (rather than complex but definitely doable issues) are:

1. Observability/evaluation order changes: removing _named_
temporaries means that, when optimised, a future Julia couldn't "set a
breakpoint and then look at that array" (because the elements are
never there at no time). Personally I'm happy to say that, as with
other optimisations, an optmised piece of code may not longer be
observable.

2. Detecting higher-level access patterns in code. You _could_ provide
new implementations of +, -, .*, ... that are more "build an AST"
rather than actually evaluate things, but I'd prefer to avoid that if
possible. This is because it's providing a "parallel" set of code that
needs to be kept in synch with normal evaluation routines
(particularly given Julia's method dispatch), which is probably more
maintenance work than is feasible. That's why I'm thinking about if
this info can be gathered by tracing, since then it's essentially
derived from the normal evaluation process. However, the task of
extracting the pattern from the trace may be too hard, not sure yet.

But these are issues specific to the array issue, while you were
asking about more general optimisation issues.

Cheers,
Dave Tweed

Krzysztof Kamieniecki

unread,
Apr 22, 2012, 11:18:58 AM4/22/12
to juli...@googlegroups.com
This is the sort of optimization that I think the CUDA library that I want, needs to do to build up the kernels that are passed to the GPU card. I may have a chance this week to mock up the design with a Julia backend. The main idea is that the execution does not occur until the assignment is done (and I think I want to try to delay multiple assignment of execution until the data is requested from the GPU). There would be some sort of barrier to force the execution, and some way to force the barrier when you are using the commandline. This is a common technique in C++, and Julia is MUCH more suited for this type of programming.

Stefan Karpinski

unread,
Apr 22, 2012, 5:47:37 PM4/22/12
to juli...@googlegroups.com
These observations are fascinating. It shows that there's still a lot of performance to be gained from automatically doing the types of tmp-elimination and inlining that you do by hand here. As you note, on the up side, you can already get big performance gains by doing it by hand, which means it's a) possible in Julia without resorting to writing C code, and b) potentially possible to achieve automatically, given sufficiently good optimization. Now the trick is just figuring out how to get there.

On Sat, Apr 21, 2012 at 5:23 PM, david tweed <david...@gmail.com> wrote:

Stefan Karpinski

unread,
Apr 22, 2012, 5:53:38 PM4/22/12
to juli...@googlegroups.com
This is the kind of thing that might be possible to achieve at the LLVM optimization level using polyhedral optimization, as implemented by the polly project. Polly is still very much a research project, which is why we haven't gone full steam ahead with integration, but it might be a really fruitful project to try to use Polly optimization passes in Julia's code gen. It may also enable auto-parallelization of loop kernels.

Stefan Karpinski

unread,
Apr 22, 2012, 5:58:57 PM4/22/12
to juli...@googlegroups.com
On Sun, Apr 22, 2012 at 8:11 AM, david tweed <david...@gmail.com> wrote:

The two really big issues that are "conceptual problems" for array
fusion (rather than complex but definitely doable issues) are:

1. Observability/evaluation order changes: removing _named_
temporaries means that, when optimised, a future Julia couldn't "set a
breakpoint and then look at that array" (because the elements are
never there at no time). Personally I'm happy to say that, as with
other optimisations, an optmised piece of code may not longer be
observable.

Having optimized code be "unobservable" in this sense is completely ok. It's already the case since code gen for scalars can eliminate temporaries and do other transformations with the same sorts of effects. This is doing the same sort of thing at a higher level for arrays.
 
2. Detecting higher-level access patterns in code. You _could_ provide
new implementations of +, -, .*, ... that are more "build an AST"
rather than actually evaluate things, but I'd prefer to avoid that if
possible. This is because it's providing a "parallel" set of code that
needs to be kept in synch with normal evaluation routines
(particularly given Julia's method dispatch), which is probably more
maintenance work than is feasible. That's why I'm thinking about if
this info can be gathered by tracing, since then it's essentially
derived from the normal evaluation process. However, the task of
extracting the pattern from the trace may be too hard, not sure yet.

But these are issues specific to the array issue, while you were
asking about more general optimisation issues.

I agree that doing something like this in parallel to the normal evaluation is bound to be a maintenance nightmare. I'm still not sure about tracing, but whatever works and gives performance is welcomed. I think stream fusion might be a better approach, but just don't really understand the technique well enough yet. Guess I should do some reading up on stream fusion.

Stefan Karpinski

unread,
Apr 22, 2012, 6:03:03 PM4/22/12
to juli...@googlegroups.com
On Sun, Apr 22, 2012 at 8:18 AM, Krzysztof Kamieniecki <kr...@kamieniecki.com> wrote:
This is the sort of optimization that I think the CUDA library that I want, needs to do to build up the kernels that are passed to the GPU card. I may have a chance this week to mock up the design with a Julia backend. The main idea is that the execution does not occur until the assignment is done (and I think I want to try to delay multiple assignment of execution until the data is requested from the GPU). There would be some sort of barrier to force the execution, and some way to force the barrier when you are using the commandline. This is a common technique in C++, and Julia is MUCH more suited for this type of programming.

I've thought of barrier and force stuff like this before, in the context of doing array operations lazily, constructing essentially, an object that represents a collection of array operations and then calling a "materialize" function on the representation implicitly at the expression level, which then figures out how to do the full computation without temporaries. The repl could just automatically call materialized on each input expression, keeping things working the way you expect, while the actual code could potentially be more aggressive, only calling materialize at points where it's deemed necessary by some criterion (e.g. before a loop). I've hesitated to do anything like this because it's uncomfortably magical and hidden from the user. We don't currently do anything this magical ever.
Reply all
Reply to author
Forward
0 new messages