My
hypothesis is that avoiding temporaries helps most at sizes where,
without temporaries things stay in the L2 cache but with temporaries
they don't.
Interestingly, I'm working on this right now, but for reasons of "how to do
local operations on arrays too large to live in memory at once" rather than
performance on in-memory arrays.
The whole file is not yet ready for public consumption, but you can see all
operations referencing "BrickArray" in this file:
https://github.com/timholy/julia/blob/imagelib/extras/grid.jl
(methods may be a bit scattered)
BrickArray will interface nicely with the almost-ready-to-be-pulled support
for memory-mapped arrays:
https://github.com/JuliaLang/julia/pull/743#issuecomment-5265229
Best,
--Tim
The two really big issues that are "conceptual problems" for array
fusion (rather than complex but definitely doable issues) are:
1. Observability/evaluation order changes: removing _named_
temporaries means that, when optimised, a future Julia couldn't "set a
breakpoint and then look at that array" (because the elements are
never there at no time). Personally I'm happy to say that, as with
other optimisations, an optmised piece of code may not longer be
observable.
2. Detecting higher-level access patterns in code. You _could_ provide
new implementations of +, -, .*, ... that are more "build an AST"
rather than actually evaluate things, but I'd prefer to avoid that if
possible. This is because it's providing a "parallel" set of code that
needs to be kept in synch with normal evaluation routines
(particularly given Julia's method dispatch), which is probably more
maintenance work than is feasible. That's why I'm thinking about if
this info can be gathered by tracing, since then it's essentially
derived from the normal evaluation process. However, the task of
extracting the pattern from the trace may be too hard, not sure yet.
But these are issues specific to the array issue, while you were
asking about more general optimisation issues.
This is the sort of optimization that I think the CUDA library that I want, needs to do to build up the kernels that are passed to the GPU card. I may have a chance this week to mock up the design with a Julia backend. The main idea is that the execution does not occur until the assignment is done (and I think I want to try to delay multiple assignment of execution until the data is requested from the GPU). There would be some sort of barrier to force the execution, and some way to force the barrier when you are using the commandline. This is a common technique in C++, and Julia is MUCH more suited for this type of programming.