Krys,
I actually explored this approach initially. The problem is that it introduces considerable run-time overhead.
In your package (DeMat), the delayed expression was actually built at run-time:
first wrapping an array to DeVec or DeMat, and then an operator (say +) on DeVec results in a wrapped DeBinOp, so on and so forth. If you write a complex expression, such as
sin(a + cos(b .* exp(c + d + e .* log(f + g))))
The initial process of building up such wrappers already incurs noticeable run-time overhead -- when a and b are not very large. Then at run-time, the kernel calls multiple levels of de_jl_do_op -- this also introduces overhead.
There should be no problem if we are using C++ (with -O3) -- even ten levels of indirection would be completely inlined and compressed into one instruction. But in Julia, it seems that things are not as aggressively inlined, and I have to make a lot of efforts to avoid indirection.
Here is just one example that I tried when comparing the run-time performance of different approaches:
type Wrapper{T<:Real}
data::Array{T}
end
a = rand(10000)
w = Wrapper(a)
get_value(a::Array, i::Int) = a[i]
get_value(a::Wrapper{Float64}, i::Int) = a.data[i]
a[i] ------- (1)
get_value(a, i) ------- (2)
w.a[i] ------- (3)
get_value(w, i) ------- (4)
Using (4) in the kernel sometimes (not always, depending on the what the whole expression looks like) leads to two orders of magnitute slower than using (1),(2),or (3).
I have experienced performance hit on all sorts of indirect constructs. Then, after trying a bunch of other things, I decided to generate the most direct code (without any wrapper types) -- the code that looks nearly the same as what you may write when coding a simple for-loop.
In this way, the @devec version guarantees negligible run-time overhead even when length(a) is just as small as 10 --- so it can be used everywhere (both small and large matrices).
For DeExpr (which will later be renamed to Devectorize), minimal overhead is a very important design goal. I don't want to put a notice saying that your array has to contain at least 5000 elements to get benefits from @devec.
Also, as I mentioned, the current approach actually did not introduce lots of compilation-delay.