DeExpr package

Dahua

unread,

Feb 4, 2013, 3:23:59 PM2/4/13

to juli...@googlegroups.com

The first version of DeExpr is now working. Check it out from https://github.com/lindahua/DeExpr.jl

Or, simply typing in Pkg.add("DeExpr").

The basic use is very simple, just adding a macro @devec before your expressions, as

@devec r = exp(a) + sin(b .* c) - sum(a + b)

It will automatically translates your expression into de-vectorized loops -- temporaries will be eliminated and all results are done in a single pass. My benchmark shows that this may result in 2x - 8x performance gain in typical cases.

Currently, DeExpr supports a variety of expressions, including element-wise arithmetics, math functions, references (e.g. a[:,j]), reduction, colwise/rowwise reduction, blocks, and hybrid expressions.

Advanced users may use the tools provided in the package to analyze, process expressions and generate codes in a different way.

Refers to the project page ( https://github.com/lindahua/DeExpr.jl ) for more documentation, and detailed benchmark results. Feedbacks are very much appreciated.

Stefan Karpinski

unread,

Feb 4, 2013, 3:29:57 PM2/4/13

to Julia Dev

This is very exciting!

Viral Shah

unread,

Feb 4, 2013, 5:33:25 PM2/4/13

to juli...@googlegroups.com

I have been wanting this for the longest time. This is indeed really exciting!

-viral

Stefan Karpinski

unread,

Feb 4, 2013, 5:35:22 PM2/4/13

to Julia Dev

Regarding naming, might this package be better named Devectorize.jl?

Patrick O'Leary

unread,

Feb 4, 2013, 5:37:54 PM2/4/13

to juli...@googlegroups.com

This does bear a resemblance to Python's NumExpr, which performs similar sorts of optimizations with an eye towards cache friendliness. Is that coincidental?

Dahua

unread,

Feb 4, 2013, 5:47:42 PM2/4/13

to juli...@googlegroups.com

They aim to achieve similar goals, but the approaches are different.

NumExpr in Python breaks computation into smaller chunks (for better cache performance), while DeExpr breaks into scalar kernels. Also, the @devec macro is IMO more elegant than NumExpr's functions which take strings as arguments.

All these are made possible by Julia's advantages: powerful meta-programming system and fast loop.

Dahua

unread,

Feb 4, 2013, 5:51:03 PM2/4/13

to juli...@googlegroups.com

I am open to renaming the package, but will wait to see people's opinions.

"Devectorize" sounds a little bit too long for me as a package name -- what about DeVec (or DeVect) ?

On Monday, February 4, 2013 4:35:22 PM UTC-6, Stefan Karpinski wrote:

Patrick O'Leary

unread,

Feb 4, 2013, 5:51:25 PM2/4/13

to juli...@googlegroups.com

I definitely agree that this is a much friendlier interface for the programmer.

Stefan Karpinski

unread,

Feb 4, 2013, 5:56:39 PM2/4/13

to Julia Dev

You only load the package once. After that all you use is the @devec macro, which can have the same name. Think about readability of programs:

using DeExpr
@devec r = exp(a) + sin(b.*c) - sum(a+b)

versus

using Devectorize

@devec r = exp(a) + sin(b.*c) - sum(a+b)

In the former, I don't really have any idea what might be happening; in the latter I can already tell what's going on without having to look anything up.

Dahua

unread,

Feb 4, 2013, 6:43:58 PM2/4/13

to juli...@googlegroups.com

This makes sense.

I will wait until tonight to see if there are other suggestions, and try to make the changes tomorrow.

Viral Shah

unread,

Feb 4, 2013, 7:05:42 PM2/4/13

to juli...@googlegroups.com

I prefer the Devectorize package name too. Your README indicates that you may do some fancier things, and I look forward to those - but until then, the Devectorize name would be a clearer name.

-viral

Krzysztof Kamieniecki

unread,

Feb 4, 2013, 10:44:05 PM2/4/13

to juli...@googlegroups.com

Awesome! Two questions:

1. Does it use Jeff's JIT caching "trick" to prevent having to reprocess expressions?

2. How would you want to attach a GPU backend?

On Monday, February 4, 2013 3:23:59 PM UTC-5, Dahua wrote:

Stefan Karpinski

unread,

Feb 4, 2013, 10:52:09 PM2/4/13

to Julia Dev

It would be lovely if we could factor out the common infrastructure needed for devectorization, symbolic manipulations (e.g. differentiation), generating GPU-targeted code, loop unrolling, and such.

Krzysztof Kamieniecki

unread,

Feb 4, 2013, 11:06:16 PM2/4/13

to juli...@googlegroups.com

I agree!

I think generating dynamic link libraries is also important, being able to write code that can ship in products would spur some people (at least me :) ) to jump ship from MATLAB to C++ translation and focus on improving Julia performance.

Dahua

unread,

Feb 4, 2013, 11:30:26 PM2/4/13

to juli...@googlegroups.com

I also feel this is very important.

In particular, I think Julia needs a common framework to support meta-programming, which can serve as the basis for all the tasks that you mentioned. Currently, all what we have is the Expr, which contains little semantics for useful analysis.

As a workaround, when I developed this library, I have to create a set of "meta-functions" to "memorize" the traits of operators and math functions, thus restricting the use of @devec to all the things that those trait functions can recognize.

It would be great if Julia can have facilities like C++'s std::result_of (at least for a subset of commonly used functions).

It would also be great to provide facilities for library authors and even end users to "register" their function as element-wise functions (i.e. Python's ufuncs)

Dahua

unread,

Feb 4, 2013, 11:58:57 PM2/4/13

to juli...@googlegroups.com

See answers below

On Monday, February 4, 2013 9:44:05 PM UTC-6, Krzysztof Kamieniecki wrote:

Awesome! Two questions:
1. Does it use Jeff's JIT caching "trick" to prevent having to reprocess expressions?

All expressions marked with @devec are respectively analyzed and directly translated into for loops. Currently, the package does not explicitly explore the expression caching capability.

To make such caching happen, some intermediate wrappers might have to be introduced --- which sometimes introduce remarkable run-time overhead (e.g. some functions that work on those wrappers may failed to be inlined). The entire development is primarily driven by runtime benchmarks --- if you look at the logs of commits, I have made several substantial re-design to reduce run-time overhead to minimal.

That said, I actually kept an eye on the compilation time -- it is at a very reasonable level. The time took to evaluate a @devec expression for the first time (which should include both compilation time and evaluation time) is less than without @devec.

2. How would you want to attach a GPU backend?

The compile function takes a context argument and an expression argument. My thought was that if one supplies a CUDAContext, it will be dispatched to the function that generates GPU codes.

However, the problem is more complicated than that. De-vectorizing into CPU for loops would introduce very small (if any) overhead, so you can simply use a macro to tag an expression and de-vectorize it. But, writing GPU codes is a lot more involved. Transferring data back and forth for each expression between host and device is definitely not a efficient way, and would introduce overwhelming overheads.

So people who want to port codes to GPU have to carefully plan the communication between GPUs and CPUs (e.g. when to transfer things to GPU, when to fetch the results, whether to spawn async threads or to make a blocking run). The ideal code in my mind would look something like these:

b0 = rand(2000, 2000)

@gpu a = rand(2000, 2000) # creates a random matrix on GPU device

@gpu b = b0 # transfer b0 in CPU to b in GPU

@gpu r = zeros(2000, 2000)

ker = @kernel begin

... some computation

end

evaluate(ker, a, b, r) # allow optional arguments to specify #blocks & #threads

@cpu r2 = r # triggers a transfer from gpu to cpu

Dahua

unread,

Feb 5, 2013, 12:03:35 AM2/5/13

to juli...@googlegroups.com

I really hope to see Julia supports SIMD vectors (as built-in types) and intrinsics --- such things are supported by LLVM.

This could take the performance of de-vectorized codes to a next level.

Dahua

unread,

Feb 5, 2013, 12:09:14 AM2/5/13

to juli...@googlegroups.com

Agree.

In particular, I would like it to be able to generate MATLAB mex files (the shared objects / DLLs ended with *.mex???) --- which would make it much easier to interact with MATLAB.

I love to write Julia, but many colleagues at my institute like to use MATLAB. The capability of generating mex files from Julia might make my life (perhaps some others) easier.

Dahua

unread,

Feb 5, 2013, 10:47:46 AM2/5/13

to juli...@googlegroups.com

According to the discussion yesterday -- I am going to rename the package to Devectorize

As DeExpr has been registered at METADATA.jl, is there a way to make such changes without messing up things ?

John Myles White

unread,

Feb 5, 2013, 10:49:57 AM2/5/13

to juli...@googlegroups.com

Stefan will correct me if I'm wrong, but I believe that there's no problem changing the name of something in METADATA.jl. The only quirk is that it may orphan some packages on people's systems, but those could be deleted manually without incident.

-- John

Dahua

unread,

Feb 5, 2013, 2:37:46 PM2/5/13

to juli...@googlegroups.com

For your first question, I think a little bit about it again, and found that it actually triggers caching. The kernel composition functions look like:

function compose(ctx::ScalarContext, ex::(Some specific kind of expression))
....
end

So when this function is called on a certain expression, it will be cached. And, when another expression of the same kind is processed, the cached code will be used to generate expressions for it.

Krzysztof Kamieniecki

unread,

Feb 5, 2013, 4:06:43 PM2/5/13

to juli...@googlegroups.com

Does this trigger the caching of Julia's JIT compiled version of the devectorized loop?

In DeMat what happens is that an new assign is created along the lines of assign(lhs:LHSType,rhs:RHSType) where LHSType and RHSType encode the structure of the expression but not the scalar and vector values. The creation of the assign is done inside of a generic assign, that should only be called once when an expression appears for the first time.

Take a look at the assign in https://github.com/kk49/julia-delayed-matrix/blob/master/demat_be_julia.jl

Dahua

unread,

Feb 5, 2013, 5:02:14 PM2/5/13

to juli...@googlegroups.com

Krys,

I actually explored this approach initially. The problem is that it introduces considerable run-time overhead.

In your package (DeMat), the delayed expression was actually built at run-time:

first wrapping an array to DeVec or DeMat, and then an operator (say +) on DeVec results in a wrapped DeBinOp, so on and so forth. If you write a complex expression, such as

sin(a + cos(b .* exp(c + d + e .* log(f + g))))

The initial process of building up such wrappers already incurs noticeable run-time overhead -- when a and b are not very large. Then at run-time, the kernel calls multiple levels of de_jl_do_op -- this also introduces overhead.

There should be no problem if we are using C++ (with -O3) -- even ten levels of indirection would be completely inlined and compressed into one instruction. But in Julia, it seems that things are not as aggressively inlined, and I have to make a lot of efforts to avoid indirection.

Here is just one example that I tried when comparing the run-time performance of different approaches:

type Wrapper{T<:Real}
    data::Array{T}
end

a = rand(10000)
w = Wrapper(a)

get_value(a::Array, i::Int) = a[i]
get_value(a::Wrapper{Float64}, i::Int) = a.data[i]

a[i]              ------- (1)
get_value(a, i)   ------- (2)
w.a[i]            ------- (3)

get_value(w, i)   ------- (4)

Using (4) in the kernel sometimes (not always, depending on the what the whole expression looks like) leads to two orders of magnitute slower than using (1),(2),or (3).
I have experienced performance hit on all sorts of indirect constructs. Then, after trying a bunch of other things, I decided to generate the most direct code (without any wrapper types) -- the code that looks nearly the same as what you may write when coding a simple for-loop.

In this way, the @devec version guarantees negligible run-time overhead even when length(a) is just as small as 10 --- so it can be used everywhere (both small and large matrices).

For DeExpr (which will later be renamed to Devectorize), minimal overhead is a very important design goal. I don't want to put a notice saying that your array has to contain at least 5000 elements to get benefits from @devec.

Also, as I mentioned, the current approach actually did not introduce lots of compilation-delay.

Krzysztof Kamieniecki

unread,

Feb 5, 2013, 10:07:41 PM2/5/13

to juli...@googlegroups.com

Dahua,

DeMat does (should) not generate multiple levels of de_jl_do_op, it should flatten the expression tree. de_jl_do_op is used to translate from the type that has to operation to the call to that op. The type and object of the RHS is assembled in a recursive manner, but the kernal should just pull out the inputs from that structure and run in a flat loop.

I did implement Wrapper and like you I found that same problem, my workaround was for DeMat to copy a.data into a new variable (in the preamble of the kernel) and then access it directly within the loop body. That behavior is mimics what is needed for my GPU backend.

I did try your complicated example, and I do see that DeMat runs in about 10 seconds vs. 9.5 for the Julia loop version. for 10 iterations of 10e6 sized arrays of float32s

Unfortunately dissassemble dumps the LLVM intermediate assembly, before all the optimizations, so it's tough to see what the differences are.

I will ponder this...

Dahua

unread,

Feb 5, 2013, 10:32:14 PM2/5/13

to juli...@googlegroups.com

Krys,

Thanks for clarifying your implementation.

It seems that your kernel runs as fast as a Julia loop. But then the difference between 9.5s and 10s might come from initial setup.

When running on large arrays with 1e6 elements, the overhead of initial setup might not be that obvious. Perhaps, you can try such a setting -- running your package on thousands of even millions of relatively short vectors (say len = 100), then you will see the cost of the initial setup more clearly.

In my research domain (i.e. machine learning and probabilistic inference) --- this is a very common setting, for example, each MCMC/Gibbs sampling step only updates a vector of length 20 - 200, and every step depends on the previous one, thus it is nontrivial to parallelize them --- so it is crucial to make these steps run very fast.

Krzysztof Kamieniecki

unread,

Feb 6, 2013, 9:17:55 AM2/6/13

to juli...@googlegroups.com

Dahua,

Something strange is definitely going on with DeMat, other expression that are less complex have run times that are essentially match the Julia loop times. And if I run with a shorter length arrays (1000) the times are equivalent for your complex example. I wish it was not such a pain to get the final x86 disassembly of JIT functions. Maybe it's a memory access pattern issue.

In general what I want out of DeMat (or any delayed expression library and it's host language) is compiled GPU kernels (that can also be compiled for the CPU, depending on the overall system computing hardware) that are surrounded by compiled CPU code. All written in one "prototyping to production" language. The idea is that once the JIT processes the code on the first iteration, it will not need to do anything for the million iterations that follow.

And if devectorize/DeVec is that library I would be happy :)

Krys

P.S. I want to use Julia to do signal and 2d/3d image processing (http://youtu.be/OvDP75oFGXM)

Dahua

unread,

Feb 6, 2013, 9:38:01 AM2/6/13

to juli...@googlegroups.com

I am working on stabilizing the interfaces, and will look at the internal implementation again after this is done -- and explore again the idea of introducing a caching mechanism (the assign1 in DeMat) and see how it would help.

In terms of GPU, this is a much more complicated project, which probably needs joining the efforts from multiple people. It may be advisable to first set up a separate package to provide common infrastructure to support CUDA -- delayed expressions can be built on top of that later.

The common infrastructure may include functions/types for device memory management, communication between device and host, kernel compilation and evaluation, etc.

Dahua

unread,

Feb 6, 2013, 11:52:10 PM2/6/13

to juli...@googlegroups.com

The package has been renamed to "Devectorize".

I also expanded its capability, it now accepts fields, various ref-expressions:

@devec x = a[1:3, :] + b[2:4,:] + c["z"]
@devec y = a[:, u:v] + x.y.z
@devec s.r = a.x + b.y + c.z

Dahua

unread,

Feb 7, 2013, 4:24:54 PM2/7/13

to juli...@googlegroups.com

The link in "Available package doc" still refers to the old DeExpr, please modify it.

http://docs.julialang.org/en/latest/packages/packagelist/#id19

Viral Shah

unread,

Feb 10, 2013, 3:40:07 PM2/10/13

to juli...@googlegroups.com

This will happen at the next refresh of the packages page. It would be nice to have something automated...

-viral

Reply all

Reply to author

Forward