Re: [julia-dev] Big performance gap between vectorized codes and for-loops

751 views
Skip to first unread message

John Myles White

unread,
Jan 28, 2013, 7:43:49 AM1/28/13
to juli...@googlegroups.com
This is really something that needs improvement. I entirely agree that vectorized code is easier to read and write. Making it competitive with explicit loops would be a major gain for Julia's usability.

-- John

On Jan 27, 2013, at 11:41 PM, Dahua <lind...@gmail.com> wrote:

> I am starting working on porting machine learning codes to Julia.
>
> The first module that I created is Distances, which provides functions to compute various distances between vectors. (https://github.com/lindahua/Distances.jl)
>
> When I tried to test the best way to implement the distance functions, I did a simple benchmark and found a huge gap between vectorized and de-vectorized codes. (See attached file)
>
> This is what I got by running the script,
>
> bench: sqeuc_raw_forloop
> elapsed time: 0.4254341125488281 seconds
>
> bench: sqeuc_sumsqr_vectorized
> elapsed time: 4.688630104064941 seconds
>
> bench: sqeuc_sumsqr_percol
> elapsed time: 8.979635000228882 seconds
>
> bench: sqeuc_norm_percol
> elapsed time: 4.917685031890869 seconds
>
> bench: sqeuc_norm_percol_s
> elapsed time: 16.51025891304016 seconds
>
> bench: sqeuc_map_norm
> elapsed time: 5.374030113220215 seconds
>
> It means that completely de-vectorized for-loop is about 10x - 20x faster than any vectorized way I tried. To maintain a desirable performance, I had to implement many functions in a de-vectorized way, which is often much more tedious and time-consuming than writing vectorized codes.
>
> I am amazed by the performance on raw for loops that Julia produced. However, 10x slower vectorized codes virtually mandates de-vectorization everywhere that is performance-critical, which I am not sure is the way to go.
>
> --
>
>
>

Dahua

unread,
Jan 28, 2013, 8:06:07 AM1/28/13
to juli...@googlegroups.com
Error occurred when I tried to uploaded the code. The link to the benchmark code is here:

Dahua

unread,
Jan 28, 2013, 8:51:56 AM1/28/13
to juli...@googlegroups.com
I found that this issue is related:


It was open two years ago, but I am not sure how much progress has been achieved along that line. BTW, I actually tried to test the same thing in MATLAB:

sqrt(sum((a - b).^2, 1))

The statement above achieves similar performance to de-vectoried code in Julia. I believe that MATLAB's JIT has some magic to avoid creating temporaries which a straightforward implementation would do.

Even this simple expression may lead to two temporary copies : a - b, (a - b) .^ 2. More complicated expressions which are typical in every day computation, e.g. the formula to calculate normal pdf, etc would lead to way more temporaries, and kill performance.
According to the benchmark above, 90% of the time is wasted on something irrelevant to computing -- this is a very low CPU utilization.

I know little about how Julia actually handle the code generation with LLVM. Here is just a speculative way to handle this in a very generic sense:

Introduce a specific tag (at compile time) to expressions to indicate whether an expression supports element-wise access, and a specific tag to indicate whether a function is element-wise evaluation. 

Then anything as follows should be treated as element-wise accessible:

a : when a is an instance of AbstractArray
f(a, b, ...) : when a, b, ... is element-wise accessible and f is an element-wise function (e.g. exp, log, +, -)
bsxfun(f, a, b, ...)

Recursively applying this implies that things like exp((x - y).^2) is element-wise accessible.

Then in the code generation part, the entire element-wise accessible expression should be automatically de-vectorized (at compile-time) without creating any temporaries. (Furthermore, if you turn on LLVM's bb-vectorization, it may generate SIMD codes for you)

Ideally, users should be able to tag their functions as element-wise (using macro?) -- for example I can write normpdf and tag it as element-wise accessible, then it can enjoy the benefits of automatic de-vectorization.  

Tim Holy

unread,
Jan 28, 2013, 10:12:13 AM1/28/13
to juli...@googlegroups.com
On Monday, January 28, 2013 07:43:49 AM John Myles White wrote:
> This is really something that needs improvement. I entirely agree that
> vectorized code is easier to read and write.

Often, but not always. Sometimes with Matlab, which generally has poor
performance unless you vectorize, I had to go through contortions to figure out
how to vectorize my computation.

Certainly this is something that will, in the fullness of time, be addressed.
I think there are a couple of efforts underway; one that is publicly available
is Krys' "DeVec" framework: https://github.com/kk49/julia-delayed-matrix
Obviously it would be better to have automatic elision of temporaries built
into core Julia; I don't know about other people's plans, and I'm not working
in this area myself, but possibly this is an area where joining the effort
might make it happen sooner.

Another resource, not immediately visible, is that many of the matrix routines
have a "preallocated output" form. For example, "A*B" calls A_mul_B, and
A_mul_B has both
C = A_mul_B(A, B)
and
A_mul_B(C, A, B)
forms. The latter allows you to explicitly manage the memory for the output
(i.e., avoid temporaries), which is usually the main source of difference
between vectorized and devectorized performance. A_mul_B(C, A, B) returns C,
so you can compose it with other such routines. Not as pretty as A*B, of
course, but certainly it gets you performance without having to dive down into
low-level Lapack.

--Tim

Dahua

unread,
Jan 28, 2013, 10:41:12 AM1/28/13
to juli...@googlegroups.com
Thanks for mentioning Krys' DeVec framework. It looks quite interesting, except that it hasn't been updated for 7 - 8 months. I am not sure whether it remains under active development.

Eventually, this need to be part of the Julia compiler to automatic de-vectorize things rather than making users write de-vectorize codes explicitly.

As a temporary work-around, I am considering writing a macro system, as follows

@devec r = exp( (x - y) .^ 2)

The @devec macro may destruct the expression tree and reconstruct a block of de-vectorized codes.  As Julia's macros may access the expression tree, this seems feasible to me -- any insight?

Tim Holy

unread,
Jan 28, 2013, 11:04:46 AM1/28/13
to juli...@googlegroups.com
On Monday, January 28, 2013 07:41:12 AM Dahua wrote:
> Thanks for mentioning Krys' DeVec framework. It looks quite interesting,
> except that it hasn't been updated for 7 - 8 months. I am not sure whether
> it remains under active development.

Check the recent mailing list archives for a recent post from Krys.

> Eventually, this need to be part of the Julia compiler to automatic
> de-vectorize things rather than making users write de-vectorize codes
> explicitly.
>
> As a temporary work-around, I am considering writing a macro system, as
> follows
>
> @devec r = exp( (x - y) .^ 2)
>
> The @devec macro may destruct the expression tree and reconstruct a block
> of de-vectorized codes. As Julia's macros may access the expression tree,
> this seems feasible to me -- any insight?

Yes, it seems feasible. My guess is that "simple things will be pretty simple"
(e.g., if all operations are elementwise) but that the general problem will be
nontrivial (which is probably why it's not part of Julia proper yet).

--Tim

Krzysztof Kamieniecki

unread,
Jan 28, 2013, 1:28:59 PM1/28/13
to juli...@googlegroups.com
DeMat is not currently under active development, but I am actively itching to start working on it again and feeling guilty about not working on it. I hope to get some of my ducks in a row and start working on it again in the next month.

I use a new type for my vectors (which for the Julia backend just wraps the Julia vectors) and build (or load from the cache) the Julia for loop or PTX kernel and execute as part of the assignment.

The @devec macro you propose is feasible, it could even be a wrapper around my code, although I would be interested to see if what I did could be done in more elegant form.

Douglas Bates

unread,
Jan 28, 2013, 1:53:20 PM1/28/13
to juli...@googlegroups.com
For unary operators and functions there is already a version of sum that uses mapreduce.  So the squared Euclidean length of a vector doesn't use temporaries if written as

sum(x->x*x,v)

Dahua

unread,
Jan 28, 2013, 4:50:30 PM1/28/13
to juli...@googlegroups.com
Thanks for suggesting this.

While it may avoid temporaries, calling anonymous functions such as x -> x * x  incurs very large overhead. I did a simple benchmark: this statement is about two orders of magnitude slower than even the vectorized code.

I am now working on the @devel macro, which shows quite promising performance -- basically the same as a hand-coded for-loop, but with a much more concise syntax. Will post about this later.

Douglas Bates

unread,
Jan 28, 2013, 5:15:14 PM1/28/13
to juli...@googlegroups.com
On Monday, January 28, 2013 3:50:30 PM UTC-6, Dahua wrote:
Thanks for suggesting this.

While it may avoid temporaries, calling anonymous functions such as x -> x * x  incurs very large overhead. I did a simple benchmark: this statement is about two orders of magnitude slower than even the vectorized code.

How does sum(square(v)) stack up?  I had forgotten the overhead of anonymous functions.

Viral Shah

unread,
Jan 28, 2013, 5:37:48 PM1/28/13
to juli...@googlegroups.com
We need two things. We need better performance for our vectorized code, and Jeff is addressing much of this on the `jb/gcwork` branch. Clearly, our vectorized routines are implemented with loops, which do work fast. Avoiding temps will improve cache behaviour and reduce GC pressure significantly. In the ideal case, our inner loops get optimized with SIMD instructions.

The second thing we need is automatic devectorization. For example, array expressions that are element-wise can be automatically devectorized. Your suggestion of having a macro that marks functions doing element-wise stuff could also help with devectorization.

-viral

Krzysztof Kamieniecki

unread,
Jan 28, 2013, 6:13:21 PM1/28/13
to juli...@googlegroups.com
Hi Viral,

The DeMat code can be modified to do the automatic devectorization by replacing the normal vector operators with my operators that work on my new type. Now that you can use symbols as type parameters, implementing all the operators should be much easier, since you don't have to define a type for each operator. Take it and integrate it please. :)

Krys

Viral Shah

unread,
Jan 28, 2013, 7:48:31 PM1/28/13
to juli...@googlegroups.com
Ok - this is exciting. I am going to try it out.

Thanks,

-viral
> --
>
> ---
> You received this message because you are subscribed to the Google Groups "julia-dev" group.
> To unsubscribe from this group, send email to julia-dev+...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

Krzysztof Kamieniecki

unread,
Jan 28, 2013, 9:17:17 PM1/28/13
to juli...@googlegroups.com
UGH, I just tried it and it looks like I have to update it to handle changes in Julia. I'll give it a shot tonight...

Dahua

unread,
Jan 28, 2013, 9:45:02 PM1/28/13
to juli...@googlegroups.com
Hi Krys,

I carefully read through your DeMat code, and I found that it was similar to the approach that I took in developing LightMatrix (https://github.com/lindahua/light-matrix), that is, to construct matrix accessors and computational kernel at compile time, and apply them in a "de-vectorized" way at run-time. Except that in LightMatrix, C++ meta-programming is used instead of Julia macro.

I really like your design and am very excited about DeMat. A problem of using your codes directly is that it seems to require users to manually calling the constructors of DeVecJ or DeMatJ to wrap arrays into your new types. This is sometimes cumbersome, and a macro may save users from writing such boiler-plate codes. 

What about following syntax ?

@devec r = f(x) + g(y) * h(z)  # map to default backend (Julia de-vectorized code ?)
@devec_simd r = ...            # map to SIMD back-end
@devec_cuda r = ...   # map to CUDA back-end, etc

The implementation of these macros can be simply based on the DeMat module. The cool thing here is that users can freely choose/change the back-end as they see fit, by simply adding/modifying the macro. (I am now playing with this idea).

IMHO, to make it easy for the community to contribute (e.g. adding new supported operators/functions and even adding new back-ends), I think it might be worth moving the module to Julia's new package management system, so that we can get it more easily (or ideally, integrated into the Base module). Also, it would be important to stabilize the common interface and type-system so that people can rely on them to make extensions and other contributions.

Krzysztof Kamieniecki

unread,
Jan 28, 2013, 10:10:38 PM1/28/13
to juli...@googlegroups.com
Hi Dahua,

I basically stole the pattern from traditional C++ meta programming (I wrote a C++ library at work (i.e. proprietary) to make the same application vectorized code compile to C++ or CUDA code) My hope for Julia is to merge my C++ and MATLAB algorithm work into one tool. 

I really like the idea of wrapping the library with a macros. Updating to the module system is at the top of my list, I have some free time tonight so I'll see where I can get too (first step is to make it work again)

Krys

Toivo Henningsson

unread,
Jan 29, 2013, 12:43:44 AM1/29/13
to juli...@googlegroups.com
I did some experiments along these lines about nine months ago. The discussion can be found at https://groups.google.com/forum/m/#!searchin/julia-dev/julia-kernels/julia-dev/2aqKe9Q9dEc. You can look at the source at https://github.com/toivoh/julia-kernels. Unfortuntaley, julia has changed quite a bit since then, so it won't run. Also, I have some ideas on how to implement it in a better way. I've been wanting to come back to this at some point, but I don't have a lot of time right now, and other interesting projects to work on as well.

Toivo Henningsson

unread,
Jan 29, 2013, 12:55:48 AM1/29/13
to juli...@googlegroups.com
Sorry, it seems that the link was broken. This one should work: https://groups.google.com/forum/?fromgroups=#!topic/julia-dev/2aqKe9Q9dEc

Krzysztof Kamieniecki

unread,
Jan 29, 2013, 2:02:38 AM1/29/13
to juli...@googlegroups.com
Progress but not quite there yet...

Krzysztof Kamieniecki

unread,
Jan 31, 2013, 12:03:36 AM1/31/13
to juli...@googlegroups.com
FIXED! Again it's just a proof of concept.

https://github.com/kk49/julia-delayed-matrix/blob/e49803ffcf125e2c21d55799327cae27d99086fc/results002.txt is a test run from my computer, it looks like my delayed expression / devectorization library can get the performance of a for loop at the same scope level, but not a for loop that is in an inner function that is called from that scope level (see tests 1 and 1a), I played with this before and I think I had posted some results in this mailing list. The loop in the inner function get better optimization by LLVM and the same thing happens in C++ with CLANG (LLVM based C++ compiler)

The CUDA code runs slow (about the same speed as the devectorized Julia code) the first time (CUDA/PTX JIT cost), but on the next call is FAST, with these examples 20x - 30x.

Enjoy!

Elliot Saba

unread,
Jan 31, 2013, 3:31:41 AM1/31/13
to juli...@googlegroups.com
Krzystof, I may be missing something, but shouldn't the sum() tests be yielding the same sum?  It looks like the julia result for sum( x .* x ) is 3.29971025e7, whereas the CUDA result is 3.30015025e7.

Dahua

unread,
Jan 31, 2013, 8:46:10 AM1/31/13
to juli...@googlegroups.com
When you are summing large amount of floating point numbers, there can be differences in the results if you sum things in different order or group the sums in different ways, e.g.

in general 

a + b + c + d = (((a + b) + c) + d) != (a + b) + (c + d) ...

when the number of summands becomes large, the difference may become noticeable. I believe Krys parallelize the sum in CUDA and thus the order & grouping should be difference from a serial version. 

Also, CPU in x86 architecture (when not using SIMD) uses 80-bit register to represent a number in extended double precision format in FPU, while CUDA  internally uses 32-bit for single-precision and 64-bit for double precision. This may also contribute to the difference.

Krzysztof Kamieniecki

unread,
Jan 31, 2013, 8:47:51 AM1/31/13
to juli...@googlegroups.com
It depends, if LLVM uses a float64 or float80 to hold some intermediate values you can accumulate some differences. But it is something I should prove one way or another.

Dahua

unread,
Jan 31, 2013, 9:05:02 AM1/31/13
to juli...@googlegroups.com
I believe that people should expect differences (to a reasonable extent depending on the computation) in FP computation.

Almost surely, parallelized sum (mainly due to different grouping by taking advantage of associativity) would not produce exactly the same result of a serial sum (except in very special cases such as those where every number is integer)

That said, there is a provable upper bound of the difference. But for a testing purpose, I think it should be fine if the difference is reasonably small.

Kevin Squire

unread,
Jan 31, 2013, 2:25:44 PM1/31/13
to juli...@googlegroups.com
At some point KBN sum was being used by default in julia, and I have a vague recollection of Krysztof updating his work to reflect that--is that still the case?  Because KBN caused a performance hit on some platforms, the default was changed back to simple summation, and kbn summation was moved to sum_kbn/cumsum_kbn. (https://github.com/JuliaLang/julia/issues/1258)

Kevin

Krzysztof Kamieniecki

unread,
Jan 31, 2013, 2:32:59 PM1/31/13
to juli...@googlegroups.com
I had looked into that, but I did not implement it. In this specific case of the CUDA error, the summing, in both cases, is actually being done using Julia's sum.

But the change back to the simple sum explains why my julia devectorzing sum now matches Julia's sum. :-D
Reply all
Reply to author
Forward
0 new messages