how long until vectorized code runs fast?

366 views
Skip to first unread message

Anonymous

unread,
May 12, 2016, 1:22:24 AM5/12/16
to julia-users
This remains one of the main drawbacks of Julia, and the devectorize package is basically useless as it doesn't support some really crucial vectorized operations.  I'd really prefer not to rewrite all my vectorized code into nested loops if at all possible, but I really need more speed, can anyone tell me the timeline and future plans for making vectorized code run at C speed?

Kristoffer Carlsson

unread,
May 12, 2016, 1:35:28 AM5/12/16
to julia-users
It is always easier to discuss if there is a piece of code to look at. Could you perhaps post a few code examples that does not run as fast as you want?

Also, make sure to look at : https://github.com/IntelLabs/ParallelAccelerator.jl. They have a quite sophisticated compiler that does loop fusions and parallelization and other cool stuff.

Keno Fischer

unread,
May 12, 2016, 1:41:55 AM5/12/16
to julia...@googlegroups.com
There seems to be a myth going around that vectorized code in Julia is
slow. That's not really the case. Often times it's just that
devectorized code is faster because one can manually perform
operations such as loop fusion, which the compiler cannot currently
reason about (and most C compilers can't either). In some other
languages those benefits get drowned out by language overhead, but in
julia those kinds of constructs are generally fast. The cases where
julia can be slower is when there is excessive memory allocation in a
tight inner loop, but those cases can usually be rewritten fairly
easily without losing the vectorized look of the code.

Anonymous

unread,
May 12, 2016, 2:03:38 AM5/12/16
to julia-users
In response to both Kristoffer and Keno's timely responses,

Originally I just did a simple @time test of the form
Matrix .* horizontal vector

and then tested the same thing with for loops, and the for loops were way faster (and used way less memory)

However I just devectorized one of my algorithms and ran an @time comparison and the vectorized version was actually twice as fast as the devectorized version, however the vectorized version used way more memory.  Clearly I don't really understand the specifics of what makes code slow, and in particular how vectorized code compares to devectorized code.  Vectorized code does seem to use a lot more memory, but clearly for my algorithm it nevertheless runs faster than the devectorized version.  Is there a reference I could look at that explains this to someone with a background in math but not much knowledge of computer architecture?

Milan Bouchet-Valat

unread,
May 12, 2016, 2:56:46 AM5/12/16
to julia...@googlegroups.com
Some major improvements are coming in 0.5, and more are currently being
worked on/discussed. See
https://github.com/JuliaLang/julia/issues/16285


Regards

Milan Bouchet-Valat

unread,
May 12, 2016, 3:06:45 AM5/12/16
to julia...@googlegroups.com
Le mercredi 11 mai 2016 à 23:03 -0700, Anonymous a écrit :
> In response to both Kristoffer and Keno's timely responses,
>
> Originally I just did a simple @time test of the form
> Matrix .* horizontal vector
>
> and then tested the same thing with for loops, and the for loops were
> way faster (and used way less memory)
>
> However I just devectorized one of my algorithms and ran an @time
> comparison and the vectorized version was actually twice as fast as
> the devectorized version, however the vectorized version used way
> more memory.  Clearly I don't really understand the specifics of what
> makes code slow, and in particular how vectorized code compares to
> devectorized code.  Vectorized code does seem to use a lot more
> memory, but clearly for my algorithm it nevertheless runs faster than
> the devectorized version.  Is there a reference I could look at that
> explains this to someone with a background in math but not much
> knowledge of computer architecture?
I don't know about a reference, but I suspect this is due to BLAS.
Vectorized versions of linear algebra operations like matrix
multiplication are highly optimized, and run several threads in
parallel. OTC, your devectorized code isn't carefully tuned for a
specific processor model, and uses a single CPU core (soon Julia will
support using several threads, and see [1]).

So depending on the particular operations you're running, the
vectorized form can be faster even though it allocates more memory. In
general, it will likely be faster to use BLAS for expensive operations
on large matrices. OTOH, it's better to devectorize code if you
successively perform several simple operations on an array, because
each operation currently allocates a copy of the array (this may well
change with [2]).


Regards


1: http://julialang.org/blog/2016/03/parallelaccelerator
2: https://github.com/JuliaLang/julia/issues/16285

Anonymous

unread,
May 12, 2016, 3:31:20 AM5/12/16
to julia-users
are operators such as

[1 2; 3 4] .* [1 2]

or

[1,2] .^ [1,2]

part of BLAS?

The latter is covered by devectorize.jl, however my understanding is that the former falls between the cracks, neither covered by devectorize.jl nor by BLAS.

Stefan Karpinski

unread,
May 12, 2016, 7:49:42 AM5/12/16
to Julia Users
On Thu, May 12, 2016 at 7:41 AM, Keno Fischer <kfis...@college.harvard.edu> wrote:
There seems to be a myth going around that vectorized code in Julia is
slow. That's not really the case. Often times it's just that
devectorized code is faster because one can manually perform
operations such as loop fusion, which the compiler cannot currently
reason about (and most C compilers can't either). In some other
languages those benefits get drowned out by language overhead, but in
julia those kinds of constructs are generally fast. The cases where
julia can be slower is when there is excessive memory allocation in a
tight inner loop, but those cases can usually be rewritten fairly
easily without losing the vectorized look of the code.

This. JMW's blog post on the subject is as relevant as when he wrote it:


Conclusion:
  • Julia’s vectorized code is 2x faster than R’s vectorized code
  • Julia’s devectorized code is 140x faster than R’s vectorized code
  • Julia’s devectorized code is 1350x faster than R’s devectorized code
Julia's vectorized code is not slow – it's faster than other languages. It's just that Julia allows you to write even faster code when it matters.
Message has been deleted
Message has been deleted

Miguel Bazdresch

unread,
May 12, 2016, 8:51:44 AM5/12/16
to julia...@googlegroups.com
The easiest way to write slow for loops is to make them row-major instead of column-major.

-- mb

On Thu, May 12, 2016 at 8:46 AM, Anonymous <esp...@gmail.com> wrote:
So I guess the consensus is not that Julia's devectorized code is so much faster than its vectorized code (in fact I keep getting slow downs when I test out different devectorizations of my algorithms), but that R's devectorized code just sucks, either that or I really suck at writing for loops.

honestly I've been testing out different devectorizations of my algorithms and I keep getting slower results, not faster, so either I really suck at writing for loops or Julia is doing a good job with my vectorized code.

Tim Holy

unread,
May 12, 2016, 8:58:22 AM5/12/16
to julia...@googlegroups.com
Did you run it twice? Remember that memory is allocated during JIT
compilation, so the amount of memory on the first call is completely
meaningless.

--Tim

Anonymous

unread,
May 12, 2016, 9:44:17 AM5/12/16
to julia-users
I did run it multiple times yes.  I've tried a couple different devectorizations on my algorithms and none result in speed ups, and most result in slightly slower run-times.  I guess I find it a bit strange because the memory allocations and garbage collection is far less when I devectorize, but that doesn't translate into performance improvements.  Also like I said before, I'm most curious about the current status of operations of the form:

[1 2; 3 4] .* [1 2]

is such an operation covered by BLAS?

Steven G. Johnson

unread,
May 12, 2016, 9:46:32 AM5/12/16
to julia-users


On Thursday, May 12, 2016 at 8:51:44 AM UTC-4, Miguel Bazdresch wrote:
honestly I've been testing out different devectorizations of my algorithms and I keep getting slower results, not faster, so either I really suck at writing for loops or Julia is doing a good job with my vectorized code.

Make sure your loops are in a function — don't benchmark in global scope (see the performance tips sections of the manual).  Try running your function through @code_warntype myfunction(args...) and see if it warns marks any variables as type "ANY" (which indicates a type instability in your code, see the performance tips),

Also, if you do "@time myfunc(args...)" and it indicates that you did a huge number of allocations, you could either have a type instability or be allocating new arrays in your inner loops (it is always better to allocate arrays once outside your inner loops and then update them in-place as needed).

Tom Breloff

unread,
May 12, 2016, 9:53:32 AM5/12/16
to julia-users
Also it's possible that your vectorized versions are being passed to multithreaded routines? The setup might require more memory but the execution would run in parallel.

Anonymous

unread,
May 12, 2016, 9:53:35 AM5/12/16
to julia-users
Yes the algorithm I'm testing this on is fairly polished at this point, all variables are within a type and they all have strict type declarations.  The memory allocations are very low compared to the vectorized code, so memory-wise the loops are doing their job, but this doesn't translate into speed-ups.

Ford Ox

unread,
May 12, 2016, 10:05:28 AM5/12/16
to julia-users
Why dont you just post your code here?

Dne čtvrtek 12. května 2016 15:53:35 UTC+2 Anonymous napsal(a):

Stefan Karpinski

unread,
May 12, 2016, 1:19:07 PM5/12/16
to Julia Users
I also have to ask... you're not working with global variables, right?

Tim Holy

unread,
May 12, 2016, 1:55:38 PM5/12/16
to julia...@googlegroups.com
On Thursday, May 12, 2016 06:44:16 AM Anonymous wrote:
> Also like I said before, I'm most curious about the current status of
> operations of the form:
>
> [1 2; 3 4] .* [1 2]
>
> is such an operation covered by BLAS?

No, among other reasons because BLAS only handles floating-point numbers. That
specific operation is handled by broadcasting.

Best,
--Tim
Reply all
Reply to author
Forward
0 new messages