function is approximately 100X faster in C than in Julia! What am I doing wrong?

1,611 views
Skip to first unread message

Sam Kaplan

unread,
May 26, 2013, 3:08:28 PM5/26/13
to julia...@googlegroups.com
Hi,

I needed to implement a 3D Laplacian operator.  I saw that in the tests/perf2 folder of Julia there is a 2D laplace example.  I guess in that example they are solving Laplace's equation, rather than applying the Laplacian operator.  But, the code pattern in these two cases seems similar to me.

I ran the tests/perf2 example, and in that example I get comparable run-times between the C and Julia versions.  However, when I run my 3D Laplacian example, my C code is approximately 100X faster than my Julia code.  I put my code into a gist:

https://gist.github.com/anonymous/5653687

When I run the laplace_bench.jl file, I get the following output on my computer:

Julia: elapsed time: 54.036786598 seconds
C (serial): elapsed time: 0.520510243 seconds

This seems strange, especially given the perf2 example.  Can anyone spot a mistake that I have made?

Thanks for your time!

Sam

John Myles White

unread,
May 26, 2013, 4:07:24 PM5/26/13
to julia...@googlegroups.com
This is really interesting. As I mentioned on GitHub, reversing the order of the loops is a big gain: the code goes from 60s to 40s on my system. In contrast, gcc with no explicit optimization setting takes 3s.

Maybe the slowdown comes from bounds-checking?

-- John

Gunnar Farnebäck

unread,
May 26, 2013, 6:12:46 PM5/26/13
to julia...@googlegroups.com
It may give some insight computing indices like in the C code and only do linear indexing into the arrays.

Or try doing the operation with imfilterN from https://github.com/GunnarFarneback/julia/blob/imfilter/extras/image.jl

John Myles White

unread,
May 26, 2013, 6:18:35 PM5/26/13
to julia...@googlegroups.com
I tried doing linear indexing. To my amazement It seemed to slow things down.

 -- John

Jameson Nash

unread,
May 26, 2013, 7:16:23 PM5/26/13
to julia...@googlegroups.com
I figured out how to get it down from 150 seconds to 13.6 seconds on my machine (switching the order of the loops didn't make a difference). This is apparently a bug in the compiler gc analysis https://github.com/JuliaLang/julia/issues/3206

function laplace3 (del_x::Array{Float64, 3}, x::Array{Float64, 3})
               n1, n2, n3 = size (x)
               for i1 = 2:n1-1, i2 = 2:n2-1, i3 = 2:n3-1
                       a =     x[i1+1,i2,i3] + x[i1-1,i2,i3] +
                               x[i1,i2+1,i3] + x[i1,i2-1,i3]
                       b =     x[i1,i2,i3+1] + x[i1,i2,i3-1] -
                               6.0 * x[i1,i2,i3]
                       del_x[i1,i2,i3] = a+b;
               end
        
       end

Taylor Maxwell

unread,
May 26, 2013, 8:26:32 PM5/26/13
to julia...@googlegroups.com
On my machine:
Original laplace jl = ~6.6 seconds
switching order of loops = ~4.5 seconds
Jamesons laplace3 = ~1.5 seconds
switching order of loops on laplace3  =~0.26 seconds



Taylor Maxwell

unread,
May 26, 2013, 8:31:03 PM5/26/13
to julia...@googlegroups.com
I used @time laplace(del_field, field)  to time it.

Sam Kaplan

unread,
May 26, 2013, 10:20:49 PM5/26/13
to julia...@googlegroups.com
Thanks everyone!  The Julia community is fantastic!

Switching the order of the loops as suggested by John takes me from 54s to 35s.  Subsequently, adding Jameson's suggestion to break the computation into two parts takes me down to 2.4s.  So, now we have (on my computer):

Julia: 2.4 seconds
C:     0.5 seconds

So, with this trick Julia is ~5X slower than C.  Obviously that's a huge improvement over 100X :)

Jameson: Thanks for pointing out the bug report.  I'll keep my eye on it.

Sam

Sam Kaplan

unread,
May 26, 2013, 10:25:43 PM5/26/13
to julia...@googlegroups.com
Gunner:  I did try the linear indexing in Julia, but it did not make a difference.  Thanks for the suggestion!

Gunnar Farnebäck

unread,
May 27, 2013, 11:11:45 AM5/27/13
to julia...@googlegroups.com
Try this for better speed:

function laplace(del_x::Array{Float64, 3}, x::Array{Float64, 3})
    n1, n2, n3 = size(x)
    n12 = n1 * n2
    for i3 = 2:n3-1, i2 = 2:n2-1, i1 = 2:n1-1
        index = i1 + n1 * ((i2 - 1) + n2 * (i3 - 1))
        t = -6.0 * x[index]
        t += x[index - 1]
        t += x[index + 1]
        t += x[index - n1]
        t += x[index + n1]
        t += x[index - n12]
        t += x[index + n12]
        del_x[index] = t
    end
end

Chris Foster

unread,
May 27, 2013, 11:26:08 AM5/27/13
to julia...@googlegroups.com
On Tue, May 28, 2013 at 1:18 AM, Chris Foster <chri...@gmail.com> wrote:
> Basically the native indexing seems to result in most (if not all) of the
> index calculations being emitted into the inner loop. OTOH, explicit
> linear indexing has encouraged the optimizer to hoist the loop invariant
> parts of the index calculation out of the inner loop. Perhaps this is
> what's going on... perhaps not though, I'm very far from an expert in
> LLVM IR :) Is there any way of getting at a disassembly of the actual
> machine code by the way?

On second thoughts, maybe all that inner loop complexity is required,
given that index checking happens in all three dimensions? The IR is
kinda hard to read :)

Chris Foster

unread,
May 27, 2013, 11:18:48 AM5/27/13
to julia...@googlegroups.com
I tried linear indexing with the reordered loops and the trick from
Jameson, and got the code to within a factor of 2x of gcc-4.6.3 on my
linux machine:

Julia: elapsed time: 0.909373584 seconds
C (serial): elapsed time: 0.468050562 seconds

Here's my uglified version:

function laplace4 (del_x::Array{Float64, 3}, x::Array{Float64, 3})
n1, n2, n3 = size (x)

for i1 = 1:n3-2
for i2 = 1:n2-2
for i3 = 1:n1-2
idx1 = 1 + ( i1*n2 + i2 )*n3 + i3;
idx2 = 1 + ((i1+1)*n2 + i2 )*n3 + i3;
idx3 = 1 + ((i1-1)*n2 + i2 )*n3 + i3;
idx4 = 1 + ( i1*n2 + (i2+1))*n3 + i3;
idx5 = 1 + ( i1*n2 + (i2-1))*n3 + i3;
idx6 = 1 + ( i1*n2 + i2 )*n3 + (i3+1);
idx7 = 1 + ( i1*n2 + i2 )*n3 + (i3-1);

tmp1 = x[idx2] + x[idx3] + x[idx4] + x[idx5]
tmp2 = x[idx6] + x[idx7] - 6.0 * x[idx1]
del_x[idx1] = tmp1 + tmp2
end
end
end
end


In contrast, the following natively indexed version takes 2.25s on my
machine.

function laplace5 (del_x::Array{Float64, 3}, x::Array{Float64, 3})
n1, n2, n3 = size (x)
for i2 = 2:n2-1, i3 = 2:n3-1, i1 = 2:n1-1
a = x[i1+1,i2,i3] + x[i1-1,i2,i3] +
x[i1,i2+1,i3] + x[i1,i2-1,i3]
b = x[i1,i2,i3+1] + x[i1,i2,i3-1] -
6.0 * x[i1,i2,i3]
del_x[i1,i2,i3] = a+b;
end
end


I've found a very tentative explanation of the difference between linear
indexing and the builtin indexing by blundering through the LLVM IR, which
apparently can be obtained using

disassemble(laplace, (Array{Float64, 3}, Array{Float64, 3}))

Basically the native indexing seems to result in most (if not all) of the
index calculations being emitted into the inner loop. OTOH, explicit
linear indexing has encouraged the optimizer to hoist the loop invariant
parts of the index calculation out of the inner loop. Perhaps this is
what's going on... perhaps not though, I'm very far from an expert in
LLVM IR :) Is there any way of getting at a disassembly of the actual
machine code by the way?

On a side note, compiling this with current clang-3.4 (svn) is
significantly better than gcc-4.6 on -O3, giving only 0.4s. Using
-march=native with gcc beats this again with only 0.38s. Both clang and
gcc emit an sse vectorized version of the inner loop with -O3 which speeds
things up by a few percent but nothing amazing.


BTW, this is my first post here (hi all!). I've been following julia with
great interest for a while and finally got started writing some simple
code yesterday. Thanks for such an exciting new language! I hope it will
finally do what scipy hasn't quite managed, and free me of matlab for good
:-)

~Chris

Sam Kaplan

unread,
May 27, 2013, 11:51:13 AM5/27/13
to julia...@googlegroups.com
Thanks Gunner and Chris!

With Gunner's version I get:
Julia: elapsed time: 1.080966755 seconds
C (serial): elapsed time: 0.489777748 seconds

Which looks pretty similar to the numbers that Chris is getting with his version.  Of course, now the C code looks nicer than the Julia code :)

Chris: thanks for pointing out the disassemble function!

Sam

Ariel Keselman

unread,
May 27, 2013, 2:28:09 PM5/27/13
to julia...@googlegroups.com
This version is a bit faster, and more compact. It is a mix of all previous versions:

function laplace7!(del_x::Array{Float64, 3}, x::Array{Float64, 3})
    n1, n2, n3 = size(x)
    n12 = n1 * n2
    for i3 = 2:n3-1, i2 = 2:n2-1, i1 = 2:n1-1
        index = i1 + n1 * ((i2 - 1) + n2 * (i3 - 1))
        tmp1 = -6.0 * x[index] + x[index - 1] + x[index + 1]
        tmp2 = x[index - n1] + x[index + n1] + x[index - n12]
        del_x[index] = x[index + n12] + tmp1 + tmp2
    end
end

Ariel Keselman

unread,
May 29, 2013, 8:06:20 AM5/29/13
to julia...@googlegroups.com
I've timed GCC 4.6 vs the latest Julia Windows binaries, and the results are:

GCC: 2.99 secs
Julia0.2: 4.69 secs

i.e. Julia is just ~1.56X slower.

This could be all just due to bounds checking

Stefan Karpinski

unread,
May 29, 2013, 1:33:23 PM5/29/13
to julia...@googlegroups.com
56% slower is likely due to bounds checking. I'm happy we could get this faster relatively easily, but we should, as we improve various things, not need quite as much tinkering to get comparable performance. It's nice that the tinkering is a) possible and b) relatively simple and transparent, but it would be even better if no tinkering were required at all.

John Tillinghast

unread,
May 30, 2013, 11:18:06 AM5/30/13
to julia...@googlegroups.com
I wrote a vectorized version last night and couldn't get it to 1/4 the speed of the fastest loopy version. This is disappointing: if I wanted to write C code, I'd write in C.:)

At the same time, for a 1-d array vectorization can be much faster.

Is the Julia community planning to write loops for anything complicated, or will this be fixed for 1.0?

Gunnar Farnebäck

unread,
May 30, 2013, 11:48:21 AM5/30/13
to julia...@googlegroups.com
Den onsdagen den 29:e maj 2013 kl. 19:33:23 UTC+2 skrev Stefan Karpinski:
56% slower is likely due to bounds checking. I'm happy we could get this faster relatively easily, but we should, as we improve various things, not need quite as much tinkering to get comparable performance. It's nice that the tinkering is a) possible and b) relatively simple and transparent, but it would be even better if no tinkering were required at all.

Bounds checking may explain more than that.

This is dangerous... but fast!

function laplace(del_x::Array{Float64, 3}, x::Array{Float64, 3})
    n1, n2, n3 = size (x)
    n12 = n1 * n2
    xp = pointer(x)
    del_xp = pointer(del_x)
    for i3 = 2:n3-1, i2 = 2:n2-1, i1 = 2:n1-1
        index = i1 + n1 * ((i2 - 1) + n2 * (i3 - 1))
        t = -6.0 * unsafe_load(xp, index)
        t += unsafe_load(xp, index - 1)
        t += unsafe_load(xp, index + 1)
        t += unsafe_load(xp, index - n1)
        t += unsafe_load(xp, index + n1)
        t += unsafe_load(xp, index - n12)
        t += unsafe_load(xp, index + n12)
        unsafe_store!(del_xp, t, index)
    end
end

In fact I'm measuring nearly twice the speed compared to safe, bounds checked indexing.

Stefan Karpinski

unread,
May 30, 2013, 1:17:27 PM5/30/13
to Julia Users
Our current performance for vectorized code is definitely not good enough and we will certainly fix it before 1.0. The following planned improvements will go a long way:
  1. Make array slicing create views rather than copies (like NumPy).
  2. Be more aggressive about coalescing operations like a+b+c+d into a single loop that only creates one temporary.
  3. Improve garbage collection so that temporaries in loops get reused more.
  4. Get better about automatic bounds check elimination and allow explicitly skipping bounds checks.
  5. Provide some syntax (.= maybe) for writing vectorized operations succinctly.
That's a lot of work, but it's all doable.

John Tillinghast

unread,
May 30, 2013, 4:38:41 PM5/30/13
to julia...@googlegroups.com
Hang on. Do += etc. not work yet for vectorized code? I think I used it in at least one version. Is it slower for some reason?

Stefan Karpinski

unread,
May 30, 2013, 5:18:01 PM5/30/13
to Julia Users
`v += w` does work for arrays, but it means `v = v + w`, which means that it creates a new vector rather than modifying v in place. There's a couple of different open issue about this:


The trouble with doing something like making `x += y` mean `x = (+=)(x,y)` thereby allowing the operation to modify mutable x's in-place, is that it ruins the ability to write generic code for immutable x that works in the same side-effect-free way for mutable x.

Sam Kaplan

unread,
May 30, 2013, 10:29:01 PM5/30/13
to julia...@googlegroups.com
Hello,

I think that it is worthwhile to attempt some kind of summary of the discussion.  To that end, I created a new gist:

https://gist.github.com/samtkaplan/5682557

In 'laplace_benchmark.jl', I benchmark the original laplace written in what I guess would be considered idiomatic Julia, along with versions from Jameson, Ariel and Gunnar (in particular, the version from Gunnar that avoids bounds checking), along with the C version.   Further, I should note that I am running with the following version of Julia:

Version 0.2.0-1746.ra9f61cea
Commit a9f61cea6f 2013-05-30 15:32:48

Importantly, this version is after the commit that fixes issue https://github.com/JuliaLang/julia/issues/3206 (Jameson pointed to it earlier in this thread) had been applied.

Here are the results of running laplace_bench.jl on my computer:

Julia (idomatic): elapsed time: 2.75631323 seconds
Julia (Jameson): elapsed time: 2.435454939 seconds
Julia (Ariel): elapsed time: 1.037012132 seconds
Julia (Gunnar, no bounds): elapsed time: 0.526309314 seconds
C (serial): elapsed time: 0.515237944 seconds

Here are my conclusions:
1) The fix for issue 3206 largely solves the problem.  However, Jameson's version of the code is faster (by a very small margin) than the idiomatic version, so I guess that there is a small (tiny?) overhead associated with the recursion introduced by the fix for 3206, and it may be worthwhile hard-coding  the "+()" operator for more than just 5 arguments.

2) There is a cost to indexing into an array using 3 dimensions, and it seems faster to index into the array using linear indexing (this is in contrast to what both John and I thought way back at the top of this thread).  I think that this is shown by the comparison of Jameson's and Ariel's versions of the code.  But, I may be missing something here.

3) As others have stated, bounds checking is important.  Gunnar's no-bounds checking version of the code seems to cut the execution time in half compared to Ariel's version, and, importantly, gives us a time that is comparable to the C version.

Thanks everyone!

Sam

John Myles White

unread,
May 30, 2013, 10:41:04 PM5/30/13
to julia...@googlegroups.com
Thanks for the summary, Sam.

I have to say: it's pretty amazing that, once bounds checking is removed, Julia is able to perform basically as well as C.

 -- John

Stefan Karpinski

unread,
May 30, 2013, 10:50:54 PM5/30/13
to Julia Users
That's a really great summary. Having real-world examples like this helps a lot with figuring out what optimizations to prioritize so that we can get C-like performance with less effort in the future. Great thread.

Ariel Keselman

unread,
May 31, 2013, 2:05:09 AM5/31/13
to julia...@googlegroups.com
One more optimization, latest julia beats (significantly) GCC4.6.2 on my machine:

GCC: 2.86 secs
julia: 2.28 secs

i.e. julia is ~1.25 faster than GCC (!) with the following version, making less use of +=:

function laplace3! (del_x::Array{Float64, 3}, x::Array{Float64, 3})
    n1, n2, n3 = size (x)
    n12 = n1 * n2
    xp = pointer(x)
    del_xp = pointer(del_x)
    for i3 = 2:n3-1, i2 = 2:n2-1, i1 = 2:n1-1
        index = i1 + n1 * ((i2 - 1) + n2 * (i3 - 1))
        t = -6.0 * unsafe_load(xp, index) +
                   unsafe_load(xp, index - 1) +
                   unsafe_load(xp, index + 1) +
                   unsafe_load(xp, index - n1) +
                   unsafe_load(xp, index + n1) +
                   unsafe_load(xp, index - n12)

Stefan Karpinski

unread,
May 31, 2013, 2:21:08 AM5/31/13
to Julia Users
Woah, that's not exactly pretty code, but very cool that you can get substantially faster than GCC for this.

Gunnar Farnebäck

unread,
May 31, 2013, 10:40:17 AM5/31/13
to julia...@googlegroups.com
I'm afraid that this optimization is not admissible. It's missing a plus sign so that the last term falls off. Adding it back in gives about the same speed as my solution but should still be faster than your gcc build.
                                                              ^^^^ 

Chris Foster

unread,
Jun 7, 2013, 8:57:25 AM6/7/13
to julia...@googlegroups.com
On Tue, May 28, 2013 at 1:18 AM, Chris Foster <chri...@gmail.com> wrote:
> Is there any way of getting at a disassembly of the actual machine code by
> the way?

... answering my own question in the hope that it will be helpful for others:
Yes, you can get a disassembly of the machine code for the actual hardware by
setting the optional asm argument of the disassemble function function to
true. Eg,

disassemble(laplace, (Array{Float64, 3}, Array{Float64, 3}), true)

~Chris

Stefan Karpinski

unread,
Jun 7, 2013, 9:37:29 AM6/7/13
to Julia Users
This was *very* recently added – like a couple days ago. We'll probably change the interface a bit.

Chris Foster

unread,
Jun 7, 2013, 10:04:52 AM6/7/13
to julia...@googlegroups.com
On Fri, Jun 7, 2013 at 11:37 PM, Stefan Karpinski <ste...@karpinski.org> wrote:
> This was *very* recently added – like a couple days ago. We'll probably
> change the interface a bit.

Ah, and here I was feeling stupid for asking the original question.
Whichever interface you settle on, it's a very useful feature (cheers
ihnorton!)

Stefan Karpinski

unread,
Jun 7, 2013, 10:06:06 AM6/7/13
to Julia Users
It is. It's very cool to have. The native disassembly is often shorter and easier to read than LLVM.

Sam Kaplan

unread,
Jul 13, 2013, 12:47:52 PM7/13/13
to julia...@googlegroups.com
Hello,

Since Julia has gained the "@inbounds" macro, I decided to revisit this topic.  I made the following gist:

https://gist.github.com/samtkaplan/5991236

There are three functions for computing the 3D laplace:

1. laplace_idiomatic -- this function accesses the arrays using 3D indexing.
2. laplace_inbounds -- this is the same as 'laplace_idiomatic', except that we turn off array bounds checking using the @inbounds macro.
3. laplace_inbounds_linear -- this uses the @inbounds macro, and accesses the array using 1D indexing.

Here are the timings on my computer:

Julia (idiomatic): elapsed time: 2.741377846 seconds
Julia (inbounds): elapsed time: 1.790284902 seconds
Julia (inbounds, linear): elapsed time: 0.542571549 seconds

The speed-up that we get using the @inbounds macro is great!  I guess it also exposes the next layer of the onion -- why is indexing into the array using all three dimensions approximately 3X slower than indexing into the array with one dimension?  Should I file a issue on github for this?

Thanks!

Sam


On Sunday, May 26, 2013 12:08:28 PM UTC-7, Sam Kaplan wrote:
Hi,

I needed to implement a 3D Laplacian operator.  I saw that in the tests/perf2 folder of Julia there is a 2D laplace example.  I guess in that example they are solving Laplace's equation, rather than applying the Laplacian operator.  But, the code pattern in these two cases seems similar to me.

I ran the tests/perf2 example, and in that example I get comparable run-times between the C and Julia versions.  However, when I run my 3D Laplacian example, my C code is approximately 100X faster than my Julia code.  I put my code into a gist:

https://gist.github.com/anonymous/5653687

When I run the laplace_bench.jl file, I get the following output on my computer:

Julia: elapsed time: 54.036786598 seconds

Stefan Karpinski

unread,
Jul 13, 2013, 12:55:33 PM7/13/13
to Julia Users
Very nice results! Multidimensional indexing needs to apply a linear transform to the indices to compute a linear index into the array data. When you just use linear indexing, that is avoided entirely. Still, it may be possible to get the cost of that down from 3x, which is pretty bad. In theory with sufficiently good code gen and optimization, you could generate the equivalent linear indexing code in the 3d case, but that's pretty hard to do.

Sam Kaplan

unread,
Jul 13, 2013, 1:01:00 PM7/13/13
to julia...@googlegroups.com
Thanks.  The thing is, in the function 'laplace_inbounds_linear', I am explicitly computing the linear index.  If I understand you correctly, then Julia is doing a similar computation to find the linear index.  Why is explicit computation for the linear index in 'laplace_inbounds_linear" so much faster than the implicit "linear transform" that you mention?

Sam

Stefan Karpinski

unread,
Jul 13, 2013, 1:12:29 PM7/13/13
to Julia Users
Ah, yes – didn't look at the code. It's possible that explicitly computing the linear indices is allowing the optimizer to factor out the common computations there in a way that doesn't happen in the implicit version. But clearly this is a performance issue and worth looking into – please do file an issue.

Sam Kaplan

unread,
Jul 14, 2013, 9:22:32 AM7/14/13
to julia...@googlegroups.com
Sounds good.  For reference, here is a link to the issue:

https://github.com/JuliaLang/julia/issues/3712

I spent a little time trying to figure out where to look in the Julia code to try to debug the issue, but I got a little lost in the weeds.  Any hints on where to go looking?

Sam Kaplan

unread,
Jul 14, 2013, 3:23:37 PM7/14/13
to julia...@googlegroups.com
It looks like Jeff B. has fixed this. You guys work really fast!!  For the record, here are the new timings:

Julia (idiomatic): elapsed time: 1.828987957 seconds
Julia (inbounds): elapsed time: 0.581809846 seconds
Julia (inbounds, linear): elapsed time: 0.547825128 seconds

Still a tiny overhead in using the 3D indexing, but way better than before.  Great work!!

Sam

PS.  It's fun to note that with this version of Julia, the non-inbouds version is faster than the inbounds version with the previous version of Julia.

Stefan Karpinski

unread,
Jul 14, 2013, 3:40:36 PM7/14/13
to Julia Users
That's a remarkable improvement. Amazing work by Jeff, as usual. Thanks, Sam, for bringing up the issue in the first place, and thanks to everyone for doing such great investigation of the performance problem.
Reply all
Reply to author
Forward
0 new messages