Difference between plus and dot-plus operators with SubArrays

254 views
Skip to first unread message

Matthew Frank

unread,
Feb 17, 2014, 1:44:12 PM2/17/14
to julia...@googlegroups.com
I have been trying to use SubArrays to reduce the number of matrix copies that my code is doing, but I've been surprised to find that replacing slices and extra copies with SubArrays seems to slow things down.

My first question is what is the difference between the "+" operator and the (AFAIK undocumented) ".+" operator for SubArrays?  .+ seems to be about 3x faster, although the output seems to be the same.  (img is a 2048x3072 Array{Float32,2}).

julia> @time result_no_dot = sub(img,:,1:3072-2) + sub(img,:,3:3072)
elapsed time: 0.182192137 seconds (25151376 bytes allocated)

julia> @time result_with_dot = sub(img,:,1:3072-2) .+ sub(img,:,3:3072)
elapsed time: 0.060138815 seconds (25152360 bytes allocated)

julia> result_no_dot == result_with_dot
true

I also don't understand what the profiler is telling me.  With the "+" operator I get:
9  multidimensional.jl; getindex; line: 119
1  multidimensional.jl; getindex; line: 120
83 multidimensional.jl; getindex; line: 123
15 multidimensional.jl; getindex; line: 124
2  multidimensional.jl; getindex; line: 125
26 multidimensional.jl; getindex; line: 127
              30 profile.jl; anonymous; line: 14
                21 array.jl; +; line: 756
                7  multidimensional.jl; getindex; line: 119
While with the .+ operator I get the much clearer (to me):
3  abstractarray.jl; size; line: 9
              53 profile.jl; anonymous; line: 14
                53 broadcast.jl; .+; line: 148
                  12 broadcast.jl; broadcast!; line: 92
                  41 broadcast.jl; broadcast!; line: 101
                      3  abstractarray.jl; size; line: 9
                      19 broadcast.jl; _F_; line: 71
                      9  broadcast.jl; _F_; line: 72
                      1  broadcast.jl; _F_; line: 256
                      9  broadcast.jl; _F_; line: 257


My second question is, is there a way to avoid doing any allocation at all (short of de-vectorizing)?  Even when I preallocate an array for the result, the "+" and ".+" operators (or maybe the assignment operator) seem to create an extra copy of the array:

julia> my_result = Array(Float32, 2048, 3070)
julia> @time my_result = sub(img,:,1:3072-2) .+ sub(img,:,3:3072)
elapsed time: 0.059179547 seconds (25152360 bytes allocated)

I can't seem to get that extra 25Mbytes of allocation to go away.


My third question is what is the slicing operator doing that makes it so much faster than the sub operator, even though it has to make an extra copy?

julia> @time result_with_slicing = img[:,1:3072-2] + img[:,3:3072]
elapsed time: 0.029242747 seconds (75448992 bytes allocated)

That's another 2x faster than subarrays with the ".+" operator, but allocates 75M instead of 25M.

And the profile is:
              20 profile.jl; anonymous; line: 14
                2  array.jl; +; line: 754
                7  array.jl; +; line: 756
                11 multidimensional.jl; getindex; line: 47
                 11 multidimensional.jl; _getindex!; line: 33

Thanks,
-Matt

Tim Holy

unread,
Feb 17, 2014, 4:43:34 PM2/17/14
to julia...@googlegroups.com
Hi Matthew,

Great questions.

On Monday, February 17, 2014 10:44:12 AM Matthew Frank wrote:
> I have been trying to use SubArrays to reduce the number of matrix copies
> that my code is doing, but I've been surprised to find that replacing
> slices and extra copies with SubArrays seems to slow things down.

Julia's handling of SubArrays has long been...suboptimal, but it's getting
better. If you had tried this a year ago, you would have seen slowdowns of
30x. There's still room for improvement, however. This is why A[indx1, indx2]
currently returns a copy rather than a SubArray.

> My first question is what is the difference between the "+" operator and
> the (AFAIK undocumented) ".+" operator for SubArrays? .+ seems to be about
> 3x faster, although the output seems to be the same. (img is a 2048x3072
> Array{Float32,2}).
>
> julia> @time result_no_dot = sub(img,:,1:3072-2) + sub(img,:,3:3072)
> elapsed time: 0.182192137 seconds (25151376 bytes allocated)
>
> julia> @time result_with_dot = sub(img,:,1:3072-2) .+ sub(img,:,3:3072)
> elapsed time: 0.060138815 seconds (25152360 bytes allocated)

Best way to diagnose this kind of thing is to use @which:
julia> @which sub(img,:,1:3072-2) + sub(img,:,3:3072)
+{S,T}(A::Union(DenseArray{S,N},SubArray{S,N,A<:DenseArray{T,N},I<:
(Union(Range{Int64},Range1{Int64},Int64)...,)}),B::Union(DenseArray{T,N},SubArray{T,N,A<:DenseArray{T,N},I<:
(Union(Range{Int64},Range1{Int64},Int64)...,)})) at array.jl:754

julia> @which sub(img,:,1:3072-2) .+ sub(img,:,3:3072)
.+(As::AbstractArray{T,N}...) at broadcast.jl:148

So as you surmised, they're calling two different functions. All "dot"
operators are broadcasting, so the second isn't a surprise.

The issue with the first can be traced to these lines:
for i=1:length(A)
@inbounds F[i] = ($f)(A[i], B[i])
end
Here f = +, so it's summing them. The problem is that linear indexing (A[i])
is slow for SubArrays.

We're partway through a transition to Cartesian indexing, but not all
functions have been fixed yet.

>
> julia> result_no_dot == result_with_dot
> true
>
> I also don't understand what the profiler is telling me. With the "+"
> operator I get:
>
> 9 multidimensional.jl; getindex; line: 119
> 1 multidimensional.jl; getindex; line: 120
> 83 multidimensional.jl; getindex; line: 123
> 15 multidimensional.jl; getindex; line: 124
> 2 multidimensional.jl; getindex; line: 125
> 26 multidimensional.jl; getindex; line: 127
> 30 profile.jl; anonymous; line: 14
> 21 array.jl; +; line: 756
> 7 multidimensional.jl; getindex; line: 119

Ugh, you've gotten bitten by a bad case of truncated backtraces (issue #3469).
That indeed makes this pretty hard to interpret. These should go away with the
MCJIT work that's ongoing (but it's a herculean effort).

> My second question is, is there a way to avoid doing any allocation at all
> (short of de-vectorizing)? Even when I preallocate an array for the
> result, the "+" and ".+" operators (or maybe the assignment operator) seem
> to create an extra copy of the array:

Sure, just "devectorize", e.g.,
for j = 1:size(A,2), i = 1:size(A,1)
my_result[i,j] = A[i,j] + B[i,j]
end

> My third question is what is the slicing operator doing that makes it so
> much faster than the sub operator, even though it has to make an extra copy?

It's all because everything is better optimized for Arrays than SubArrays
right now. You could try the devectorized loop and add an @inbounds (after
first doing suitable bounds-checking) and see if it makes it better.

--Tim

Matthew Frank

unread,
Feb 18, 2014, 11:47:34 AM2/18/14
to julia...@googlegroups.com


On Monday, February 17, 2014 3:43:34 PM UTC-6, Tim Holy wrote:
> My second question is, is there a way to avoid doing any allocation at all
> (short of de-vectorizing)?  Even when I preallocate an array for the
> result, the "+" and ".+" operators (or maybe the assignment operator) seem
> to create an extra copy of the array:

Sure, just "devectorize", e.g.,
    for j = 1:size(A,2), i = 1:size(A,1)
        my_result[i,j] = A[i,j] + B[i,j]
    end

> My third question is what is the slicing operator doing that makes it so
> much faster than the sub operator, even though it has to make an extra copy?

It's all because everything is better optimized for Arrays than SubArrays
right now. You could try the devectorized loop and add an @inbounds (after
first doing suitable bounds-checking) and see if it makes it better.

Thanks for all the help!  I was going to save the devectorization question for another post,
but since you brought it up...  :)

The devectorized version of this loop takes about 14ms and allocates just 48 bytes.
And @inbounds reduces that to about 10ms.  (For comparison g++ -O2 gives 3.8ms or 4.1ms
depending on whether you are summing neighboring columns or rows, and g++ -O3 (turns on
simd vectorization) gives 2.4ms or 3.9ms).

But what's the intuition behind when I should devectorize and when I shouldn't?
I was modeling my code after the code in the Images package, very little of which is
devectorized.  And when I try to devectorize some of it the result is awful.

For example, Images' algorithm.jl has:

function hsi2rgb{T}(img::Array{T})
    H = img[:,:,1]*(2pi)
    S = img[:,:,2]
    I = img[:,:,3]
    R = zeros(T, size(img,1), size(img,2))
    G = zeros(T, size(img,1), size(img,2))
    B = zeros(T, size(img,1), size(img,2))
    RG = 0 .<= H .< 2*pi/3
    GB = 2*pi/3 .<= H .< 4*pi/3
    BR = 4*pi/3 .<= H .< 2*pi
    # RG sector
    B[RG] = I[RG].*(1 - S[RG])
    R[RG] = I[RG].*(1 + (S[RG].*cos(H[RG]))./cos(pi/3 - H[RG]))
    G[RG] = 3*I[RG] - R[RG] - B[RG]
    # GB sector
    R[GB] = I[GB].*(1 - S[GB])
    G[GB] = I[GB].*(1 + (S[GB].*cos(H[GB] - pi/3))./cos(H[GB]))
    B[GB] = 3*I[GB] - R[GB] - G[GB]
    # BR sector
    G[BR] = I[BR].*(1 - S[BR])
    B[BR] = I[BR].*(1 + (S[BR].*cos(H[BR] - 2*pi/3))./cos(-pi/3 - H[BR]))
    R[BR] = 3*I[BR] - G[BR] - B[BR]
    return cat(3, R, G, B)
end

On a 2048x3072 image this takes 1.4 seconds.

But here's my devectorized version of hsi2rgb().  It takes 7.2 seconds!  So either I did something
completely wrong in the devectorization (quite likely), or sometimes devectorization is
a bad idea performance-wise.  How did you know that hsi2rgb() should _not_ be devectorized?

function devect{T}(img::Array{T})
    H = img[:,:,1]*(2pi)
    S = img[:,:,2]
    I = img[:,:,3]
    rows, cols = size(H)
    R = Array(T, size(img,1), size(img,2))
    G = Array(T, size(img,1), size(img,2))
    B = Array(T, size(img,1), size(img,2))
    for c = 1:cols
        for r = 1:rows
            hue = H[r,c]
            sat = S[r,c]
            int = I[r,c]
            if 0 <= hue < 2*pi/3
                # RG sector
                B[r,c] = int * (1 - sat)
                R[r,c] = int * (1 + (sat * cos(hue)) / cos(pi/3 - hue))
                G[r,c] = 3*int - R[r,c] - B[r,c]
            elseif 2*pi/3 <= hue < 4*pi/3
                # GB sector
                R[r,c] = int * (1 - sat)
                G[r,c] = int * (1 + (sat * cos(hue - pi/3)) / cos(hue))
                B[r,c] = 3*int - R[r,c] - G[r,c]
            elseif 4*pi/3 <= hue < 2*pi
                # BR sector
                G[r,c] = int * (1 - sat)
                B[r,c] = int * (1 + (sat * cos(hue - 2*pi/3)) / cos(-pi/3 - hue\
))
                R[r,c] = 3*int - G[r,c] - B[r,c]
            else
                # hue is outside range
                R[r,c] = 0
                G[r,c] = 0
                B[r,c] = 0
            end
        end
    end
    return cat(3, R, G, B)
end


Tim Holy

unread,
Feb 18, 2014, 3:26:11 PM2/18/14
to julia...@googlegroups.com
On Tuesday, February 18, 2014 08:47:34 AM Matthew Frank wrote:
> The devectorized version of this loop takes about 14ms and allocates just
> 48 bytes.
> And @inbounds reduces that to about 10ms. (For comparison g++ -O2 gives
> 3.8ms or 4.1ms
> depending on whether you are summing neighboring columns or rows, and g++
> -O3 (turns on
> simd vectorization) gives 2.4ms or 3.9ms).

Glad to hear it's working better. Presumably with Arrays you'd be
approximately matching g++, but with SubArrays (even using Cartesian indexing)
you're currently doomed to performance hit of 2-3 fold, due (I think) to a
lack of inlining. Still, not bad to be so close with a type that was formerly
infamous for performance problems.

> But what's the intuition behind when I should devectorize and when I
> shouldn't?
> I was modeling my code after the code in the Images package, very little of
> which is
> devectorized. And when I try to devectorize some of it the result is awful.
>
> For example, Images' algorithm.jl has:
>
> function hsi2rgb{T}(img::Array{T})
[snip]
> On a 2048x3072 image this takes 1.4 seconds.

Ug, this is one of the old algorithms, dating from before I even discovered
Julia---notice it assumes outright that color is stored in the third
dimension. Most of the "modern" algorithms are devectorized, and you can tell
because so many of them are initially a little hard to understand because they
have @nloops in them :-). (See the Cartesian.jl documentation if you want to
understand how they work.)

Since all the images I personally care about are monochrome, the colorspace-
conversion part of Images has gotten zero love. You may get better mileage
from the immutable representations in Color.jl. The documentation for imread
(https://github.com/timholy/Images.jl/blob/master/doc/function_reference.md#image-
io) describes how you can read an image file using a particular ColorValue
representation, and other parts of the documentation describe how you can
convert it on-the-fly. But I suspect that not all representations are available
in Color.jl.

Alternatively, feel free to improve the versions in Images!

> But here's my devectorized version of hsi2rgb(). It takes 7.2 seconds! So
> either I did something
> completely wrong in the devectorization (quite likely), or sometimes
> devectorization is
> a bad idea performance-wise.

You almost surely have a type-inference problem. I haven't played with it
much, but Leah Hanson's TypeCheck.jl may tell you what you need to know. The
profiler (see description in the stdlib documentation) may also help you.

> How did you know that hsi2rgb() should _not_
> be devectorized?

I can't think of a single circumstance when devectorizing should make
something slower (well, unless the vectorized version has been written with
fancy features, e.g., parallelism). After all, all the vectorized algorithms
are written internally in devectorized state---remember that unlike Matlab,
looping is fast in Julia, and consequently there does not need to be any
"internal magic" (e.g., "make C do the work!") to most algorithms.

--Tim

Matthew Frank

unread,
Feb 18, 2014, 6:14:49 PM2/18/14
to julia...@googlegroups.com


On Tuesday, February 18, 2014 2:26:11 PM UTC-6, Tim Holy wrote:
On Tuesday, February 18, 2014 08:47:34 AM Matthew Frank wrote:
> But here's my devectorized version of hsi2rgb().  It takes 7.2 seconds!  So
> either I did something
> completely wrong in the devectorization (quite likely), or sometimes
> devectorization is
> a bad idea performance-wise.

You almost surely have a type-inference problem. I haven't played with it
much, but Leah Hanson's TypeCheck.jl may tell you what you need to know. The
profiler (see description in the stdlib documentation) may also help you.

You are correct.  Declaring the H, S and I arrays as Array{T,2} makes the runtime of the
devectorized version drop to 0.3 sec (20x speedup over the badly typed devectorized version and
more than 4x faster than the version currently in algorithms.jl).  I don't understand Julia's type
inferencer well enough yet to explain what was going wrong.

I'll take a look at Color.jl and then see what I can contribute.

Thanks,
-Matt

Tim Holy

unread,
Feb 18, 2014, 6:25:01 PM2/18/14
to julia...@googlegroups.com
On Tuesday, February 18, 2014 03:14:49 PM Matthew Frank wrote:
> You are correct. Declaring the H, S and I arrays as Array{T,2} makes the
> runtime of the
> devectorized version drop to 0.3 sec (20x speedup over the badly typed
> devectorized version and
> more than 4x faster than the version currently in algorithms.jl).

Nice! But I'm actually surprised that happened--it seems the type should have
been inferred. It may be worth submitting an issue.

Note that for H, the type will be Array{T,2} only for T=Float64 (because of
the 2pi). So your type declaration could generate errors in some
circumstances. You can generalize it with promote_type, or even better using
typeof(one(T)*2pi).

> I'll take a look at Color.jl and then see what I can contribute.

Thanks in advance!

--Tim

Reply all
Reply to author
Forward
0 new messages