Allocations in numerical linear algebra in v0.4

191 views
Skip to first unread message

Douglas Bates

unread,
May 9, 2016, 2:04:05 PM5/9/16
to julia-dev
In v0.4 I am encountering consdierable storage allocation and garbage collection in linear algebra operations that I think should be allocation free.  By way of background I am working with block-diagonal matrices in which all the diagonal blocks are square and of the same size.  In the particular case I am considering there are 960 blocks of size 3 by 3.  I store the diagonal blocks as an Array(Float64, (3, 3, 960)).  The type is called MixedModels.HBLkDiag (homoegeneous block diagonal) and the 3D-array is the arr member of the type.

An associated type is MixedModels.VectorReMat which contains a matrix with, in this case, 3 rows and  25,498 columns, and a DataArrays.PooledDataVector whose refs are of length 25,498 and whose pool is of length 960.

During iterations I need to update the contents of the HBlkDiag by taking the ith column, say c,  of the VectorReMat, forming c * c' and adding that to the refs[i]'th 3 by 3 diagonal block.  The current version of the code is 

function Base.Ac_mul_B!{T}(C::HBlkDiag{T}, A::VectorReMat{T}, B::VectorReMat{T})
    c, a, r = C.arr, A.z, A.f.refs
    fill!(c, 0)
    if !is(A, B)
        throw(ArgumentError("Currently defined only for A === B"))
    end
    for i in eachindex(r)
        Base.BLAS.syr!('U', 1.0, slice(a, :, i), sub(c, :, :, Int(r[i])))
    end
    for k in 1:size(c, 3)
        Base.LinAlg.copytri!(sub(c, :, :, k), 'U')
    end
    C
end

I thought I was being careful by using sub or slice (I think the effect is the same in this case) but apparently not.  When I track allocations while fitting the model being represented here I get

        - function Base.Ac_mul_B!{T}(C::HBlkDiag{T}, A::VectorReMat{T}, B::VectorReMat{T})
  1503384     c, a, r = C.arr, A.z, A.f.refs
        0     fill!(c, 0)
        0     if !is(A, B)
        0         throw(ArgumentError("Currently defined only for A === B"))
        -     end
        0     for i in eachindex(r)
-1352727904         Base.BLAS.syr!('U', 1.0, slice(a, :, i), sub(c, :, :, Int(r[i])))
        -     end
        0     for k in 1:size(c, 3)
1016371200         Base.LinAlg.copytri!(sub(c, :, :, k), 'U')
        -     end
        0     C
        - end

As far as the copytri! call goes, I can write 3 nested loops to achieve the copy without all this allocation.  I suppose I can also replace the syr! call with explicit loops.  It turns out that there are a lot of places like this that I would need to back off Lapack/BLAS calls and run explicit loops if I want to make this fast in v 0.4

Am I missing something here?  Is there a way to do the programming at a higher level and not suffer all these allocations?  Should I expect this problem to go away after the Arraypocalypse?

Yichao Yu

unread,
May 9, 2016, 2:20:50 PM5/9/16
to Julia Dev
slice and sub allocates, they should be cheaper than copying a big
array though. While we are trying to add stack allocation for these
objects, it's relatively hard to make a escaped object stack
allocated.

If you care about the performance of the loop, you should benchmark
the cost of the creation of these allocations and the efficiency of
calling blas function on subarray.

Douglas Bates

unread,
May 9, 2016, 2:25:35 PM5/9/16
to julia-dev
If I rewrite that function as

function Base.Ac_mul_B!{T}(C::HBlkDiag{T}, A::VectorReMat{T}, B::VectorReMat{T})
    c, a, r = C.arr, A.z, A.f.refs
    _, m, n = size(c)
    fill!(c, 0)
    if !is(A, B)
        throw(ArgumentError("Currently defined only for A === B"))
    end
    for k in eachindex(r)
        ri = Int(r[k])
        for j in 1 : m
            aj = a[j, k]
            c[j, j, ri] += abs2(aj)
            for i in 1 : j - 1
                aij = a[i, k] * aj
                c[i, j, ri] += aij
                c[j, i, ri] += aij
            end
        end
    end
    C
end

my execution time goes from 110 seconds to 68 seconds.  There are a few other places where I can do similar replacement of function calls by explicit loops, even though that is aesthetically unpleasant, to make the model fitting faster.

Andreas Noack

unread,
May 9, 2016, 2:32:25 PM5/9/16
to juli...@googlegroups.com
Death by abstraction. In a world of complete inlining, I guess it shouldn't actually be necessary to allocate the `SubArray`s at all. sub and slice are just convenient ways of calculating a pointer address and two integers and then pass them to BLAS. I think it would be more important to figure out how to completely eliminate the creation of the `SubArray`s instead of how to stack allocate them...but I'm probably missing something.

Yichao Yu

unread,
May 9, 2016, 2:40:42 PM5/9/16
to Julia Dev
On Mon, May 9, 2016 at 2:32 PM, Andreas Noack
<andreasno...@gmail.com> wrote:
> Death by abstraction. In a world of complete inlining, I guess it shouldn't
> actually be necessary to allocate the `SubArray`s at all. sub and slice are
> just convenient ways of calculating a pointer address and two integers and
> then pass them to BLAS. I think it would be more important to figure out how
> to completely eliminate the creation of the `SubArray`s instead of how to
> stack allocate them...but I'm probably missing something.

If the functions using subarray are inlined, it'll be easy to
eliminate their allocation.
Or of course, you can pass in the stride to the blas wrapper functions
separately.

Douglas Bates

unread,
May 9, 2016, 2:42:29 PM5/9/16
to julia-dev
Thanks for the responses.  I just wanted to check that I wasn't doing something dumb or missing an obvious way to circumvent the problem.

Thanks to the track-allocation option when running Julia I can easily find and eliminate the allocation bottlenecks.

Tim Holy

unread,
May 9, 2016, 5:54:09 PM5/9/16
to juli...@googlegroups.com
I definitely agree that the goal should be to get it to the point where LLVM
generates the "trivial" code. My understanding is that the heap allocation
interferes with LLVM's ability to realize how simple the problem really is,
and that implementing stack allocation may be an important step along the way
towards eliding the creation of the SubArray.

But I'm pretty fuzzy on such things, so I could be wrong.

Best,
--Tim
Reply all
Reply to author
Forward
0 new messages