Modified Gram-Schmidt: getting the best out of Julia

1,424 views
Skip to first unread message

Ján Dolinský

unread,
Oct 21, 2014, 4:37:28 AM10/21/14
to julia...@googlegroups.com
Hello,

I have been using Julia for couple of weeks now. I did some benchmarking like computing a dot product etc. in Julia and Octave but I soon realized that I am benchmarking BLAS/LAPACK implementations of elementary algebra functions rather then Julia or Octave itself.

To learn Julia further, I tried to write Modified Gram-Schmidt procedure (MGS) in Julia and Octave. First, I would like to write MGS in a serial fashion properly and then perhaps move on to a parallelized version (and release it as open source learning material).

My attempt in Octave is as follows:
function [X, R] = mgs(X)

[n p] = size(X);

R
= eye(p);

for l=1:p
    R
(l,l+1:p)=(X(:,l)'*X(:,l+1:p))/(X(:,l)'*X(:,l));
    X
(:,l+1:p)=X(:,l+1:p)-X(:,l)*R(l,l+1:p);
end

X
= rand(1000);
octave
:18> tic; [Q, R] = mgs(X); toc;
Elapsed time is 3.99396 seconds.

I rewrite the above in Julia:
using ArrayViews

function mgs(X)

  n
, p = size(X)
  R
= eye(nvars)

 
for l=1:nvars-1
    R
[l,l+1:p]=(view(X,:,l)'*view(X,:,l+1:nvars))/sumabs2(view(X,:,l))
    X[:,l+1:p]=view(X,:,l+1:nvars)-view(X,:,l)*view(R,l,l+1:p)
  end

  X, R

end

X = rand(1000,1000);
@time Q,R = mgs(X);
elapsed time: 7.986772918 seconds (8085500312 bytes allocated, 34.69% gc time)

At the moment, my Julia implementation is a lot slower but I assume my understanding of Julia and julian writing style can be improved a lot :).

Thanks in advance for any hints.

Best Regards,
Jan

Steven G. Johnson

unread,
Oct 21, 2014, 11:16:10 AM10/21/14
to julia...@googlegroups.com
Just write out the three nested loops of MGS, rather than using array operations.  It will be a lot faster (in Julia).

Douglas Bates

unread,
Oct 21, 2014, 12:43:03 PM10/21/14
to julia...@googlegroups.com
Also, in your Julia code you have used nvars without defining it.  You probably intended
n,nvars=size(X)



Ján Dolinský

unread,
Oct 22, 2014, 4:33:37 AM10/22/14
to julia...@googlegroups.com
Yes, indeed. It should be p. Thanks.

Dňa utorok, 21. októbra 2014 18:43:03 UTC+2 Douglas Bates napísal(-a):

Ján Dolinský

unread,
Oct 22, 2014, 5:27:30 AM10/22/14
to julia...@googlegroups.com
Thanks. I'll try to use loops too.

While still exploring my vectorized code I found out that matrix-vector multiplication is a lot faster when going column-wise, e.g.

X = rand(8000,8000);
a = rand(8000);
julia> @time r = a' * X;
elapsed time: 0.131689512 seconds (128224 bytes allocated)
going row-wise (slow)

julia> @time r = X' * a;
elapsed time: 0.057772737 seconds (64168 bytes allocated)
going column-wise (fast)

I therefore rewrote
R[l,l+1:p]=(view(X,:,l)'*view(X,:,l+1:nvars))/sumabs2(view(X,:,l))
into
R[l,l+1:p]=(view(X,:,l+1:nvars)'*view(X,:,l))/sumabs2(view(X,:,l))

and I got a little increase in performance
julia> @time Q,R = mgs(X);
elapsed time: 6.815198093 seconds (8017005584 bytes allocated, 39.99% gc time)

Original qr function can do the job a lot more faster. I'd like to approach this speed at the end.
julia> @time Q,R = qr(X);
elapsed time: 0.542084044 seconds (32580000 bytes allocated, 5.41% gc time)

I wonder why my vectorized code is not fast. I assume operations like "X' * a" are using BLAS (e.g. BLAS.gemv()) and thus it should be as fast as a devectorized code at least in this case.

Thanks,
Jan





Dňa utorok, 21. októbra 2014 17:16:10 UTC+2 Steven G. Johnson napísal(-a):

Tim Holy

unread,
Oct 22, 2014, 5:56:31 AM10/22/14
to julia...@googlegroups.com
On Wednesday, October 22, 2014 02:27:30 AM Ján Dolinský wrote:
> I wonder why my vectorized code is not fast. I assume operations like "X' *
> a" are using BLAS (e.g. BLAS.gemv()) and thus it should be as fast as a
> devectorized code at least in this case.

Yes, but their result requires allocating a temporary, and temporaries mean
garbage collection. Try
- pre-allocating a couple of temp vectors to store the results of your matrix
multiplies, and use functions like `Ac_mul_B!(dest, A, B)` to calculate A'*B
and store the result in dest.
- Do the subtractions & divisions using a loop, as suggested by Steven.
- in each iteration you should call `view(X, :, l)` just once and store the
result to a variable---no point recreating the same view 3 times per
iteration.

--Tim

Ján Dolinský

unread,
Oct 22, 2014, 10:09:14 AM10/22/14
to julia...@googlegroups.com


Dňa streda, 22. októbra 2014 11:56:31 UTC+2 Tim Holy napísal(-a):

Hello Tim,

Thanks for the tips. I rewrote a bit the mgs routine so that it is more readable and did profiling.

using ArrayViews

function mgs(X)
# mgs speed project

  nobs
, nvars = size(X)

  R
= eye(nvars)

 
for l=1:nvars-1

    v_i
= view(X,:,l+1:nvars) # --- line 10
    s
= view(X,:,l);
   
    R
[l,l+1:nvars]=(v_i'*s)/sumabs2(s) # --- line 13
    X[:,l+1:nvars]=v_i-(s*view(R,l,l+1:nvars)) # --- line 14
  end

  X, R

end

Profilling is as follows:
1     linalg/blas.jl; gemm!; line: 526
10455 task.jl; anonymous; line: 96
 10455 REPL.jl; eval_user_input; line: 54
  10455 profile.jl; anonymous; line: 14
   2    ...me/Projects/julia-trials/mgs.jl; mgs; line: 7
    2 array.jl; eye; line: 176
     2 array.jl; fill!; line: 155
   2383 ...me/Projects/julia-trials/mgs.jl; mgs; line: 10
   41   ...me/Projects/julia-trials/mgs.jl; mgs; line: 13
    3  array.jl; ./; line: 756
    7  array.jl; ./; line: 758
    1  linalg/matmul.jl; gemv!; line: 217
     1 pointer.jl; pointer; line: 29
    16 multidimensional.jl; setindex!; line: 76
    3  multidimensional.jl; setindex!; line: 295
    6  reduce.jl; _mapreduce; line: 168
   8027 ...me/Projects/julia-trials/mgs.jl; mgs; line: 14
    31   ...3/ArrayViews/src/ArrayViews.jl; strided_view; line: 72
    1    abstractarray.jl; copy!; line: 152
    3109 array.jl; -; line: 719
    2770 array.jl; -; line: 721
    5    linalg/matmul.jl; *; line: 74
     1 abstractarray.jl; copy!; line: 150
    1807 multidimensional.jl; setindex!; line: 76
    1    multidimensional.jl; setindex!; line: 80
    303  multidimensional.jl; setindex!; line: 295
   1    abstractarray.jl; stride; line: 30
 
There is apparently a huge difference between line 13 and 14.
Line 14 indeed deserves a loop as suggested by Steven. The culprit seems to be the minus operation. I'll explore the problem further.
Line 13 seems to be already reasonably fast but preallocating a space for output of (v_i'*s) and using something like "Ac_mul_B!(dest, v_i, s)" might provide further speedup.


It is very interesting, I'll play with it a bit further and post the improvements.

Jan

Andreas Noack

unread,
Oct 22, 2014, 10:23:32 AM10/22/14
to julia...@googlegroups.com
I'd like to approach this speed at the end.

I don't think it is possible in Julia right now without using dirty tricks such as passing pointers around. You'd like to get the speed from BLAS by operating on panels of your matrix, but you'd like to avoid the copying and reallocation of arrays. If you devectorised your code completely, there'd be no allocation, but you wouldn't have the fast multiplications in BLAS. You can get some of the way by using one of the ArrayViews packages, but the garbage collector will charge a fee for each view you produce so LAPACK speed is not attainable.

Med venlig hilsen

Andreas Noack

Ján Dolinský

unread,
Oct 23, 2014, 9:43:39 AM10/23/14
to julia...@googlegroups.com
I'd like to approach this speed at the end.

I don't think it is possible in Julia right now without using dirty tricks such as passing pointers around. You'd like to get the speed from BLAS by operating on panels of your matrix, but you'd like to avoid the copying and reallocation of arrays. If you devectorised your code completely, there'd be no allocation, but you wouldn't have the fast multiplications in BLAS. You can get some of the way by using one of the ArrayViews packages, but the garbage collector will charge a fee for each view you produce so LAPACK speed is not attainable.

Med venlig hilsen

Andreas Noack

Indeed, achieving BLAS speed is rather difficult for good reasons you mentioned above. Nevertheless, I'd like to match (or beat :)) the speed of my MGS routine written in Octave. It takes about 4 seconds in Octave and about 6.6 seconds in Julia. In Octave I could write it in a very straight forward fashion (Octave code available in the very first post of this thread).

Apparently most of the computing time of my Julia MGS routine is consumed by line #14. I tried to comment out the line 14. Computing time (and memory allocation too) then dropped sharply from 6.6s to 0.37s.

This indicates that line 13 is fast already in its vectorized form. This is very nice since it is a very compact piece of code.

I'll try to devectorize line 14 (or try to find better semi-vectorized representation). Any tips here appreciated.

Jan 

Tim Holy

unread,
Oct 23, 2014, 10:02:23 AM10/23/14
to julia...@googlegroups.com
I believe I mentioned A_mul_B! and friends. For the subtraction, check out
NumericExtensions, specifically the subtract! function.

--Tim

Andreas Noack

unread,
Oct 23, 2014, 10:21:11 AM10/23/14
to julia...@googlegroups.com
I missed that you were already using the ArrayViews package. Sorry for that. 

The way you have written the rank one update in line 14 is very expensive. Unfortunately, we are not making all of BLAS available through our generic mat-vec interface right now, but the whole of BLAS is wrapped in the BLAS module, so you can use the ger! function there, i.e.

    function mgs2(X)
    # mgs speed project
    
        nobs, nvars = size(X)
        R = eye(nvars)
        
        for l=1:nvars-1
            v_i = view(X,:,l+1:nvars) # --- line 10
            s = view(X,:,l);
          
            R[l,l+1:nvars] = (v_i'*s)/sumabs2(s) # --- line 13
            BLAS.ger!(-1.0, s, vec(view(R,l,l+1:nvars)), v_i) # --- line 14
        end
        
        X, R
    
    end

which avoids a lot of allocation. I don't know how Octave handles such expression, so I don't know why they are faster when not calling into BLAS directly.

Med venlig hilsen

Andreas Noack

Simon Byrne

unread,
Oct 23, 2014, 10:31:34 AM10/23/14
to julia...@googlegroups.com
On Thursday, 23 October 2014 15:02:23 UTC+1, Tim Holy wrote:
I believe I mentioned A_mul_B! and friends. For the subtraction, check out
NumericExtensions, specifically the subtract! function.

Also I would like to shamelessly plug InplaceOps:

which you might find useful.

Simon

Ján Dolinský

unread,
Oct 23, 2014, 10:57:44 AM10/23/14
to julia...@googlegroups.com
Dňa štvrtok, 23. októbra 2014 16:02:23 UTC+2 Tim Holy napísal(-a):
I believe I mentioned A_mul_B! and friends. For the subtraction, check out
NumericExtensions, specifically the subtract! function.

--Tim

Yes, you did. Exploration along this line is in progress. :) I will also check the substract! function. Thanks. Now I see why my line #14 is so expensive (with a lot of unnecessary allocation).

Jan 

Tim Holy

unread,
Oct 23, 2014, 10:58:00 AM10/23/14
to julia...@googlegroups.com
Personally, I greatly appreciate the shameless plug. There are too many cool
things available to keep track of them all, so it's great when people
advertise their own code.

I'll have to try this, it looks quite handy.

--Tim

Ján Dolinský

unread,
Oct 23, 2014, 11:05:08 AM10/23/14
to julia...@googlegroups.com
Dňa štvrtok, 23. októbra 2014 16:21:11 UTC+2 Andreas Noack napísal(-a):
I missed that you were already using the ArrayViews package. Sorry for that. 

The way you have written the rank one update in line 14 is very expensive. Unfortunately, we are not making all of BLAS available through our generic mat-vec interface right now, but the whole of BLAS is wrapped in the BLAS module, so you can use the ger! function there, i.e.

    function mgs2(X)
    # mgs speed project
    
        nobs, nvars = size(X)
        R = eye(nvars)
        
        for l=1:nvars-1
            v_i = view(X,:,l+1:nvars) # --- line 10
            s = view(X,:,l);
          
            R[l,l+1:nvars] = (v_i'*s)/sumabs2(s) # --- line 13
            BLAS.ger!(-1.0, s, vec(view(R,l,l+1:nvars)), v_i) # --- line 14
        end
        
        X, R
    
    end

which avoids a lot of allocation. I don't know how Octave handles such expression, so I don't know why they are faster when not calling into BLAS directly.

Med venlig hilsen

Andreas Noack

Indeed line #14 is very expensive. I see it now :). Thanks for the BLAS tip. I am just getting familiar with the BLAS and concepts of de-vectorized code. Last five years I used  to use Matlab/Octave and vectorization and it turned into a rather strong habit :). I may also try to completely de-vectorize the code.

Thanks,
Jan

Ján Dolinský

unread,
Oct 23, 2014, 11:05:45 AM10/23/14
to julia...@googlegroups.com

Looks interesting, thanks.

Jan

Andreas Noack

unread,
Oct 23, 2014, 11:10:35 AM10/23/14
to julia...@googlegroups.com
Completely devectorisation can sometimes be useful, but I have come to prefer writing linear algebra code close to the style of LAPACK where you organise the calculations in BLAS calls. Not only for speed, but also for making the code easier to read and infer the structure of the algorithm.

Med venlig hilsen

Andreas Noack

Ján Dolinský

unread,
Oct 24, 2014, 7:27:46 AM10/24/14
to julia...@googlegroups.com
 
The way you have written the rank one update in line 14 is very expensive. Unfortunately, we are not making all of BLAS available through our generic mat-vec interface right now, but the whole of BLAS is wrapped in the BLAS module, so you can use the ger! function there, i.e.

    function mgs2(X)
    # mgs speed project
    
        nobs, nvars = size(X)
        R = eye(nvars)
        
        for l=1:nvars-1
            v_i = view(X,:,l+1:nvars) # --- line 10
            s = view(X,:,l);
          
            R[l,l+1:nvars] = (v_i'*s)/sumabs2(s) # --- line 13
            BLAS.ger!(-1.0, s, vec(view(R,l,l+1:nvars)), v_i) # --- line 14
        end
        
        X, R
    
    end

which avoids a lot of allocation. I don't know how Octave handles such expression, so I don't know why they are faster when not calling into BLAS directly.

Med venlig hilsen

Andreas Noack

Thanks for pointing out BLAS.ger! , I was not aware of this function because it is not listed in Julia manual. Using ger! resulted in a computing time of about 1 second not mentioning huge memory allocation saving.
I will play further also with line #13 as this can probably also be turned into a direct BLAS call.

I assume some mat-vec (or mat-mat) operations like (v_i' * s) are directly translated into a BLAS call e.g.  BLAS.gemv('T',v_i,s), right ? Is there a way to find out what is being called under the hood ?

Thanks,
Jan

Ján Dolinský

unread,
Oct 24, 2014, 7:32:37 AM10/24/14
to julia...@googlegroups.com

Completely devectorisation can sometimes be useful, but I have come to prefer writing linear algebra code close to the style of LAPACK where you organise the calculations in BLAS calls. Not only for speed, but also for making the code easier to read and infer the structure of the algorithm.

Med venlig hilsen

Andreas Noack


Organizing linear algebra code into BLAS calls make sense to me too. It is like writing vectorized (easy to read) code but with excellent speed and efficient memory allocation. 

Jan

Ján Dolinský

unread,
Oct 24, 2014, 8:15:55 AM10/24/14
to julia...@googlegroups.com
Writting line #13 as a BLAS call reduced further memory allocation. Thanks for all the tips (BLAS, NumericExtensions, InplaceOps, etc.).

using ArrayViews

function mgs(X)
# mgs speed project

  nobs, nvars = size(X)

  R = eye(nvars)

  for l=1:nvars-1
    v_i = view(X,:,l+1:nvars)
    s = view(X,:,l);

    R[l,l+1:nvars]=BLAS.gemv('T', 1.0/sumabs2(s), v_i, s) # --- line 13

Andreas Noack

unread,
Oct 24, 2014, 9:21:29 AM10/24/14
to julia...@googlegroups.com
You can squeeze a tiny bit extra out of the implementation by using the inplace gemv!, i.e.
function mgs2(X)
# mgs speed project
  nobs, nvars = size(X)
  R = eye(nvars)
  for l=1:nvars-1
    v_i = view(X,:,l+1:nvars)
    s = view(X,:,l);
r = vec(view(R,l,l+1:nvars))

    BLAS.gemv!('T', 1.0/sumabs2(s), v_i, s, 0.0, r) # --- line 13
    BLAS.ger!(-1.0, s, r, v_i) # --- line 14
  end
 X, R
end

Med venlig hilsen

Andreas Noack

Tim Holy

unread,
Oct 26, 2014, 10:37:39 AM10/26/14
to julia...@googlegroups.com
On Friday, October 24, 2014 04:27:46 AM Ján Dolinský wrote:
> I assume some mat-vec (or mat-mat) operations like (v_i' * s) are directly
> translated into a BLAS call e.g. BLAS.gemv('T',v_i,s), right ? Is there a
> way to find out what is being called under the hood ?

@which

--Tim

Ján Dolinský

unread,
Oct 27, 2014, 5:25:01 AM10/27/14
to julia...@googlegroups.com
You can squeeze a tiny bit extra out of the implementation by using the inplace gemv!, i.e.
function mgs2(X)
# mgs speed project
  nobs, nvars = size(X)
  R = eye(nvars)
  for l=1:nvars-1
    v_i = view(X,:,l+1:nvars)
    s = view(X,:,l);
r = vec(view(R,l,l+1:nvars))

    BLAS.gemv!('T', 1.0/sumabs2(s), v_i, s, 0.0, r) # --- line 13
    BLAS.ger!(-1.0, s, r, v_i) # --- line 14
  end
 X, R
end

Med venlig hilsen

Andreas Noack

Indeed, a bit of speed and memory allocation improved too. I originally rejected the use of gemv! because I thought that additional multiplication of beta*y is unnecessary overhead but this seems not to be the case. I assume if beta = 0.0 then multiplication is not performed at all.

This implementation is actually approaching the speed of original qr() function. On my other computer I get 0.17s with qr() and 0.22s with mgs() for X = rand(1000,1000). Julia is installed with OpenBLAS.

Thanks a lot, in was an excellent exercise to get more acquainted with Julia.
Jan

p.s. I also learned that the above implementation actually overwrites original X matrix inplace so I renamed it to mgs!(X). I also did non-inplace version that would be:


using ArrayViews

function mgs(X)
# mgs speed project

  nobs, nvars = size(X)

  R = eye(nvars)
  Q = copy(X);

  for l=1:nvars-1
    v_i = view(Q,:,l+1:nvars)
    s = view(Q,:,l);

    r = vec(view(R,l,l+1:nvars))

    BLAS.gemv!('T', 1.0/sumabs2(s), v_i, s, 0.0, r)
    BLAS.ger!(-1.0, s, r, v_i)
  end

  Q, R

end

Speed is pretty much the same. I assume copy() and deepcopy() works the same for arrays.

Ján Dolinský

unread,
Oct 27, 2014, 5:25:34 AM10/27/14
to julia...@googlegroups.com

Very handy, thanks.

Jan

Stefan Karpinski

unread,
Oct 27, 2014, 10:12:07 AM10/27/14
to Julia Users
On Mon, Oct 27, 2014 at 5:25 AM, Ján Dolinský <jan.do...@2bridgz.com> wrote:
Speed is pretty much the same. I assume copy() and deepcopy() works the same for arrays.

Yes, they do the same thing, although deepcopy probably creates a small (unnecessary) dictionary, but for this it may not be noticeable.
Reply all
Reply to author
Forward
0 new messages