function [X, R] = mgs(X)
[n p] = size(X);
R = eye(p);
for l=1:p
R(l,l+1:p)=(X(:,l)'*X(:,l+1:p))/(X(:,l)'*X(:,l));
X(:,l+1:p)=X(:,l+1:p)-X(:,l)*R(l,l+1:p);
end
X = rand(1000);
octave:18> tic; [Q, R] = mgs(X); toc;
Elapsed time is 3.99396 seconds.using ArrayViews
function mgs(X)
n, p = size(X)
R = eye(nvars)
for l=1:nvars-1
R[l,l+1:p]=(view(X,:,l)'*view(X,:,l+1:nvars))/sumabs2(view(X,:,l))
X[:,l+1:p]=view(X,:,l+1:nvars)-view(X,:,l)*view(R,l,l+1:p)
end
X, R
end
X = rand(1000,1000);
@time Q,R = mgs(X);
elapsed time: 7.986772918 seconds (8085500312 bytes allocated, 34.69% gc time)n,nvars=size(X)R[l,l+1:p]=(view(X,:,l)'*view(X,:,l+1:nvars))/sumabs2(view(X,:,l))R[l,l+1:p]=(view(X,:,l+1:nvars)'*view(X,:,l))/sumabs2(view(X,:,l))
and I got a little increase in performance
julia> @time Q,R = mgs(X);
elapsed time: 6.815198093 seconds (8017005584 bytes allocated, 39.99% gc time)
Original qr function can do the job a lot more faster. I'd like to approach this speed at the end.
julia> @time Q,R = qr(X);
elapsed time: 0.542084044 seconds (32580000 bytes allocated, 5.41% gc time)
I wonder why my vectorized code is not fast. I assume operations like "X' * a" are using BLAS (e.g. BLAS.gemv()) and thus it should be as fast as a devectorized code at least in this case.
Thanks,
Jan
using ArrayViews
function mgs(X)
# mgs speed project
nobs, nvars = size(X)
R = eye(nvars)
for l=1:nvars-1
v_i = view(X,:,l+1:nvars) # --- line 10
s = view(X,:,l);
R[l,l+1:nvars]=(v_i'*s)/sumabs2(s) # --- line 13
X[:,l+1:nvars]=v_i-(s*view(R,l,l+1:nvars)) # --- line 14
end
X, R
end
I'd like to approach this speed at the end.
I'd like to approach this speed at the end.I don't think it is possible in Julia right now without using dirty tricks such as passing pointers around. You'd like to get the speed from BLAS by operating on panels of your matrix, but you'd like to avoid the copying and reallocation of arrays. If you devectorised your code completely, there'd be no allocation, but you wouldn't have the fast multiplications in BLAS. You can get some of the way by using one of the ArrayViews packages, but the garbage collector will charge a fee for each view you produce so LAPACK speed is not attainable.Med venlig hilsen
Andreas Noack
I believe I mentioned A_mul_B! and friends. For the subtraction, check out
NumericExtensions, specifically the subtract! function.
I believe I mentioned A_mul_B! and friends. For the subtraction, check out
NumericExtensions, specifically the subtract! function.
--Tim
I missed that you were already using the ArrayViews package. Sorry for that.The way you have written the rank one update in line 14 is very expensive. Unfortunately, we are not making all of BLAS available through our generic mat-vec interface right now, but the whole of BLAS is wrapped in the BLAS module, so you can use the ger! function there, i.e.function mgs2(X)# mgs speed projectnobs, nvars = size(X)R = eye(nvars)for l=1:nvars-1v_i = view(X,:,l+1:nvars) # --- line 10s = view(X,:,l);R[l,l+1:nvars] = (v_i'*s)/sumabs2(s) # --- line 13BLAS.ger!(-1.0, s, vec(view(R,l,l+1:nvars)), v_i) # --- line 14endX, Rendwhich avoids a lot of allocation. I don't know how Octave handles such expression, so I don't know why they are faster when not calling into BLAS directly.Med venlig hilsen
Andreas Noack
The way you have written the rank one update in line 14 is very expensive. Unfortunately, we are not making all of BLAS available through our generic mat-vec interface right now, but the whole of BLAS is wrapped in the BLAS module, so you can use the ger! function there, i.e.function mgs2(X)# mgs speed projectnobs, nvars = size(X)R = eye(nvars)for l=1:nvars-1v_i = view(X,:,l+1:nvars) # --- line 10s = view(X,:,l);R[l,l+1:nvars] = (v_i'*s)/sumabs2(s) # --- line 13BLAS.ger!(-1.0, s, vec(view(R,l,l+1:nvars)), v_i) # --- line 14endX, Rendwhich avoids a lot of allocation. I don't know how Octave handles such expression, so I don't know why they are faster when not calling into BLAS directly.Med venlig hilsen
Andreas Noack
Completely devectorisation can sometimes be useful, but I have come to prefer writing linear algebra code close to the style of LAPACK where you organise the calculations in BLAS calls. Not only for speed, but also for making the code easier to read and infer the structure of the algorithm.Med venlig hilsen
Andreas Noack
You can squeeze a tiny bit extra out of the implementation by using the inplace gemv!, i.e.function mgs2(X)# mgs speed projectnobs, nvars = size(X)R = eye(nvars)for l=1:nvars-1v_i = view(X,:,l+1:nvars)s = view(X,:,l);r = vec(view(R,l,l+1:nvars))BLAS.gemv!('T', 1.0/sumabs2(s), v_i, s, 0.0, r) # --- line 13BLAS.ger!(-1.0, s, r, v_i) # --- line 14endX, RendMed venlig hilsen
Andreas Noack
Speed is pretty much the same. I assume copy() and deepcopy() works the same for arrays.