My (limited) understanding is that the fastest matrix multiply is GEMM
from an optimized BLAS, such as the one that comes with Intel's Math
Kernel Library (MKL).
AFAIK all major hardware makers produce optimized BLAS for their systems
(see also ATLAS). In turn, LAPACK uses BLAS, so the speed of LAPACK (and
the answer to your question) will depend on whether LAPACK was compiled
against an optimized BLAS or not.
Hope that helps.
Cheers,
Daniel.
--
Tim Prince
> On 05/25/2011 08:34 PM, Brad Weir wrote:
> > Suppose I have a general matrix and a vector and I want to multiply
> > them. Does anyone know how the performance of the Fortran 90/95
> > intrinsics (matmul, dot_product, etc.) compare to the LAPACK routines
> > for general matrices (_gemv, _nrm2, etc.)? What about if the matrix
> > is symmetric? I would guess that then the special structure of the
> > matrix makes the LAPACK routines more efficient.
>
>
> My (limited) understanding is that the fastest matrix multiply is GEMM
> from an optimized BLAS, such as the one that comes with Intel's Math
> Kernel Library (MKL).
The routines mentioned above are not LAPACK routines, they are BLAS
routines. _GEMM is a level-3 BLAS routine, _GEMV is a level-2 BLAS
routine, and _NRM2 is a level-1 routine (or function, in that case).
>
> AFAIK all major hardware makers produce optimized BLAS for their systems
> (see also ATLAS). In turn, LAPACK uses BLAS, so the speed of LAPACK (and
> the answer to your question) will depend on whether LAPACK was compiled
> against an optimized BLAS or not.
This statement is a little confusing. I don't know what "compiled
against" means. You can link your user program with any LAPACK
library that you might have available, and you can link your program
with any BLAS library that you might have available, and if the
calling conventions match, your program will work. On my Mac, for
example, suppose I have the default Apple-supplied LAPACK routines
(in veclib), the commercial MKL LAPACK routines, and two versions of
LAPACK that are compiled from fortran source code (with various
compilers and with various options). And suppose I have the same
four possibilities for BLAS. Then if I get the libraries in the
right order on my ld statement when I link my program together, I
have the possibility of creating something like 16 different
versions of my program, all with fixed object files where the
compiler (and its options) do not yet enter the picture.
I think the symmetric matrix question is just a red herring. You
have to use specific BLAS routines that take advantage of matrix
symmetry or banded matrices or triangular matrices, it does not
recognize this automatically and then adjust the algorithm
accordingly. For example, DGEMM computes general matrix-matrix
products, while DSYMM computes symmetric times general matrix-matrix
product.
Sometimes it is confusing knowing what is a low-level BLAS operation
rather than a high-level LAPACK operation. For example, the
solution of a linear equation that involves a triangular matrix is a
BLAS operation rather than a LAPACK operation (i.e. _TRSV).
Tridiagonal and banded equations solutions are also BLAS operations,
not LAPACK operations.
$.02 -Ron Shepard
For small matrices the Matmul intrinsic is fine. For large matrices
(in a serial code) you would want a multithreaded BLAS3 like Intel's
MKL. I dont know if Matmul is multithreaded in any of the compilers.
LAPACK is built on top of BLAS so a faster/multithreaded BLAS would
mean a faster LAPACK.
Note: I generally dont work with large dense matrices.
True. I was momentarily confused about how compilation works. I had it
in my mind that if you compile a program using one library, you are
stuck with that library and can't just switch it later. But thinking
about it, as long as the API is compatible (as the BLAS are) you are
free to change the library later.
> I think the symmetric matrix question is just a red herring. You
> have to use specific BLAS routines that take advantage of matrix
> symmetry or banded matrices or triangular matrices, it does not
> recognize this automatically and then adjust the algorithm
> accordingly. For example, DGEMM computes general matrix-matrix
> products, while DSYMM computes symmetric times general matrix-matrix
> product.
Interesting. I didn't know about all these other matrix product subroutines.
> Sometimes it is confusing knowing what is a low-level BLAS operation
> rather than a high-level LAPACK operation. For example, the
> solution of a linear equation that involves a triangular matrix is a
> BLAS operation rather than a LAPACK operation (i.e. _TRSV).
> Tridiagonal and banded equations solutions are also BLAS operations,
> not LAPACK operations.
Interesting. I didn't know that either. Thanks. I think I'm going to go
read more about the BLAS.
Daniel.
You got me, I meant BLAS, not LAPACK.
> I think the symmetric matrix question is just a red herring. You
> have to use specific BLAS routines that take advantage of matrix
> symmetry or banded matrices or triangular matrices, it does not
> recognize this automatically and then adjust the algorithm
> accordingly. For example, DGEMM computes general matrix-matrix
> products, while DSYMM computes symmetric times general matrix-matrix
> product.
I included the symmetric matrix question in case the intrinsic
procedures were actually _more_ efficient than the BLAS procedures,
but I gather that other than maybe dot_product on small vectors, this
is never the case. In any case, I come from a math background, and am
self taught, so I'm sure my code is far from optimal.
All - thanks for the suggestions/help. I've got plenty of
documentation to sift through now.
Brad
DSYMM is part of BLAS 3, and it multiplies symmetric matrices. Maybe
that's what you want?
> In any case, I come from a math background, and am
> self taught, so I'm sure my code is far from optimal.
I think most Fortran people come from a science rather than comp sci
background. My background is astrophysics.
Daniel.
> I included the symmetric matrix question in case the intrinsic
> procedures were actually _more_ efficient than the BLAS procedures,
> but I gather that other than maybe dot_product on small vectors, this
> is never the case.
Well, sure it can be the case. You seem to be talking about the
intrinsic procedures as though they are necessarily tied to some
specific implementation. They aren't. The standard just says what
results they have to give - not how they have to be implemented.
In order to meaningfully answer your question, one has to be talking
about a particular compiler, and probably even a particular version of a
compiler, as such things could change. The more general version of the
question as asked has no answer. Heck, the intrinsic procedures could
even *BE* the BLAS procedures under the cover in cases where that fits.
(I'm not judging how likely that is - just saying that it is possible).
Compare a well-done and maybe inlined version of the intrinsic versus a
BLAS library compiled without optimization, and maybe with compiler
debugging options turned on just to make things worse. You might have
very good odds that the intrinsic version would win, maybe by a lot.
Yes, I'm sure that's not the kind of comparison you were thinking of, at
least in terms of how to compile the BLAS. I just mention it as an
extreeme case.
There have been lots of implementations of the matmul intrinsic. I
presume that some are pretty good. I recall reading of some other cases
where it is hard to figure out how they managed to do quite so poorly
(namely, cases where you could do better with using the same compiler on
the most simple-minded version of how to write a matrix multiply routine
in Fortran). Talking about performance of "the intrinsic matmul" without
being more specific is about as meaningful as considering the
above-mentioned pessimal compilation of BLAS.
--
Richard Maine | Good judgment comes from experience;
email: last name at domain . net | experience comes from bad judgment.
domain: summertriangle | -- Mark Twain
Well, I would say it is very likely if you compile the program with
"gfortran -fexternal-blas".
(There is also -fblas-matmul-limit=<n> which makes sure that for small
arrays, BLAS is not called - assuming the size is known at compile time.)
Tobias
PS: I have never used that flag.
Dick Hendrickson