Intrinsic matmul vs. LAPACK

Brad Weir

unread,

May 25, 2011, 2:34:40 PM5/25/11

to

Suppose I have a general matrix and a vector and I want to multiply
them. Does anyone know how the performance of the Fortran 90/95
intrinsics (matmul, dot_product, etc.) compare to the LAPACK routines
for general matrices (_gemv, _nrm2, etc.)? What about if the matrix
is symmetric? I would guess that then the special structure of the
matrix makes the LAPACK routines more efficient.

Daniel Carrera

unread,

May 25, 2011, 3:23:27 PM5/25/11

to

My (limited) understanding is that the fastest matrix multiply is GEMM
from an optimized BLAS, such as the one that comes with Intel's Math
Kernel Library (MKL).

AFAIK all major hardware makers produce optimized BLAS for their systems
(see also ATLAS). In turn, LAPACK uses BLAS, so the speed of LAPACK (and
the answer to your question) will depend on whether LAPACK was compiled
against an optimized BLAS or not.

Hope that helps.

Cheers,
Daniel.

Tim Prince

unread,

May 25, 2011, 7:23:05 PM5/25/11

to

I doubt I can guess correctly where you're going with this, particularly
as you mention the structure of the matrix, which is set by your
programming outside any of these functions, but I'll make a few comments:
1. gfortran, ifort, and maybe other compilers have an option to
implement matmul directly as a call to ?gemm for the case where both
operands are rank 2. gfortran makes the switch at a specifiable size
threshold. This is desirable for the case where the data set is big
enough to take advantage of an optimized threaded library
implementation. For matrix-scalar multiplication, in-line expansion is
likely to run faster for cases which aren't large enough to benefit from
threading, but you could compile your own ?gemv to get better
performance for many specific cases which may interest you, if you are
at liberty to use the most effective compiler options as well as modify
the source code to favor your cases.
2. In-line expansions of dot_product with a good compiler normally will
out-perform a library function, until the case is large enough to
benefit from threading, and there still may be questions about data
locality (don't know whether you include this in your comment about
structure). If you work under ground rules specifying conservative
compiler options which don't optimize dot_product, or you run otherwise
into missed optimizations, then there is a case for the library.

--
Tim Prince

Ron Shepard

unread,

May 25, 2011, 9:10:43 PM5/25/11

to

In article <irjkvg$r7s$1...@dont-email.me>,
Daniel Carrera <dan...@gmail.com> wrote:

> On 05/25/2011 08:34 PM, Brad Weir wrote:
> > Suppose I have a general matrix and a vector and I want to multiply
> > them. Does anyone know how the performance of the Fortran 90/95
> > intrinsics (matmul, dot_product, etc.) compare to the LAPACK routines
> > for general matrices (_gemv, _nrm2, etc.)? What about if the matrix
> > is symmetric? I would guess that then the special structure of the
> > matrix makes the LAPACK routines more efficient.
>
>
> My (limited) understanding is that the fastest matrix multiply is GEMM
> from an optimized BLAS, such as the one that comes with Intel's Math
> Kernel Library (MKL).

The routines mentioned above are not LAPACK routines, they are BLAS
routines. _GEMM is a level-3 BLAS routine, _GEMV is a level-2 BLAS
routine, and _NRM2 is a level-1 routine (or function, in that case).

>
> AFAIK all major hardware makers produce optimized BLAS for their systems
> (see also ATLAS). In turn, LAPACK uses BLAS, so the speed of LAPACK (and
> the answer to your question) will depend on whether LAPACK was compiled
> against an optimized BLAS or not.

This statement is a little confusing. I don't know what "compiled
against" means. You can link your user program with any LAPACK
library that you might have available, and you can link your program
with any BLAS library that you might have available, and if the
calling conventions match, your program will work. On my Mac, for
example, suppose I have the default Apple-supplied LAPACK routines
(in veclib), the commercial MKL LAPACK routines, and two versions of
LAPACK that are compiled from fortran source code (with various
compilers and with various options). And suppose I have the same
four possibilities for BLAS. Then if I get the libraries in the
right order on my ld statement when I link my program together, I
have the possibility of creating something like 16 different
versions of my program, all with fixed object files where the
compiler (and its options) do not yet enter the picture.

I think the symmetric matrix question is just a red herring. You
have to use specific BLAS routines that take advantage of matrix
symmetry or banded matrices or triangular matrices, it does not
recognize this automatically and then adjust the algorithm
accordingly. For example, DGEMM computes general matrix-matrix
products, while DSYMM computes symmetric times general matrix-matrix
product.

Sometimes it is confusing knowing what is a low-level BLAS operation
rather than a high-level LAPACK operation. For example, the
solution of a linear equation that involves a triangular matrix is a
BLAS operation rather than a LAPACK operation (i.e. _TRSV).
Tridiagonal and banded equations solutions are also BLAS operations,
not LAPACK operations.

$.02 -Ron Shepard

rusi_pathan

unread,

May 25, 2011, 9:35:56 PM5/25/11

to

For small matrices the Matmul intrinsic is fine. For large matrices
(in a serial code) you would want a multithreaded BLAS3 like Intel's
MKL. I dont know if Matmul is multithreaded in any of the compilers.

LAPACK is built on top of BLAS so a faster/multithreaded BLAS would
mean a faster LAPACK.

Note: I generally dont work with large dense matrices.

Daniel Carrera

unread,

May 26, 2011, 5:42:53 AM5/26/11

to

On 05/26/2011 03:10 AM, Ron Shepard wrote:
>> AFAIK all major hardware makers produce optimized BLAS for their systems
>> (see also ATLAS). In turn, LAPACK uses BLAS, so the speed of LAPACK (and
>> the answer to your question) will depend on whether LAPACK was compiled
>> against an optimized BLAS or not.
>
> This statement is a little confusing. I don't know what "compiled
> against" means. You can link your user program with any LAPACK
> library that you might have available, and you can link your program
> with any BLAS library that you might have available, and if the
> calling conventions match, your program will work.

True. I was momentarily confused about how compilation works. I had it
in my mind that if you compile a program using one library, you are
stuck with that library and can't just switch it later. But thinking
about it, as long as the API is compatible (as the BLAS are) you are
free to change the library later.

> I think the symmetric matrix question is just a red herring. You
> have to use specific BLAS routines that take advantage of matrix
> symmetry or banded matrices or triangular matrices, it does not
> recognize this automatically and then adjust the algorithm
> accordingly. For example, DGEMM computes general matrix-matrix
> products, while DSYMM computes symmetric times general matrix-matrix
> product.

Interesting. I didn't know about all these other matrix product subroutines.

> Sometimes it is confusing knowing what is a low-level BLAS operation
> rather than a high-level LAPACK operation. For example, the
> solution of a linear equation that involves a triangular matrix is a
> BLAS operation rather than a LAPACK operation (i.e. _TRSV).
> Tridiagonal and banded equations solutions are also BLAS operations,
> not LAPACK operations.

Interesting. I didn't know that either. Thanks. I think I'm going to go
read more about the BLAS.

Daniel.

Brad Weir

unread,

May 26, 2011, 12:30:30 PM5/26/11

to

On May 25, 6:10 pm, Ron Shepard <ron-shep...@NOSPAM.comcast.net>
wrote:
> In article <irjkvg$r7...@dont-email.me>,

> Daniel Carrera <dan...@gmail.com> wrote:
>
> > On 05/25/2011 08:34 PM, Brad Weir wrote:
> > > Suppose I have a general matrix and a vector and I want to multiply
> > > them. Does anyone know how the performance of the Fortran 90/95
> > > intrinsics (matmul, dot_product, etc.) compare to theLAPACKroutines
> > > for general matrices (_gemv, _nrm2, etc.)? What about if the matrix
> > > is symmetric? I would guess that then the special structure of the

> > > matrix makes theLAPACKroutines more efficient.

>
> > My (limited) understanding is that the fastest matrix multiply is GEMM
> > from an optimized BLAS, such as the one that comes with Intel's Math
> > Kernel Library (MKL).
>

> The routines mentioned above are notLAPACKroutines, they are BLAS

> routines. _GEMM is a level-3 BLAS routine, _GEMV is a level-2 BLAS
> routine, and _NRM2 is a level-1 routine (or function, in that case).

You got me, I meant BLAS, not LAPACK.

> I think the symmetric matrix question is just a red herring. You
> have to use specific BLAS routines that take advantage of matrix
> symmetry or banded matrices or triangular matrices, it does not
> recognize this automatically and then adjust the algorithm
> accordingly. For example, DGEMM computes general matrix-matrix
> products, while DSYMM computes symmetric times general matrix-matrix
> product.

I included the symmetric matrix question in case the intrinsic
procedures were actually _more_ efficient than the BLAS procedures,
but I gather that other than maybe dot_product on small vectors, this
is never the case. In any case, I come from a math background, and am
self taught, so I'm sure my code is far from optimal.

All - thanks for the suggestions/help. I've got plenty of
documentation to sift through now.

Brad

Daniel Carrera

unread,

May 26, 2011, 12:53:28 PM5/26/11

to

On 05/26/2011 06:30 PM, Brad Weir wrote:
> I included the symmetric matrix question in case the intrinsic
> procedures were actually _more_ efficient than the BLAS procedures,
> but I gather that other than maybe dot_product on small vectors, this
> is never the case.

DSYMM is part of BLAS 3, and it multiplies symmetric matrices. Maybe
that's what you want?

> In any case, I come from a math background, and am
> self taught, so I'm sure my code is far from optimal.

I think most Fortran people come from a science rather than comp sci
background. My background is astrophysics.

Daniel.

Richard Maine

unread,

May 26, 2011, 12:56:17 PM5/26/11

to

Brad Weir <bria...@gmail.com> wrote:

> I included the symmetric matrix question in case the intrinsic
> procedures were actually _more_ efficient than the BLAS procedures,
> but I gather that other than maybe dot_product on small vectors, this
> is never the case.

Well, sure it can be the case. You seem to be talking about the
intrinsic procedures as though they are necessarily tied to some
specific implementation. They aren't. The standard just says what
results they have to give - not how they have to be implemented.

In order to meaningfully answer your question, one has to be talking
about a particular compiler, and probably even a particular version of a
compiler, as such things could change. The more general version of the
question as asked has no answer. Heck, the intrinsic procedures could
even *BE* the BLAS procedures under the cover in cases where that fits.
(I'm not judging how likely that is - just saying that it is possible).

Compare a well-done and maybe inlined version of the intrinsic versus a
BLAS library compiled without optimization, and maybe with compiler
debugging options turned on just to make things worse. You might have
very good odds that the intrinsic version would win, maybe by a lot.
Yes, I'm sure that's not the kind of comparison you were thinking of, at
least in terms of how to compile the BLAS. I just mention it as an
extreeme case.

There have been lots of implementations of the matmul intrinsic. I
presume that some are pretty good. I recall reading of some other cases
where it is hard to figure out how they managed to do quite so poorly
(namely, cases where you could do better with using the same compiler on
the most simple-minded version of how to write a matrix multiply routine
in Fortran). Talking about performance of "the intrinsic matmul" without
being more specific is about as meaningful as considering the
above-mentioned pessimal compilation of BLAS.

--
Richard Maine | Good judgment comes from experience;
email: last name at domain . net | experience comes from bad judgment.
domain: summertriangle | -- Mark Twain

Tobias Burnus

unread,

May 26, 2011, 4:36:00 PM5/26/11

to

Richard Maine wrote:
> Heck, the intrinsic procedures could even *BE* the BLAS procedures
> under the cover in cases where that fits.
> (I'm not judging how likely that is - just saying that it is possible).

Well, I would say it is very likely if you compile the program with
"gfortran -fexternal-blas".

(There is also -fblas-matmul-limit=<n> which makes sure that for small
arrays, BLAS is not called - assuming the size is known at compile time.)

Tobias

PS: I have never used that flag.

Dick Hendrickson

unread,

May 27, 2011, 12:11:28 PM5/27/11

to

It's even worse! 30 or so years ago, Cray was making a transition from
an old compiler to a new compiler. The old compiler was very good at
single DO loops, but not as good at multi nested loops. So, it did
template matching on the source code and would transform a set of matrix
multiply loops into a call to a finely tuned matrix multiply routine.
The new compiler was better at multi-nested loops, so it transformed
calls to matmul into inline code. Both strategies worked fine on the
same machine.

Dick Hendrickson