Matrix Multiplication

Evgenii Rudnyi

unread,

Jan 3, 2008, 2:32:08 PM1/3/08

to

Hello,

I have written an introductory text about matrix multiplication where
I compare the performance in NumPy, Fortran, C, and C++. The direct
(naive) implementation is compared with ATLAS.

http://matrixprogramming.com/MatrixMultiply/

Your comments are welcome.

Best wishes,

Evgenii Rudnyi

rmau...@googlemail.com

unread,

Jan 3, 2008, 3:40:45 PM1/3/08

to

Surely modern performance issues worthy of discussion relate to vector
FPU and the choice of AoS/SoA data layout.

But it's a nice "introduction" ...

"it is assumed that you already know it. If you are in doubt, please
refresh your knowledge in Wikipedia"

- Richard Maudsley

Gordon Sande

unread,

Jan 3, 2008, 4:30:36 PM1/3/08

to

If one were looking at this like a referee one would be questioning an
author who seems to like to use a 30 year old version of Fortran.
There are three revisions since 1977. Fortran 90 introduced a lot
of matrix operations which are never mentioned. The G77 compiler has
not been supported for several years.

When an author goes out of the way to "think old" one wonders how much
of the rest of the material has the same attributes.

Both a good implementation of an interpreter and of Fortran 90 are free
to use the best available subroutine libraries for their purposes. The
GNU successor to G77 may or may not have reached the maturity of using
a highly optimized run time for matrix multiply. Since it is developed
by volunteers one can hardly criticize it for such a lapse when its
major achievement is that it is available at zero cost on many platforms.
Optimization is not its strong point and what is there is because of
work directed at other languages in the GCC language family.

There is also the question of the intended audience for an introductory
text and whether that audience should be treated to content which is so
dated. In a fast skim I noticed some comments on memory issues. I did not
notice comments on the issues of size scaling when the matrix fits in the
cache memory and when it does not. When it does not then is the issue of
subscript order that may or may not be cache friendly. The advice that
good subroutine packages deal with such issues is good. One would like
some indication of the problem they address even if the full technical
nature of the solution is not given. Keeping such a discussion both
accurrate and introductory is not easy.

Fortran 77 is important to know because of the large amount of excellent
developed software that is available. But it should be recognized for what
it is. It is a marvelous legacy foundation but there has been progress.

There even are some who argue that the array features for F90 are such that
they prefer to use it over things like MatLab as it does not suffer as severe
of performance penalties on "other" programming with the same compactness and
convenience of the interpreter. Such assessments would strongly depend on
the mix of applications.

Stephen Montgomery-Smith

unread,

Jan 3, 2008, 6:24:39 PM1/3/08

to

Very interesting.

Do you know how the writers of ATLAS made their code just so very much
faster? Do they write directly in assembler code, or are there other
tricks they use?

user923005

unread,

Jan 4, 2008, 3:43:26 AM1/4/08

to

On Jan 3, 3:24 pm, Stephen Montgomery-Smith

Here is the ATLAS FAQ:
http://math-atlas.sourceforge.net/faq.html

Evgenii Rudnyi

unread,

Jan 4, 2008, 1:43:23 PM1/4/08

to

Hi Richard,

> Surely modern performance issues worthy of discussion relate to vector
> FPU and the choice of AoS/SoA data layout.
>
> But it's a nice "introduction" ...
>
> "it is assumed that you already know it. If you are in doubt, please
> refresh your knowledge in Wikipedia"

I am sorry for confusion. I meant an intoduction to the implementation
issues for a person who has already learnt the theory.

Evgenii

Evgenii Rudnyi

unread,

Jan 4, 2008, 2:20:18 PM1/4/08

to

Hello Gordon,

Thank you for your comments.

> If one were looking at this like a referee one would be questioning an
> author who seems to like to use a 30 year old version of Fortran.
> There are three revisions since 1977. Fortran 90 introduced a lot
> of matrix operations which are never mentioned.

I have written that I know Fortran 77 only and I have mentioned that
the newer versions are available. If you are interested why, the
answer is simple - I have switched to C++ long ago.

You can find my comments Fortran vs. C++ here

http://evgenii.rudnyi.ru/doc/misc/fortran.txt

but we do have to agree on this. Taste differs.

> When an author goes out of the way to "think old" one wonders how much
> of the rest of the material has the same attributes.

I would appreciate if you will be more specific here.

> Both a good implementation of an interpreter and of Fortran 90 are free
> to use the best available subroutine libraries for their purposes. The
> GNU successor to G77 may or may not have reached the maturity of using
> a highly optimized run time for matrix multiply.

Is Intel Fortran good? If you want it to, I will compile the code with
it the next week.

By the way, g95 produces exactly the same results as g77 with this
code. If you could donate the code in newer Fortran, please. It would
be my pleasure to compile it and compare timing.

> There is also the question of the intended audience for an introductory
> text and whether that audience should be treated to content which is so
> dated.

I am not sure I understand you. Do you mean that NumPy, C, C++ and
ATLAS are outdated?

> In a fast skim I noticed some comments on memory issues. I did not
> notice comments on the issues of size scaling when the matrix fits in the
> cache memory and when it does not. When it does not then is the issue of
> subscript order that may or may not be cache friendly. The advice that
> good subroutine packages deal with such issues is good. One would like
> some indication of the problem they address even if the full technical
> nature of the solution is not given. Keeping such a discussion both
> accurrate and introductory is not easy.

Well, the goal was learning by doing. One first observe something and
then starts thinking why it is. I guess that it is understandable that
a matrix of 1000x1000 does not fit in the processor cache.

Best wishes,

Evgenii
http://MatrixProgramming.com

Evgenii Rudnyi

unread,

Jan 4, 2008, 2:33:46 PM1/4/08

to

Hi Stephen,

> Do you know how the writers of ATLAS made their code just so very much
> faster? Do they write directly in assembler code, or are there other
> tricks they use?

ATLAS is written in C, so you can read it. However, it is more
complicated. If I understood it correctly, it writes the code to
compile on the fly after many tests.

You can find some papers here

http://www.cs.utsa.edu/~whaley/papers.html

Evgenii

Gordon Sande

unread,

Jan 4, 2008, 3:04:49 PM1/4/08

to

On 2008-01-04 15:20:18 -0400, Evgenii Rudnyi <use...@rudnyi.ru> said:

> Hello Gordon,
>
> Thank you for your comments.
>
>> If one were looking at this like a referee one would be questioning an
>> author who seems to like to use a 30 year old version of Fortran.
>> There are three revisions since 1977. Fortran 90 introduced a lot
>> of matrix operations which are never mentioned.
>
> I have written that I know Fortran 77 only and I have mentioned that
> the newer versions are available. If you are interested why, the
> answer is simple - I have switched to C++ long ago.
>
> You can find my comments Fortran vs. C++ here
>
> http://evgenii.rudnyi.ru/doc/misc/fortran.txt

I see that you have Van Synder's citation of the Les Hatton work on
empirical error rates in real programs. C++ does not look good there.
Fortran 90 helps find some of the call mismatchs that are possible in
Fortran 77 if one uses the newer facilities. The subscript errors
would remain and are a more serious issue.

Fortran 90 has matrix multiply in the language. Why do you then
talk about implementing matrix multiply? Because you choose
to make comparisons against the 1977 standard? Perhaps you should
also use the C++ version from 30 years ago. If you can not then
perhaps that indicates that things do change.

> but we do have to agree on this. Taste differs.
>
>> When an author goes out of the way to "think old" one wonders how much
>> of the rest of the material has the same attributes.
>
> I would appreciate if you will be more specific here.
>
>> Both a good implementation of an interpreter and of Fortran 90 are free
>> to use the best available subroutine libraries for their purposes. The
>> GNU successor to G77 may or may not have reached the maturity of using
>> a highly optimized run time for matrix multiply.
>
> Is Intel Fortran good? If you want it to, I will compile the code with
> it the next week.

As the saying goes "Old Fortran programmers can write Fortran in any
language" so using a newer compiler for Fortran 77 changes nothing
other than some operating system dependence.

You might want to redo things with Fortran 90 array notation and
use the matrix multiply intrinsics as an exercise. If Numerical
Recipes can have (and have had for a long time) a Fortran 90
version then one would expect others to meet even that minimal standard.

> By the way, g95 produces exactly the same results as g77 with this
> code. If you could donate the code in newer Fortran, please. It would
> be my pleasure to compile it and compare timing.

Gfortran has the intent of reproducing G77 on the source that G77
accepts, subject to exceptions over dropped extensions. So that is
hardly very surprising.

>> There is also the question of the intended audience for an introductory
>> text and whether that audience should be treated to content which is so
>> dated.
>
> I am not sure I understand you. Do you mean that NumPy, C, C++ and
> ATLAS are outdated?

There is an old saying giving two recipes for swill.

One has a tablespoon of swill and a gallon of fine wine. The other
has a gallon of swill and a tablespoon of fine wine. Both produce
swill.

>> In a fast skim I noticed some comments on memory issues. I did not
>> notice comments on the issues of size scaling when the matrix fits in the
>> cache memory and when it does not. When it does not then is the issue of
>> subscript order that may or may not be cache friendly. The advice that
>> good subroutine packages deal with such issues is good. One would like
>> some indication of the problem they address even if the full technical
>> nature of the solution is not given. Keeping such a discussion both
>> accurrate and introductory is not easy.
>
> Well, the goal was learning by doing. One first observe something and
> then starts thinking why it is. I guess that it is understandable that
> a matrix of 1000x1000 does not fit in the processor cache.

If you are trying to give advice then mentioning the effects of
subscript order on both cache and page hits as sizes grow would
seem to reflect the modern world. Time the filling a large array
in random order, as some problems might suggest, against the
same content in a sequential order.

Evgenii Rudnyi

unread,

Jan 4, 2008, 4:24:40 PM1/4/08

to

I would suggest to switch from words to actions. I do not know newer
Fortran and I am not sure if I want to learn it. Could you please give
me a program in newer Fortran that multiplies two matrices and with
that you are satisfied? Then I compile it with Intel compiler and
compare with my code. Is this sounds reasonable?

user923005

unread,

Jan 4, 2008, 7:44:19 PM1/4/08

to

On Jan 4, 1:24 pm, Evgenii Rudnyi <use...@rudnyi.ru> wrote:
> I would suggest to switch from words to actions. I do not know newer
> Fortran and I am not sure if I want to learn it. Could you please give
> me a program in newer Fortran that multiplies two matrices and with
> that you are satisfied? Then I compile it with Intel compiler and
> compare with my code. Is this sounds reasonable?

I guess that if you make a request to news:comp.lang.fortran for them
to provide a modern and fast F95 Fortran matrix multiply someone will
do it for you. Atlas will still clobber it, of course.

Gordon Sande

unread,

Jan 4, 2008, 9:00:08 PM1/4/08

to

A Fortran 90 compiler is free to use Atlas, or any other method that
its authors care to choose, to implement its MATMUL intrinsic. The
early compiler implementations would not have but by now the run
time support can be expected to be as good as the state of the art.
A carefully written code like Atlas would be an obvious candidate
for inclusion in the run time library. A vendor specific
implementation might have some improvements on a more generic
version of the Atlas code.

The array operations available since Fortran 90 include matrix
multiply so there is no need to "do it longhand" in the style
of the algebraic definitions.

user923005

unread,

Jan 5, 2008, 12:25:57 AM1/5/08

to

On Jan 4, 6:00 pm, Gordon Sande <g.sa...@worldnet.att.net> wrote:

> On 2008-01-04 20:44:19 -0400, user923005 <dcor...@connx.com> said:
>
> > On Jan 4, 1:24 pm, Evgenii Rudnyi <use...@rudnyi.ru> wrote:
> >> I would suggest to switch from words to actions. I do not know newer
> >> Fortran and I am not sure if I want to learn it. Could you please give
> >> me a program in newer Fortran that multiplies two matrices and with
> >> that you are satisfied? Then I compile it with Intel compiler and
> >> compare with my code. Is this sounds reasonable?
>
> > I guess that if you make a request to news:comp.lang.fortran for them
> > to provide a modern and fast F95 Fortran matrix multiply someone will
> > do it for you. Atlas will still clobber it, of course.
>
> A Fortran 90 compiler is free to use Atlas, or any other method that
> its authors care to choose, to implement its MATMUL intrinsic. The
> early compiler implementations would not have but by now the run
> time support can be expected to be as good as the state of the art.
> A carefully written code like Atlas would be an obvious candidate
> for inclusion in the run time library. A vendor specific
> implementation might have some improvements on a more generic
> version of the Atlas code.

I would be surprised if they chose ATLAS. On the other hand, a
hardware specific version (e.g. using the Intel compiler with Intel's
BLAS library) would possibly do even better than ATLAS.

> The array operations available since Fortran 90 include matrix
> multiply so there is no need to "do it longhand" in the style
> of the algebraic definitions.

None of the routines in the original post do the N^3 naive way.

Evgenii Rudnyi

unread,

Jan 5, 2008, 2:37:25 AM1/5/08

to

Thanks, probably at some point I will do. If would be interesting to
see it indeed. I just wonder why Gordon prefers to keep silent when it
comes to the code.

b52b

unread,

Jan 5, 2008, 5:06:16 AM1/5/08

to

On Jan 5, 8:37 am, Evgenii Rudnyi <use...@rudnyi.ru> wrote:
. I just wonder why Gordon prefers to keep silent when it
> comes to the code.

This is matter of ethics. Writing an educational text is hard because
you should be up to date. That's what Gordon said in my opinion.
You can't blame him - it is your article.
My advice : sent your request to comp.lang.fortran
Otherwise this is another face of the language war.

Regards,
B52

Evgenii Rudnyi

unread,

Jan 5, 2008, 7:53:36 AM1/5/08

to

On Jan 5, 11:06 am, b52b <jam...@wp.pl> wrote:
> On Jan 5, 8:37 am, Evgenii Rudnyi <use...@rudnyi.ru> wrote:
> . I just wonder why Gordon prefers to keep silent when it> comes to the code.
>
> This is matter of ethics. Writing an educational text is hard because
> you should be up to date. That's what Gordon said in my opinion.
> You can't blame him - it is your article.

Sorry, but I would disagree. My text is completely up to date. The
current practice for linear algebra with dense matrices is to use
LAPACK (by the way, it is written in Fortran 77) together with the
optimized BLAS. From what a programming language to call LAPACK and
optimized BLAS does not matter.

Also if you read the conclusion, it says

http://matrixprogramming.com/MatrixMultiply/
-------------------------------
I hope that this chapter has convinced you that it is very important
to use libraries even for seemingly simple things, like matrix
multiplication.
--------------------------------

I am pretty sure it concerns the newer Fortran as well. There are no
miracles, either one uses optimized DGEMM or performance suffers. The
use of optimized DGEMM can be behind the scene, as in NumPy, Matlab
and Mathematica, this does not matter.

> My advice : sent your request to comp.lang.fortran
> Otherwise this is another face of the language war.

This what I wanted to escape, as personally I believe that good
performance can be achieved in any programming language provided that
one understands the technology.

My text is more about technology, not about a programming language.

Evgenii

Gordon Sande

unread,

Jan 5, 2008, 10:51:22 AM1/5/08

to

Thanks to Dave Dobson for a sensible response.

It is your article to write. You have picked a hard topic and asked for
comments. You got them. They pointed out consequences of your choice to
switch to C++ in 1993 and to extol its virtues. Your choice of systems
to compare is certainly idiosyncratic. Python, C++ vrs C and F77 seems
to be more a setup for language wars than about Matrix Programming.
Insisting that comments come with "your code" is following the best
newsgroup language war traditions.

If you want to write about matrix programming then where is the discussion
of MatLab and its clones? Fortran 90 lives well in that crowd with its
array operations. Matrix multiply is one of the supplied intrinsics. To
ignore that reality of current practice is to either be misinformed or
out of date. Once one has gotten the concepts sorted out with vector and
matrix capabilities then one can go back to the element based operations.
To dismiss F90 with the comment that you only know F77 would appear
to be an admission that you do not want to follow the contemporary style of
use of the array notations. Comments on your pedogogy seem to be treated
as comments on your abilities. Such defensiveness is an overreation but
as the saying goes "If the shoe fits ...".

Evgenii Rudnyi

unread,

Jan 5, 2008, 2:37:29 PM1/5/08

to

Dear Gordon,

>It is your article to write. You have picked a hard topic and asked for
>comments. You got them.

I do appreciate your comments and your time. Thank you. Unfortunately,
I still did not completely get your points. Hence if you allow I will
ask some more questions.

>They pointed out consequences of your choice to
>switch to C++ in 1993 and to extol its virtues. Your choice of systems
>to compare is certainly idiosyncratic. Python, C++ vrs C and F77 seems
>to be more a setup for language wars than about Matrix Programming.
>Insisting that comments come with "your code" is following the best
>newsgroup language war traditions.

Here I would disagree. The use of several programming languages in one
project is a common practice nowadays, at least what I see. For
example, I am using LAPACK that is written in Fortran 77 and MUMPS
that is written in Fortran 90. ATLAS, TAUCS and UMPFACK are written in
C. My code is in C++ and now I am going to use Python as a scripting
language.

Anyway, do I have a freedom to use a programming language of my choice
or not?

>If you want to write about matrix programming then where is the discussion
>of MatLab and its clones?

In my text

http://matrixprogramming.com/MatrixMultiply/

there were links to both Matlab and Mathematica.

>Fortran 90 lives well in that crowd with its
>array operations. Matrix multiply is one of the supplied intrinsics. To
>ignore that reality of current practice is to either be misinformed or
>out of date.

The matrix multiply is also intrinsic in Matlab, Mathematica and NumPy
and I guess in many other systems. The example with matrix multiply as
intrinsic is at the beginning of my text in NumPy - See mm.py.

In my opinion, everything here depends on whether this intrinsic will
call optimized BLAS or not. If yes, it will be efficient in any
language. If not, then it will be inefficient in any language. Do you
agree with this statement?

>Once one has gotten the concepts sorted out with vector and
>matrix capabilities then one can go back to the element based operations.
>To dismiss F90 with the comment that you only know F77 would appear
>to be an admission that you do not want to follow the contemporary style of
>use of the array notations.

I see your point here. Thank you. Yes, I should have mentioned the
contemporary style of the array notation that is available in many
modern languages. Will do.

Finally one more question about Fortran 77 and Fortran9*. LAPACK is
written in Fortran 77 and many say that it is the most efficient
library for dense linear algebra (of course, when one uses it with the
optimised BLAS). Well, on Netlib there is Fortran 95 interface to
LAPACK but this is only an interface, the computational engine is
still written in Fortran 77. Could you please make your comment in
this respect?

Thank you again. If you find time to answer questions above, I would
appreciate it.

Best wishes,

Evgenii

Gordon Sande

unread,

Jan 5, 2008, 3:29:34 PM1/5/08

to

On 2008-01-05 15:37:29 -0400, Evgenii Rudnyi <use...@rudnyi.ru> said:

> Dear Gordon,
>
>> It is your article to write. You have picked a hard topic and asked for
>> comments. You got them.
>
> I do appreciate your comments and your time. Thank you. Unfortunately,
> I still did not completely get your points. Hence if you allow I will
> ask some more questions.
>
>> They pointed out consequences of your choice to
>> switch to C++ in 1993 and to extol its virtues. Your choice of systems
>> to compare is certainly idiosyncratic. Python, C++ vrs C and F77 seems
>> to be more a setup for language wars than about Matrix Programming.
>> Insisting that comments come with "your code" is following the best
>> newsgroup language war traditions.
>
> Here I would disagree. The use of several programming languages in one
> project is a common practice nowadays, at least what I see. For
> example, I am using LAPACK that is written in Fortran 77 and MUMPS
> that is written in Fortran 90. ATLAS, TAUCS and UMPFACK are written in
> C. My code is in C++ and now I am going to use Python as a scripting
> language.
>
> Anyway, do I have a freedom to use a programming language of my choice
> or not?

Your choice is for you. When you start to write expositions then you need
to have well based reasons for your recommendations. "I only know F77"
would not seem to make it as a reason for ignoring F90 in an exposition.
It may be OK for you next programming assignment but it will look curious
on your resume if you are making the point of well based recommendations
on modern pratice.

You asked for comments on your Matrix Multiply. You "forgot" to mention
that F90 has that as an intrinsic which will be as efficient or not as
any other system that has a run time support for such an operation.
You claimed that it did not much matter as all you knew was F77. For
someone who seems to be proud of using modern features that is most
curious ommision.

LAPACK is MUCH more than just matrix multiply. Some of those things
are not particularly friendly to array notation so would not use
the array notation features of F90. The software engineering enhancements
of F90 are provided by the application interface offered by the F90
wrappers. One would expect to see the array operations in use in
setting up the data which is then operated on by LAPACK for the
more elaborate operations. The F90 type safety (which unfortunately
requires that you actually use the capabilities) would help avoid
programming errors. LAPACK was developed over considerable time so
one would not expect to see it redone instantly. There is no native
C version of it either.

In the strengths of Fortran are both is long history and its upwards
compatability. That should not be mistaken for a license to believe that
it is still your grandfathers F66. The history of the long delay in
producing F77 compilers, the long delay in getting to F90 and the
popularity of free Unix versions used to teach system programming
has lead many computer scientists to think that it is smart computer
science to believe in F66. But those are the same folks who invert
matrices when they solve equations etc. If you are going to discuss
matrix computations then one has every right to expect better and
to say so.

One of the advantages of MatLab and such is that they provide a relatively
simple conceptual model. This is important for those who only program
part time in otherwise technical jobs. Fortran has the same simple
conceptual base. The arcane technical issues the arise due to its
ability to support several differing implemenattion strategies make
that a bit of a stretch if pushed hard. For your apparent introductory
audience a simple conceptual model is a great bonus.

Evgenii Rudnyi

unread,

Jan 6, 2008, 4:25:15 PM1/6/08

to

Gordon,

Once more, thank you very much for your comments. They helped me to
look at my text from a different viewpoint and see how it should be
modified.

I cannot still accept your suggestion to use Fortran 9*, as my plan is
different. I am a practitioner and I plan to write a text from a
practical viewpoint. The emphasis will be on the use of existing
numerical libraries with stress on sparse matrices. So the use of
different programming languages is by design. However, I will limit
myself by Fortran 77, C and C++, as this is what I know. Unfortunately
it is impossible to learn everything. Well, such a choice allows me to
cover most numerical libraries.

An important topic that actually I plan to cover is integration. It
will include interoperability between different programming languages
(Fortran 77, C and C++), linking issues and then using a scripting
language as an integration platform. The latter is quite appealing
from many different considerations.

By the way, one library that I plan to cover is MUMPS and it is
written in Fortran 90.

I wish your success. Long live Fortran.

Evgenii

user923005

unread,

Jan 7, 2008, 2:54:04 PM1/7/08

to

I think it is a good idea to let the Fortran 9x programmers give it a
try.
If they can come up with a marvelous solution, I would like to see it.

To the Fortran camp, this article:
http://matrixprogramming.com/MatrixMultiply/

Discusses matrix multiplication. Can you provide an optimal Fortran
9x solution for comparison purposes?

Gordon Sande

unread,

Jan 7, 2008, 3:06:29 PM1/7/08

to

On 2008-01-07 15:54:04 -0400, user923005 <dco...@connx.com> said:

> On Jan 6, 1:25 pm, Evgenii Rudnyi <use...@rudnyi.ru> wrote:

>>
>>> You asked for comments on your Matrix Multiply. You "forgot" to mention
>>> that F90 has that as an intrinsic which will be as efficient or not as
>>> any other system that has a run time support for such an operation.
>

> I think it is a good idea to let the Fortran 9x programmers give it a
> try.
> If they can come up with a marvelous solution, I would like to see it.
>
> To the Fortran camp, this article:
> http://matrixprogramming.com/MatrixMultiply/
>
> Discusses matrix multiplication. Can you provide an optimal Fortran
> 9x solution for comparison purposes?

Why would anyone develop a matrix multiply when there is a MATMUL intrinsic?

Mostly you can expect to be told to read the fine manual.

Are you still under the impression that Rudnyi is doing anything other
than a three loop N^3 in his various examples. He has figured out that
there are various languages, including F90, that share the attribute
of having matrix multiply built-in and that all of them will be equally
efficient, or not, depending upon how dilligent their implementors are.
No effort required on the part of the end programmer.

user923005

unread,

Jan 7, 2008, 3:40:49 PM1/7/08

to

On Jan 7, 12:06 pm, Gordon Sande <g.sa...@worldnet.att.net> wrote:

> On 2008-01-07 15:54:04 -0400, user923005 <dcor...@connx.com> said:
>
> > On Jan 6, 1:25 pm, Evgenii Rudnyi <use...@rudnyi.ru> wrote:
>
> >>> You asked for comments on your Matrix Multiply. You "forgot" to mention
> >>> that F90 has that as an intrinsic which will be as efficient or not as
> >>> any other system that has a run time support for such an operation.
>
> > I think it is a good idea to let the Fortran 9x programmers give it a
> > try.
> > If they can come up with a marvelous solution, I would like to see it.
>
> > To the Fortran camp, this article:
> >http://matrixprogramming.com/MatrixMultiply/
>
> > Discusses matrix multiplication. Can you provide an optimal Fortran
> > 9x solution for comparison purposes?
>
> Why would anyone develop a matrix multiply when there is a MATMUL intrinsic?

Because the matrix multiply library call is faster than the MATMUL
intrinsic?

> Mostly you can expect to be told to read the fine manual.
>
> Are you still under the impression that Rudnyi is doing anything other
> than a three loop N^3 in his various examples. He has figured out that
> there are various languages, including F90, that share the attribute
> of having matrix multiply built-in and that all of them will be equally
> efficient, or not, depending upon how dilligent their implementors are.
> No effort required on the part of the end programmer.

Then if a F90 or F95 benchmark does not exceed the speed of the Atlas
code, then Rudnyi has made his point, and all the complaints about his
not using F90 or F95 are just so much hot air.

Sebastian Hanigk

unread,

Jan 8, 2008, 4:46:28 AM1/8/08

to

user923005 <dco...@connx.com> writes:

>> Why would anyone develop a matrix multiply when there is a MATMUL intrinsic?
>
> Because the matrix multiply library call is faster than the MATMUL
> intrinsic?

The whole discussion is rather strange. If you have no dire performance
needs, use the language's features, they're usually good enough. Anyway,
it's not a question of the language but of the algorithm and its
implementation. The original poster's naive three-loop implementation
shootout is not a good benchmark. For other strategies I could point you
in the direction of cache-obliviousness[1].

> Then if a F90 or F95 benchmark does not exceed the speed of the Atlas
> code, then Rudnyi has made his point, and all the complaints about his
> not using F90 or F95 are just so much hot air.

This is ridiculuous. Take for example Intel's MKL and then try to write
a homegrown (serial) matrix-multiplication routine: regardless of your
choice of language you will most probably not exceed the MKL's
performance because that's what Intel's engineers have already done. On
the other hand look at the paper I've cited above: for parallel cases,
we have implemented two different algorithms and gained a bit of
performance compared to standard libraries[2], but the choice of
language was mostly driven by preference (Fortran 95).

Sebastian

[1] Bader \& Zenger: Cache oblivious matrix multiplication using an
element ordering based on a Peano curve, Linear Algebra and Its
Applications 417, 301--313, 417, Elsevier, 2006

[2] Bader {\it et al.}: Parallelisation of Block Recursive Matrix
Multiplication in Prefix Computations, in: Proceedings of the ParCo,
Parallel Computing 2007, 2008

pixel

unread,

Jan 14, 2008, 7:39:04 PM1/14/08

to

Sebastian Hanigk wrote:
>
> The whole discussion is rather strange. If you have no dire performance
> needs, use the language's features, they're usually good enough. Anyway,
> it's not a question of the language but of the algorithm and its
> implementation. The original poster's naive three-loop implementation
> shootout is not a good benchmark. For other strategies I could point you
> in the direction of cache-obliviousness[1].

It's as good a benchmark as any - it dismantles your cache-obliviousness
claim which, if true, should produce a big performance gain, not a tiny
one. Furthermore, language does influence the performance - change
arrays from static to dynamic, make matrix sizes a power of two, and
reevaluate what's "ridiculous" about the *same* algorithm grinding to a
halt.

Sebastian Hanigk

unread,

Jan 15, 2008, 3:31:33 AM1/15/08

to

pixel <pi...@pica.net> writes:

>> The original poster's naive three-loop implementation shootout is not
>> a good benchmark. For other strategies I could point you in the
>> direction of cache-obliviousness[1].
>
> It's as good a benchmark as any

If you want to benchmark the matrix multiplication performance, it is
not a good benchmark because the implemented algorithm is suboptimal
from a theoretical and practical point of view.

> - it dismantles your cache-obliviousness claim which, if true, should
> produce a big performance gain, not a tiny one.

Careful. Cache-oblivious algorithms will not provide a performance boost
if your existing algorithm and its implementation are already quite
good. Look again at matrix multiplication: Intel's MKL routines provide
already over 90% of theoretical peak performance, one does not expect
large benefits here. The real benefit of a cache-oblivious algorithm is
its portability; instead of fiddling around with cache sizes, optimal
blocking and so on for every new processor and architecture, you
implement a cache-oblivious algorithm once and it's good on every
architecture - but obviously it will have a hard time against
hardware-optimised implementations.

> Furthermore, language does influence the performance - change arrays
> from static to dynamic, make matrix sizes a power of two, and
> reevaluate what's "ridiculous" about the *same* algorithm grinding to
> a halt.

I do not negate the influence of the choice of languages, but compared
with the right choice of algorithms it should be a small effect. There
exists a language benchmark (<http://shootout.alioth.debian.org/>), but
beware that they compare compounds of implementation, language and
run-time system against each other.

Sebastian

user923005

unread,

Jan 15, 2008, 3:12:55 PM1/15/08

to

On Jan 15, 12:31 am, Sebastian Hanigk <han...@in.tum.de> wrote:

> pixel <p...@pica.net> writes:
> >> The original poster's naive three-loop implementation shootout is not
> >> a good benchmark. For other strategies I could point you in the
> >> direction of cache-obliviousness[1].
>
> > It's as good a benchmark as any
>
> If you want to benchmark the matrix multiplication performance, it is
> not a good benchmark because the implemented algorithm is suboptimal
> from a theoretical and practical point of view.

I suspect that zero percent of benchmark software systems have
implemented algorithms that are both theoretically and practically
optimal. Generally speaking, benchmarks tend to be ANSI/ISO code that
will compile across a zillion platforms so that we can figure out how
well compilers perform or floating point units perform. I do not know
of any benchmark used to examine the optimality of an algorithm
(though it does sound interesting, now that I think of it!).

The addition of ATLAS probably pushed that particular benchmark closer
to the mark than most. The primitive mulitplication is of limited
value but for the given problem size (1000) the theoretically optimal
algorithms will not offer a huge increase in performance anyway.

> > - it dismantles your cache-obliviousness claim which, if true, should
> > produce a big performance gain, not a tiny one.
>
> Careful. Cache-oblivious algorithms will not provide a performance boost
> if your existing algorithm and its implementation are already quite
> good. Look again at matrix multiplication: Intel's MKL routines provide
> already over 90% of theoretical peak performance, one does not expect
> large benefits here. The real benefit of a cache-oblivious algorithm is
> its portability; instead of fiddling around with cache sizes, optimal
> blocking and so on for every new processor and architecture, you
> implement a cache-oblivious algorithm once and it's good on every
> architecture - but obviously it will have a hard time against
> hardware-optimised implementations.

How well will those Intel MKL routines work on a SUN Spark box or a
Mac? A benchmark that works on a single platform is a truly terrible
benchmark. In fact, I would go so far as to say a useless one. What
good is a benchmark that benchmarks a single platform?

> > Furthermore, language does influence the performance - change arrays
> > from static to dynamic, make matrix sizes a power of two, and
> > reevaluate what's "ridiculous" about the *same* algorithm grinding to
> > a halt.
>
> I do not negate the influence of the choice of languages, but compared
> with the right choice of algorithms it should be a small effect. There
> exists a language benchmark (<http://shootout.alioth.debian.org/>), but
> beware that they compare compounds of implementation, language and
> run-time system against each other.

I think it would be very nice to see the benchmark expanded. I would
like to see more optimal algorithms added and a harness to find the
crossover where the theoretically more efficient algorithms give
tangible benefit. It would be interesting even to compare external
libraries like MKL against ATLAS by having special hooks to add in
external libraries. Matrix multiplication is a very important
operation and so it would be nice if the benchmark could be expanded
into a large and well documented project.

In the meantime, I think it is interesting and worthwhile, especially
for a first effort. If there is a better matrix multiplication
benchmark, why don't we work on folding the two together?

When the OP originally made his post, he was lambasted for not using
Fortran 95. I doubt if it would have made a measurable difference.

Evgenii Rudnyi

unread,

Jan 15, 2008, 4:08:40 PM1/15/08

to

I would like to re-iterate my point. The text that I has written was
not about making benchmarks at all. It was designed for novices who
are interested in programming of computations with matrices. The goal
was actually to show that a naive three loop implementation is not
good in any programming language. To this end I have chosen the
dimension 1000x1000 deliberately to make sure that the matrix does not
fit the processor cache. On the other hand, in my view it is useful to
make a three loop implementation at the beginning anyway, say as a
reference point.

For those who are interested in theory I would recommend the ATLAS
paper where there is a discussion why a compiler cannot do things that
ATLAS does.

So the main point was that it is important to use a good library even
in this seemingly simple case.

http://matrixprogramming.com/MatrixMultiply/

On Jan 15, 9:31 am, Sebastian Hanigk <han...@in.tum.de> wrote:

user923005

unread,

Jan 15, 2008, 11:06:24 PM1/15/08

to

On Jan 15, 1:08 pm, Evgenii Rudnyi <use...@rudnyi.ru> wrote:
> I would like to re-iterate my point. The text that I has written was
> not about making benchmarks at all. It was designed for novices who
> are interested in programming of computations with matrices. The goal
> was actually to show that a naive three loop implementation is not
> good in any programming language. To this end I have chosen the
> dimension 1000x1000 deliberately to make sure that the matrix does not
> fit the processor cache. On the other hand, in my view it is useful to
> make a three loop implementation at the beginning anyway, say as a
> reference point.

When you time how long something takes by several different
approaches, it is a benchmark no matter what you call it.

> For those who are interested in theory I would recommend the ATLAS
> paper where there is a discussion why a compiler cannot do things that
> ATLAS does.
>
> So the main point was that it is important to use a good library even
> in this seemingly simple case.
>
> http://matrixprogramming.com/MatrixMultiply/

[snip of stuff under top-post]

Sebastian Hanigk

unread,

Jan 16, 2008, 3:52:18 AM1/16/08

to

Evgenii Rudnyi <use...@rudnyi.ru> writes:

> I would like to re-iterate my point. The text that I has written was
> not about making benchmarks at all. It was designed for novices who
> are interested in programming of computations with matrices. The goal
> was actually to show that a naive three loop implementation is not
> good in any programming language.

This is a bit silly, your point is one of the textbooks' most prominent
examples of bad memory access, it has nothing to do with programming
languages.

> To this end I have chosen the dimension 1000x1000 deliberately to make
> sure that the matrix does not fit the processor cache. On the other
> hand, in my view it is useful to make a three loop implementation at
> the beginning anyway, say as a reference point.

Reference point for what? I'm sorry if my comments are a bit acerbic,
but I still do not see what you are trying to accomplish. For pitting
languages against each other your example is too limited (a set of
problems like in the SPEC benchmarks or the language shootout is more
suitable), the only thing you could have shown would be the ability of
your compiler to do loop interchanging.

> So the main point was that it is important to use a good library even
> in this seemingly simple case.

Captain Obvious is calling ... ;-)

Usually one would recommend to use language features first (Fortran for
example does have a matmul, dot_product, ... intrinsic) then look for
vendor libraries because the chance that they're optimised is not so
low.

Still, I don't know what this has to do with your choice of languages;
most of them provide a way to interface with C routines which lets you
use almost any library, be it a Fortran one like BLAS, be it C-API stuff
from the operating system and so on.

Sebastian

Evgenii Rudnyi

unread,

Jan 16, 2008, 4:26:26 PM1/16/08

to

Sebastian,

The text is designed for people who has learned a bit of programming,
a bit of linear algebra and is making first steps in practical
implementation. As such, the text is quite obvious for experts, I
agree with that.

I do not know if you have heard "Learning by doing". A person does
something and then observes the results. When someone compares a three
loop implementation with the optimized BLAS by doing, in my view this
could be much more effective as just reading the guidelines.

The choice of programming languages I have already explained. These
are languages that I know. Not too scientific explanation, I
understand, but that's it.

This what I do in my practice - I use C++ and then call numerical
libraries in C and Fortran 77. I personally find it very efficient.
However, I see that for many young people this way is difficult, as to
use it in practice it is necessary to know many small technical
things. Hence quite often they reinvent the wheel. So, I plan to share
my experience with the goal to show that it is easy to use libraries
written in one language from another programming language with
examples from these three languages.

I have nothing against Fortran 9*, but as I have already said, I just
do not know it. If someone donates me the code in Fotran 9*, I will
add it as well. Well, I have already added a link to this discussion.

There are many interesting programming languages and it is just
impossible to learn all of them. Say relatively recently I have
learned Scheme, this book

Structure and Interpretation of Computer Programs
http://mitpress.mit.edu/sicp/full-text/book/book.html

is very impressive and it even changed my way of thinking. Highly
recommend it. I am really sorry that I have missed Lisp when I was
young.

Evgenii
http://MatrixProgramming.com