Matmul

169 views
Skip to first unread message

samig

unread,
Feb 22, 2012, 7:06:25 PM2/22/12
to GNU Fortran
Is MATMUL implemented inherently unoptimized in gfortran (and any
other compiler I have tested !)? Why is it always beaten with orders
of magnitude by explicit loops (even in rowwise order !) ? Which
compiler/linker options would improve its performance?

Best regards

Tobias Burnus

unread,
Feb 23, 2012, 1:44:55 PM2/23/12
to gnu-f...@googlegroups.com, samig

MATMUL should generate the same performance as any (normal but
reasonable) manual loop implementation. (The order of the generated code
is O(n**3), which can be brought down to approx. O(n**2.7) to O(n**2.3)
using special algorithms, which only pay off if "n" is large to
extremely large. gfortran can use a special matmul external function,
cf. -fexternal-blas and -fblas-matmul-limit=n)

For gfortran, compiling and linking with
-Ofast -flto -march=native -funroll-loops
produces fast programs.

However, I assume that your manual implementation makes assumptions
which the compiler cannot do. In particular, the performance is
extremely reduced if the compiler cannot rule out that the right-hand
side and the left-hand side expression depend on each other.
Additionally, it helps with the performance if the compiler knows that
all involved arrays are contiguous in memory.

For instance:
subroutine example(a, b)
integer, pointer :: a(:), b(:)
a = b
end subroutine

The compiler will generate something like:
subroutine example(a, b)
integer, pointer :: a(:), b(:)
integer :: tmp(size(a)), i
do i = 1, size(a)
tmp(i) = b(i)
end do
do i = 1, size(a)
a(i) = tmp(i)
end do
end subroutine

While you had probably written:
subroutine example(a, b)
integer, pointer :: a(:), b(:)
integer :: i
do i = 1, size(a)
a(i) = b(i)
end do
end subroutine

The latter version is much faster: It saves the time for allocatation
and free the memory and for the extra loop assigning to tmp. However, it
will give wrong results for the following callee:

integer, target :: a(2)
integer, pointer :: ptr1(:), ptr2(:)
a = [1, 2]
ptr1 => a
ptr2 => a(2:1:-1)
call example(ptr1, ptr2)
print *, ptr1
end

Using the version of the compiler or the loop one with temporary, it
will print "2 1", using the loop version without temporary, one gets "2 2".

I assume that in your example, the compiler will also insert a temporary
while your loop doesn't. You can check by compiling with
-Warray-temporaries which shows where the compiler inserts a temporary.

To avoid temporaries:
* Don't use pointers - allocatables or deferred-shape arrays are better
* Avoid "target" - while not as bad as pointers, it can also cause slow down
* Use explicit-size arrays, allocatables or the CONTIGUOUS attribute to
tell the compiler that the array is contiguous.

* * *

However, if you have found an algorithm which is faster than O(n**2.7)
which also works reasonable well for smaller "n", I am really
interested. (The expectation is that matmul can be done in O(n**2); the
best algorithms seem to reach something close to O(n**2.3), but only one
O(n**2.7) algorithm seems to work reasonably well with large but
reasonable array sizes. For smaller ones, the straight forward O(n**3)
algorithms works quite well.)

Tobias

samig

unread,
Feb 24, 2012, 8:08:36 PM2/24/12
to GNU Fortran
Tobias,

I followed your recommendations and indeed I have seen a great
improvement. Explicit looping is still optimizing slightly better. I
haven't noticed any difference by linking to BLAS. Surprisingly also,
it seems that the compiler autoparallelization of MATMUL is slightly
better than explicitely enclosing MATMUL in omp parallel workshare.
What surprises me though mostly is how much faster gfortran has become
lately. Actually it is comparably fast and even faster than two
commercial compilers I've tested against version 4.6.3 20120127
(prerelease). Great work!

/Sakis

Ondřej Čertík

unread,
Feb 24, 2012, 8:28:07 PM2/24/12
to gnu-f...@googlegroups.com
On Fri, Feb 24, 2012 at 5:08 PM, samig <sam...@gmail.com> wrote:
>
[...]

> What surprises me though mostly is how much faster gfortran has become
> lately. Actually it is comparably fast and even faster than two
> commercial compilers I've tested against version 4.6.3 20120127
> (prerelease). Great work!


Indeed I have the same experience. Many times (but not always) gfortran is
already as fast as other commercial compilers.

Ondrej

Reply all
Reply to author
Forward
0 new messages