MATMUL should generate the same performance as any (normal but
reasonable) manual loop implementation. (The order of the generated code
is O(n**3), which can be brought down to approx. O(n**2.7) to O(n**2.3)
using special algorithms, which only pay off if "n" is large to
extremely large. gfortran can use a special matmul external function,
cf. -fexternal-blas and -fblas-matmul-limit=n)
For gfortran, compiling and linking with
-Ofast -flto -march=native -funroll-loops
produces fast programs.
However, I assume that your manual implementation makes assumptions
which the compiler cannot do. In particular, the performance is
extremely reduced if the compiler cannot rule out that the right-hand
side and the left-hand side expression depend on each other.
Additionally, it helps with the performance if the compiler knows that
all involved arrays are contiguous in memory.
For instance:
subroutine example(a, b)
integer, pointer :: a(:), b(:)
a = b
end subroutine
The compiler will generate something like:
subroutine example(a, b)
integer, pointer :: a(:), b(:)
integer :: tmp(size(a)), i
do i = 1, size(a)
tmp(i) = b(i)
end do
do i = 1, size(a)
a(i) = tmp(i)
end do
end subroutine
While you had probably written:
subroutine example(a, b)
integer, pointer :: a(:), b(:)
integer :: i
do i = 1, size(a)
a(i) = b(i)
end do
end subroutine
The latter version is much faster: It saves the time for allocatation
and free the memory and for the extra loop assigning to tmp. However, it
will give wrong results for the following callee:
integer, target :: a(2)
integer, pointer :: ptr1(:), ptr2(:)
a = [1, 2]
ptr1 => a
ptr2 => a(2:1:-1)
call example(ptr1, ptr2)
print *, ptr1
end
Using the version of the compiler or the loop one with temporary, it
will print "2 1", using the loop version without temporary, one gets "2 2".
I assume that in your example, the compiler will also insert a temporary
while your loop doesn't. You can check by compiling with
-Warray-temporaries which shows where the compiler inserts a temporary.
To avoid temporaries:
* Don't use pointers - allocatables or deferred-shape arrays are better
* Avoid "target" - while not as bad as pointers, it can also cause slow down
* Use explicit-size arrays, allocatables or the CONTIGUOUS attribute to
tell the compiler that the array is contiguous.
* * *
However, if you have found an algorithm which is faster than O(n**2.7)
which also works reasonable well for smaller "n", I am really
interested. (The expectation is that matmul can be done in O(n**2); the
best algorithms seem to reach something close to O(n**2.3), but only one
O(n**2.7) algorithm seems to work reasonably well with large but
reasonable array sizes. For smaller ones, the straight forward O(n**3)
algorithms works quite well.)
Tobias
Indeed I have the same experience. Many times (but not always) gfortran is
already as fast as other commercial compilers.
Ondrej