It is clear to me, that in the first case, data in principle gets
copied in memory (to m), but I always thought that the optimizer would
recognize that m is afterwards never used and would 'inline' this?
Background:
the code is a part of a simple 2D convolution example looping over the
pixels (dp is initalized int(filterwidth/2)):
end do !jp, py
!-----------------------------------------------------------------
the code is part of a mex file to get performance in some matlab
routines (on windows) (yes, this is lame, but I'm forced to use
matlab, would prefer plain f95).
I could provide detailed information about the compiler used etc. if
necessary, but I think this is a more general misunderstanding on my
side?
The c-code version using 2 more nested loops (should be slower than
fortran) (the matrix-matrix multiplication) and pointer-arithmetic
(fast) is in total faster than the first case.:
}//end for jp
}//end for ip
!---------------------------------------------------------------
This somehow shocked me. My colleagues using matlab need more speed.
I'm coming from fortran number crunching (electronic structures) and
therefored stated that they a) need a compiled language b) should use
fortran, because it's syntax and memory-layout is similar to matlab,
and it's faster than plain c/c++.
As stated above I always followed the pilosophy 'write clean and
readable code (as in the first case above) - the optimizer will get
things fast'.
Different compilers have different optimization features. In this case it might be a feature that Andy has not yet implemented. If you haven't tried it yet, you may get a considerable speed-up of your test case by using, for an example,
g95 -O2 test.f95 -o test.exe
to give the executable test.exe. Other optimization features are available.
----- Original Message ----- From: "alexpatgoogle" <alexander.pod...@gmx.net> To: "gg95" <gg95@googlegroups.com> Sent: Saturday, June 27, 2009 7:14 AM Subject: Optimizer misunderstood?
> It is clear to me, that in the first case, data in principle gets > copied in memory (to m), but I always thought that the optimizer would > recognize that m is afterwards never used and would 'inline' this?
> Different compilers have different optimization features. In this case it
> might be a feature that Andy has not yet implemented. If you haven't tried
> it yet, you may get a considerable speed-up of your test case by using, for
> an example,
> g95 -O2 test.f95 -o test.exe
> to give the executable test.exe. Other optimization features are
> available.
> I would add that real(8) is non-portable Fortran.
> Jimmy.
> ----- Original Message ----- > From: "alexpatgoogle" <alexander.pod...@gmx.net>
> To: "gg95" <gg95@googlegroups.com>
> Sent: Saturday, June 27, 2009 7:14 AM
> Subject: Optimizer misunderstood?
>> It is clear to me, that in the first case, data in principle gets
>> copied in memory (to m), but I always thought that the optimizer would
>> recognize that m is afterwards never used and would 'inline' this?
On a Core2 Duo E7200, 4GB memory, using g95_MinGW and Vista, the program compiled with g95 -O2 test.f95 -o test.exe
program test real(8) :: pic(1000,1000) real(8) :: picres(1000,1000) real(8) :: f(9,9) real(8) :: m(9,9) pic(:,:)=1.0 f(:,:)=2.0 ! call cpu_time(t1) do i=1,10000000 m(:,:)=pic(901:909,901:909)*f(:,:) picres(905,905)=sum(m) end do call cpu_time(t2) write(*,*)(t2-t1)
call cpu_time(t1) do i=1,10000000 picres(905,905)=sum(pic(901:909,901:909)*f(:,:)) end do call cpu_time(t2) write(*,*)(t2-t1) stop end program test
gives times 9.84/5.56 without the -O2 switch and 2.30/1.32 secs with the -O2 switch. Both forms are speeded up by a factor of 4+, although the one-liner is still faster. As you didn't get this speed-up some other factors appear to be involved, eg as the arrays are large, it might be due to cache memory hits/misses.