Optimizer misunderstood?

0 views
Skip to first unread message

alexpatgoogle

unread,
Jun 27, 2009, 2:14:41 AM6/27/09
to gg95
Hi all,

the following example

real(8) :: pic(1000,1000)
real(8) :: picres(1000,1000)
real(8) :: f(9,9)
real(8) :: m(9,9)


the code:

m(:,:)=pic(900:909,900,909)*f(:,:)
picres(905,905)=sum(m)

takes twice the runtime as the one-liner

picres(905,905)=sum(pic(900:909,900,909)*f(:,:))


It is clear to me, that in the first case, data in principle gets
copied in memory (to m), but I always thought that the optimizer would
recognize that m is afterwards never used and would 'inline' this?



Background:

the code is a part of a simple 2D convolution example looping over the
pixels (dp is initalized int(filterwidth/2)):

!---------------------------------------------------------------

do jp=1,py

do ip=1,px

!ip+dp= shifted pixel; +/- dp
!m=pic(ip:ip+2*dp,jp:jp+2*dp)*f(:,:)
m=pic(ip:ip+2*dp,jp:jp+2*dp)
m=m*f(:,:)
picres(ip,jp)=sum(m)

!faster: picres(ip,jp)=sum(pic(ip:ip+2*dp,jp:jp+2*dp)*f(:,:))


end do !ip, px

end do !jp, py
!-----------------------------------------------------------------


the code is part of a mex file to get performance in some matlab
routines (on windows) (yes, this is lame, but I'm forced to use
matlab, would prefer plain f95).
I could provide detailed information about the compiler used etc. if
necessary, but I think this is a more general misunderstanding on my
side?


The c-code version using 2 more nested loops (should be slower than
fortran) (the matrix-matrix multiplication) and pointer-arithmetic
(fast) is in total faster than the first case.:


!---------------------------------------------------------------

for(ip=0; ip<px;ip++)
{
//=== predetermine rowshift for original (larger) picture and
respic
di=ip*px;
dipic=ip*(px+dp);

for(jp=0; jp<py;jp++)
{

//testcode *(picres+di+jp)=*(pic+dipic+jp);

//=== convolution for this pixel
i=0;
for(isub=(ip-dp_2);isub<=(ip+dp_2);isub++)
{

//predetermine rowshift for convolution
disub=isub*px;
disubpic=isub*(px+dp);

j=0;
for(jsub=(jp-dp_2);jsub<=(jp+dp_2);jsub++)
{

*(picres+di+jp)=*(picres+di+jp)+*(pic+disubpic
+jsub)*filter[i*fn+j];

j++;
}//end for jsub

i++;
}//end for isub





}//end for jp
}//end for ip
!---------------------------------------------------------------


This somehow shocked me. My colleagues using matlab need more speed.
I'm coming from fortran number crunching (electronic structures) and
therefored stated that they a) need a compiled language b) should use
fortran, because it's syntax and memory-layout is similar to matlab,
and it's faster than plain c/c++.

As stated above I always followed the pilosophy 'write clean and
readable code (as in the first case above) - the optimizer will get
things fast'.

Any hints?







Jimmy

unread,
Jun 28, 2009, 11:35:56 AM6/28/09
to gg...@googlegroups.com
Different compilers have different optimization features. In this case it
might be a feature that Andy has not yet implemented. If you haven't tried
it yet, you may get a considerable speed-up of your test case by using, for
an example,

g95 -O2 test.f95 -o test.exe

to give the executable test.exe. Other optimization features are
available.

I would add that real(8) is non-portable Fortran.

Jimmy.


----- Original Message -----
From: "alexpatgoogle" <alexande...@gmx.net>
To: "gg95" <gg...@googlegroups.com>
Sent: Saturday, June 27, 2009 7:14 AM
Subject: Optimizer misunderstood?


>
> Hi all,
>
> the following example
>
> real(8) :: pic(1000,1000)
> real(8) :: picres(1000,1000)
> real(8) :: f(9,9)
> real(8) :: m(9,9)
>
>
> the code:
>
> m(:,:)=pic(900:909,900,909)*f(:,:)
> picres(905,905)=sum(m)
>
> takes twice the runtime as the one-liner
>
> picres(905,905)=sum(pic(900:909,900,909)*f(:,:))
>
>
> It is clear to me, that in the first case, data in principle gets
> copied in memory (to m), but I always thought that the optimizer would
> recognize that m is afterwards never used and would 'inline' this?
>
>
>

<cut>

Adelson Santos de Oliveira

unread,
Jun 28, 2009, 4:57:31 PM6/28/09
to gg...@googlegroups.com
There's a dimensional problem in the original code. Segmentation fault
has ocurred and might be cause for performance decrease.

Note that,

shape(pic(900:909,900,909)) => [10,10],

and,

shape(m) => [9,9]



Jimmy escreveu:

alexpatgoogle

unread,
Jul 2, 2009, 2:16:09 AM7/2/09
to gg95
Hallo,


>g95 -O2 test.f95 -o test.exe

I tested several optimization options, e.g.:
-ftree-vectorize -funroll-loops -malign-double -msse2 -fomit-frame-
pointer -march=nocona

but did not find an option which seems to speed up that special part
(the oneliner is alway by far faster than the multi-line code).



>
> shape(pic(900:909,900,909)) => [10,10],
>
> and,
>
> shape(m) => [9,9]
>

sorry, this was a typo in the message above, which is not present in
the original code

Jimmy

unread,
Jul 2, 2009, 4:38:44 AM7/2/09
to gg...@googlegroups.com
On a Core2 Duo E7200, 4GB memory, using g95_MinGW and Vista, the program
compiled with g95 -O2 test.f95 -o test.exe

program test


real(8) :: pic(1000,1000)
real(8) :: picres(1000,1000)
real(8) :: f(9,9)
real(8) :: m(9,9)

pic(:,:)=1.0
f(:,:)=2.0
!
call cpu_time(t1)
do i=1,10000000
m(:,:)=pic(901:909,901:909)*f(:,:)
picres(905,905)=sum(m)
end do
call cpu_time(t2)
write(*,*)(t2-t1)

call cpu_time(t1)
do i=1,10000000
picres(905,905)=sum(pic(901:909,901:909)*f(:,:))
end do
call cpu_time(t2)
write(*,*)(t2-t1)
stop
end program test

gives times 9.84/5.56 without the -O2 switch and 2.30/1.32 secs with
the -O2 switch. Both forms are speeded up by a factor of 4+, although the
one-liner is still faster. As you didn't get this speed-up some other
factors appear to be involved, eg as the arrays are large, it might be due
to cache memory hits/misses.

Jimmy.

----- Original Message -----
From: "alexpatgoogle" <alexande...@gmx.net>
To: "gg95" <gg...@googlegroups.com>

Sent: Thursday, July 02, 2009 7:16 AM
Subject: Re: Optimizer misunderstood?


>
> Hallo,
>
>
>>g95 -O2 test.f95 -o test.exe
>
> I tested several optimization options, e.g.:
> -ftree-vectorize -funroll-loops -malign-double -msse2 -fomit-frame-
> pointer -march=nocona
>
> but did not find an option which seems to speed up that special part
> (the oneliner is alway by far faster than the multi-line code).
>
>

><cut>

Reply all
Reply to author
Forward
0 new messages