fl32 vs df in cvf6.6

formulae translator

unread,

May 15, 2008, 10:47:36 AM5/15/08

to

Today I tried a very simple speed test using fl32 and df command
(compaq visual fortran 6.6)

program speed
integer :: i, j, k, n
real :: a, b, c
call cpu_time(a)
do i=1,1000
do j=1,100
do k=1,100
c=(sin(real(i))**2.0+cos(real(j))**2.0)+log(real(i))*log(real(j))
+real(k)
! write(*,*) c
end do
end do
end do
call cpu_time(b)
b=b-a
write(*,*) b,c
end program

the output of df compiled speed.exe is 0.000000e+00 133.2387
while the output of fl32 compiled speed.exe is 4.203125 133.238703

I am just curious why df compiled exe always gives zero cpu time?

Michael Metcalf

unread,

May 15, 2008, 11:13:18 AM5/15/08

to

"formulae translator" <liwei1...@gmail.com> wrote in message
news:a64977fe-6588-41d4...@x35g2000hsb.googlegroups.com...

> Today I tried a very simple speed test using fl32 and df command
> (compaq visual fortran 6.6)
>
>

> the output of df compiled speed.exe is 0.000000e+00 133.2387
> while the output of fl32 compiled speed.exe is 4.203125 133.238703
>
> I am just curious why df compiled exe always gives zero cpu time?

Probably because the value of 'c' is never used and so doesn't need to be
calculated. Try printing its value after the loop. You'd be surprised how
smart optimizers can be!

Regards,

Mike Metcalf

Jugoslav Dujic

unread,

May 15, 2008, 11:19:20 AM5/15/08

to

But he did print the value in the posted code :-).

...and I can't reproduce the problem -- I get 5.71 seconds in both
non-optimized and optimized version on my CVF 6.6C @ 3GHZ Intel P4.

--
Jugoslav
___________
www.xeffort.com

Please reply to the newsgroup.
You can find my real e-mail on my home page above.

GaryScott

unread,

May 15, 2008, 11:25:18 AM5/15/08

to

On May 15, 9:47 am, formulae translator <liwei19742...@gmail.com>
wrote:

With cvf 6.6c, i get:
(cpu_time=4.187500)

4.156250 133.2387

Press any key to continue

Michael Metcalf

unread,

May 15, 2008, 11:27:30 AM5/15/08

to

"Jugoslav Dujic" <jdu...@yahoo.com> wrote in message
news:6932jqF...@mid.individual.net...

> Michael Metcalf wrote:
>
> But he did print the value in the posted code :-).
>
> ...and I can't reproduce the problem -- I get 5.71 seconds in both
> non-optimized and optimized version on my CVF 6.6C @ 3GHZ Intel P4.
>

Sorry, I overlooked that. Maybe it just calculates the last value, but that
would be incredible!

Regards,

Mike Metcalf

formulae translator

unread,

May 15, 2008, 11:39:28 AM5/15/08

to

Thanks for you guys' prompt reply.

I forget to mention that df and fl32 work similarly if I cancel the
comment on "write c" within the loop.

I guess Mike is right. maybe only the last "c" was calculated in the
df compiled version.

I have no idea why Jugoslav cannot reproduce the running. my version
is cvf 6.6b @ Intel P4 D 3.2G

Wei

formulae translator

unread,

May 15, 2008, 11:56:24 AM5/15/08

to

program speed
integer :: i, j, k, n

real :: a, b, c(10,100,100)
call cpu_time(a)
do i=1,10
do j=1,100
do k=1,100
c(i,j,k)=(sin(real(i))**2.0+cos(real(j))**2.0)+log(real(i))*log(real(j))
+real(k)
!write(*,*) c

end do
end do
end do
call cpu_time(b)
b=b-a

write(*,*) b,c(1,1,1), c(10,100,100)
end program

I tried above with "c" as a 3D array, the df compiled version still
gives zero cpu time but right answer.
I can not believe that the compilor optimization can be so smart.

Wei

Steve Lionel

unread,

May 15, 2008, 12:04:56 PM5/15/08

to

On Thu, 15 May 2008 08:39:28 -0700 (PDT), formulae translator
<liwei1...@gmail.com> wrote:

>I have no idea why Jugoslav cannot reproduce the running. my version
>is cvf 6.6b @ Intel P4 D 3.2G

Perhaps because CVF supports the "fl32" command and just invokes CVF with some
FPS defaults?
--
Steve Lionel
Developer Products Division
Intel Corporation
Nashua, NH

For email address, replace "invalid" with "com"

User communities for Intel Software Development Products
http://softwareforums.intel.com/
Intel Fortran Support
http://support.intel.com/support/performancetools/fortran
My Fortran blog
http://www.intel.com/software/drfortran

glen herrmannsfeldt

unread,

May 15, 2008, 12:43:20 PM5/15/08

to

formulae translator wrote:
> program speed
> integer :: i, j, k, n
> real :: a, b, c(10,100,100)
> call cpu_time(a)
> do i=1,10
> do j=1,100
> do k=1,100

Note, DO loops nesting in a different order than the array
elements are stored.

> c(i,j,k)=(sin(real(i))**2.0+cos(real(j))**2.0)+log(real(i))*log(real(j))
> +real(k)

You don't want to write out the whole array each time through
the loop!

> !write(*,*) c
> end do
> end do
> end do
> call cpu_time(b)
> b=b-a
> write(*,*) b,c(1,1,1), c(10,100,100)
> end program

> I tried above with "c" as a 3D array, the df compiled version still
> gives zero cpu time but right answer.
> I can not believe that the compilor optimization can be so smart.

How long does it take to compile?

There is a story from many years ago (about 40) of a Fortran
benchmark containing many statement functions doing very
complex calculations that was tested on different systems.
When tested with the OS/360 Fortran H compiler it compiled very
slowly, but ran in no time at all. It seems that Fortran H,
unusual for the time, expanded statement functions inline.
After that it did constant expression evaluation such that only
one constant was left.

It is possible in your case to evaluate the whole array
expression at compile time. I would be surprised to see
a compiler actually do it, but only a little surprised.

It might just be that your computer is fast enough that
the time rounds to zero.

-- glen

Richard Maine

unread,

May 15, 2008, 12:37:54 PM5/15/08

to

formulae translator <liwei1...@gmail.com> wrote:

> I can not believe that the compilor optimization can be so smart.

Optimizers are known for being incredibly smart and dumb at the same
time. They sometimes fail to see the most "obvious" (to humans) things,
yet at the same time manage optimizations that you'd never expect.

This is part of why simple speed tests like the one you tried are so
notoriously poor. With amazing regularity, compilers optimize such
simple speed tests in ways that turn out to give completely misleading
results in that they don't tell you anything at all useful about what
you presumably really care about - the performance of actual
applications. This is why one of the main pieces of advice about speed
testing is to use actual applications instead of such artificial
benchmarks if at all practical.

Of course, it is harder to use an actual application for many reasons.
Sometimes you have to make do with something simple. You should just be
aware of the possibility of bogus results. Sometimes, as in the case
here, the results are so obviously strange that you can tell they aren't
what you wanted. Other times, the results can be just far enough off to
give you the wrong conclusion without standing out as obviously wrong.

Benchmarking is not at all a simple subject. Its apparent simplicity has
fooled many people for many years. This is all pretty much motherhood
and apple pie stuff, I might add - nothing that wasn't fundamentally
known 3 or 4 decades ago.

--
Richard Maine | Good judgement comes from experience;
email: last name at domain . net | experience comes from bad judgement.
domain: summertriangle | -- Mark Twain

e p chandler

unread,

May 15, 2008, 1:26:21 PM5/15/08

to

On May 15, 12:37 pm, nos...@see.signature (Richard Maine) wrote:

> formulae translator <liwei19742...@gmail.com> wrote:
> > I can not believe that the compilor optimization can be so smart.
>
> Optimizers are known for being incredibly smart and dumb at the same
> time. They sometimes fail to see the most "obvious" (to humans) things,
> yet at the same time manage optimizations that you'd never expect.

I've heard the story told several different ways - perhaps it's an
urban legend - that someone applied a profiler to an ancient operating
system - and optimized away the system's idle loop. :-). [Sorry I
don't remember if this was OS/360 or TOPS-10 or Multics or VMS
or ????]

- e

glen herrmannsfeldt

unread,

May 15, 2008, 2:11:41 PM5/15/08

to

e p chandler wrote:
(snip)

> I've heard the story told several different ways - perhaps it's an
> urban legend - that someone applied a profiler to an ancient operating
> system - and optimized away the system's idle loop. :-). [Sorry I
> don't remember if this was OS/360 or TOPS-10 or Multics or VMS
> or ????]

There are plenty of good optimizer stories, but OS/360
doesn't have an idle loop. The processor enters a wait state
(and stops executing instructions) when there isn't anything
else to do. For S/360 hardware there was a light on the
console that would light up when in a wait state. The
average brightness would approximate the fraction of time
the machine was idle.

That is especially convenient when running in a virtual
machine, but as I understand it the reason traces
back to when leased machines were charged by the amount of
CPU time actually used.

An actual example from the Fortran H Extended Programmers
Guide (which came from earlier programmers guides)

DO 11 I=1,10
DO 12 J=1,10
9 IF (B(I).LT.0) GO TO 11
12 C(J)=SQRT(B(I))
11 CONTINUE

The compiler is smart enough to move the SQRT out
of the inner loop, but the IF statement stays inside
the loop.

-- glen

James Parsly

unread,

May 15, 2008, 2:57:16 PM5/15/08

to

"formulae translator" <liwei1...@gmail.com> wrote in message
news:a64977fe-6588-41d4...@x35g2000hsb.googlegroups.com...

I tried running it from the IDE in debug and release configurations. The
release configuration gave
me the 0 cpu time result. I've pasted the assembler language listing at the
bottom.
Now, my assembler is pretty rusty, but this line is pretty revealing:

REAL4 043053D1Cr ; REAL4 133.2387

It looks like the compiler has figured out that all of the looping wasn't
necessary, and has
gone so far as to precompute the value that needed to be printed at the end.
That's pretty sharp!

.486
.MODEL flat
COMM .bss$:16
.data
.data$ BYTE "test"

BYTE 0
REPEAT 3
BYTE 0
ENDM
.data
.literal$ BYTE 1Ah
BYTE 01h
BYTE 01h
BYTE 00h

BYTE 0
REPEAT 3
BYTE 0
ENDM
BYTE 1Ah
BYTE 01h
BYTE 02h
BYTE 00h

BYTE 0
REPEAT 3
BYTE 0
ENDM
REAL4 043053D1Cr ; REAL4 133.2387
REAL4 03F53AE61r ; REAL4 0.8268796
SDWORD 00h ; SDWORD 0
BYTE 0
REPEAT 3
BYTE 0
ENDM
.data
.drectve$ BYTE "-defaultlib:dfor.lib "

BYTE "-defaultlib:libc.lib "

BYTE "-defaultlib:dfconsol.lib "

BYTE "-defaultlib:dfport.lib "

BYTE "-defaultlib:kernel32.lib "

BYTE 0
REPEAT 4
BYTE 0
ENDM
EXTERN _for_check_flawed_pentium@0:PROC
EXTERN _for_set_reentrancy@4:PROC
EXTERN _for_cpusec@4:PROC
EXTERN _for_cpusec@4:PROC
EXTERN _for_write_seq_lis:PROC
EXTERN _for_write_seq_lis_xmit:PROC
.CODE
; 1 program speed
PUBLIC _MAIN__
_MAIN__ PROC
push ebp
mov ebp, esp
sub esp, 40
call _for_check_flawed_pentium@0
sub esp, 4
lea eax, dword ptr .literal$+24
push eax
call _for_set_reentrancy@4
; 2 integer :: i, j, k, n
; 3 real :: a, b, c
; 4 call cpu_time(a)
lea eax, dword ptr .bss$+8 ; 000004
push eax
call _for_cpusec@4
; 5 do i=1,1000
; 6 do j=1,100
; 7 do k=1,100
; 8
c=(sin(real(i))**2.0+cos(real(j))**2.0)+log(real(i))*log(real(j))+real(k)
fld dword ptr .literal$+20 ; 000008
; 9 ! write(*,*) c
; 10 end do
; 11 end do
; 12 end do
; 13 call cpu_time(b)
lea eax, dword ptr .bss$+4 ; 000013
fstp st(1) ; 000008
ffree st(0) ; 000013
push eax
call _for_cpusec@4
; 14 b=b-a
fld dword ptr .bss$+4 ; 000014
; 15 write(*,*) b,c
xor edx, edx ; 000015
lea eax, dword ptr -40[ebp]
fsub dword ptr .bss$+8 ; 000014
mov dword ptr -32[ebp], edx ; 000015
lea edx, dword ptr .literal$+8
fstp dword ptr .bss$+4 ; 000014
mov ecx, dword ptr .bss$+4 ; 000015
mov dword ptr -40[ebp], ecx
push eax
lea ecx, dword ptr -32[ebp]
push edx
push 59047680
push -1
push ecx
call _for_write_seq_lis
add esp, 24
mov eax, dword ptr .literal$+16
mov dword ptr -40[ebp], eax
sub esp, 4
lea edx, dword ptr -40[ebp]
lea ecx, dword ptr .literal$
lea eax, dword ptr -32[ebp]
push edx
push ecx
push eax
call _for_write_seq_lis_xmit
add esp, 16
; 16 end program
mov esp, ebp ; 000016
mov eax, 1
pop ebp
ret
_MAIN__ ENDP
END

alex

unread,

May 15, 2008, 11:12:42 PM5/15/08

to

it is a very interesting problem. i did other two tests.

TEST ONE:

program speed
integer :: i, j, k, n
real :: a, b

real :: c(100,100,1000) = 0.0

call cpu_time(a)

do i=1,1000
do j=1,100
do k=1,100

c(k,j,i) =

(sin(real(i))**2.0+cos(real(j))**2.0)+log(real(i))*log(real(j))
+real(k)
! write(*,*) c
end do
end do
end do

call cpu_time(b)

b=b-a

write(*,*) b,c(1,1,1),c(100,100,1000)

pause
end program

TEST TWO:

program speed
integer :: i, j, k, n
real :: a, b

real :: c(100,100,1000)

call cpu_time(a)

do i=1,1000
do j=1,100
do k=1,100

c(k,j,i) =

(sin(real(i))**2.0+cos(real(j))**2.0)+log(real(i))*log(real(j))
+real(k)
! write(*,*) c
end do
end do
end do

call cpu_time(b)

b=b-a

write(*,*) b,c(1,1,1),c(100,100,1000)

pause
end program

The difference is only in the statement field between those two tests.
In TEST ONE, I initialized 'c' with 'c(100,100,1000) = 0.0'; while in
TEST TWO i did nothing.

i compiled those two code with DOC command-line:

df speed.f90 /check:all /fpe:0 /traceback /warn:argument_checking /
automatic /link /STACK:800000000

The results are:

TEST ONE: 0.1875000 2.000000 133.2387
TEST TWO: 9.3750000E-02 2.000000 133.2387

But another issue is arose. I found the size of executable files,
which producted by those two codes, were very different. In the first
case, the size is 38.5MB and more compilation time was spended, while
the second case with the size of 395KB.

I don't know why?

glen herrmannsfeldt

unread,

May 15, 2008, 11:37:18 PM5/15/08

to

alex wrote:
(snip)

> The difference is only in the statement field between those two tests.
> In TEST ONE, I initialized 'c' with 'c(100,100,1000) = 0.0'; while in
> TEST TWO i did nothing.

> i compiled those two code with DOC command-line:

> df speed.f90 /check:all /fpe:0 /traceback /warn:argument_checking /
> automatic /link /STACK:800000000

> The results are:

> TEST ONE: 0.1875000 2.000000 133.2387
> TEST TWO: 9.3750000E-02 2.000000 133.2387

> But another issue is arose. I found the size of executable files,
> which producted by those two codes, were very different. In the first
> case, the size is 38.5MB and more compilation time was spended, while
> the second case with the size of 395KB.

> I don't know why?

It puts 10 million zeros in the executable file.

40000000/1024/1024 is 38.14

-- glen

Ron Ford

unread,

May 15, 2008, 11:34:33 PM5/15/08

to

Is there something about this that would convince a compiler that it were
raising a negative number to a non-integer power? That's the run-time I
get in both cases.

With loops this size, the debugger isn't much use.
--

Terence

unread,

May 16, 2008, 12:15:17 AM5/16/08

to

On May 16, 3:26 am, e p chandler <e...@juno.com> wrote:
> I heard the story told several different ways - perhaps it's an

> urban legend - that someone applied a profiler to an ancient operating
> system - and optimized away the system's idle loop. :-). [Sorry I
> don't remember if this was OS/360 or TOPS-10 or Multics or VMS
> or ????]

Funny you should mention that!
I worked at IBM UK in Hursley labs where the 360 40 was designed and
built.
This machine fells between the 30 and 50 machines of the USA.

We did timing of running speeds and unfortunatly the UK 40 and the USA
50 were pretty much the same. So we put delay loops in the circuitry
of the 40!

Another IBM story. There was an early printer that was geared by two
cog-wheels on identical-sized axles. There were two models, "fast" and
"slow" at standard work-capability rates.
Smart "for-resale" buyers bought the "slow" and interchanged the two
wheels and supplied "fast" printers at competitive prices...

glen herrmannsfeldt

unread,

May 16, 2008, 2:47:55 AM5/16/08

to

Ron Ford wrote:
(snip)

>>(sin(real(i))**2.0+cos(real(j))**2.0)+log(real(i))*log(real(j))
(snip)

> Is there something about this that would convince a compiler that it were
> raising a negative number to a non-integer power? That's the run-time I
> get in both cases.

2.0 is a non-integer power. 2 is an integer.

-- glen

Richard Maine

unread,

May 16, 2008, 2:56:32 AM5/16/08

to

glen herrmannsfeldt <g...@ugcs.caltech.edu> wrote:

And yes, it is illegal to raise a negative number to the 2.0 power. It
is legal to raise it to the 2 power. Don't use things like 2.0 as an
exponent; it is almost always a bad idea. Besides the issue of being
invalid for negative numbers, it is also slower and less accurate even
in the cases where it is valid. Some compilers are smart enough to
recognize your error and fix it for you. Far better to just do it right
in the first place.

Kurt Kallblad

unread,

May 17, 2008, 10:20:04 AM5/17/08

to

"formulae translator" <liwei1...@gmail.com> wrote in message
news:a64977fe-6588-41d4...@x35g2000hsb.googlegroups.com...

Hi,

The differences depends on optimizations.

With Pentium 4, 3.40Ghz and CVF 6.6C I got 0 sec for optimized
code
and 4.218750 sec for "debug" code.

I looked at the assembly code. The debug code had the loops, cos,
sin
and log as externel procdeures and no constant value for c.

The optimized code had 133.2387 as a constant, no loops and no
calls
of cos, sin or log functions. So the value was calculated at
compile
time.

Kurt

glen herrmannsfeldt

unread,

May 17, 2008, 1:31:35 PM5/17/08

to

Kurt Kallblad wrote:

> "formulae translator" <liwei1...@gmail.com> wrote in message
> news:a64977fe-6588-41d4...@x35g2000hsb.googlegroups.com...

>> Today I tried a very simple speed test using fl32 and df command
>> (compaq visual fortran 6.6)

>> program speed
>> integer :: i, j, k, n
>> real :: a, b, c
>> call cpu_time(a)
>> do i=1,1000
>> do j=1,100
>> do k=1,100
>> c=(sin(real(i))**2.0+cos(real(j))**2.0)+log(real(i))*log(real(j))
>> +real(k)
>> ! write(*,*) c
>> end do
>> end do
>> end do
>> call cpu_time(b)
>> b=b-a
>> write(*,*) b,c
>> end program

Previously you were storing into an array c, such that
the compiler might believe all the elements needed to
be computed. In either case, the compiler with optimization
turned on should figure out to move the i dependent terms
out of the j loop, and the j dependent terms out of the k loop.
That will make it much faster than it looks.

Soon after that, in the case of scalar c, the compiler will
figure out that only the last value of c is needed and
remove all the loops. For array c, it is somewhat harder to
figure out that only one or two elements are needed.

>> the output of df compiled speed.exe is 0.000000e+00 133.2387
>> while the output of fl32 compiled speed.exe is 4.203125 133.238703

>> I am just curious why df compiled exe always gives zero cpu time?

> The differences depends on optimizations.

> With Pentium 4, 3.40Ghz and CVF 6.6C I got 0 sec for optimized code
> and 4.218750 sec for "debug" code.

> I looked at the assembly code. The debug code had the loops, cos, sin
> and log as externel procdeures and no constant value for c.

> The optimized code had 133.2387 as a constant, no loops and no calls
> of cos, sin or log functions. So the value was calculated at compile
> time.

For scalar c, many optimizing compilers should figure that out.

With the more recent standards, compilers have to be able
to do transcendental functions at compile time for constant
expressions. It is a little less obvious in this case, but
once the whole thing has been moved out of the loop, and the
loops removed it isn't so hard.

-- glen

Gary Scott

unread,

May 17, 2008, 1:29:52 PM5/17/08

to

glen herrmannsfeldt wrote:

I think that if code is so poorly written that the compiler can deduce
that dozens of lines of code can be removed and replaced with a
calculated constant, it should issue some sort of warning or
informational message such as "you dummy, what were you thinking".

>
> -- glen
>

--

Gary Scott
mailto:garylscott@sbcglobal dot net

Fortran Library: http://www.fortranlib.com

Support the Original G95 Project: http://www.g95.org
-OR-
Support the GNU GFortran Project: http://gcc.gnu.org/fortran/index.html

If you want to do the impossible, don't hire an expert because he knows
it can't be done.

-- Henry Ford

mecej4

unread,

May 17, 2008, 5:55:42 PM5/17/08

to

How about

sin(x + 2*Pi) - cos(Pi/2 - x)

where x varies in the loop and Pi is suitably declared and initialized
to acos(-1d0) ? How much mathematical knowledge must the Fortran
compiler have? Perhaps, it should say "I smart" instead of "you dummy"?

-- mecej4

>>
>> -- glen
>>
>
>

Gary Scott

unread,

May 17, 2008, 7:23:19 PM5/17/08

to

mecej4 wrote:

there should have been a smiley face in my prior post.

>
> -- mecej4

Ron Ford

unread,

May 18, 2008, 6:10:32 AM5/18/08

to

On Thu, 15 May 2008 23:56:32 -0700, Richard Maine wrote:

> glen herrmannsfeldt <g...@ugcs.caltech.edu> wrote:
>
>> Ron Ford wrote:
>> (snip)
>>
>>>>(sin(real(i))**2.0+cos(real(j))**2.0)+log(real(i))*log(real(j))
>> (snip)
>>
>>> Is there something about this that would convince a compiler that it were
>>> raising a negative number to a non-integer power? That's the run-time I
>>> get in both cases.
>>
>> 2.0 is a non-integer power. 2 is an integer.
>
> And yes, it is illegal to raise a negative number to the 2.0 power. It
> is legal to raise it to the 2 power. Don't use things like 2.0 as an
> exponent; it is almost always a bad idea. Besides the issue of being
> invalid for negative numbers, it is also slower and less accurate even
> in the cases where it is valid. Some compilers are smart enough to
> recognize your error and fix it for you. Far better to just do it right
> in the first place.

I see. Half of sin(real(i))**2 or cos(real(j))**2 are going have a
negative base with this loop.

The offending line is then unfortunate for multiple reasons.
c(k,j,i) = (sin(real(i))**2 + cos(real(j))**2)&
+log(real(i))*log(real(j)) + real(k)

Some whitespace, better grouping, lack of illegality and a continuation
that accounts for usenet posting does better. I'm happy that plato3 spit
it out as a runtime.
--