Two questions for the f90 gurus:
(a) When will an array argument be copied during a subroutine call?
(b) Consider the following two options for passing arguments to subroutines:
(i) a defined type consisting of integers, allocatable arrays
(through pointers) and characters.
(ii) an integer array, a real array and a character array containing the
same information as in (i)
Which one is the most efficient way to pass the arguments to subroutines,
as subtypes of (i) or as the arrays of (ii)?
Thanks a lot!
Christos Frouzakis
This is system dependent. The answer ranges from "never" to "always".
>
>(b) Consider the following two options for passing arguments to subroutines:
> (i) a defined type consisting of integers, allocatable arrays
> (through pointers) and characters.
> (ii) an integer array, a real array and a character array containing the
> same information as in (i)
> Which one is the most efficient way to pass the arguments to subroutines,
> as subtypes of (i) or as the arrays of (ii)?
>
THis is system dependent, and also dependent an what use is made
in the subroutines. I would say that with (ii), the odds are better
for performance, but no garantee whatsoever.
Michel
--
Michel OLAGNON email : Michel....@ifremer.fr
http://www.ifremer.fr/metocean/group/michel/michel_olagnon.htm
http://www.fortran-2000.com/
With only anecdotal evidence, I agree (FWIW :o). How you pass an array can be an issue
also, e.g. using a subscript triplet with a stride > 1. I _think_ this pretty much
guarantees an argument copy.
cheers,
paulv
--
Paul van Delst A little learning is a dangerous thing;
CIMSS @ NOAA/NCEP Drink deep, or taste not the Pierian spring;
Ph: (301)763-8000 x7274 There shallow draughts intoxicate the brain,
Fax:(301)763-8545 And drinking largely sobers us again.
paul.v...@noaa.gov Alexander Pope.
> In article <3ACD97BF...@lvv.iet.mavt.ethz.ch>, Christos Frouzakis <cfro...@lvv.iet.mavt.ethz.ch> writes:
> >Hi all,
> >
> >Two questions for the f90 gurus:
> >
> >(a) When will an array argument be copied during a subroutine call?
>
> This is system dependent. The answer ranges from "never" to "always".
Doesn't this result in significant performance degradation?
Is there a way to make sure it will not be copied?
Is this also the case in f77?
Christos.
It is a very good thing that it is system-dependent. Apart from immature
compilers, that may always exist, you can thus have most chances that
the compiler will choose a solution best suited to the machine architecture.
The idea that copies should be avoided at all cost is a very bad one, because
it means that you assume a given architecture and operating system are the
standard, and that you think that you are smarter than the optimizer.
(I agree that some optimizers lend themselves to make you believe it ;-), but
they are not the rule.)
In general, you should give as much information as possible to the compiler,
and avoid as much as possible redundancy or to restate defaults.
For instance,
Real, Dimension (1025) :: A
...
Call SUB (A(1:16))
is better (in the sense "equal or better") than
Call SUB (A) ! Assuming A is used in such a way it is equivalent
or than
N = 16
... ! numerous lines of code
Call SUB (A(1:N))
or than
Call SUB (A(1:16:1))
or than
Call SUB (A(1)) ! F77 way
0.02 $
> > >(a) When will an array argument be copied during a subroutine call?
> > This is system dependent. The answer ranges from "never" to "always".
> Doesn't this result in significant performance degradation?
When it occurs, yes.
> Is there a way to make sure it will not be copied?
No. There are some situations where it can't be avoided (I think). In others,
it is a quality of implementation-issue. See a recent thread on a CVF switch
that tells you about copies as they happen.
Jan
This is the essence of programming in a "high-level" language. The
compiler is supposed to "do the right thing" for you.
... [ copying arguments instead of just passing a descriptor ] ...
>Doesn't this result in significant performance degradation?
No. That's why compilers make copies sometimes. There are
cases where performance would be degraded if you had to always
access the original version of the data.
Passing copies is the best choice for distributed parallel
computations - passing a reference would mean that every
use of the data would be a network transaction.
Copying discontiguous to a contiguous temporary space speeds
up subsequent uses of that data as well ( and reduces the complexity
of the code the compiler would have to generate). This is one of the
causes of unnecessary copying that some compilers do. If the
compiler can't determine at compile time whether a given argument
is contiguous or not, it will make a copy.
--
J. Giles
> cases where performance would be degraded if you had to always
> access the original version of the data.
I do development for numerical meteorological and air pollution
modeling. Both of these are computationally intense applications,
and I wind up profiling laternative formulations. A lot.
It is my experience across the trio of TLA vendors on whose
machines I work most frequently that moving computational kernels
from a F77 separately compiled implementation (where the
implementation for these vendors is pass-by-reference) to a
F90 internal procedures approach which _should_ (if anything)
allow the vendors _more_ room for optimization *costs a
substantial performance penalty*.
And when you've a weather forecast that's starting at 11:30 PM
(because that's when the observational data comes in), that takes
4 hours, and is due out by 4 AM, you *can't afford* an "improvement"
that bloats your run-time up to 5.5 hours !!!!
Call it a quality of implementation issue if you will.
But the biggest improvement I would like for F9x compilers is
to be able to control the argument-passing mechanism. Or at
least, diagnostic listings that _document_ which mechanism is
being used, on a case-by-case basis. It takes too damned long
to have to decipher the assembly listing for every case of
potential significance.
Carlie J. Coats, Jr. co...@ncsc.org
MCNC-Environmental Programs phone: (919)248-9241
North Carolina Supercomputing Center fax: (919)248-9245
3021 Cornwallis Road P. O. Box 12889
Research Triangle Park, N. C. 27709-2889 USA
"My opinions are my own, and I've got *lots* of them!"
No. I made no blanket statement. I negated one. I did not
state that is was always faster to make a copy. I said that
is was sometimes *not* faster to avoid the copy. The
statement that's MUCH too strongly worded is that making
copies is a significant performance degradation. The fact that
that statement is disguised as a question not withstanding.
...
>It is my experience across the trio of TLA vendors on whose
>machines I work most frequently that moving computational kernels
>from a F77 separately compiled implementation (where the
>implementation for these vendors is pass-by-reference) to a
>F90 internal procedures approach which _should_ (if anything)
>allow the vendors _more_ room for optimization *costs a
>substantial performance penalty*.
>
>And when you've a weather forecast that's starting at 11:30 PM
>(because that's when the observational data comes in), that takes
>4 hours, and is due out by 4 AM, you *can't afford* an "improvement"
>that bloats your run-time up to 5.5 hours !!!!
>
>Call it a quality of implementation issue if you will.
I agree entirely that the present state of development is sorely
lacking (at least, often). But, I do call it a quality of implementation
issue. Anyway, what you said still doesn't contradict the observation
that making copies is *sometimes* a more efficient way.
If you require procedures to accept and work with discontiguous
slices in all cases, your code will usually be slower than making
a copy and passing it as a contiguous piece. This will be true
even if the actual argument is contiguous because the procedure
will either have to have two copies of the code and make a run-time
decision of which to use, or it will have to treat every argument
as potentially discontiguous. A better implementation is to have
the compiler generate code for the procedure which expects
arguments to be contiguous and then copy actual arguments
to a contiguous temporary when necessary. It is this last point
that compilers are still weak on: when is it necessary to make the
copy?
Using old F77 style, you weren't even allowed to pass discontiguous
arguments (indeed, there was not even a way of directly expressing
discontiguous arguments). You either had to make a copy explicitly,
or rewrite the procedure to accept lots of extra arguments that describe
the slice it was to work on. Or, maybe you've never had to pass the
diagonal of a matrix, or the 'inner' elements of a mesh, or the upper
right quadrant, etc..
By the way , I agree with you that what Fortran really needs is a
more direct way for programmers to control these issues. Ken
Kennedy proposed such a mechanism in 1982 (both in an open
letter to the committee, and I believe he spoke at a meeting).
It was never acted upon.
--
J. Giles
If there is a lot of randomly placed data that needs to be acted on by
a subroutine, and that action needs multiple access, it is best to copy
the data into a linear array, pass the array, work on it *stride of
one*, and re-copy changes back to the source and drop the temp array.
Stride access in powers of 2 can degrade speed one to two orders of
magnitude; change by one (ie: 2**n-1 or 2**n+1) can be another way to
get significant improvement.
Mr David Bailey of Ames (now LBL?) has done some magnificent studies
of the stride effect, to improve FFT algorithms, and deserves credit
here.
> ... Anyway, what you said still doesn't contradict the observation
> that making copies is *sometimes* a more efficient way.
>
> If you require procedures to accept and work with discontiguous
> slices in all cases, your code will usually be slower than making
> a copy and passing it as a contiguous piece.
It may be useful to point out that there are actually two problems
that both can lead to significant slowing down of the code:
A) The compiler makes a copy when calling a routine, in a case
where we know it is not necessary.
B) The compiler assumes that an argument can be discontiguous,
and treats it accordingly in the routine.
If the argument really is discontiguous, then either A or B will
be necessary, and the only thing we could wish for is that the
compiler chooses the least of two evils. This seems pretty
difficult to do (and compilers don't do it, as far as I know).
If the argument is contiguous, especially if this is true for each
call to the routine, then one can still argue that A may be useful
in rare cases, but we would at least want the compiler to avoid the
overhead of B. Again, many compilers fail to do this.
In practice, this means that when passing a contiguous arry, for
instance, to a time-critical routine, one should:
(1) Write the routine with an f77-style argument.
(2) Call the routine with a single array argument, even though
you want it to work with the entire array, e.g.:
call sub( A(1,1,1) )
if A is a rank-3 array.
This seems a bit out of style, in f90, but if you forget (1), then
current compilers give you penalty B, and if you forget (2) they
often give you penalty A.
> ...
> more direct way for programmers to control these issues. Ken
> Kennedy proposed such a mechanism in 1982 ...
Did this address (and solve) both the issues described above?
-- Jos
>
> But the biggest improvement I would like for F9x compilers is
> to be able to control the argument-passing mechanism. Or at
> least, diagnostic listings that _document_ which mechanism is
> being used, on a case-by-case basis. It takes too damned long
> to have to decipher the assembly listing for every case of
> potential significance.
>
Some compilers do better than others. You will find below a snippet
that gives some ideas of the compiler performance related to argument
passing. I added a small shell script that may be helpful to run this
small benchmark under Unix.
Interestingly enough, compilers made some real progress in the last
three years. On Unix, most of them shows good results. On Windows,
the situation looked less good last time I tried.
Regards,
Arnaud
! tab2.f90
!-----------------------------------------------------------------------------
! Test program to show the difference of treatment between a "regular
array"
! and an array in a derived type
!
! This snippet derives from the experience gained during a port of a
large
! production system (http://www.telemac-system.com).
!
! Originally issued by Jean-Michel Hervouet <j-m.he...@edf.fr>
! Modified by Arnaud Desitter - Nag Ltd. - <arnaud....@nag.co.uk>
!
! use separate compilation to avoid inlining
! typical example on Unix:
! f95 -O -c tab1.f90; f95 -O -c tab2.f90; f95 -O tab1.o tab2.o; ./a.out
!-----------------------------------------------------------------------------
PROGRAM TEST
USE BIEF_DEF
USE TIMING
IMPLICIT NONE
INTEGER :: I
INTEGER, PARAMETER :: NPOIN=100000, N=1000
double precision :: t1, t2
double precision :: timings(4)
TYPE(BIEF_OBJ), TARGET :: A_STRUCT ,B_STRUCT
DOUBLE PRECISION,ALLOCATABLE :: A_NORMAL(:),B_NORMAL(:)
!
ALLOCATE(A_STRUCT%R(NPOIN))
ALLOCATE(B_STRUCT%R(NPOIN))
ALLOCATE(A_NORMAL(NPOIN))
ALLOCATE(B_NORMAL(NPOIN))
DO I=1,NPOIN
B_STRUCT%R(I) = 1.D0
B_NORMAL(I) = 1.D0
ENDDO
call TIME_IN_SECONDS(T1)
DO I=1,N
CALL OV(A_NORMAL,B_NORMAL,NPOIN)
ENDDO
call TIME_IN_SECONDS(T2)
timings(1)=(T2-T1)
call TIME_IN_SECONDS(T1)
DO I=1,N
CALL OV(A_STRUCT%R,B_STRUCT%R,NPOIN)
ENDDO
call TIME_IN_SECONDS(T2)
timings(2)=(T2-T1)
call TIME_IN_SECONDS(T1)
DO I=1,N
CALL OV_sh(A_NORMAL,B_NORMAL,NPOIN)
ENDDO
call TIME_IN_SECONDS(T2)
timings(3)=(T2-T1)
call TIME_IN_SECONDS(T1)
DO I=1,N
CALL OV_sh(A_STRUCT%R,B_STRUCT%R,NPOIN)
ENDDO
call TIME_IN_SECONDS(T2)
timings(4)=(T2-T1)
!
timings=timings/timings(1)
write(unit=*,fmt=10) timings(2:4)
10 format(&
1x,"Abstraction penality versus regular array + assumed size",&
" dummy argument",/,&
1x,"(1 means no penalty)",/,&
1x,"Derived type + assumed size : ",F8.3,/,&
1x,"Regular array + assumed shape: ",F8.3,/,&
1x,"Derived type + assumed shape : ",F8.3)
END PROGRAM TEST
! tab1.f90
MODULE BIEF_DEF
IMPLICIT NONE
!
TYPE BIEF_OBJ
DOUBLE PRECISION, POINTER,DIMENSION(:)::R
END TYPE BIEF_OBJ
contains
SUBROUTINE OV(X,Y,NPOIN)
!
IMPLICIT NONE
INTEGER, INTENT(IN) :: NPOIN
DOUBLE PRECISION, INTENT(IN) :: Y(NPOIN)
DOUBLE PRECISION, INTENT(INOUT) :: X(NPOIN)
INTEGER :: I
!
DO I=1,NPOIN
X(I) = 2.D0*Y(I)
ENDDO
!
RETURN
END SUBROUTINE OV
SUBROUTINE OV_sh(X,Y,NPOIN)
!
IMPLICIT NONE
INTEGER, INTENT(IN) :: NPOIN
DOUBLE PRECISION, INTENT(IN) :: Y(:)
DOUBLE PRECISION, INTENT(INOUT) :: X(:)
INTEGER :: I
!
DO I=1,NPOIN
X(I) = 2.D0*Y(I)
ENDDO
!
RETURN
END SUBROUTINE OV_sh
END MODULE BIEF_DEF
MODULE TIMING
CONTAINS
subroutine TIME_IN_SECONDS ( t )
IMPLICIT NONE
!
double precision, intent(out) :: T
INTEGER :: TEMPS,PARSEC
intrinsic dble
!
CALL SYSTEM_CLOCK(COUNT=TEMPS,COUNT_RATE=PARSEC)
T = dble(TEMPS) / PARSEC
!
RETURN
END subroutine TIME_IN_SECONDS
END MODULE TIMING
go_test:
#! /bin/sh
compile_run(){
echo "======================"
echo "******* build *******"
(
set -x
${FC} ${FFLAGS} -c tab1.${SUFF}
${FC} ${FFLAGS} -c tab2.${SUFF}
${FC} ${FFLAGS} tab1.o tab2.o
) 2>&1
echo "********* run ********"
./a.out
echo "**********************"
echo
}
exe_test(){
compile_run
}
clean(){
rm -f a.out
rm -f *.o
rm -f *.mod *.MOD *.M
}
go_nagf95(){
FC='f95 -w'
VERSIONFLAGS='-V'
echo Platform: `uname`
${FC} ${VERSIONFLAGS} 2>&1
SUFF='f90'
FFLAGS='-O4'
exe_test
FFLAGS='-O4 -Oassumed=contig'
exe_test
}
go_hpf90(){
FC='f90'
VERSIONFLAGS='+version'
echo Platform: `uname`
${FC} ${VERSIONFLAGS} 2>&1
SUFF='f90'
FFLAGS='+O2'
exe_test
}
go_lx_sgif90(){
FC='sgif90'
VERSIONFLAGS='-version'
echo Platform: `uname`
${FC} ${VERSIONFLAGS} 2>&1
SUFF='f90'
FFLAGS='-O2'
exe_test
}
go_sgif90(){
FC='f90'
VERSIONFLAGS='-version'
echo Platform: `uname`
${FC} ${VERSIONFLAGS} 2>&1
SUFF='f90'
FFLAGS='-O2'
exe_test
}
go_duxf90(){
FC='f90'
VERSIONFLAGS='-version'
echo Platform: `uname`
${FC} ${VERSIONFLAGS} 2>&1
SUFF='f90'
FFLAGS='-O3'
exe_test
}
go_sunf90(){
FC='f90'
VERSIONFLAGS='-V'
echo Platform: `uname`
${FC} ${VERSIONFLAGS} 2>&1
SUFF='f90'
FFLAGS='-O3'
exe_test
}
go_xlf90(){
FC='xlf'
echo Platform: `uname`
lslpp -l | grep Fortran
ln -s tab1.f90 tab1.f 2> /dev/null
ln -s tab2.f90 tab2.f 2> /dev/null
SUFF='f'
FFLAGS='-qfree=f90 -O'
exe_test
rm -f tab1.f tab2.f
}
case `uname` in
HP*)
go_hpf90
;;
IRIX*)
go_sgif90
;;
OSF1)
go_duxf90
;;
Linux)
go_nagf95
case `uname -m` in
ia64)
go_lx_sgif90
;;
esac
;;
SunOS)
go_sunf90
;;
AIX)
go_xlf90
;;
esac
clean
> also, e.g. using a subscript triplet with a stride > 1. I _think_ this pretty much
> guarantees an argument copy.
Not if the dummy is assumed shape.
--
Richard Maine
email: my last name at domain
domain: qnet dot com
Really? Man, just when I think I've sussed something out, Richard, you return from
vacation! :o)
So if I have:
PROGRAM blah
REAL, DIMENSION( 100 ) :: X
....
CALL testcall( x( 1:100:5 ) )
....
CONTAINS
SUBROUTINE testcall( x )
REAL, DIMENSION( : ) :: x
INTEGER :: n
n = SIZE( x )
DO i = 1, n
.... do something with data
END DO
....
END SUBROUTINE testcall
END PROGRAM BLAH
then whether or not an argument copy occurs is still platform/compiler dependent? What
about if the array is 2-D, e.g. x( 50, 100 ) and passed as x( :, 1:100:5 ) ?
cheers,
paulv
p.s. Good to have you back. Hope your holiday was fun and fortran-free :o)
> So if I have:
...
> CALL testcall( x( 1:100:5 ) )
> ....
> CONTAINS
>
> SUBROUTINE testcall( x )
>
> REAL, DIMENSION( : ) :: x
...
> then whether or not an argument copy occurs is still platform/compiler dependent?
Correct. Odds are that most compilers will not do a copy. The
subroutine is likely to be coded to deal with the non-contiguous
argument. But that isn't guaranteed. For example, NAG has a compiler
switch that affects the decision here - by default, there is no copy,
but if you have reason to believe that the copy would give enough
better performance (because of the subroutine then working on a
contiguous array) to make up for the cost of the copy, you can force
one.
> What
> about if the array is 2-D, e.g. x( 50, 100 ) and passed as x( :, 1:100:5 ) ?
The same.
> p.s. Good to have you back. Hope your holiday was fun and fortran-free :o)
Yep. Aided by the fact that I lost my secure ID card, so I couldn't get
through the firewall into the NASA machines, and it would have been a
long distance call to access my home ISP. Perhaps my subconscious decided
I needed to loose it.