Pros and cons of assumed-shape arrays

deltaquattro

unread,

Nov 21, 2008, 7:37:18 AM11/21/08

to

Hello,

I've always used extensively assumed-shape arrays in my subroutines,
partly because of the suggestions of most F90/95 books, and partly
because I like to use fancy new language features when they come
out :-) Until now, however, I've never thought seriously about their
pros and cons. Recently I've seen a lot of F95 codes which use them
sparingly, allegedly because of performance penalties. See also here:

http://www.baronams.com/staff/coats/epa2002_seminar.html#exmois

So I've started to think about the issue, and in the end I can't a lot
of reasons to prefer assumed-shape. Why risking a performance hit when
I just need to pass a few extra integers to the subroutine ? A
different case altogether is when the the caller itself cannot
determine the dimensions of the array being passed. However, in this
case you can't use assumed-shape either - you have to resort to
allocatable array dummy arguments, and while at that, be grateful that
they exist, since they're so very useful :-) But other than this, I
can't see reasons against explicit-shape. Before changing my coding
habit, however, Id' like to ask your opinion. What do you think?

Cheers

deltaquattro

Jan Vorbrüggen

unread,

Nov 21, 2008, 8:13:39 AM11/21/08

to

> So I've started to think about the issue, and in the end I can't a lot
> of reasons to prefer assumed-shape. Why risking a performance hit when
> I just need to pass a few extra integers to the subroutine ?

If you think about it, this is just what the generated code will be
doing "behind you back" when you are using assumed-shape arrays. Why
should there be a difference in performance? If any, it should be for
the automatically-generated code.

A different case arises when an array has dimensions that are known at
compile time. The compiler can then - at least in some cases - optimize
more because of its knowledge of the actual value of the dimensions. But
that's not what you are talking about.

> A different case altogether is when the the caller itself cannot
> determine the dimensions of the array being passed. However, in this
> case you can't use assumed-shape either

I really can't understand what you want to say here. Allocatble arrays
are a subset of assumed-shape arrays, in any case.

Jan

Jugoslav Dujic

unread,

Nov 21, 2008, 8:20:23 AM11/21/08

to

deltaquattro wrote:
> Hello,
>
> I've always used extensively assumed-shape arrays in my subroutines,
> partly because of the suggestions of most F90/95 books, and partly
> because I like to use fancy new language features when they come out
> :-) Until now, however, I've never thought seriously about their pros
> and cons. Recently I've seen a lot of F95 codes which use them
> sparingly, allegedly because of performance penalties. See also here:
>
>
> http://www.baronams.com/staff/coats/epa2002_seminar.html#exmois
>
> So I've started to think about the issue, and in the end I can't a
> lot of reasons to prefer assumed-shape. Why risking a performance hit
> when I just need to pass a few extra integers to the subroutine ?

"Premature optimization is the root of all evils."

As you said yourself, the performance penalty is alleged rather than
proven. I've seen the performance measurements which prove that
assumed-shape is about equally fast or, paradoxically, even faster, on
several compilers (sorry that I don't have a link handy). Even if we
accept that non-contiguous arrays are slower, they form a fairly small
fraction of typical use cases.

Thus, changing an elegant and safe coding habit to a clumsier and less
safe, for an alleged performance gain, doesn't look like a good idea
to me.

--
Jugoslav
www.xeffort.com
Please reply to the newsgroup.
You can find my real e-mail on my home page above.

Gordon Sande

unread,

Nov 21, 2008, 8:48:52 AM11/21/08

to

These sort of issues tend to be very date dependent. The very first version
of an F90 compiler may have used bulky generic, and maybe slower, instruction
sequences to access assumed shape arrays. It is a safe bet that those generic
code sequences were very high on the list of optimizations to sort out if
the compiler treated performance as an important issue.

Some of the "complaints" may have been cover for folks who were feeling
pressure to move to F90 and away from their confort zone of F66 (none
of that newfangled F77 nonsense!). Being fast is clearly more important
than being maintainable, updateable or even correct (for some). :-(

So look hard at both the timing and notivation of the complaints. Make
sure your have a ready supply of pinchs of salt.

Dick Hendrickson

unread,

Nov 21, 2008, 10:39:01 AM11/21/08

to

Jan Vorbrüggen wrote:
>> So I've started to think about the issue, and in the end I can't a lot
>> of reasons to prefer assumed-shape. Why risking a performance hit when
>> I just need to pass a few extra integers to the subroutine ?
>
> If you think about it, this is just what the generated code will be
> doing "behind you back" when you are using assumed-shape arrays. Why
> should there be a difference in performance? If any, it should be for
> the automatically-generated code.

The one thing you lose with assumed shape arrays is knowledge about
dimension relations. In most subroutines with several arrays as
arguments almost all of the arrays have similar shapes. Some
are (N), some (N,M), some (N,M,L) etc. That's usually fixed
by the physics of the problem. It's a useful documentation for the
user to see the expected/required relation explicitly. It's a
potential optimization aid for the compiler. When it sees
something like
A(I,J) = B(I,J)
it can treat the (I,J) as a common subexpression. If it has to
evaluate the expression twice (or 5 or 10 times), there's more of a
chance of running out of registers.

Dick Hendrickson

deltaquattro

unread,

Nov 21, 2008, 10:51:17 AM11/21/08

to

On 21 Nov, 14:13, Jan Vorbrüggen <Jan.Vorbrueg...@not-thomson.net>
wrote:

> > A different case altogether is when the the caller itself cannot
> > determine the dimensions of the array being passed. However, in this
> > case you can't use assumed-shape either
>
> I really can't understand what you want to say here. Allocatble arrays
> are a subset of assumed-shape arrays, in any case.

If allocatable array dummy arguments were just a subset of assumed-
shape arrays, they should have worked in F90, while they don't.
Neither they do in F95 - but luckily there is the F95 Allocatable TR.

>
> Jan

Best Regards,

deltaquattro

unread,

Nov 21, 2008, 11:00:15 AM11/21/08

to

On 21 Nov, 16:39, Dick Hendrickson <dick.hendrick...@att.net> wrote:

[..]

>
> The one thing you lose with assumed shape arrays is knowledge about
> dimension relations. In most subroutines with several arrays as
> arguments almost all of the arrays have similar shapes. Some
> are (N), some (N,M), some (N,M,L) etc. That's usually fixed
> by the physics of the problem. It's a useful documentation for the
> user to see the expected/required relation explicitly.

That's something I've also thought of, even though most of the time
you can grasp such relations easily by just looking at the code:

function add(a,b)
implicit none
real, dimension(:) :: a
real, dimension(:) :: b
real, dimension(size(a)) :: add
add = a+b ! it's quite clear that a and b must have the same
dimension
end function add

It's a
> potential optimization aid for the compiler. When it sees
> something like
> A(I,J) = B(I,J)
> it can treat the (I,J) as a common subexpression. If it has to
> evaluate the expression twice (or 5 or 10 times), there's more of a
> chance of running out of registers.
>
> Dick Hendrickson

I'm not sure I understood the reason why, however you're saying that
assumed shape may be slower, isn't it?

Best Regards

deltaquattro

unread,

Nov 21, 2008, 11:11:06 AM11/21/08

to

On 21 Nov, 14:20, Jugoslav Dujic <jdu...@yahoo.com> wrote:
>
> "Premature optimization is the root of all evils."
>
> As you said yourself, the performance penalty is alleged rather than
> proven. I've seen the performance measurements which prove that
> assumed-shape is about equally fast or, paradoxically, even faster, on
> several compilers (sorry that I don't have a link handy). Even if we
> accept that non-contiguous arrays are slower, they form a fairly small
> fraction of typical use cases.
>
> Thus, changing an elegant and safe coding habit to a clumsier and less
> safe, for an alleged performance gain, doesn't look like a good idea
> to me.
>

Hi, Jugoslav,

I also like the "style" of assumed shape best, otherwise I wouldn't
have used it extensively up to now. Clearly, the best way to solve the
issue would be to do some timings myself. However, I haven't time for
extensive timing tests. Also, as someone once said to me on this ng,
timing tests must be cleverly designed in order to be meaningful, and
since I don't understand the reasons of such (alleged) differences,
I'm not in a position to do that. So I asked here, wondering if anyone
on this ng had similar performance problems with assumed-shape arrays.
This doesn't seem to be the case, so I could close the issue here.
However if anybody is working on large, computationally intensive
codes such for example meteorology or CFD or computational chemistry
etc. and he can tell me he didn't find significant differences, I
would feel a bit safer ;-) Thanks,

Cheers

deltaquattro

> --
> Jugoslavwww.xeffort.com

> Please reply to the newsgroup.

> You can find my real e-mail on my home page above.- Nascondi testo citato
>
> - Mostra testo citato -

deltaquattro

unread,

Nov 21, 2008, 11:30:16 AM11/21/08

to

On 21 Nov, 17:11, deltaquattro <deltaquat...@gmail.com> wrote:
> On 21 Nov, 14:20, Jugoslav Dujic <jdu...@yahoo.com> wrote:
>
>
>
> > "Premature optimization is the root of all evils."
>
> > As you said yourself, the performance penalty is alleged rather than
> > proven. I've seen the performance measurements which prove that
> > assumed-shape is about equally fast or, paradoxically, even faster, on
> > several compilers (sorry that I don't have a link handy). Even if we

[..]

> However if anybody is working on large, computationally intensive
> codes such for example meteorology or CFD or computational chemistry
> etc. and he can tell me he didn't find significant differences, I
> would feel a bit safer ;-) Thanks,

Just a note: I'm absolutely not saying that the performance
measurements you saw aren't reliable. I'm only suggesting that if
anybody had "real-life" experience on the matter, for codes similar to
those I work on, I'd very much like to hear from him.

Cheers

deltaquattro

Jan Vorbrüggen

unread,

Nov 21, 2008, 12:15:17 PM11/21/08

to

> If allocatable array dummy arguments were just a subset of assumed-
> shape arrays, they should have worked in F90, while they don't.
> Neither they do in F95 - but luckily there is the F95 Allocatable TR.

Tsk, tsk - they are a subset precisely because they have additional
properties that other assumed-shape arrays do not have. Richard has said
several times that the Allocatable TR was needed in order to get the
semantics correct, which couldn't be done on the F90 and F95 time frame.

Jan

Ron Shepard

unread,

Nov 21, 2008, 12:24:03 PM11/21/08

to

In article
<pCAVk.146467$Mh5....@bgtnsc04-news.ops.worldnet.att.net>,
Dick Hendrickson <dick.hen...@att.net> wrote:

> It's a
> potential optimization aid for the compiler. When it sees
> something like
> A(I,J) = B(I,J)
> it can treat the (I,J) as a common subexpression. If it has to
> evaluate the expression twice (or 5 or 10 times), there's more of a
> chance of running out of registers.

This "common subexpression" would apply only when J is the innermost
loop (or is changing through evaluation in some other way). If J is
fixed and only I is changing, then the addresses would be computed
separately from the array column offset in all cases. But if J is
changing in the innermost loop, then your code is already
nonoptimal, and changing the dummy array declarations isn't going to
help.

The biggest performance hit you take when using assumed shape arrays
is when you pass noncontinguous array sections to a subprogram that
uses some array declaration that requires contiguous addressing
(such as explicit shape or assumed size, or arrays with an implicit
interface). In this case, the compiler will make a copy of the data
for you. But there is no other option is there, regardless of
whether you make the copy or let the compiler make it? If your data
is not stored contiguously, and your subprogram requires contiguous
storage, then that is the only way to make the two fit. So even
here, it would be somewhat rare for there to be a practical
difference. One such example might be if you have a noncontiguous
array section that is passed to several such subprograms. Then the
most efficient code might be to make a single local copy that is
contiguous, pass that array to the subprograms, and then at the end
copy it back to where it needs to be. The compiler would likely do
copy-in/copy-out for each subprogram invocation, but if you do it
manually, there would be only a single copy in each direction. But,
an even better solution might be to rewrite the subprograms to use
assumed shape arrays, and in this case there would be no copying
required at all, not multiple times and not even a single time.

So although there are some situations where there might be a
potential performance advantage to assumed size (or explicit
declarations) arrays, in practice I don't think that the casual
programmer needs to be concerned with it. And if you do want to go
to the effort to optimize these cases, you can optimize it with
assumed shape arrays, so again there is no need to avoid them.

And there are huge advantages during the code development stage in
favor of assumed shape arrays, so ultimately you have to figure out
the tradeoff between spending weeks (or months) debugging some code
using one of the alternatives compared to the execution time that it
might save.

$.02 -Ron Shepard

Dick Hendrickson

unread,

Nov 21, 2008, 12:31:35 PM11/21/08

to

deltaquattro wrote:
> On 21 Nov, 16:39, Dick Hendrickson <dick.hendrick...@att.net> wrote:
>
> [..]
>
>> The one thing you lose with assumed shape arrays is knowledge about
>> dimension relations. In most subroutines with several arrays as
>> arguments almost all of the arrays have similar shapes. Some
>> are (N), some (N,M), some (N,M,L) etc. That's usually fixed
>> by the physics of the problem. It's a useful documentation for the
>> user to see the expected/required relation explicitly.
>
> That's something I've also thought of, even though most of the time
> you can grasp such relations easily by just looking at the code:
>
> function add(a,b)
> implicit none
> real, dimension(:) :: a
> real, dimension(:) :: b
> real, dimension(size(a)) :: add
> add = a+b ! it's quite clear that a and b must have the same
> dimension

Sure, all code is self documenting once you understand it ;).

> end function add
>
>
> It's a
>> potential optimization aid for the compiler. When it sees
>> something like
>> A(I,J) = B(I,J)
>> it can treat the (I,J) as a common subexpression. If it has to
>> evaluate the expression twice (or 5 or 10 times), there's more of a
>> chance of running out of registers.
>>
>> Dick Hendrickson
>
> I'm not sure I understood the reason why, however you're saying that
> assumed shape may be slower, isn't it?

I'm saying "maybe slower". With modern pipelined machines
it's hard to make generalizations based on operation counts.
To evaluate the subscript A(I,J), the compiler must compute
something like
stride_1*(I-LB_1) + stride_1*stride_2*size(A,1)*(J-LB_2)
For assumed shape arrays the strides and lower bounds (LB_*)
are passed in with the hidden dimensions. If there are several
arrays the compiler will have to evaluate similar expressions
for each reference. For explicit dimensions, the compiler will
know that the strides are 1 and that the lower bounds and sizes
are the same. It effectively only evaluates the expression once
and uses the value for all of the references.

It's hard to know how much difference this makes. As a practical
matter, all of the computation time in programs with arrays takes
place inside of DO loops. Almost all of the address computation
is loop invariant and will be done once in the loop preamble
(or even in the routine entry code) and saved in a register.
The final multiplies will be strength-reduced to adds and
pipelined in with the other loop computations. You're unlikely
to see a difference on reasonable codes.

Personally, I think the choice is a matter of what's easier
to code and read for you. A program that works and that you
understand is better than the other kind. :)

If timing is reasonably important, you could try writing
some subroutines like
subroutine xxx(a,b,c,d,e,f...)
real dimension(:,:,:,:...) :: a,b,c,d,e,f...
a = b+c+d+e+f...

versus
subroutine yyy(a,b,c,d,e,f...I,J,K,L,...)
real dimension(I,J,K,L...) :: a,b,c,d,e,f...
a = b+c+d+e+f...

and look at timings as you vary the number of arrays, the
number of dimensions, and the complexity of the working
computation. If you have optimization turned up to almost
anything above "brain-dead" I don't think you'll see a
significant difference.

Dick Hendrickson
>
> Best Regards
>
> deltaquattro

Steven Correll

unread,

Nov 21, 2008, 1:12:38 PM11/21/08

to

On Nov 21, 5:37 am, deltaquattro <deltaquat...@gmail.com> wrote:
> Hello,
>
> I've always used extensively assumed-shape arrays in my subroutines,
> partly because of the suggestions of most F90/95 books, and partly
> because I like to use fancy new language features when they come
> out :-) Until now, however, I've never thought seriously about their
> pros and cons. Recently I've seen a lot of F95 codes which use them
> sparingly, allegedly because of performance penalties.

A compiler which performs inter-procedural analysis and optimization
will often be able to generate the same code for assumed-shape and
explicit-shape arrays, giving you the benefit of assumed-shape (no
danger of discrepancy between actual and dummy argument shapes) for
free.

Richard Maine

unread,

Nov 21, 2008, 1:54:35 PM11/21/08

to

Jugoslav Dujic <jdu...@yahoo.com> wrote:

> deltaquattro wrote:

> > Recently I've seen a lot of F95 codes which use them

> > [asumed-shape arrays] sparingly, allegedly because of
> > performance penalties.

> "Premature optimization is the root of all evils."

Amen, brother! Knuth is a really, really bright guy. Ignore his advice
at your risk. It is orders of magnitude more valuable than random
"allegations" that you might read on the net. It is more valuable than
most advice you are likely to get here (alas, including my own), except
to the extent that the advice here is to pay attention to Knuth.

Note also the many, many tales of people doing complicated things
allegedly to improve performance, only to actually make the performance
worse. I've been there and done that one. Another one of those
programing aphorisms is that, when it comes to performance, the most
important things are measurement, measurement, and measurement.

People regularly make excuses for why they don't make measurements, but
the excuses all pretty much sound the same, and they aren't good enough.
If you are going to change nice, clear code into something more
complicated just because of performance reasons, you need to actually
measure that performance. That's even more so when you are talking about
a change in coding style that will be pervasive throughout your code.

You also need to do a good job of making sure that the performance
measurement is actually measuring what you think it is. Gordon mentioned
the supply of salt you need to have handy when reading performance
reports on the net. Some of them are probably well done. Others... well,
we've had great counterexamples in this newsgroup; I'll avoid mentioning
names. I suspect that Sturgeon's Law applies. ("Ninety percent of
everything is crap.")

There is a fairly large subset of alleged performance comparisons on the
net where the comparison ends up depending on things completely
unrelated to what it purports. You'll see "I rewrote my code using
feature x and it slowed down a lot, so feature x is slow," but it will
turn out that the code rewrite involved lots of other changes in
addition to "feature x". Or the writer didn't understand how to use
"feature x" well.

> As you said yourself, the performance penalty is alleged rather than
> proven.

I've never measured such performance penalty myself. I can believe there
might be one in some cases, but I've never documented one first hand.
Partly that's because it just wouldn't make any difference to me. It
would have to be a pretty big penalty to outweigh the clarity and
functionality benefits of assumed shape for me and to force me to redo
pretty much all array code I've written in the past decade and a half.

It isn't plausible for there to be that big a penalty. The alleged
explanations of how one would come about don't support penalties big
enough. Also, I have on many occasions done performance tests of my
overall applications. Those high-level performance tests don't tell me
precise numbers on how much was from assumed-shape issues, but they
bound it, and the bound is well within levels that are tolerable to me.

The one place where I definitely did directly measure huge and
unacceptable performance penalties was the copy-in/copy-out that you
sometimes get from randomly mixing assumed-shape and explicit shape
coding styles. That's not a consequence of one or the other, but of
inconsistency. Compilers have also gotten better about optimizing that
kind of thing away.

I mentioned above that Knuth is a really bright guy. But it's been to
many paragraphs, so maybe it is about time to repeat it. It merits the
repeat. If you concentrate too narrowly on performance, your programs
will end up as... well... material to support Sturgeon's Law. They will
be buggy, hard to read and maintain, and they probably won't actually
end up performing very well anyway because you'll have concentrated so
much on code optimization that you miss the bigger picture.

--
Richard Maine | Good judgment comes from experience;
email: last name at domain . net | experience comes from bad judgment.
domain: summertriangle | -- Mark Twain

Glen Herrmannsfeldt

unread,

Nov 21, 2008, 4:37:45 PM11/21/08

to

Ron Shepard wrote:
(snip)

> The biggest performance hit you take when using assumed shape arrays
> is when you pass noncontinguous array sections to a subprogram that
> uses some array declaration that requires contiguous addressing
> (such as explicit shape or assumed size, or arrays with an implicit
> interface).

Non-contiguous arrays also don't make as efficient use
of the cache on most processors. Though if a column
(that addressed by the leftmost subscript) is contiguous the
cache effect should be small.

Even more, be sure the subscripts and loops are in
the right order.

-- glen

Glen Herrmannsfeldt

unread,

Nov 21, 2008, 4:43:34 PM11/21/08

to

deltaquattro wrote:

> On 21 Nov, 16:39, Dick Hendrickson <dick.hendrick...@att.net> wrote:

>>The one thing you lose with assumed shape arrays is knowledge about
>>dimension relations. In most subroutines with several arrays as
>>arguments almost all of the arrays have similar shapes.

(snip)

> That's something I've also thought of, even though most of the time
> you can grasp such relations easily by just looking at the code:

> function add(a,b)
> implicit none
> real, dimension(:) :: a
> real, dimension(:) :: b
> real, dimension(size(a)) :: add
> add = a+b ! it's quite clear that a and b must have the same

Well, at that point the compiler can assume that size(a).eq.size(b),
as the last statement would otherwise be illegal. If you do:

function add(a,b)
implicit none
integer i

real, dimension(:) :: a
real, dimension(:) :: b
real, dimension(size(a)) :: add

do i=1,size(a)
add(i) = a(i)+b(i) ! it's NOT clear that a and b must have the same size
enddo

if(size(a).ne.size(b)) stop 'a and b don''t have the same size!'

would fix that, though.

-- glen

Glen Herrmannsfeldt

unread,

Nov 21, 2008, 4:58:45 PM11/21/08

to

Gordon Sande wrote:

> On 2008-11-21 08:37:18 -0400, deltaquattro <deltaq...@gmail.com> said:

(snip)

>> Until now, however, I've never thought seriously about their
>> pros and cons. Recently I've seen a lot of F95 codes which use them
>> sparingly, allegedly because of performance penalties.

(snip regarding relative performance of assumed shape vs. assumed size)

> These sort of issues tend to be very date dependent. The very first
> version of an F90 compiler may have used bulky generic, and maybe
> slower, instruction sequences to access assumed shape arrays.

Since in most cases the address of a descriptor containing the
address is passed, instead of just the address, it takes one
more level of indirection. That is especially bad on register
starved architectures such as IA32. Not that I agree that it
is a problem, but the effect is likely there.

First, consider Amdahl's law: what fraction of time is
actually spent doing such array operations. In something like
matrix multiply that does spend just about all its time doing
array access, it might be noticeable. But matrix multiply
already has the problem that it accesses one of the arrays
in the wrong direction. There are known solutions to that,
but it is still likely slow.

My feeling, without actually measuring, is that you could
probably find a case where it was measurable, but most
likely overall a small effect. Add the time used getting
the algorithm right while keeping track of array dimensions,
that the compiler would do in assumed shape, and you
probably lose.

-- glen

Glen Herrmannsfeldt

unread,

Nov 21, 2008, 7:43:00 PM11/21/08

to

deltaquattro wrote:

> I've always used extensively assumed-shape arrays in my subroutines,
> partly because of the suggestions of most F90/95 books, and partly
> because I like to use fancy new language features when they come
> out :-) Until now, however, I've never thought seriously about their
> pros and cons. Recently I've seen a lot of F95 codes which use them
> sparingly, allegedly because of performance penalties.

As a simple test, I compiled the following with g95 0.91!

subroutine shape(x,y,z)
real x(:),y(:),z(:)
z=x+y
end

subroutine size(x,y,z,n)
real x(n),y(n),z(n)
z=x+y
end

When compiled as g95 -S matest.f for IA32,
shape is 101 lines of assembly code, size is 89.

When compiled as g95 -S -O2 matest.f for IA32,
shape is 85 lines of assembly code, size is 31.

Lines of assembly code aren't the best measurement, but
they are a hint as to how complex some code sequence is.

It seems pretty definite that assumed size optimizes much
better than assumed shape (at least for g95).
The actual time used by each still needs to be determined.

-- glen

Glen Herrmannsfeldt

unread,

Nov 21, 2008, 8:02:45 PM11/21/08

to

Glen Herrmannsfeldt wrote:
(I previously wrote)

> As a simple test, I compiled the following with g95 0.91!

> subroutine shape(x,y,z)
> real x(:),y(:),z(:)
> z=x+y
> end

> subroutine size(x,y,z,n)
> real x(n),y(n),z(n)
> z=x+y
> end

To get a better idea how long they might be, I count
just the instructions inside the add loop.

When compiled as g95 -S matest.f for IA32, the loop for
shape has 33 lines of assembly code, size is 30.

When compiled as g95 -S -O2 matest.f for IA32, the loop for
shape is 9 lines of assembly code, size is 6.

For assumed shape, the loop looks like"

.L5:
flds (%eax)
incl %ebx
addl %edi, %eax
fadds (%edx)
fstps (%ecx)
addl -116(%ebp), %edx
addl -120(%ebp), %ecx
cmpl %esi, %ebx
jne .L5

And for assumed size:

.L13:
flds 4(%ecx,%edx,4)
fadds 4(%esi,%edx,4)
fstps 4(%ebx,%edx,4)
incl %edx
cmpl %edx, %eax
jne .L13

The difference seems to be that in the assumed size case
the data must be contiguous, and, using the indexing
instructions off %edx the compiler can take advantage of that.

For assumed shape, it has to keep a separate stride for each,
and, because there aren't enough registers, keep two of them
in memory.

-- glen

James Van Buskirk

unread,

Nov 22, 2008, 12:33:44 AM11/22/08

to

"deltaquattro" <deltaq...@gmail.com> wrote in message
news:99199932-b2d2-445f...@f3g2000yqf.googlegroups.com...

> So I've started to think about the issue, and in the end I can't a lot
> of reasons to prefer assumed-shape. Why risking a performance hit when
> I just need to pass a few extra integers to the subroutine ? A
> different case altogether is when the the caller itself cannot
> determine the dimensions of the array being passed. However, in this
> case you can't use assumed-shape either - you have to resort to
> allocatable array dummy arguments, and while at that, be grateful that
> they exist, since they're so very useful :-) But other than this, I
> can't see reasons against explicit-shape. Before changing my coding
> habit, however, Id' like to ask your opinion. What do you think?

One problem with assume-shape arrays is that they can't be assumed
contiguous and SSE/SSE2 tries to carry out 2 double-precision or 4
single-precision operations with every instruction. If the data
aren't known contiguous, the processor has to issue instructions to
gather them together in registers and all the loads will dominate
your in-cache throughput. The very fastest code requires contiguous
data aligned to the size of the registers, which I think are supposed
to get twice as wide in the not too distant future.

The difference between fast code and the very fastest code may not
be significant and compilers typically aren't going to make the
very fastest code for you.

--
write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, &
6.0134700243160014d-154/),(/'x'/)); end

Glen Herrmannsfeldt

unread,

Nov 22, 2008, 3:22:36 AM11/22/08

to

James Van Buskirk wrote:
(snip)

> One problem with assume-shape arrays is that they can't be assumed
> contiguous and SSE/SSE2 tries to carry out 2 double-precision or 4
> single-precision operations with every instruction. If the data
> aren't known contiguous, the processor has to issue instructions to
> gather them together in registers and all the loads will dominate
> your in-cache throughput. The very fastest code requires contiguous
> data aligned to the size of the registers, which I think are supposed
> to get twice as wide in the not too distant future.

Are there compilers that will generate inline SSE2 or SSE3 code?
I can see doing that inside an (internally) called subroutine,
but it would surprise me to see in generated code.

------------------------------------------------------------

For the following program, assumed size is about 10% faster
than assumed shape. That is about what I would have guessed
without actually doing any timing.

! attempt to compare assumed shape and assumed size for speed
! this time with rank 2 arrays. Multiply without MATMUL

real x(100,100),y(100,100),z(100,100),r
integer i
integer*8 t1,t2,t3,t4,rdtsc
interface
subroutine shape(x,y,z)
real x(:,:),y(:,:),z(:,:)
end
end interface

call random_number(x)
call random_number(y)

z=matmul(x,y)
print *,'matmul',z(30,30),0

do i=1,100
t1=rdtsc(1)
call shape(x,y,z)
t2=rdtsc(1)
print *,'shape',z(30,30),t2-t1
t3=rdtsc(1)
call size(x,y,z,100)
t4=rdtsc(1)
print *,'size',z(30,30),t4-t3
t1=rdtsc(1)
call shape(x,y,z)
t2=rdtsc(1)
print *,'shape',z(30,30),t2-t1
print *,' '
enddo
end

subroutine shape(x,y,z)
real x(:,:),y(:,:),z(:,:)
integer i,j,k,n
n=min(ubound(x,1),ubound(x,2))
n=min(ubound(y,1),ubound(y,2),n)
n=min(ubound(z,1),ubound(z,2),n)
do i=1,n
do j=1,n
z(j,i)=0
do k=1,n
z(j,i)=z(j,i)+x(j,k)*y(k,i)
enddo
enddo
enddo
end

subroutine size(x,y,z,n)
real x(n,n),y(n,n),z(n,n)
integer i,j,k
do i=1,n
do j=1,n
z(j,i)=0
do k=1,n
z(j,i)=z(j,i)+x(j,k)*y(k,i)
enddo
enddo
enddo
end

.file "rdtsc.s"
.text
.p2align 4,,15
.globl rdtsc_
.type rdtsc_, @function
rdtsc_:
rdtsc
ret
.size rdtsc_, .-rdtsc_

> The difference between fast code and the very fastest code may not
> be significant and compilers typically aren't going to make the
> very fastest code for you.

For g95, it seems that it doesn't try hard at all without
any -O options. Also, using the MATMUL function seems about
as fast as above for assumed shape and about 10% slower
for assumed size. (That is, close to the assumed shape time.)

-- glen

Carlie J. Coats

unread,

Nov 22, 2008, 8:27:56 AM11/22/08

to

deltaquattro wrote:
> Hello,
>
> I've always used extensively assumed-shape arrays in my subroutines,
> partly because of the suggestions of most F90/95 books, and partly
> because I like to use fancy new language features when they come
> out :-) Until now, however, I've never thought seriously about their
> pros and cons. Recently I've seen a lot of F95 codes which use them
> sparingly, allegedly because of performance penalties. See also here:
>
> http://www.baronams.com/staff/coats/epa2002_seminar.html#exmois

Actually the right section of that article is
http://www.baronams.com/staff/coats/epa2002_seminar.html#opt

and let's get the context and the quote right. The context is that of
"blue-book" high performance computing problems in meteorology and air
quality modeling with compilers available in 2002; we observed

...Explicit Fortran-90-style interfaces (from INTERFACEs,
MODULE USEs, and CONTAINs) may cause copy-in/copy-out argument
passing instead of pass-by-reference, with potentially substantial
performance penalties. Our experience is that using an internal
TRIDIAG solver CONTAINed within the VDIFF science module cost
a 50% penalty over the use of an F77-style (no explicit interface)
separately compiled TRIDIAG, for both Sun and SGI compilers.

This is already in a situation where all dimensioning is by PARAMETERs,
so the compiler has _lots_ of opportunity for optimizing address
calculation, and the particular TRIDIAG() solver was identified as
one of the computational bottlenecks in a time-critical forecast model.

All this was in a presentation to a somewhat domain specific audience,
and is a minor point compared with the main thrust of the talk, that
algorithms and data structures really matter, and their interaction with
computer hardware can kill performance. I still stand by:

...in bad cases, code structure can make a factor of 20-40
difference in a model's delivered computational performance.
Smaller effects (factors of 2-5) are quite common unless
care has been taken in the model design and coding...
...three things we wish to watch for:

* memory-system behavior;
* computational-unit behavior; and
* optimizer behavior.

Note that optimizer behavior is last in this list. Dependency-bound
coding in met models is a worse performance-killer on pipelined
microprocessors. And bad memory-system behavior can *really* destroy
you. Memory-system behavior, particularly, is an architecture/design
issue, not a low-level coding issue.

And in my experience, many of the large agencies and multi-agency
projects (in the environmental sciences field, anyway) do *not* take
the requisite architectural care...

-- Carlie Coats

Carlie J. Coats

unread,

Nov 22, 2008, 8:38:21 AM11/22/08

to

In the original (2002-vintage) paper referenced, I was talking about
one of two time critical computational kernels in a model with a
run-time of about 50 CPU-hours; arguments to this tridiagonal solver
routine were PARAMETER-dimensioned 1-D and 2-D arrays, and with
then-available compilers, we saw factor of 1.5-2.5 performance
penalties for using it as an explicit-interface CONTAINed routine,
as compared to a separately-compiled Fortran-77-interface routine.

For comparison, manually in-lining the solver gave speedup factors
in the range of 1.02-1.03.

At least at that time, Glen Herrmannsfeldt's stride-assumption
analysis seems to have been correct, even in the best of possible
circomstances for the compiler.

fwiw -- Carlie Coats

Richard Maine

unread,

Nov 22, 2008, 12:15:17 PM11/22/08

to

Carlie J. Coats <car...@jyarborough.com> wrote:

> ...Explicit Fortran-90-style interfaces (from INTERFACEs,
> MODULE USEs, and CONTAINs) may cause copy-in/copy-out argument
> passing instead of pass-by-reference,

I didn't go track down the article, but I find that *VERY* difficult to
believe.

Oh, I can well believe that you had something that caused
copy-in/copy-out and that this caused a huge performance penalty. I've
been there. What I find difficult to believe is that the cause of this
is accurately identified as being the explicit interface. I can't
imagine any circumstance in which an explicit interface should slow
things down. Mostly I'd expect it to have no effect, with a slight
possibility for speedup in some cases.

Was it the explicit interface per se, or was it the use of features that
require an explicit interface? I strongly suspect the latter.

I suppose I ought to go read the paper. Maybe after some coffee.

Richard Maine

unread,

Nov 22, 2008, 12:22:30 PM11/22/08

to

Richard Maine <nos...@see.signature> wrote:

> Carlie J. Coats <car...@jyarborough.com> wrote:
>
> > ...Explicit Fortran-90-style interfaces (from INTERFACEs,
> > MODULE USEs, and CONTAINs) may cause copy-in/copy-out argument
> > passing instead of pass-by-reference,
>
> I didn't go track down the article, but I find that *VERY* difficult to

> believe....

>
> Was it the explicit interface per se, or was it the use of features that
> require an explicit interface? I strongly suspect the latter.
>
> I suppose I ought to go read the paper. Maybe after some coffee.

Looked at the paper. I can't find the word "explicit" other than in the
quoted para, which offers no supporting details. Lacking such supporting
details, I'm going to go with my earlier thesis that I believe you did
observe a problem, but I strongly suspect this description
mischaracterizes it.

Andrea

unread,

Nov 22, 2008, 5:02:44 PM11/22/08

to

On 22 Nov, 14:27, "Carlie J. Coats" <car...@jyarborough.com> wrote:
> deltaquattro wrote:
[..]
>
> Actually the right section of that article ishttp://www.baronams.com/staff/coats/epa2002_seminar.html#opt

>
> and let's get the context and the quote right. The context is that of
> "blue-book" high performance computing problems in meteorology and air
> quality modeling with compilers available in 2002; we observed

Yeah, right: let's get the context *and* the quote right. Don't omit
to quote this:

dimensions and loop bounds typically cause speedups on the order of 5%
(not a lot, but every little bit helps:-) Array sections and
ALLOCATEable arrays cause performance penalties typically running from
10% to 100%, depending upon the vendor (Sun F90 before version 7 being
bad, HP being worse); Fortran POINTERs typically cause 50%-200%
penalties (basically reducing the optimizability of Fortran down to
that of C/C++).

otherwise other ng users could get the impression that I just
daydreamed about possible performance hits due to not using explicit
shape arrays.

Best Regards

deltaquattro

Richard Maine

unread,

Nov 22, 2008, 5:33:39 PM11/22/08

to

Andrea <andrea.p...@gmail.com> wrote:

> Yeah, right: let's get the context *and* the quote right. Don't omit
> to quote this:
>
> dimensions and loop bounds typically cause speedups on the order of 5%
> (not a lot, but every little bit helps:-) Array sections and
> ALLOCATEable arrays cause performance penalties typically running from
> 10% to 100%, depending upon the vendor (Sun F90 before version 7 being
> bad, HP being worse); Fortran POINTERs typically cause 50%-200%
> penalties (basically reducing the optimizability of Fortran down to
> that of C/C++).
>
> otherwise other ng users could get the impression that I just
> daydreamed about possible performance hits due to not using explicit
> shape arrays.

Um. I might get the impression that you just daydreamed about the above
quotation containing the words "explicit shape" or any synonym or
antonym thereof.

I already commented about what I see as a likely mischaracterization in
the work cited. This seems to be an unrelated mischaracterization of the
report.

If you want to be serious about studying the cause of performance
issues, you need to start with being precise about what it is you are
even talking about. Some of the things in the quoted paragraph do have
at least some relationship to the question of using assumed shape versus
explicit shape, but none of them are precisely that.

Ron Shepard

unread,

Nov 22, 2008, 6:43:36 PM11/22/08

to

In article <1iqt9g8.1fkhsj61bwokxcN%nos...@see.signature>,
nos...@see.signature (Richard Maine) wrote:

> Some of the things in the quoted paragraph do have
> at least some relationship to the question of using assumed shape versus
> explicit shape, but none of them are precisely that.

It would be reasonable for a compiler optimizer to produce two
machine-language versions of the inner-most do-loop(s) in situations
involving assumed shape arrays, one efficient version in which all
the strides are unity (meaning array addresses that increment by 4
or 8), and another general version in which one or more strides are
are not unity. Do any current compilers do this?

Another comment regarding matmul() and other intrinsic array
operations, some compilers give you the option to link in an
external blas library for these operations. In these cases, the
intrinsic will always be faster than the explicit do-loop based code
(with assumed-shape, explicit declarations, or assumed size arrays)
for everything but short, trivial cases.

$.02 -Ron Shepard

Carlie J. Coats

unread,

Nov 23, 2008, 2:26:54 AM11/23/08

to

Richard Maine wrote:
> Carlie J. Coats <car...@jyarborough.com> wrote:
>
>> ...Explicit Fortran-90-style interfaces (from INTERFACEs,
>> MODULE USEs, and CONTAINs) may cause copy-in/copy-out argument
>> passing instead of pass-by-reference,
>
> I didn't go track down the article, but I find that *VERY* difficult to
> believe.
>
> Oh, I can well believe that you had something that caused
> copy-in/copy-out and that this caused a huge performance penalty. I've
> been there. What I find difficult to believe is that the cause of this
> is accurately identified as being the explicit interface. I can't
> imagine any circumstance in which an explicit interface should slow
> things down. Mostly I'd expect it to have no effect, with a slight
> possibility for speedup in some cases.
>
> Was it the explicit interface per se, or was it the use of features that
> require an explicit interface? I strongly suspect the latter.

No, it was decidedly the explicit interface that made the difference:
in both cases, there are PARAMETERs NDIFS, NLAYS from INCLUDE-files,
and arguments

SUBROUTINE TRIDIAG( LOWER, CENTR, UPPER, RHS, SOL )
REAL LOWER( NLAYS ) ! sub-diagonal
REAL CENTR( NLAYS ) ! diagonal
REAL UPPER( NLAYS ) ! super-diagonal
REAL RHS ( NDIFS, NLAYS ) ! right-hand side
REAL SOL ( NDIFS, NLAYS ) ! solution of the system

The subroutine is in fact identical in both cases, the only difference
being whether or not it is CONTAINed or in a separate file.

-- Carlie

Richard Maine

unread,

Nov 23, 2008, 3:59:17 AM11/23/08

to

That's extremely odd. I've never seen anything vaguely like that, and I
can't imagine what would trigger it. I'd wonder what would happen if you
kept it in the separate file, but added an interface body to the calling
routine.

Thomas Koenig

unread,

Nov 23, 2008, 4:19:54 AM11/23/08

to

On 2008-11-23, Carlie J. Coats <car...@jyarborough.com> wrote:

> The subroutine is in fact identical in both cases, the only difference
> being whether or not it is CONTAINed or in a separate file.

Possibly, the compiler tried to inline the subroutine, and you got
pessimized code as a result (register pressure, ...)

In principle, this should not happen; in practice, it does.

Richard Maine

unread,

Nov 23, 2008, 4:37:26 AM11/23/08

to

Thomas Koenig <tko...@netcologne.de> wrote:

That kind of thing is why I asked about the separate file and interface
body. That would seem to minimize the possible factors other than just
the explicit interface that might contribute to the problem.

Glen Herrmannsfeldt

unread,

Nov 23, 2008, 8:19:30 AM11/23/08

to

Ron Shepard wrote:
(snip)

> Another comment regarding matmul() and other intrinsic array
> operations, some compilers give you the option to link in an
> external blas library for these operations. In these cases, the
> intrinsic will always be faster than the explicit do-loop based code
> (with assumed-shape, explicit declarations, or assumed size arrays)
> for everything but short, trivial cases.

In the matrix multiply I posted yesterday, I used a simple
three DO loop form, and seemed to be pretty close to the time
of MATMUL. Then I changed it to accumulate into a temporary
variable and store in the array outside the inner loop,
which was a little faster than MATMUL.

-- glen

Ron Shepard

unread,

Nov 23, 2008, 10:37:57 AM11/23/08

to

In article <2B7Wk.339$QX3...@nwrddc02.gnilink.net>,

I notice that you do not have IMPLICIT NONE in your subroutine. Is
it possible that there is an undeclared variable in the subroutine?
If so, that would be an undeclared local variable in the subroutine,
but it might be a host associated variable when it is a contained
internal subroutine due to an unfortunate use of the same variable
name in both. In this case, it is understandable that you might get
different results -- the variable in the host program would be
getting changed when you don't expect it.

Of course, this could well be due to a compiler bug, but there might
be other reasons for the observed behavior too. The above is one
possibility.

$.02 -Ron Shepard

Carlie J. Coats

unread,

Nov 23, 2008, 1:32:24 PM11/23/08

to

Richard Maine wrote:
> Thomas Koenig <tko...@netcologne.de> wrote:
>
>> On 2008-11-23, Carlie J. Coats <car...@jyarborough.com> wrote:
>>
>>> The subroutine is in fact identical in both cases, the only difference
>>> being whether or not it is CONTAINed or in a separate file.
>> Possibly, the compiler tried to inline the subroutine, and you got
>> pessimized code as a result (register pressure, ...)
>>
>> In principle, this should not happen; in practice, it does.
>
> That kind of thing is why I asked about the separate file and interface
> body. That would seem to minimize the possible factors other than just
> the explicit interface that might contribute to the problem.

The old code was separate-file/separate routine F77 (and, Ron, it
*was* IMPLICIT NONE--I abbreviated the post to what I thought was
the essentials); I had hoped that by simply CONTAINing it without
further changes to its body, I could maybe get the compilers to inline
the subroutine body and speed up the model as a result, at least at
high optimization levels. However, the CONTAINed version was
substantially slower even with full optimization for the 2001-2002
vintage editions of all of the following: Sun f90, SGI f90, and
IBM XLF.

An eventual hand in-line (which *was* straightforward--cut, paste,
and fixup array-names) did give a 2-3% speedup across all the
platforms, so I seriously doubt the CONTAINed version suffered
from a a register-pressure issue. And this was on SPARC, R12000,
and POWER3, which aren't badly register-starved to begin with.

-- Carlie

Richard Maine

unread,

Nov 23, 2008, 3:11:58 PM11/23/08

to

Carlie J. Coats <car...@jyarborough.com> wrote:

> Richard Maine wrote:
> > Thomas Koenig <tko...@netcologne.de> wrote:
> >
> >> On 2008-11-23, Carlie J. Coats <car...@jyarborough.com> wrote:
> >>
> >>> The subroutine is in fact identical in both cases, the only difference
> >>> being whether or not it is CONTAINed or in a separate file.
> >> Possibly, the compiler tried to inline the subroutine, and you got
> >> pessimized code as a result (register pressure, ...)
> >>
> >> In principle, this should not happen; in practice, it does.
> >
> > That kind of thing is why I asked about the separate file and interface
> > body. That would seem to minimize the possible factors other than just
> > the explicit interface that might contribute to the problem.
>
> The old code was separate-file/separate routine F77 (and, Ron, it
> *was* IMPLICIT NONE--I abbreviated the post to what I thought was
> the essentials); I had hoped that by simply CONTAINing it without
> further changes to its body, I could maybe get the compilers to inline
> the subroutine body and speed up the model as a result, at least at
> high optimization levels. However, the CONTAINed version was
> substantially slower even with full optimization for the 2001-2002
> vintage editions of all of the following: Sun f90, SGI f90, and
> IBM XLF.

That's a broad enough spectrum of compilers to make it unlikely to be a
compiler quirk. I guess I'd really want to see the actual code for clues
as to what was happening, because the describws behavior just doesn't
make sense to me. I've seen plenty of cases where copy-in/copy-out gets
triggered and kills performance. That does tend to happen across a
fairly broad range of compilers, and I have a pretty good feeling for
the kinds of things that trigger it. But I've never seen a case where it
was triggerred solely by having an explicit interface; not once have I
ever seen that. I'd like to see it to understand.

I find it easier to imagine that it would result from making the routine
internal, as there are plenty of non-obvious consquences of that. I
can't off-hand think of a real likely candidate, but at least it is
easier for me to imagine. (For one quick example - not that it seems
particularly likely, but it sort of matches the symptoms - I think an
internal routine is recursive if its host is; I could easily imagine
that triggerring copy-in/copy-out).

And, of course, I'm assuming that you meant it literally when you said
the part about having no changes to the body. In particular, I'm
assuming that you did not change explicit-shape to assumed-shape dummy
arguments (which is, after all, the original subject of this thread).
You did say that you did it "without further changes to its body", but
words can be so much slippier than code. That's part of why I tend to be
so insistent on seeing actual code when helping with debugging (which
this is, in a way). I guess I could imagine someone envisioning a change
to assumed shape as being part of the change to having an eplicit
interface and thus not count as a "further" change. I wouldn't describe
it that way, but I could imagine some people doing so, thus my
assumption seems worth verifying. That would be a whole different ball
game, pretty much making 90% of what I've said in this subthread
irrelevant.

James Van Buskirk

unread,

Nov 23, 2008, 3:31:36 PM11/23/08

to

"Richard Maine" <nos...@see.signature> wrote in message
news:1iqu2la.121snvw1fhrvv0N%nos...@see.signature...

I have seen this effect. It was with DVF or CVF. If we look at
ISO/IEC 1539-1:1997(E) Note 12.12:

"The standard does not allow internal prodedures to be used as
actual arguments, in part to simplify the problem of ensuring that
internal procedures with recursive hosts access entities from the
correct instance of the host. If, as an extension, a processor
allows internal procedures to be used as actual arguments, the
correct instance in this case is the instance in which the procedure
is supplied as an actual argument, even if the corresponding dummy
argument is eventually invoked from a different instance."

DVF, or at least CVF implemented this extension and as a consequence
it would poke a stub procedure onto the stack that loaded the frame
pointer for the host procedure and then jumped to the internal
procedure. The address of this stub procedure could then be passed
to other procedures which could then invoke it. This resulted in
the correct instance being loaded, but the stub was close to the
data on the stack, close enough that access to data on the stack
and invoking the stub could cause access to the same cache line as
both code and data. That could cause flushing the cache line and
slow things down.

DVF/CVF also had an extension where any procedure could be recursive.
I can't recall the particulars any more, but I do recall that
making module procedures into internal procedures definitely caused
code to slow down by my measurements in at least on case in DVF or
CVF. A safer way to attempt to force inlining might be to make the
candidate for inlining a private procedure in the module of the
procedure that invokes it. The compiler could then see that its
address never gets passed to any other procedures and then might
not have to compile stand-alone code for it.

JB

unread,

Nov 23, 2008, 3:34:45 PM11/23/08

to

On 2008-11-21, Richard Maine <nos...@see.signature> wrote:
> Note also the many, many tales of people doing complicated things
> allegedly to improve performance, only to actually make the performance
> worse. I've been there and done that one. Another one of those
> programing aphorisms is that, when it comes to performance, the most
> important things are measurement, measurement, and measurement.

There's a guy called Steve Yegge that writes a frequently quite
insightful blog mostly about programming over at

http://steve-yegge.blogspot.com

A long time ago, he was employed by a company called Geoworks, that
produced a graphical OS and applications. They were really focused on
performance, to the extent of writing everything (yes, everything!) in
assembler. He claimed that eventually their competitors outperformed
them because while their code was highly optimized assembler, the
codebase had gotten so big an unwieldy that they missed the important
global optimizations. The entire article where this is mentioned is at

http://steve-yegge.blogspot.com/2008/05/dynamic-languages-strike-back.html

--
JB

Richard Maine

unread,

Nov 23, 2008, 7:29:28 PM11/23/08

to

James Van Buskirk <not_...@comcast.net> wrote:

> "Richard Maine" <nos...@see.signature> wrote in message
> news:1iqu2la.121snvw1fhrvv0N%nos...@see.signature...
>
> > Carlie J. Coats <car...@jyarborough.com> wrote:
>
> >> The subroutine is in fact identical in both cases, the only difference
> >> being whether or not it is CONTAINed or in a separate file.
>
> > That's extremely odd. I've never seen anything vaguely like that, and I

> > can't imagine what would trigger it....

> I have seen this effect. It was with DVF or CVF....
...

> DVF/CVF also had an extension where any procedure could be recursive.
> I can't recall the particulars any more, but I do recall that
> making module procedures into internal procedures definitely caused

> code to slow down.

Some context got lost along the way there (and I'll admit I might have
been the one mostly responsible for dropping it), but this would
reinforce the point that I intended to make (which might not be quite
the one that I actually stated).

In particular, your examples are about slowdowns that are most
distinctly *NOT* caused by having an explicit interface. They are caused
by being internal procedures. There is a relationship between being an
internal procedure and having an explicit interface, but that
relationship is not an equivalence one.

James Van Buskirk

unread,

Nov 23, 2008, 8:17:35 PM11/23/08

to

"Richard Maine" <nos...@see.signature> wrote in message

news:1iqv9fm.1fo9exk1vlpb0gN%nos...@see.signature...

> In particular, your examples are about slowdowns that are most
> distinctly *NOT* caused by having an explicit interface. They are caused
> by being internal procedures. There is a relationship between being an
> internal procedure and having an explicit interface, but that
> relationship is not an equivalence one.

Yes, I tried to emphasize that the difference was due to internal vs.
other kinds of procedures. Sorry if I didn't make that clear.

Glen Herrmannsfeldt

unread,

Nov 25, 2008, 2:19:16 PM11/25/08

to

James Van Buskirk wrote:
(snip)

> I have seen this effect. It was with DVF or CVF. If we look at

> ISO/IEC 1539-1:1997(E) Note 12.12:

> "The standard does not allow internal prodedures to be used as
> actual arguments, in part to simplify the problem of ensuring that
> internal procedures with recursive hosts access entities from the
> correct instance of the host. If, as an extension, a processor
> allows internal procedures to be used as actual arguments, the
> correct instance in this case is the instance in which the procedure
> is supplied as an actual argument, even if the corresponding dummy
> argument is eventually invoked from a different instance."

The C setjmp/longjmp also keeps the context except, according to
the one I am looking at, of automatic variables local to the
routine doing setjmp that are modified after setjmp and
before longjmp, and not declared volatile.

The IBM PL/I description is:

"The environment of a procedure invoked from within a recursive
procedure by means of an entry variable is the one that was
current when the entry constant was assigned to the variable."

In addition PL/I also has statement label variables that can be
used in GOTO statements, which also keep the appropriate
recursion nest level.

-- glen

nm...@cam.ac.uk

unread,

Nov 25, 2008, 2:54:35 PM11/25/08

to

In article <gghj57$82k$1...@aioe.org>,
Glen Herrmannsfeldt <g...@ugcs.caltech.edu> wrote:

>James Van Buskirk wrote:
>
>> I have seen this effect. It was with DVF or CVF. If we look at
>> ISO/IEC 1539-1:1997(E) Note 12.12:
>
>> "The standard does not allow internal prodedures to be used as
>> actual arguments, in part to simplify the problem of ensuring that
>> internal procedures with recursive hosts access entities from the
>> correct instance of the host. If, as an extension, a processor
>> allows internal procedures to be used as actual arguments, the
>> correct instance in this case is the instance in which the procedure
>> is supplied as an actual argument, even if the corresponding dummy
>> argument is eventually invoked from a different instance."

Of course, Algol 68 allowed that, and all forms of procedure call
were usually much faster than the much simpler calls in Fortran 66.
It's not hard to do, if you do it right - and there are half a
dozen good ways of doing it.

>The C setjmp/longjmp also keeps the context except, according to
>the one I am looking at, of automatic variables local to the
>routine doing setjmp that are modified after setjmp and
>before longjmp, and not declared volatile.

As I pointed out during the C89 standardisation process, that
restriction is completely unnecessary, and (as with the SIGFPE
shambles) merely enshrines one implementation's bugs into the
standard. My C run-time system had no such restriction.

>The IBM PL/I description is:
>
> "The environment of a procedure invoked from within a recursive
> procedure by means of an entry variable is the one that was
> current when the entry constant was assigned to the variable."

Quite. As in Algol 68.

>In addition PL/I also has statement label variables that can be
>used in GOTO statements, which also keep the appropriate
>recursion nest level.

Now, they WERE/ARE a perversion! Fortran has got rid of ASSIGN,
which was a fraction as revolting. Just because something isn't
hard to implement doesn't mean that it is a good idea ....

Regards,
Nick Maclaren.

Carlie J. Coats

unread,

Nov 27, 2008, 12:50:25 PM11/27/08

to

Richard Maine wrote:
[snip...]

> I find it easier to imagine that it would result from making the routine
> internal, as there are plenty of non-obvious consquences of that. I
> can't off-hand think of a real likely candidate, but at least it is
> easier for me to imagine. (For one quick example - not that it seems
> particularly likely, but it sort of matches the symptoms - I think an
> internal routine is recursive if its host is; I could easily imagine
> that triggerring copy-in/copy-out).
>
> And, of course, I'm assuming that you meant it literally when you said

> the part about having no changes to the body...

What I did was quite literally to add a CONTAINS statement,
and then paste in the subroutine between COTAINS and END.
Except for making the two dimensioning PARAMETERs local instead
of INCLUDE-file resident, the subroutine is below. (And, btw.,
the _only_ thing in the INCLUDE files was PARAMETER definitions).

The whole thing was originally valid F77, by the way.

-- Carlie

------------------------------- cut here -----------------------------

SUBROUTINE TRIDIAG( L, D, U, B, X )
C-----------------------------------------------------------------------
C FUNCTION:
C Solves tridiagonal system by Thomas algorithm.
C The associated tri-diagonal system is stored in 3 arrays
C D : diagonal
C L : sub-diagonal
C U : super-diagonal
C B : right hand side function
C X : return solution from tridiagonal solver
C
C [ D(1) U(1) 0 0 0 ... 0 ]
C [ L(2) D(2) U(2) 0 0 ... . ]
C [ 0 L(3) D(3) U(3) 0 ... . ]
C [ . . . . . ] X(i) = B(i)
C [ . . . . 0 ]
C [ . . . . ]
C [ 0 L(n) D(n) ]
C
C where n = NLAYS
C-----------------------------------------------------------------------

IMPLICIT NONE

INTEGER NLAYS
PARAMETER ( NLAYS = 31 )
INTEGER N_SPC_DIFF
PARAMETER ( N_SPC_DIFF = 25 )

C Arguments:

REAL L( NLAYS ) ! subdiagonal
REAL D( N_SPC_DIFF,NLAYS ) ! diagonal
REAL U( NLAYS ) ! superdiagonal
REAL B( N_SPC_DIFF,NLAYS ) ! R.H. side
REAL X( N_SPC_DIFF,NLAYS ) ! solution

C Local Variables:

REAL GAM( N_SPC_DIFF,NLAYS )
REAL BET( N_SPC_DIFF )
INTEGER V, K

C Decomposition and forward substitution:

DO V = 1, N_SPC_DIFF
BET( V ) = 1.0 / D( V,1 )
X( V,1 ) = BET( V ) * B( V,1 )
END DO

DO K = 2, NLAYS
DO V = 1, N_SPC_DIFF
GAM( V,K ) = BET( V ) * U( K-1 )
BET( V ) = 1.0 / ( D( V,K ) - L( K ) * GAM( V,K ) )
X( V,K ) = BET( V ) * ( B( V,K ) - L( K ) * X( V,K-1 ) )
END DO
END DO

C Back-substitution:

DO K = NLAYS - 1, 1, -1
DO V = 1, N_SPC_DIFF
X( V,K ) = X( V,K ) - GAM( V,K+1 ) * X( V,K+1 )
END DO
END DO

RETURN
END