Practice of code generation for elemental procedures

Bastiaan Braams

unread,

Dec 14, 2021, 6:48:48 AM12/14/21

to

I am wondering how compiler authors deal with possible code bloat in connection with elemental procedures. If such a procedure has k intent(in) arguments then any subset of those might be arrays in an invocation of the procedure (all arrays conformable), which looks like 2^k possibilities right there.
Here is the reason for my interest. Think of a procedure for some special function of mathematical physics taking arguments (n, x); an integer and a real number. The integer controls some recurrence and for fixed argument (n) the operations involving (x) are all identical. If the procedure is going to be called with scalar argument (n) and array argument (x) then there is a lot of efficiency to be gained by putting the operations involving x in an inner loop. On the other hand, if the argument (n) should be an array then the loop might as well be put on the outside. Any chance that the compiler will produce such properly optimized code?

Thomas Koenig

unread,

Dec 14, 2021, 12:23:18 PM12/14/21

to

Bastiaan Braams <bjbr...@gmail.com> schrieb:

> I am wondering how compiler authors deal with possible code
> bloat in connection with elemental procedures.

By implementing the procedure as scalar and looping.

An example: you can look over gfortran's shoulders, so to speak,
by specifying -fdump-tree-original.

An example:

module x
implicit none
contains
elemental function ele (a, b, c) result(res)
real, intent(in) :: a, b, c
real :: res
res = a + 2*b + 4*c
end function ele
end module x

program main
use x
implicit none
real :: a(3), b(3), c
call random_number(a)
call random_number(b)
call random_number(c)
print *,ele(a,b,c)
end program main

results in (part of the dump):

_gfortran_st_write (&dt_parm.2);
{
real(kind=4) D.3909;

D.3909 = c;
{
integer(kind=8) S.3;

S.3 = 1;
while (1)
{
if (S.3 > 3) goto L.1;
{
real(kind=4) D.3911;

D.3911 = ele (&a[S.3 + -1], &b[S.3 + -1], &D.3909);
_gfortran_transfer_real_write (&dt_parm.2, &D.3911, 4);
}
S.3 = S.3 + 1;
}
L.1:;
}
}
_gfortran_st_write_done (&dt_parm.2);

gah4

unread,

Dec 15, 2021, 12:17:55 AM12/15/21

to

On Tuesday, December 14, 2021 at 9:23:18 AM UTC-8, Thomas Koenig wrote:
> Bastiaan Braams <bjbr...@gmail.com> schrieb:
> > I am wondering how compiler authors deal with possible code
> > bloat in connection with elemental procedures.

> By implementing the procedure as scalar and looping.

In some cases, it would be an optimization to pass an array and
have it processed inside the routine.

I do remember about 1973, when the IBM OS/360 compilers
changed to doing implied-DO loops from outside to inside
the I/O routines. The new library was backwards compatible,
but old libraries were not forward compatible.

In the case of compilers doing global optimization,
I suppose it is possible to do it in the routine. Only code
for the actual used combination would be generated.

Bastiaan Braams

unread,

Dec 15, 2021, 2:21:29 AM12/15/21

to

Thank you Thomas and Glen, this is very informative. So I think that the following code structure is the proper way to deal with the situation mentioned in my original post. Function Foo(n,x) can be written as an elemental function, but this will not lead to efficient code for the case that argument (n) is scalar and argument (x) is vector. (Let's say that this case is of special interest.) So then we still write Foo as an elemental function, but we also supply a specific function for the special case and compose the two versions into a generic interface.

module M
implicit none ; private
interface Foo
module procedure Foo_Scalar
module procedure Foo_Vector
end interface Foo
contains
elemental function Foo_Scalar (n, x) result (r)
integer, intent (in) :: n
real, intent (in) :: x
real :: r
! Return Foo[n](x). Elemental function.
...
end function Foo_Scalar
pure function Foo_Vector (n, x) result (r)
integer, intent (in) :: n
real, intent (in) :: x(0:)
real :: r(0:size(x)-1)
! Return Foo[n](x) for scalar argument n and rank-1 array argument x.
...
end function Foo_Vector
end module M

Additional versions can be written and included in the generic interface for higher-rank array arguments x.

gah4

unread,

Dec 15, 2021, 3:11:50 AM12/15/21

to

On Tuesday, December 14, 2021 at 11:21:29 PM UTC-8, bjbr...@gmail.com wrote:

(snip)

> Thank you Thomas and Glen, this is very informative.
> So I think that the following code structure is the proper way to deal
> with the situation mentioned in my original post. Function Foo(n,x)
> can be written as an elemental function, but this will not lead to
> efficient code for the case that argument (n) is scalar and argument (x)
> is vector. (Let's say that this case is of special interest.)
> So then we still write Foo as an elemental function, but we also supply a
> specific function for the special case and compose the two versions
> into a generic interface.

I am sometimes surprised how much work, over the years, has been done
to make function calls faster, even as computers got faster, so there was
less need for the speed.

Assuming your function does a reasonable amount of computation,
the function call overhead should be small enough for most of us.

The rules for GENERIC are funny, and I was going to guess that what
you asked wouldn't work. That is, it would be an ambiguity, and so
not compile.

But I put the part you wrote, removing the ... lines, into gfortran,
and it seems to take it.

So, yes, that is how you do it.

Thomas Koenig

unread,

Dec 15, 2021, 4:24:14 PM12/15/21

to

gah4 <ga...@u.washington.edu> schrieb:

> On Tuesday, December 14, 2021 at 11:21:29 PM UTC-8, bjbr...@gmail.com wrote:
>
> (snip)
>
>> Thank you Thomas and Glen, this is very informative.
>> So I think that the following code structure is the proper way to deal
>> with the situation mentioned in my original post. Function Foo(n,x)
>> can be written as an elemental function, but this will not lead to
>> efficient code for the case that argument (n) is scalar and argument (x)
>> is vector. (Let's say that this case is of special interest.)
>> So then we still write Foo as an elemental function, but we also supply a
>> specific function for the special case and compose the two versions
>> into a generic interface.
>
> I am sometimes surprised how much work, over the years, has been done
> to make function calls faster, even as computers got faster, so there was
> less need for the speed.

In languages like C++, almost everything is a function call
to some virtual function. Reducing that overhead can go a
long way towards making things faster.

Also, if anything, bloat has been increasing faster than the
increase in processor speed in the last few years...

Dick

unread,

Dec 15, 2021, 7:12:28 PM12/15/21

to

On 12/15/21 4:24 PM, Thomas Koenig wrote:
>
> In languages like C++, almost everything is a function call
> to some virtual function.

Not necessarily, it depends on your programming style.

Message has been deleted

Bastiaan Braams

unread,

Dec 16, 2021, 2:06:07 AM12/16/21

to

Just to be sure... I was not concerned about the overhead of the function call. Once again the structure of that function Foo, here written as a scalar elemental function. This is the three-term recurrence for Laguerre polynomial r = L[n](x).

if (n==0) then
r = 1
else if (n==1) then
r = 1-x
else
t1 = 1
t2 = 1-x
do k = 2, n
t0 = t1 ; t1 = t2
t2 = ((2*k-1-x)*t1-(k-1)*t0)/k
end do
r = t2
end if

Suppose now that this function is called with scalar argument (n) and array argument (x). Then I would think that we really want a loop within each branch of the if-then-else construct and maybe within the (do k = 2, n) as well; we don't want one single loop over the entire code, with the test on (n) inside the loop. And if Foo is declared elemental and there is not a special version for the case of scalar (n) and array (x), then I expect the compiler to do just that, place the loop outside the if-then-else.

However, to add to my confusion I just did a timing test. One instance where Foo is just an elemental function, no further overloading, and another version where the elemental function is overloaded with a function for arguments scalar (n) and array (x). Makes no difference in the timing when using gfortran; I tried it with -O0 and with -O3.

gah4

unread,

Dec 16, 2021, 2:19:13 AM12/16/21

to

C++ or not, modern programming style more often uses smaller
functions and more function calls.

Fortran has a long history of especially large subroutines and
functions. I used to wonder about this related to IBM S/360
and S/370 code, where instructions have a 12 bit offset, and
one normally has a base register to index from. That works well
for program units up to 4K bytes. For slightly bigger programs,
compiler will generate two or three base registers. Past that,
the might do indirect addressing for each, loading an address
from memory each time.

It is not so obvious what other systems do with large routines,
but in general optimizing compilers do better with ones that aren't
so big.

gah4

unread,

Dec 16, 2021, 2:46:43 AM12/16/21

to

On Wednesday, December 15, 2021 at 11:06:07 PM UTC-8, bjbr...@gmail.com wrote:

(snip, I wrote)

> > > I am sometimes surprised how much work, over the years, has been done
> > > to make function calls faster, even as computers got faster, so there was
> > > less need for the speed.

(snip)

> Just to be sure... I was not concerned about the overhead of the function call.
> Once again the structure of that function Foo, here written as a scalar
> elemental function. This is the three-term recurrence for
> Laguerre polynomial r = L[n](x).

Oh, that one. Many years ago (Fortran 66 days) I was working on a program
written by someone else that way too slow. It computed a bunch of
integrals with Gaussian quadrature, the first time I knew about that.

Now, Gaussian quadrature is pretty fast, but it was doing a lot of them.
As well as remember now, there was a subroutine that would compute
two of them, then divide, with the second one being a normalization.

In any case, for reasons similar to what you say, it was recomputing
the same thing many times. I think to fix it, I separately computed the
integrals, moved some things out of a loop, and so eliminated much
duplication.

> Suppose now that this function is called with scalar argument (n) and
> array argument (x). Then I would think that we really want a loop within
> each branch of the if-then-else construct and maybe within the (do k = 2, n)
> as well; we don't want one single loop over the entire code, with the test
> on (n) inside the loop. And if Foo is declared elemental and there is not
> a special version for the case of scalar (n) and array (x), then I expect the
> compiler to do just that, place the loop outside the if-then-else.

> However, to add to my confusion I just did a timing test.
> One instance where Foo is just an elemental function, no further
> overloading, and another version where the elemental function is
> overloaded with a function for arguments scalar (n) and array (x).
> Makes no difference in the timing when using gfortran;
> I tried it with -O0 and with -O3.

I think that sounds right. The statements outside the inner loop
are fairly fast, and no so many, so doing them mutliple times
make a small difference. Branch prediction will mean almost
no delay to doing the if, that you might have expected.

If there was a large part of the calculation depending on n,
but not on x, then the difference would be larger.

I think your idea is good, but doesn't apply to this example.

James Van Buskirk

unread,

Dec 18, 2021, 11:09:19 PM12/18/21

to

> "gah4" wrote in message
> news:0ab6410a-af0e-4f5d...@googlegroups.com...

> On Wednesday, December 15, 2021 at 11:06:07 PM UTC-8, bjbr...@gmail.com
> wrote:

> (snip, I wrote)

> Just to be sure... I was not concerned about the overhead of the function
> call.
> > Once again the structure of that function Foo, here written as a scalar
> > elemental function. This is the three-term recurrence for
> > Laguerre polynomial r = L[n](x).

> Oh, that one. Many years ago (Fortran 66 days) I was working on a program
> written by someone else that way too slow. It computed a bunch of
> integrals with Gaussian quadrature, the first time I knew about that.

> Now, Gaussian quadrature is pretty fast, but it was doing a lot of them.
> As well as remember now, there was a subroutine that would compute
> two of them, then divide, with the second one being a normalization.

> In any case, for reasons similar to what you say, it was recomputing
> the same thing many times. I think to fix it, I separately computed the
> integrals, moved some things out of a loop, and so eliminated much
> duplication.

Here you know that the O.P. is not doing Gauss-Laguerre quadrature because
his function returns only L[n](x) and not also L[n-1](x). I'm not sure what
application needs so many values of Laguerre polynomials: one I can think
of is hydrogenlike atoms, but they need generalized Laguerre polynomials
and there is no extra parameter in his code either.

> > Suppose now that this function is called with scalar argument (n) and
> > array argument (x). Then I would think that we really want a loop within
> > each branch of the if-then-else construct and maybe within the (do k =
> > 2, n)
> > as well; we don't want one single loop over the entire code, with the
> > test
> > on (n) inside the loop. And if Foo is declared elemental and there is
> > not
> > a special version for the case of scalar (n) and array (x), then I
> > expect the
> > compiler to do just that, place the loop outside the if-then-else.

> > However, to add to my confusion I just did a timing test.
> > One instance where Foo is just an elemental function, no further
> > overloading, and another version where the elemental function is
> > overloaded with a function for arguments scalar (n) and array (x).
> > Makes no difference in the timing when using gfortran;
> > I tried it with -O0 and with -O3.

> I think that sounds right. The statements outside the inner loop
> are fairly fast, and no so many, so doing them mutliple times
> make a small difference. Branch prediction will mean almost
> no delay to doing the if, that you might have expected.

> If there was a large part of the calculation depending on n,
> but not on x, then the difference would be larger.

> I think your idea is good, but doesn't apply to this example.

Well, I looked at some of my code:

subroutine laguerre(n,alpha,x,y,y1,nsturm)
integer, intent(in) :: n
real(wp), intent(in) :: alpha
real(wp), intent(in) :: x
real(wp), intent(out) :: y
real(wp), intent(out) :: y1
integer, optional :: nsturm
integer k
real(wp) y2

y = 1
y1 = 0
if(present(nsturm)) nsturm = 0
do k = 0, n-1
y2 = y1
y1 = y
y = ((2*k+alpha+1-x)*y1-k*y2)/(k+alpha+1)
if(present(nsturm)) nsturm = merge(nsturm+1,nsturm, y/=0 .AND. y*y1 <=
0)
end do
end subroutine laguerre

On comparison I see that there are early-out tests in the O.P.'s code which
are unnecessary. Of course for his usage there is no need for the Sturm
sequence value nor the parameter alpha which makes it generalized rather
than simple Laguerre. Also he doesn't have to return L[n-1](x) because, as
pointed out earlier, he can't be doing Gaussian quadrature.

But there is a big potential advantage in the array call vs. the elemental
call: an x86 processor can perform 4 wide multiply-accumulates on double
precision data via two pipelines. Have you checked whether gfortran produces
4-wide AVX code for the array version? Those divisions take a long time. It
might be useful to create an array of reciprocals on function entry and then
reuse it in the main loop.

But in any case, don't try to guess what the compiler is doing here but look
at the assembly code it generates.

James Van Buskirk

unread,

Dec 18, 2021, 11:56:58 PM12/18/21

to

> "gah4" wrote in message
> news:0ab6410a-af0e-4f5d...@googlegroups.com...

[snip]

It seems that my reply has not appeared, so here it is again, perhaps as a
duplicate:

> "gah4" wrote in message
> news:0ab6410a-af0e-4f5d...@googlegroups.com...

> On Wednesday, December 15, 2021 at 11:06:07 PM UTC-8, bjbr...@gmail.com
> wrote:

> (snip, I wrote)

> Just to be sure... I was not concerned about the overhead of the function
> call.
> > Once again the structure of that function Foo, here written as a scalar
> > elemental function. This is the three-term recurrence for
> > Laguerre polynomial r = L[n](x).

> Oh, that one. Many years ago (Fortran 66 days) I was working on a program
> written by someone else that way too slow. It computed a bunch of
> integrals with Gaussian quadrature, the first time I knew about that.

> Now, Gaussian quadrature is pretty fast, but it was doing a lot of them.
> As well as remember now, there was a subroutine that would compute
> two of them, then divide, with the second one being a normalization.

> In any case, for reasons similar to what you say, it was recomputing
> the same thing many times. I think to fix it, I separately computed the
> integrals, moved some things out of a loop, and so eliminated much
> duplication.

Here you know that the O.P. is not doing Gauss-Laguerre quadrature because
his function returns only L[n](x) and not also L[n-1](x). I'm not sure what
application needs so many values of Laguerre polynomials: one I can think
of is hydrogenlike atoms, but they need generalized Laguerre polynomials
and there is no extra parameter in his code either.

> > Suppose now that this function is called with scalar argument (n) and
> > array argument (x). Then I would think that we really want a loop within
> > each branch of the if-then-else construct and maybe within the (do k =
> > 2, n)
> > as well; we don't want one single loop over the entire code, with the
> > test
> > on (n) inside the loop. And if Foo is declared elemental and there is
> > not
> > a special version for the case of scalar (n) and array (x), then I
> > expect the
> > compiler to do just that, place the loop outside the if-then-else.

> > However, to add to my confusion I just did a timing test.
> > One instance where Foo is just an elemental function, no further
> > overloading, and another version where the elemental function is
> > overloaded with a function for arguments scalar (n) and array (x).
> > Makes no difference in the timing when using gfortran;
> > I tried it with -O0 and with -O3.

> I think that sounds right. The statements outside the inner loop
> are fairly fast, and no so many, so doing them mutliple times
> make a small difference. Branch prediction will mean almost
> no delay to doing the if, that you might have expected.

> If there was a large part of the calculation depending on n,
> but not on x, then the difference would be larger.

> I think your idea is good, but doesn't apply to this example.