gcc -O, missing feature?

dat%uk...@nsfnet-relay.ac.uk

unread,

Mar 31, 1990, 5:45:24 AM3/31/90

to

To whom it may concern:

This is not exactly a bug report - it concerns an easy and quite
important optimisation which I would have expected gcc -O to perform.
Namely tail recursion elimination. That is, if the last action of a
function, before returning, is to call another function, the inner
function invokation should be treated as an unconditional jump, not a
call (ie so that a single return will exit both functions).

Here is a trivial example. The following function can run in constant
space (use only one stack-frame):-

f(n)
int n;
{ mumble...;
f(n+1);
}

This optimisation becomes very important when tail recursive calls are
deeply nested, as in the "continuation passing" style of programming. I
am currently writing a compiler for a functional language, and was
hoping to use C, rather than assembler for the target code (for obvious
reasons of portability). My generated code uses a continuation-passing
style (each function ends by calling another). The viability of my
project therefore depends on having a C compiler that handles tail
recursion intelligently.

The -O flag of gcc does not appear, currently, to perform this
optimisation. Is there some other flag that will turn it on? If not,
do you have any plans to add this optimisation in the future? I imagine
it would be very easy to put this in.

Many thanks for for any information you can give me.

Yours sincerely

David Turner

--------------------------------------------------------------------
Prof. D. A. Turner phone: +44 227 764000 ext 7695
Computing Laboratory fax: +44 227 762811
University of Kent email: d...@ukc.ac.uk
Canterbury CT2 7NF internet: dat%u...@nsfnet-relay.ac.uk
England
--------------------------------------------------------------------

Michael Tiemann

unread,

Mar 31, 1990, 9:23:59 PM3/31/90

to

To whom it may concern:

This is not exactly a bug report - it concerns an easy and quite
important optimisation which I would have expected gcc -O to perform.
Namely tail recursion elimination. That is, if the last action of a
function, before returning, is to call another function, the inner
function invokation should be treated as an unconditional jump, not a
call (ie so that a single return will exit both functions).

Here is a trivial example. The following function can run in constant
space (use only one stack-frame):-

Try this:

f(n)
int n;
{ mumble...;
return f(n+1); /* note `return'. */
}

Michael

David Keppel

unread,

Apr 1, 1990, 1:20:24 AM4/1/90

to

In article <900401022...@teacake.sun.com> you write:
>[Gcc doesn't do tail recursion elimination.]

You should tell us which version and target of `gcc' you are running.
The VAX target has been doing tail recursion since at least 1.34, and I
think before that. As far as I know, tail recursion is done in the
FRONT end, so all targets SHOULD be doing tail-recursion elimination.

Here's an example:

int
f (int n)
{
if (!zork (n))
return (0);
return (f (n+1));
}

/uns/bin/gcc -v -O -S /tmp/zork.c
gcc version 1.36.92
/uns/usr/local/lib/gcc-cpp -v -undef -D__GNUC__ -Dvax -Dunix -D__vax__ -D__unix__ -D__OPTIMIZE__ /tmp/zork.c /usr/tmp/cca16136.cpp
GNU CPP version 1.36.92
/uns/usr/local/lib/gcc-cc1 /usr/tmp/cca16136.cpp -quiet -dumpbase /tmp/zork.c -O -version -o zork.s
GNU C version 1.36.92 (vax) compiled by GNU C version 1.36.92.
default target switches: -munix

#NO_APP
gcc_compiled.:
.text
.align 1
.globl _f
_f:
.word 0x40
movl 4(ap),r6
L2:
pushl r6
calls $1,_zork
tstl r0 # if `zork' returns zero...
jneq L1
clrl r0 # ... then set to zero (huh!?)
ret # and return
L1:
incl r6 # ... else tail recurse.
jbr L2

I remember being particularly impressed after I spent an hour trying to
write a highly-optimized version of one of the standard benchmarks
(the Ackerman function, as I remember). I had finally realized that I
could code it as half-recursive and half-tail-recursive. After coding
it up in assembly, I ran the C source through `gcc' and (except that I
used the jsb/rsb instead of calls/ret), the code produced by `gcc' was
IDENTICAL to my hour-long exercise in assembly language.

Ok, so I can write assembly faster than that, now :-)

(I think:) Because tail-recursion is done ``early'' (by the front end
of `gcc'), the optimizer sometimes cannot take advantage of information
that is often discovered later. For example, I would expect the same
result code (as above) from this semantically equivalent (I think :-)
program:

int
f (int n)
{
int m;

if (!zork (n))
return (0);
m = f (n+1);
return (m);
}

However, `m' is not in itself a tail-recursive call, so it isn't
recognized by the front end:

#NO_APP
gcc_compiled.:
.text
.align 1
.globl _f
_f:
.word 0x40
movl 4(ap),r6
pushl r6
calls $1,_zork
tstl r0 # if `zork' returns zero...
jneq L1
clrl r0 # ... then set it to zero (again!)
ret # and return
L1:
addl3 r6,$1,-(sp)
calls $1,_f # ... else recurse the `hard way'.
ret

In general you want to look at a recursive call and see if it can be
moved past any remaining computations so that it appears as the last
statement (along that particular branch of execution). That can catch
things like:

int
f (int n)
{
int m;

if (!zork (n))
return (n); /* Changed: was `0', now `n'. */
m = f (n+1);
m += 4;
return (m);
}

in which the last fragment can be rewritten as:

return (4 + f (n+1));

or:

_f:
.word ^M<r6,r7>
movl 4(ap),r6
clrl r7
L0:
pushl r6
calls $1,_zork
tstl r0 # if `zork' returns zero...
jneq L1
addl3 r7,r6,r0 # ... figure out the total
ret # and return
L1:
addl2 $4,r7 # ... else increment induction variable
incl r6 # and parameter
jmp L0 # and tail-recurse.

Clearly, it is possible to better than `gcc' is doing right now. I
speculate that it requires moving tail recursion in to the back end.
Moreover, planning an order in which to apply optimizations is an
NP-complete (a.k.a. ``hard'') problem.

;-D on ( Feature present and accounted for... ) Pardo
--
pa...@cs.washington.edu
{rutgers,cornell,ucsd,ubc-cs,tektronix}!uw-beaver!june!pardo

dat%uk...@nsfnet-relay.ac.uk

unread,

Apr 1, 1990, 10:32:53 AM4/1/90

to

Michael,

Many thanks for your reply. I tried putting in the explicit "return"
and got the optimisation. I notice however that the optimisation is
inhibited if the function takes the address of a local variable
anywhere. This is not too serious, I can restrain myself in that
respect. However, more serious is that the optimisation seems to work
only for immediate tail recursion, and not for tail calls in general.
For example this should also run in constant stack-space

f(n)
int n;
{ mumble...;
return g(n+1);
}

g(n)
int n;
{ mumble1...;
return f(n);
}

and doesn't. For my intended application I need the optimisation to
work for mutual tail-recursion (and in general, the tail call will be
to a computed function pointer, rather than to a named function). Is
there any way I can get gcc -O to do this right?

Many thanks for any further information.

David Turner

Professor Simon Peyton-Jones

unread,

Apr 2, 1990, 11:26:36 AM4/2/90

to

Further to the conversation about tail recursion...

Coincidentally I had a brief discussion with Richard Stallman about
this very topic (and for the same reason, to boot) a few weeks ago. He
pointed out that because the CALLER deallocates the parameters, you
can't in general optimise tail recursion in C. For example:

f(n) { ...mumble...; return( g(n, n+1) ); }

Since f's caller will deallocate the space for its one paramter,
f cannot tail-call g, which uses space for two parameters. I think
the constraint is this:

a tail call can (and should) be compiled as a jump provided the
tail-called procedure's parameters can fit within the caller's
parameter block on the stack.

As David points out, this should all work for mutual recursion. It
should also work for *conditionals*:

f(n) { ...mumble...; if (...stuff...)
{ return( g(n+1); }
else { return( h(n+1); }
}

and *when the called procedure is held in a variable*, thus:

f(n) { ...mumble...; x = foo(); (*x)(n+1); }

Both of these are important when compiling functional languages to C.

Finally, here's a *portable* trick to fake tail-recursion, in the
special case where all the functions involved take no arguments. This
is actually the case I am interested in (and I suspect David is also),
because the parameters are passed on a separate stack. This trick
allows a functional language to be compiled to any C (not just one
which optimises tail recursion), and I am using it for just this
purpose.

First, define PFUNC to be the type of
pointers-to-functions-returning-integers.

Every function which you want to be part of the tail-call world
takes no arguments, and returns an int. Here's the trick: the int
it returns is the address of the function to be called next. So
instead of writing

f() { ...mumble...; return( g() ); }
you write
f () { ...mumble...; return( g ); }

The ...mumble... can of course include arbitrary calls to ordinary
non-tail-recursive C.

Finally, you write a trivial "interpreter" to do the calling. It
calls the first function, which returns a pointer to the next one to
call, which returns a pointer... Here's the code fragment for the
interpreter

PFUNC C = initial_function;
while TRUE { C = (PFUNC) (*C)(); }

This is one of the wierdest pieces of C I have ever seen. (See if your
friends can figure out what it is meant to do.) Of course you have to
get out of the loop sometime. For me this happens only when the
functional program is finished, so I use a longjump rather than test C
for nil (for example) each time round.

This trick only works in rather special situations, but they are
exactly the situations one wants to use C as a code generator. The
only alternative (used by the Gambit compiler people) is to map all
labels onto integers, and write the program as a giant switch
statement. But this kills separate compilation (or requires you to
write a separate linker) and stresses the C compiler more than
somewhat.

Simon L Peyton Jones

Dept of Computing, University of Glasgow, G12 8QQ.
sim...@cs.glasgow.ac.uk

David Keppel

unread,

Apr 2, 1990, 7:56:59 PM4/2/90

to

[This is not a bug report, this describes a tail-call mechanism.]

Simon Peyton Jones writes:
>[Because the CALLER deallocates the parameters, you can't, in general,
> optimize tail recursion in C. Example:]

>
> f(n) { ...mumble...; return( g(n, n+1) ); }
>
>Since f's caller will deallocate the space for its one paramter,
>f cannot tail-call g, which uses space for two parameters.

I think you're assuming something that is not necessarily true. For
example the standard MIPS calling convention (as implemented by both
the MIPS compiler and by gcc) pass the first N integral arguments in
registers. The registers can be clobbered by callee, so it must be the
callee and not the caller that is deallocating the parameters.

The VAX `ret' instruction takes an implicit argument that is the number
of 32-bit words to pop as a part of function return. Read that as
``callee-deallocate''.

In many machines it is possible to adjust the number of args. If the
stack layout looks like:

unused
unused
local <--sp
local
local
return pc <--fp
arg
arg
arg

then the tail cal from n bytes of args plus locals to m bytes of args
plus locals (with n less than m) looks like (assume stack grows up):

add $(m*4), sp
mov (fp), (sp)
# mucking with order/placement of operands

So for the particular example above, and assuming all the variables
used for `mumble' fit in to registers we can do the tail call as:

push (fp) # allocate space for 1 more arg
# ... and move return pc
mov -4(fp), r0 # move `n'
mov r0, -4(sp) # ... to first arg position
add r0, $1, -8(sp) # ... and push `n+1'.
jmp g

compared to the normal recursion:

mov -4(fp), r0
add $1, r0, +(sp) # push `n+1'
push r0 # push `n'
call g
ret # `extra' return

Points to ponder:

* For the i386, `jmp g' and `call g' take about the same ammount of
time, 7 and 10 cycles, respectively. The added overhead of the `ret'
is about 10 cycles. A register-to-register integer op takes about 2
instructions. These performance rations are the same for the i486,
divide cycle counts by two.

* If call/return is very fast, then the difference for the above code
fragments won't be much.

* If the order of the parameters to the example were reversed, then
the tail call code would be

push (fp) # allocate space for 1 more arg
# ... and push return pc
add $1, -4(fp), -4(sp) # push `n+1'
# `n' already pushed calling `f'.
jmp g

which is faster not (just) because it avoids call/ret instructions,
but because it optimizes memory loads and stores.

* This optimization will NOT work if different functions use different
callee/caller register save protocols. To the best of my knowledge
most widely-available machines have a single model.

* Profitability analysis: I will hazard a guess that when the code
``eaten'' by the return instruction is a single word, then tail call
optimization like this is never more expensive than the equivalent
``push values, call, return'' kind of code UNLESS it is necessary to
swap stack values via a temporary. (That's just a guess. I'd
welcome a counterexample or two.) For some machines, e.g., the VAX,
the value ``eaten'' by the return instruction is much larger than a
single machine word and the WHOLE value must be moved to adjust the
stack for more parameters (actually, the VAX is complicated because
it uses a non-constant register save model, needs to adjust more
registers, etc.)

* Tail calls can make debugging harder, since `g' appears to have been
called by the caller of `f'.

This kind of optimization is machine-dependent. There are, however, a
class of machines -- with a stack layout that is like the one described
above -- that are all similar enough that the machine description could
probably contain a flag that says either ``do the most straightforward
tail call'' or ``do harder tail calls.''

Note that tail call optimizations can be performed without help from
the front end, but that the code produced for tail recursion by this
kind of optimization will be worse than that currently provided by
gcc's front end:

int
f (int n)
{
if (!zork (n))
return (0);
return (f (n+1));
}

produces (more or less) like:

_f:
movl 4(ap),r6
L2:
push r6
call _zork
tst r0
jneq L1
ret
L1:
inc r6
jmp L2

Note that the value is stored in `r6' across the tail call. The
optimization mechanism described above will not do that. Instead, it
will do someting like:

_f:
mov 4(fp),r6
push r6
call _zork
tst r0
jneq L1
ret
L1:
add $1, r6, 4(fp)
jmp f

Which is still OK, but has an extra store and reload compared to the
code produced when gcc can recognize it from the front end. For the
i386, the frame pointer is treated as a callee-save register, so
the tail-call version uses an extra two instructions at procedure entry
(save old frame pointer, create new frame pointer) and an extra one
at function exit (restore frame pointer) compared to the tail-recursive
version.

_f:
push fp # callee saves
movl sp, fp # set up new frame pointer
mov 4(fp),r6
push r6
call _zork
tst r0
jneq L1
ret
L1:
add $1, r6, 4(fp)
pop fp
jmp f

My handy-dandy memory of 'i386 instruction timings says:

* gcc-recognized tail recursion, `L2:' to `jmp L2': 31 cycles
Note that the gcc-recognized tail-recursion timing is independent of
whether the system uses a frame pointer.
* tail-call, no frame pointer, `_f:' to `jmp _f': 36 cycles
* tail-call, frame pointer, `_f:' to `jmp _f': 44 cycles
* default (no tail call, frame pointer): 56 cycles

_f:
push fp
movl sp, fp
mov 4(fp),r6
push r6
call _zork
tst r0
jneq L1
ret
L1:
add $1, r6, +(sp) # push `n+1'
call _f # recursion
sub $4, sp # pop `n+1'
pop fp
ret

This optimization is easy to recognize. Anytime that a function
call is (or can be turned in to) the last thing in a function, this
technique can be applied.

I hope somebody finds this discussion useful. Whether these
optimizations are useful or not, I won't say :-)

;-D on ( My other VAX is a RISC ) Pardo
--
pa...@cs.washington.edu
{rutgers,cornell,ucsd,ubc-cs,tektronix}!uw-beaver!june!pardo

Mikael Pettersson

unread,

Apr 3, 1990, 11:41:51 PM4/3/90

to

In article <11021.90...@tutuila.cs.glasgow.ac.uk> simonpj%cs.glasg...@NSFNET-RELAY.AC.UK (Professor Simon Peyton-Jones) writes:
>
>Further to the conversation about tail recursion...
>
>Coincidentally I had a brief discussion with Richard Stallman about
>this very topic (and for the same reason, to boot) a few weeks ago. He
>pointed out that because the CALLER deallocates the parameters, you
>can't in general optimise tail recursion in C. For example:
>
> f(n) { ...mumble...; return( g(n, n+1) ); }

Yes. The pre-ANSI rules more-or-less mandates a caller-pops strategy
because of printf() and other variadic functions (Ritchie and Johnson
imply this in their report "The C Language Calling Sequence").

>Finally, here's a *portable* trick to fake tail-recursion, in the
>special case where all the functions involved take no arguments. This
>is actually the case I am interested in (and I suspect David is also),
>because the parameters are passed on a separate stack. This trick
>allows a functional language to be compiled to any C (not just one
>which optimises tail recursion), and I am using it for just this
>purpose.

> ...

> while TRUE { C = (PFUNC) (*C)(); }

This is a neat trick; I've used it myself in a Scheme interpreter
implemented on top of a non-tail-recursive Lisp.

However, as pointed out by Joel Bartlett (he wrote the Scheme->C
compiler at DECWRL), there are huge advantages to sticking to the
standard C calling conventions:

* register allocation of arguments and optimization of
leaf procedures (especially true for RISC processors)
* interfacing to routines written in other languages

I have devised a semi-portable solution to this problem that
works if all arguments and return values are 32-bit quantities
(i.e., it's ideally suited for compilers for functional languages
using C as their intermediate code). The idea is to identify
non-trivial (*) cases of tail-recursion and replace them as follows:

f(..) { ... ; return g(..); }
becomes: f(..) { ... ; return TAILCALL(ac1,ac2,g,`..'); }

where ac1 is the number of arguments to f(), ac2 is the number of
arguments to g(), and `..' is the actual argument list.

(*) trivial cases such as self-calls or calls to local procedures
are turned into goto's.

Then I preprocess the C code using a machine dependent M4-script
that looks at ac1 and ac2 (they are compile-time constants) and
turns the TAILCALL into a call to a library routine optimized for
the values of ac1 and ac2. These library routines (written in
assembler) do the necessary shuffling of arguments to achieve the
tail-recursive effect.

I have done implementations for the M680x0 and the Sparc (assuming
Sun's calling conventions) that are fairly efficient for the common
cases. Using GCC's souped-up asm()'s one could probably improve it
even further.

I can make this available if there is sufficient interest. It will
probably take a couple of weeks before I'm finished with the code
and the tech report describing it, so don't hold your breath :-)
--
Mikael Pettersson, Dept of Comp & Info Sci, University of Linkoping, Sweden
email: m...@ida.liu.se or ...!{mcsun,munnari,uunet,unido,...}!sunic!liuida!mpe