I am currently working on the register allocation for a compiler
generating code for the IBM RS/6000 architecture.
The register sets of this architecture are split into caller and callee
saved registers. Parameters are always passed in caller saved registers.
I'm having some difficulty grasping the key idea behind the division in
caller/calleesaved registers:
What would happen if all registers were saved by the called procedure
(callee saved) ?
The kind (caller saved/callee saved) of register used for a variable is
determined locally. Won't this a suboptimal global allocation ?
Why are there no callee saved parameter registers ?
How did the designers of the architecture decide on the relative number of
caller/callee saved registers ? Should the division be chosen differently
for different high level languages ?
I would be very interesetd in Your comments, suggestions, advice and would
appreciate any reference to papers dealing with this topic.
Thanks,
Jakob Paul Magun, pma...@iiic.ethz.ch
(currently graduating from the Swiss Federal Institute of Technology)
--
Send compilers articles to comp...@iecc.com or
{ima | spdcc | world}!iecc!compilers. Meta-mail to compiler...@iecc.com.
The idea is a compromise.
Using a caller-saves discipline (where the caller saves only the registers
with useful information in them) may result in the caller saving registers
that the calle wasn't going to trash; hence, wasted saves.
Using a callee-saves discipline (where the callee saves only the registers
that will be trashed by the callee) may result in the callee saving
registers that had no useful information in them; hence, wasted saves.
>The kind (caller saved/callee saved) of register used for a variable is
>determined locally. Won't this a suboptimal global allocation ?
The kind is usually determined by a local competition. If a variable is
live across a call, then it needs to go into the set of callee-saves
registers. If a variable isn't live across a call, then it can go into
either set, but should be placed in the set of caller-saves registers to
avoid wasting the callee-saves regs.
When using graph coloring allocators, it's a matter of making the
long-leved variables interfere with the caller-saves registers at each
procedure call. Note that caller-saves registers are usually not
explicitely saved at the call; instead, we just ensure that there's
nothing in them that needs saving. Similarly, callee-saves registers need
not be saved unless they are actually trashed.
>Why are there no callee saved parameter registers ?
Usually the value isn't required after the procedure call.
>How did the designers of the architecture decide on the relative number of
>caller/callee saved registers ?
Experiments!
>Should the division be chosen differently for different high level languages ?
Probably, though this would make it harder to arrange calls between
routines written in different languages.
Preston Briggs
Register sets are divided in to caller-save and callee-save because you
have to save and restore registers sometime and because procedure
boundaries seems like a good place to do it. Yes, interprocedural
allocation can give better save/restore but (a) interprocedural allocation
is more expensive (b) can't be parallelized as easily and (c) can't always
be done (e.g. dynamic liniking; you can't do caller register allocation
for a callee that doesn't yet exist) so you have to have a back-up anyway.
If all regs are caller-save, then callers may save and restore registers
that the callee never uses. Even if the callee uses the registers, each
caller has to have code to save and restore those registers. By contrast,
if all registers are callee-save, then only those registers that are used
by the callee actually get saved and restored, and the code to do so is
concentrated in one place (the callee, rather than being duplicated in
each caller). However, the callee may save and restore registers that are
`dead' -- that is, registers that will be set by the caller before they
are read by the caller.
How do you decide what's the best division? The best description that
I've seen is that some registers tend to be dead at every call site and
few procedures use all registers (directly; most registers get used sooner
or later) and that empirically 1/2 caller-save 1/2 callee-save works well.
If you believe that most leaf functions are small, then most leaf
functions will not use all registers, so a strict caller-save policy will
behave badly (e.g. useless spills and restores) and will waste code space
(and locality) by performing spills and restores in each caller instead of
in the callee. If you believe there are some dead registers at each call
site then full callee-save will behave badly (e.g. useless spills and
restores). Note that some optimizations can change the profitability of
each strategy and that e.g. better inlining of small routines tends to
reduce the relative costs of either scheme (by reducing the frequency of
calls). Sorry, I can't point you at definative studies.
Callee-save parameter registers could be useful if, say, several routines
are called with the same parameters and/or one routine is called
repeatedly with most of the same parameters. However, making good use of
these registers would require the caller and callee to agree in some way
on which parameters should be passed in which registers. I think the
decision to just use caller-save registers is based on simplicity and
common use patterns.
;-D on ( The square dance caller ) Pardo
This debate has been going in circles for decades. I don't think that any
a priori scheme will ever be optimum (standards committees therefore take
note: you can't win). Gerry Sussman and Guy Steele came up with a 'best
of both worlds' hardware scheme called 'Dream of a Lifetime', and their
paper is in the 1980 Lisp Conference available from ACM. I guess that
'register windows' tried to finesse this problem too, but they have
substantial problems of their own.
More recently, Andrew Appel's group at Princeton have done a lot of
interesting work in which 'registers' are abstracted into additional
arguments which are then returned as additional values. These map
elegantly into callee-saves registers. Given this insight, one can then
do 'lambda-lifting' kinds of optimizations which avoid consing closure
environments so that their contents can be passed around as additional
arguments & returned values. However, one is then left with the uneasy
feeling that any architecture which requires these sorts of heroics can't
be the right thing.
The dataflow people have rightly decoupled the _naming_ of intermediate
results from their _storage_. Perhaps it is time for RISC'ers to take
notice of this approach.
Henry Baker
hba...@netcom.com
Callee saved registers are those registers that the called routine must
save before it may use them. The VAX is one big proponent of this.
Caller saved registers are registers that must be saved by the calling
routine before a call can be made. Otherwise, the contents of the
register may be destroyed across the call.
It is a religous war as to which is more correct. Caller saved registers
places the smarts in the calling routine. It knows what values are live
across the call. Callee saved registers places the smarts in the called
routine. It knows which values to save.
If you wish to minimize registers spills, a hybrid approach may be the
best... When scheduling registers, use the caller saved registers first
because your caller has already saved the information if it was needed.
If you need a value across a subroutine call, move it into a callee saved
register and let the called routine decide whether or not to spill its
value into memory. If your requirements exceed the number of either
register set, you must start spilling.
Thank you.
Steve Simmons
Saving all registers would negate some of the advantages gained by passing
parameters in register. There's often no reason to save parameter
registers since the values can often be used right where they are.
> The kind (caller saved/callee saved) of register used for a variable is
> determined locally. Won't this a suboptimal global allocation ?
It's impossible to be optimal unless you are looking at all of the callers
and callees at once. Register saving conventions are designed to do as
well as possible most of the time for only being able to consider a single
routine at once. Typically the biggest win from a good convention is very
cheap leaf routines. If they are not too big, they typically won't have
to save any registers and may not need to do any stack adjustments at all.
> How did the designers of the architecture decide on the relative number of
> caller/callee saved registers ? Should the division be chosen differently
> for different high level languages ?
This isn't an architecture issue. If you're designing a register
allocator, you can design your own conventions. However, if you have to
be able to call or be called by code generated by other compilers, you
need to know what their conventions are.
Different divisions may be better for different languages, but then you
have problems with calls between languages. Coding style is probably an
even bigger determinant than language on what makes a good division.
--
Paul Beusterien pa...@travis.csd.harris.com
M.S. #161
Harris Computer Systems (305) 973 5270
2101 W. Cypress Creek Road
Ft. Lauderdale, FL 33309
When procedure A calls procedure B:
(1) Procedure A must save any caller-saved registers it still needs, and
(2) Procedure A relies on B to save all the callee-save registers.
Thus from A's point of view, B can trash the parameter registers (because
they are caller-save, and A, the caller, has already saved them).
|> What would happen if all registers were saved by the called procedure
|> (callee saved) ?
Then B might save some registers that A does not care about. In this
case, B does extra work (of the worst kind: memory traffic).
|> The kind (caller saved/callee saved) of register used for a variable is
|> determined locally. Won't this a suboptimal global allocation?
Why, of course. But caller and callee might be compiled in separate
object files, so must agree on which parameters are in registers and which
are not without ever seeing each other.
|> Why are there no callee saved parameter registers ?
Because the compiler-writer assumes that procedure A doesn't need the
parameter it makes for B. The value is dead after the call to B, so there
is no point in having B save and restore it.
|> How did the designers of the architecture decide on the relative number of
|> caller/callee saved registers ? Should the division be chosen differently
|> for different high level languages ?
Probably. I suspect the IBM designers used a "best-guess" approach,
perhaps backed up by a few test cases. There's a paper out there on IBM's
design decisions in the ABI, but I can't find my copy of it right now.
--
cli...@cs.rice.edu -- Massively Scalar Compiler Group, Rice University
Contra M. Beusterien, I feel it is an architectural issue in as
much as it is a problem most appropriately solved by throwing
hardware at it: The obvious architectural solution seems to be for
the caller to set a use mask, for the callee to set another
use mask, and for the processor to do a save of the intersection,
with special-purpose logic. Anything like this out there?
-- Anthony L Kimball -- a...@think.com, a...@msc.edu, {uunet,harvard}!think!alk
(I hope the above paragraph wasn't too opaque.)
Bart Massey
ba...@cs.uoregon.edu
Preston Briggs writes:
>The idea is a compromise.
I've often thought the solution to this problem could easily be solved in
hardware. A call instruction that indicates which regs need saving in a
bitmask could be combined with a stack-frame-creation instruction
(similar to the mc68k family's link instr.) that logically ands this
bitmask with another bitmask that indicates which regs the routine needs
to trash. The result of this logical and would be the actual regs that
need saving. This could all be done in software, but it would be far
more efficient in hardware.
so why hasn't this been done yet?
gears,
ye wilde ryder
--
robe...@agcs.com
[The vax call instruction has a save mask that supports a callee saves
convention. It's slower than a snoozing slug. -John]
robe...@agcs.com (Ye Wilde Ryder) writes:
>I've often thought the solution to this problem could easily be solved in
>hardware. A call instruction that indicates which regs need saving in a
>bitmask could be combined with a stack-frame-creation instruction
>(similar to the mc68k family's link instr.) that logically ands this
>bitmask with another bitmask that indicates which regs the routine needs
>to trash. The result of this logical and would be the actual regs that
>need saving. This could all be done in software, but it would be far
>more efficient in hardware.
Suppose A calls B, which calls C in a loop.
A needs registers 1 and 2 and 5 saved past the call to B.
B uses 2 and 3, but needs nothing saved past the call to C.
C uses 1 and 3 and 4, calls nobody.
Who saves register 1? More precisely, what mask does B pass to C? If
this mask includes registers 1 and 4 and 5 (i.e., all but 2 and 3), then
register 4 is needlessly saved every time C is called, and register 1 is
saved every time through the loop in B rather than once at the start of B.
If the mask passed to C excludes registers 1, 4, 5, then register 1 must
be saved at the start of B; do we also save register 5 there even though
it is never used in B or C?
It may be possible to achieve this with a smart linker which sees
that C needs register 1 but not 5, and sets up the right mask at the start
of B. But this scheme will not work when the addresses of subroutines
being called are unknown at compile/link time, such as if the reference to
C is
(*Cfunc)(...);
--
Peter L. Montgomery pmon...@cwi.nl
One additional point about caller vs. callee save which I'm surprised no
one has mentioned: callee-save makes it rather more difficult to implement
tail-call optimization. Recall that full tail-call (as opposed to mere
self tail-recursion) optimization avoids saving any state (including a
return address) whenever the last instruction in a procedure is a call
to another procedure.
Callee-save does not make tail-call optimization more difficult. The
procedure making the tail-call needs to first get rid of everything it put
on the stack (in the process restoring any of the callee-save registers it
used.) This is what makes it a properly tail recursive call. The called
procedure can do anything with the stack it wants. Would you say an
implementation was not properly tail recursive if some procedure that was
tail-called temporarily spilled some registers to the stack? The
situation with callee save registers is analogous.
This is a very valuable optimization for recursive programs...
It does not mater so much for recursive programs, if they are
tail-recursive then for most people, most of the time it is just as easy
to use/read a for/while/until loop. The lack of tail-call optimization
really hurts if you want to use a continuation passing style of
programming.
John W.F. McClain
>Contra M. Beusterien, I feel it is an architectural issue in as
>much as it is a problem most appropriately solved by throwing
>hardware at it: The obvious architectural solution seems to be for
>the caller to set a use mask, for the callee to set another
>use mask, and for the processor to do a save of the intersection,
>with special-purpose logic. Anything like this out there?
The study of architectural support to reduce RSR traffic across function
calls is presented in the following paper:
Architectural Support for Reduced Register Saving/Restoring in
Single-Window Register Files. Miquel Huguet and Tomas Lang. ACM
Transactions on Computer Systems. Volume 9, No 1, Feb 1991.
A more detailed study can be found in Huguet's PhD thesis.
Architectural and Compiler Support for Efficient Function Calls. Ph.D.
Dissertation. Computer Science Dept, Univ. of California at Los Angeles,
Sep 1989.
They study the impact of hardware support (use masks) and compiler
optimizations, individually, and also when put together. The experimental
results are very interesting (difficult to summarize them here, though!).
Rakesh Ghiya.
School of Computer Science,
McGill University,
Montreal, Canada.
|> hba...@netcom.com (Henry G. Baker) writes:
|> Thus, so long as you don't run out of registers, the overhead of tail
|> recursion is _independent_ of the number of arguments, and essentially
|> identical to the imperative code.
Actually, tail-recursion (or loop) that makes other function calls inside
its body can benefit even more from good use of callee-save registers. For
example, considering the following C code which iteratively applies
function "f" to "x" until it converges to satisfy the predicate "p":
------------
extern int p(float a, float r);
extern float f(float a);
float iter(float x) {
float a,r;
a = x; r = 1.0;
while (p(a,r)) {
r = a;
a = f(a);
}
return a;
}
-------------
[the difference between "iter" here and "fact1" in Henry's example is
that "fact1" doesn't make any other function calls in its loop body.]
Both "p" and "f" here are functions defined in other modules which you
don't really know how they treat the machine registers. Assuming initially
all variables (e.g., "a","p", and "r") all stay in registers. Then,
inside the above "while" loop, both "a" and "p" must be saved into memory
(i.e., stack frame) before the function call to "p", and later be restored
after the call to "p" returns.
If we establish a "callee-save register" convention: "every function
reserves k callee-save registers (say, r1-r4 for k=4)," then we can do the
following:
(1) Before entering the "while" loop:
(a) save "r1-r3" into the memory by building a 3-element record "CR",
put "CR" into r3.
(b) move "a" into r1, and "p" into r2.
(2) Run the while loop as usual, except now even when you are calling
"p" and "f", no register save/restore are necessary because every
function promises to preserve the callee-save register contents (r1-r4)
(3) At the exit of "while" loop, restore the old contents into r1-r4
(basically, retrieve from the "CR" record)
In Standard ML, the above C program is normally written as the following
tail-recursive function, which benefits from the "callee-save registers"
in the same way:
----------------
fun iter(x) =
let fun h(a,r) = if p(a,r) then a
else h(f(a),a)
in h(x,1.0)
end
-----------------
The algorithm that does the above "callee-save register" allocation is
based on the compile-time variable lifetime information and control-flow
information (and is described in our LFP94 paper).
-Zhong Shao
(z...@cs.princeton.edu)
The KSR-1 compiler/runtime system requires that every procedure have (or
at least beahve `as if' it has) two normal entry points: one for when the
caller expects a return value that's a structure and one for when the
caller expects a non-struct return value (or none at all). In addition,
non-leaf procedures must have (or behave `as if' they had) one minor entry
point for each call site they contain.
The struct return entry is first, at offset 0 from the procedure start.
It is typically a branch to the "real" function entry code (the second
delay slot instruction comes from the first instruction of the second
entry point but the first entry squashes it to avoid side effects).
The non-struct-return entry is second at an offset 16 from the procedure
start. It typically sets a flag (to indicate non-struct invocation, for
use at return) and falls through to the "real" function entry code. Even
routines that do not return structures support (`as if') the dual-entry
behavior.
In addition, each call site is organized like (each line is actually an
8-byte instruction pair):
call <target>
delay-slot-1
delay-slot-2
nargs-instruction
The return pc is set to be the address of the `call' instruction. The
callee can make use of the built-in `nargs' routine to determine how many
(words of) arguments the caller passed. The `nargs' builtin is
implemented in the callee as:
jump 24(return pc) ! jump to caller's nargs-instruction
jump L1 ! jump back to callee -- in delay slot !!
L1:
That is, the callee branches to the `nargs-instruction' of the caller, but
the callee branch is followed by a delay slot instruction that branches
back to the callee, so only a single caller instruction is executed. That
instruction is typically something like `mov N,%reg', to set N, the number
of (words of) caller args. Note that this makes it hard to implement
`nargs' in a threads package that supports varargs, since there must be an
`nargs-instruction' for each possible value of `N'.
Whether you think these `nargs-instruction' bits qualify as entry points
probably depends on how you squint at the code. The struct/nonstruct
entreis, though, certainly qualify. (As an aside, all calls are also done
using an indirect call through a descriptor, but it's implemented as an
indirect pointer, not as a stub code fragment.)
Disclaimer: I'm doing this all from memory andn probably got some bits
wrong.
;-D on ( Caller-slave registers ) Pardo
Typical caller/callee-save divisions leave a "modest" number of free
caller-save registers (e.g. 10) for the body of each procedure, so only
"larger" routines (those that need many registers) need to save and
restore callee-save regs.
Further, since tail-recursion can be transformed in to iteration, spills
and restores of registers can be placed outside of the iteration, e.g.
f: ... ! normal entry
spill
f': ... ! tail-recursion loop entry
body
bn.cnd f' ! iteration/recursion
restore
return
I believe this is what GNU CC does in some cases (it is thus able to
inline recursive calls). In other cases the code at f' allocates stack
space (but is less general than f) and in some cases it uses the standard
calling convention.
;-D on ( Have your registers been SAVED? ) Pardo
Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
> 2) The arguments are callee-saved: Then you need not save and restore
> r1 around g, because g does that. Again, no pop r1 and you can
> optimize the tail call.
Huh??? Let's take it slowly here:
(1) In any general callee-saves convention the value in
r1 when f exits should be the same as the value in r1
when f is entered (after all, the caller of f may be
using r1).
(2) If r1 is being used to pass an argument to g, then
the value in r1 when g is entered will in general need
to be different than the value in r1 when f is entered.
(3) g does not in general have access to the value which
was in r1 when f was entered, so g can't put it back.
Presumably, when g is exited, r1 will contain one of
two things -- either garbage, or the value in r1 when g
was entered.
(4) Thus, f, and f alone, must restore the value of r1 .
This must happen after the call to g, but before f
exits.
(5) Given (4), that "pop" instruction (or some analogue)
must remain firmly planted where it is, and tail-calling
g is thus impossible.
Am I missing something?
Bart Massey
ba...@cs.uoregon.edu
If you allocate any local storage in the 'tail-recursive' routine and keep
pointers to it, you lose. Since some C programs do this, and testing for
it is kinda hard, I can understand why it isn't the default behavior. I
believe that this thread discussed this very issue about 2 months ago. A
reference to an article about the problems with C tail recursion was
posted, but I didn't write it down.
Since getting C to do this routinely is difficult, I have 'gone with the
flow' and suggested utilizing this behavior to incoporate 'tail recursion'
and a first-level generational GC in the same mechanism. My paper is
called 'Cheney on the MTA', and I'll be happy to send a copy to anyone who
wants it. I also have a version of the Boyer Benchmark using this style
of C coding, which I'll also be happy to send. Perhaps these should be
stored in some ftp directory? Any volunteers?
John McClain <pd...@ai.mit.edu> wrote:
|> > The lack of tail-call optimization really hurts if you want
|> > to use a continuation passing style of programming.
The 'Cheney on the MTA' technique gives you O(1) continuation captures
'for free'.
(The Xerox Alto microcode had some of this same bizarre behavior, and was
strictly pipelined, so that a sequence of 3 tests would have you quickly
pulling your hair out trying to figure out exactly which instructions
would be executed under what conditions. It was fairly simple to compile
for, however, using a kind of backwards 'dynamic programming' technique.)
Why does this remind me so much of the 7090 & 360 'execute' instructions?
When I first saw them, I thought "what a curious thing, but what are they
good for?" They're kinda like indirect addressing for the instruction
stream. The only good use I ever saw for the 7090 version was a
table-driven emulation of a PDP-10 style byte extraction instruction. I
believe that there's an article from the early 1960's (CACM???) which
attempts to convince us of their utility. It's interesting to see them
pop up in this new context.
[They were popular in the 1960s, and the PDP-6 and 10 had them. I sometimes
found them useful as switches, and they were very handy for handling Algol
thunks, but I can't say I miss them terribly. On the 360 they were typically
used in stylized ways to fake variable length string instructions. -John]
I've long been philosophizing about the `meaning' of dynamic code and I
like to view the `execute' instruction as an early attempt at structured
self-modifying code. I guess the form of the `execute' instruction, as
compared to some more general support for dynamic compilation, was
motivated by hardware designs of the day (easy to move a register in to
the internal instruction register), and its fall from favor was driven in
part by the trend towards more regular instruction sets that do less work
on each instruction (e.g., no implied looping); these changes make it hard
to pack big commands like "walk the following memory and apply this op. to
each nonzero element" in to a single register.
>[I've rarely seen it used.]
Mimic S/370 simulator lacks general support for dynamic compilation and
SMC (I-cache flushing and moving bits from D-space to I-space) but
`execute' is widely enough that Mimic supports it.
%A Cathy May
%T Mimic: A Fast S/370 Simulator
%J Proceedings of the ACM SIGPLAN 1987 Symposium on Interpreters and
Interpretive Techniques; SIGPLAN Notices
%V 22
%N 6
%C St. Paul, MN
%D June 1987
%P 1-13
;-D on ( Coming soon: self-modifying hardware ) Pardo
I have a public domain compiler for the amiga which generates separate
entry points for calls with the arguments on the stack or in registers.
This is fairly normal on the amiga, where system calls pass their
arguments through registers.
It is interesting to note that this compiler has various other interesting
aspects to it: a cmd-line flag for generating reentrant code, support for
dynamic linking (that is, select-code-object-at-runtime-and-load-it, not
just shared libraries). Such properties are important/useful on a machine
with no hardware memory protection :-)
Thomas
--
| Thomas Conway
| Computer Science
| Melbourne University
| con...@mundil.cs.mu.oz.au