Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Optimal # of registers?

21 views
Skip to first unread message

James Barr

unread,
May 10, 1992, 8:48:36 AM5/10/92
to
Hi,

i'm very ill-informed about processor architecture, but I find something
confusing, so I wonder if anyone could clear it up for me.

The average C function is usually not very big and so can only make
use of so many registers, or so I would think.

When another function is invoked some of the registers have to be saved,
since a function doesn't know what has been used and what is free;
sparc register windows or something analogous to /bin/sh's shift
instruction would get around this.

Some risc machines have lots of registers, 32 in mips and alpha, I believe.
But can your average function/procedure make use of so many and doesn't
this impact on context switching speed. If a compiler did lots of inlining
to make the object code routine bigger, I imagine better use would be made.

Is there something obvious i'm missing here, and if not does anyone have
any idea what the optimal number of available registers is for a C program.

thanks,
james barr

--
James Barr ............... ......
+31 20-592-5104 ........... ..........
James...@nikhef.nl ...... ..............
NIKHEF, Kruislaan 409, 1009DB Amsterdam .. ..................

John Mashey

unread,
May 11, 1992, 12:42:07 AM5/11/92
to
In article <16...@nikhefh.nikhef.nl> a...@nikhefh.nikhef.nl (James Barr) writes:
>Hi,

>Is there something obvious i'm missing here, and if not does anyone have
>any idea what the optimal number of available registers is for a C program.

There's no such thing as one number; it depends on the code and the
compilers.

At one point years ago, {MIPS, HP, and IBM} independently reached the conlcuison
that on the order of 25 allocatable integer registers [i.e., not including
stack pointer, zero, frame-pointer, or other dedicated things]
was a "good" number, in the sense that no one at that point could find
much use for any more, but people found examples where the incremental
value of adding a register was still noticable.
In our case, it was some quick exeriments done by telling the register
allocator that some registers were unavailable, and varying the number.
I don't recall if HP or IBM ever published such things; this was
bar- and/or conference discussion -type info.

Hennessy and Patterson devote a section to this topic.

A subtle point often missed:
IF you have a good register allocator
AND IF you have "plenty" of registers
THEN you can allocate some of them as CALLEE_SAVE registers
and some as CALLER_SAVE registers, and you can have a reasonable
number of each flavor. In this case, you have enough CALLER_SAVE
(i.e., scratch registers) that you can drastically reduce the
number of register save/restores that need to be done.

THAT IS:
suppose you have a CPU with a total of 8 integer registers.
It would be real easy to chew up at least 1 with stack-pointer,
1 with return-address, and 1 with return-value, and if you
use a frame-pointer ... that's 4 gone, one of which(return-value)
is a natural scratch register.
That leaves 4 registers, and to get enough to do much work in, you
may have to save/restore all 4 on most function calls.
IF you have 24 registers available (plus sp, fp, return-address, etc),
you could easily allocated 8-12 each for scratch and CALLEE_SAVE,
and end up being able to compute what you need in the scratch registers,
without saving more than a return-address, and in a leaf-routine,
not even that...

Floating-point: it's real nice to have 16-32 registers; graphics folks
especially like 32...
--
-john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: ma...@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash
DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: MIPS Computer Systems M/S 5-03, 950 DeGuigne, Sunnyvale, CA 94086-3650

Bruce...@bbs.actrix.gen.nz

unread,
May 11, 1992, 3:31:21 AM5/11/92
to
In article <l0ruov...@spim.mips.com> ma...@mips.com (John Mashey) writes:

> A subtle point often missed:
> IF you have a good register allocator
> AND IF you have "plenty" of registers
> THEN you can allocate some of them as CALLEE_SAVE registers
> and some as CALLER_SAVE registers, and you can have a reasonable
> number of each flavor. In this case, you have enough CALLER_SAVE
> (i.e., scratch registers) that you can drastically reduce the
> number of register save/restores that need to be done.

The 68K doesn't exactly have a surplus of registers, but that's exactly
what the convention is on the Mac. A0,A1,D0,D1 and D2 are CALLER_SAVE
(scratch), A2,A3,A4,D3,D4,D5,D6 and D7 are CALLEE_SAVE and A5,A6 and A7
are special (global, frame and stack pointers).

It would be nice to have more registers, but even so, many routines can
take good advantage of the scheme. Of course, different compilers have
to agree on this if you're going to mix code, but this is encouraged a
little by the ROMs doing it to *your* A0-1,D0-2 when you call toolbox
code :-)

I know this sort of thing isn't usual on other machines I've used (VAX/VMS,
DG's AOS/VS, MS-DOS), but I guess I'm a little surprised to hear that's
it's "often missed". Doesn't anyone else do it?

--
Bruce...@bbs.actrix.gen.nz Twisted pair: +64 4 477 2116
BIX: brucehoult Last Resort: PO Box 4145 Wellington, NZ
"Cray's producing a 200 MIPS personal computer with 64MB RAM and a 1 GB
hard disk that fits in your pocket!" "Great! Is it PC compatable?"

Rick Noah Zucker

unread,
May 11, 1992, 12:50:48 PM5/11/92
to
In article <l0ruov...@spim.mips.com> ma...@mips.com (John Mashey) writes:
>At one point years ago, {MIPS, HP, and IBM} independently reached the conlcuison
>that on the order of 25 allocatable integer registers [i.e., not including
>stack pointer, zero, frame-pointer, or other dedicated things]
>was a "good" number, in the sense that no one at that point could find
>much use for any more, but people found examples where the incremental
>value of adding a register was still noticable.
>In our case, it was some quick exeriments done by telling the register
>allocator that some registers were unavailable, and varying the number.
>I don't recall if HP or IBM ever published such things; this was
>bar- and/or conference discussion -type info.

A paper along these lines (although it also considers code
generation strategy) appears in last year's architecture conference.
It is by Bradlee, Eggers and Henry. It considers the performance
benefit of different register sizes and organizations under different
code generation strategies.

Rick Zucker

mj...@minster.york.ac.uk

unread,
May 11, 1992, 8:42:50 AM5/11/92
to
In article <16...@nikhefh.nikhef.nl> a...@nikhefh.nikhef.nl (James Barr) writes:
>The average C function is usually not very big and so can only make
>use of so many registers, or so I would think.

You imply that everyone programs in C -- this is most definitely not the
case!

>When another function is invoked some of the registers have to be saved,
>since a function doesn't know what has been used and what is free;
>sparc register windows or something analogous to /bin/sh's shift
>instruction would get around this.

The _called_ function knows _exactly_ what is used and what is free -- so do
register saving/restoring inside the _callee_, not the _caller_.

>James Barr ............... ......
>+31 20-592-5104 ........... ..........
>James...@nikhef.nl ...... ..............

Mat

| Mathew Lodge | "A conversation with you, Baldrick, |
| mj...@minster.york.ac.uk | and somehow death looses its sting..." |
| Summer: lodge%alsys@uknet | -- Blackadder II |

Othman Ahmad

unread,
May 11, 1992, 11:25:38 AM5/11/92
to
James Barr:

:
: Some risc machines have lots of registers, 32 in mips and alpha, I believe.


: But can your average function/procedure make use of so many and doesn't
: this impact on context switching speed. If a compiler did lots of inlining
: to make the object code routine bigger, I imagine better use would be made.
:
: Is there something obvious i'm missing here, and if not does anyone have
: any idea what the optimal number of available registers is for a C program.
:
: thanks,
: james barr

Registers are part of the memory hierarchy comprising main memory, cache.
Obviously if we have extremely fast main memory we do not need any register
at all. The speed of the ALU could match the speed of our main memory.
This problem can be viewed as impedence matching problem, which is solved by
using a transformer. Howeve computers are information processing machines.

1)Registers are introduced to overcome these speed differences. It behaves like
a cache but is manually loaded unlike true caches. Maybe we can design
special purpose caches in place of these registers!

2)Another advantage in introducing registers is the reduction in instruction
size. The number of address bits for registers is much less than that of
main memory.

Once you understand these two functions, it should be obvious the optimum numberof registers, by optimizing the 2 functions.
Other responses to this question take the software approach without
regards to the basic functions. In fact their views are the universal view.
Supported by text books and tons of papers published.
However they have important flaws in that software is not static. It
depends on the application. Statistics taken for one class of application is
not valid for another. Of course we cannot avoid from this dilemma.
The best approach is the hardware point of view. Technology is the
deciding factor, not software needs. The reason why we need the different
memory hierarchies is the driving limitation of transistors. Refer to Carver
Mead book called Introduction to VLSI. He quoted that the optimum transistor
size is e(~2.7) times the driver transistor.
However ALU's are not made up of one transistor stage because of
needs. The ALU is the brain of a microprocessor. It need fast memory to
catch up with its processing rate. That is the function of the registers.
The number of registers should be the maximum such that the access time is
the same as the ALU. The access time depends on the address generation and
data transfer.
However the access time is dependent on process technology and the
complexity of the ALU. BiCmos can drive more inputs. DSP multipliers are
more complex than RISC interger ALU's.
The point is, you cannot really estimate the optimal number based
on software.
--
Othman bin Ahmad, School of EEE,
Nanyang Technological University, Singapore 2263.
Internet Email: eoa...@ntuix.ntu.ac.sg
Bitnet Email: eoa...@ntuvax.bitnet

Dennis O'Connor

unread,
May 11, 1992, 5:51:04 PM5/11/92
to

mj...@minster.york.ac.uk writes:
] In article <...> a...@nikhefh.nikhef.nl (James Barr) writes:
] >When another function is invoked some of the registers have to be saved,

] >since a function doesn't know what has been used and what is free;
] >sparc register windows or something analogous to /bin/sh's shift
] >instruction would get around this.
]
] The _called_ function knows _exactly_ what is used and what is free -- so do
] register saving/restoring inside the _callee_, not the _caller_.

The calling procedure knows what registers have values that need to be
preserved across the call. The called procedure know which registers
it will use. The set of registers that need to be saved and then
restored is the intersection of these two sets. Minimizing this
intersection reduces the memory traffic associated with a call/return.

Register windows provide limited disjoint sets of registers : as long as
these sets are sufficiently large, the intersection of the in-use-by-caller
and used-by-callee sets can be empty. In any case, it's smaller than it
otherwise would have been by the size of the non-shared window. But you
pay for it in other places, as discussed in another very recent thread.

Protocols that define one set of registers as the preserve-responsibility
of the caller, and another as the responsibility of the callee, are also
meant to minimize the intersection of in-use-by-caller/callee-used registers.
The caller will try not to have the registers in it's responsibility-set
in use at the time of a call, by clever register allocation. The callee
will try to use only those registers that were the caller's responsibility.
If both the caller's objective and the callee's objective can be accomplished,
no register saves/restores are needed. Of course, once the callee becomes
a caller itself, things break down a bit. But if 50% of all calls are to
leaf procedures ( and inlining-optimizing compilers may improve upon this )
you won't have this reversal of roles a lot of the time.

In contrast, putting the responsibility for saving/restoring just on the
caller or just on the callee forces whichever it is to make worst-case
assumptions about the other's used or in-use register set, which will
cause the set of registers that needs to be saved to be larger than it
otherwise would be.
--
Dennis O'Connor doco...@sedona.intel.com

CP/M lives!

unread,
May 11, 1992, 11:49:12 AM5/11/92
to
In article <1992May11.0...@actrix.gen.nz>, Bruce...@bbs.actrix.gen.nz writes:
>
> The 68K doesn't exactly have a surplus of registers, but that's exactly
> what the convention is on the Mac. A0,A1,D0,D1 and D2 are CALLER_SAVE
> (scratch), A2,A3,A4,D3,D4,D5,D6 and D7 are CALLEE_SAVE and A5,A6 and A7
> are special (global, frame and stack pointers).
>
> I know this sort of thing isn't usual on other machines I've used (VAX/VMS,
> DG's AOS/VS, MS-DOS), but I guess I'm a little surprised to hear that's
> it's "often missed". Doesn't anyone else do it?

The convention _inside_ the VMS kernel is typically that you can do what you
like to R0 through R5 (caller saved), but if you mutilate R6 and higher you
have to save them (callee saved).

Roger Ivie
iv...@cc.usu.edu

Mike Moore

unread,
May 12, 1992, 4:43:58 AM5/12/92
to
In article <16...@nikhefh.nikhef.nl> a...@nikhefh.nikhef.nl (James Barr) writes:
>Hi,

>
>The average C function is usually not very big and so can only make
>use of so many registers, or so I would think.
>
>Some risc machines have lots of registers, 32 in mips and alpha, I believe.
>But can your average function/procedure make use of so many and doesn't
>this impact on context switching speed. If a compiler did lots of inlining
>to make the object code routine bigger, I imagine better use would be made.
>
>Is there something obvious i'm missing here, and if not does anyone have
>any idea what the optimal number of available registers is for a C program.
>
>thanks,
>james barr

I'm sure there is such a thing as an 'optimal' number of registers, but it
would only be optimal for a specific application. Registers have lots of
uses other than as place-holders for function arguments. Indeed, on many
compilers, function arguments are placed on a stack in regular memory, *not*
in registers. The ideal solution to speed up context switching/interrupts
etc. is to have a single instruction which would stack all or most of the
currently viewable registers. The first chip I came across which did this
was the Z80 CISC (*many* years ago) which had an alternate register set
(but only one). Modern RISC architectures allow you to alter the viewable
window of registers with one command. So all in all, there should be no
noticeable impact with these kinds of chips (of course, what happens when
you run out of registers has always been an intriguing challenge!).


--
Mike Moore | Sprechen Sie Deutsch? Ich moechte | The world is coming to an
IXI Limited | gerne eine(n) Elektroniker(in) als | end. Please log off.
mi...@x.co.uk | Brieffreund(in) haben. |

Tim Olson

unread,
May 12, 1992, 12:30:07 PM5/12/92
to
John Mashey writes
| The part that is missed is *not* that you split the registers, as
| few people miss that. What is missed is the drastic reduction of
| save/restores that happen by having plenty of registers and good
| allocation.
|
| As an example, recall that the experiments that led to Berkeley RISC II
| and hence to SPARC) were based on experiments with the portable
| C compiler. I recall that somebody from Sun gave a talk a Stanford
| in which they observed that if they were doing it again, they might
| arrange things to have more smaller windows, rather than fewer
| bigger ones.

Yes, smaller, fixed-sized windows would probably be a win over
larger fixed-sized windows (however, variable sized windows are
better). For example, data from 710 non-leaf functions in our
benchmark suite show the following distribution of non-temporary
register allocation:

2: 48 ( 6.76%)
4: 188 ( 26.48%)
6: 169 ( 23.80%)
8: 107 ( 15.07%)
10: 46 ( 6.48%)
12: 40 ( 5.63%)
14: 29 ( 4.08%)
16: 16 ( 2.25%)
18: 9 ( 1.27%)
20: 10 ( 1.41%)
22: 11 ( 1.55%)
24: 8 ( 1.13%)
26: 5 ( 0.70%)
28: 7 ( 0.99%)
30: 8 ( 1.13%)
32: 2 ( 0.28%)
34: 2 ( 0.28%)
42: 3 ( 0.42%)
46: 1 ( 0.14%)
48: 1 ( 0.14%)

(non-temporary registers are allocated as register-pairs in the
Am29000 calling convention, so I only show even numbers).


--
Tim Olson
Advanced Micro Devices
t...@amd.com

mj...@minster.york.ac.uk

unread,
May 12, 1992, 1:01:09 PM5/12/92
to
In article <DOCONNOR.92...@potato.sedona.intel.com> doco...@potato.sedona.intel.com (Dennis O'Connor) writes:
>mj...@minster.york.ac.uk writes:
>] In article <...> a...@nikhefh.nikhef.nl (James Barr) writes:
>] >When another function is invoked some of the registers have to be saved,
>] >since a function doesn't know what has been used and what is free;
>] >sparc register windows or something analogous to /bin/sh's shift
>] >instruction would get around this.
>]
>] The _called_ function knows _exactly_ what is used and what is free -- so do
>] register saving/restoring inside the _callee_, not the _caller_.
>
>The calling procedure knows what registers have values that need to be
>preserved across the call. The called procedure know which registers
>it will use. The set of registers that need to be saved and then
>restored is the intersection of these two sets. Minimizing this
>intersection reduces the memory traffic associated with a call/return.

But this isn't true when you've got separate compilation -- the callers of a
library-level subprogram are unknown until link time.

I am indebted to Charles Forsyth for the following information:

Under callee-save, the the reg save code only appears once but it might save
registers needlessly.

Under caller-save, the caller saves the registers it uses before the call,
and restores them afterwards. An advantage of this method is that context
switches can be done in a few instructions, and system calls need not save
registers.

A paper discussing the schemes (and a hybrid of the two) is:

%T Methods for saving and restoring register values across function calls
%K save/restore, complex instruction set computing, function calls, CISC
machines, compiler, data-flow analysis, calling function, saving, disjoint
sets, register values
%O Softw. - Pract. Exp. (UK)
%J Software - Practice and Experience
%A Davidson, J.W.
%A Whalley, D.B.
%V 21
%N 2
%P 149-65
%D Feb. 1991

>Dennis O'Connor doco...@sedona.intel.com

Robert Walker

unread,
May 12, 1992, 6:23:58 PM5/12/92
to
In article <1992May12....@dvorak.amd.com>, t...@monolith.amd.com (Tim Olson) writes:

--- deleted text -----

By examining these figures surely the case for multisize windows (a subset
of variable size windows) is called for.

With window sizes of say 4,8,12 registers (4 overlapped) out of 16 registers
surely the improved efficiency of register allocation is going to be
very advantageous!

Taking your figures above:

33.24% of all functions are handled with a window size of 4 registers
72.11% " " " " " " " " " " 8 "
84.22% " " " " " " " " " " 12 "

By allocating smaller windows it is possible to reduce the amount of wasted
registers/window and reduce usage of the overall register file (and thus
allow more windows to be formed from less registers).

Only 15.78% of functions in the above example will need to have a stack
activation frame formed for them.

This could of course be reduced by allowing additional window sizes.
E.g. windowing a total of 24 registers with 4 overlapped and window sizes
of 4,8,12,16 and 20 (handling ~96% all the above function calls).

Any comments on this technique??

Rob

Email at: r...@uk.ac.ed.dcs

Cliff Click

unread,
May 13, 1992, 10:12:53 AM5/13/92
to
In article <33...@skye.dcs.ed.ac.uk> r...@dcs.ed.ac.uk (Robert Walker) writes:

> In article <1992May12....@dvorak.amd.com>, t...@monolith.amd.com (Tim Olson) writes:
>
> > Yes, smaller, fixed-sized windows would probably be a win over
> > larger fixed-sized windows (however, variable sized windows are
> > better). For example, data from 710 non-leaf functions in our
> > benchmark suite show the following distribution of non-temporary
> > register allocation:
>

> By examining these figures surely the case for multisize windows (a subset
> of variable size windows) is called for.

This smells a lot like a Forth chip: allow the top 16 (32, whatever) stack
locations to be accessed "fast" (i.e. cached on chip, no aliasing with memory
allowed). When you need more registers you push stuff; when you need old
registers you pop stuff. An interrupt signals on-chip stack cache over-/
under-run so you can fixup (I guess at least 1 chip read or wrote stack
results "on-demand" instead of taking an interrupt).

A variety of such chips exist, but they never picked up by the major chip
houses (at least not that I heard of). Could it be because they were a pain
to compile for?

Cliff
--
The Sparc ABI has the most brain-damaged calling convention I've ever seen.
Cliff Click (cli...@cs.rice.edu) | Disclaimer: My lawyer made me say it.

adr...@cobblers.uk.sun.com

unread,
May 14, 1992, 8:47:40 AM5/14/92
to
In article 72...@minster.york.ac.uk, mj...@minster.york.ac.uk () writes:
>>
>>The calling procedure knows what registers have values that need to be
>>preserved across the call. The called procedure know which registers
>>it will use. The set of registers that need to be saved and then
>>restored is the intersection of these two sets. Minimizing this
>>intersection reduces the memory traffic associated with a call/return.
>
>But this isn't true when you've got separate compilation -- the callers of a
>library-level subprogram are unknown until link time.
>

It also isn't true when you have dynamically linked libraries and in object oriented
systems where the implementation of a method is hidden. Both these reasons were
cited in the original papers that described why Sun chose to include register windows
in SPARC (The SPARC Technical Papers - Springer Verlag ISBN 0-387-97634-5).

The key thing to realise about registers are that they are just another type of
cache, and as such can be characterised by its hit rate and miss cost.

The SPARC register window architecture provides a larger cache with an improved hit
rate compared to the basic 32 register file. It also has a larger miss cost. There
are algorithms and examples that will trash any type of cache.

The arguments are on the following points:

1. Are the algorithms that cause problems for register windows common?
2. What is the improvement in hit rate (i.e. how many load/stores are saved)
3. What is the miss cost (i.e. register window spill/fill time)
4. How much state needs to be saved/restored on context switches (OS dependent)

The miss cost is a critical factor in this, in current SPARC implementations the
window spill/fill is done by a trap routine, the core of which is the save/load
of 16 registers. Lets summarise what happens on various CPU's. I have ignored
possible cache misses - but you would get them without windows anyway.


SPARCstation 2 - Cypress SPARC write through cache with twin 64bit write buffers
Spill takes 8 store-double's @ 4 clocks each = 32 clocks plus write stalls after the
first two stores. Fill takes 8 load-doubles @ 3 clocks each = 24 clocks.

SPARCserver 600MP - Cypress SPARC write back cache
Same as SPARCstation 2 but doesn't stall on writes.

SuperSPARC directly on MBus with on-chip write back cache
Spill takes 8 store-double's @ 1 clock each = 8 clocks.
Fill takes 8 load-doubles @ 1 clocks each = 8 clocks.

SuperSPARC with on-chip write through cache with 8 64bit write buffers and 2nd level
write back cache
Spill takes 8 store-double's @ 1 clock each = 8 clocks.
Fill takes 8 load-doubles @ 1 clocks each = 8 clocks.

SuperSPARC will substantially reduce the miss cost - the spill/fill time - which
strengthens the case for register windows.

regards Adrian

---
Adrian Cockcroft - adrian.c...@uk.sun.com or adrian.c...@sun.co.uk
Sun Microsystems, 306 Science Park, Milton Road, Cambridge CB4 4WG, UK
Phone +44 223 420421 - Fax +44 223 420257 Sun Mailstop: ECBG01

John Mashey

unread,
May 14, 1992, 6:22:19 PM5/14/92
to
In article <utnhc...@grapevine.EBay.Sun.COM> adr...@cobblers.uk.sun.com writes:
>In article 72...@minster.york.ac.uk, mj...@minster.york.ac.uk () writes:
>>>
>>>The calling procedure knows what registers have values that need to be
>>>preserved across the call. The called procedure know which registers
>>>it will use. The set of registers that need to be saved and then
>>>restored is the intersection of these two sets. Minimizing this
>>>intersection reduces the memory traffic associated with a call/return.
>>
>>But this isn't true when you've got separate compilation -- the callers of a
>>library-level subprogram are unknown until link time.

>SuperSPARC will substantially reduce the miss cost - the spill/fill time - which


>strengthens the case for register windows.

Actually, I don't think it makes a lot of difference....

Register windows are not exactly caches, but lets ignore that, in that it's
pretty easy to take an instruction & reference stream and model that,
whereas use of register windows vs other things changes the streams...

In any case:

1) There has been, over time, a reasonable amount of data published
(here or elsewhere, years ago) that shows with:
- 32 integer registers (i.e., IPS, HP PA, IBM, Alpha, etc)
- Good global optimizer, but WITHOUT turning on inlining or
inter-procedural optimization
- Hence, applicable to dynlinked libraries or object-oriented stuff..
That the dynamic statistics say that you can get the number of registers
saved/restored per function call across benchmark suites like SPEC
(or the internal ones that MIPS has used, i.e., fairly large C and
FORTRAN programs) and actually including FP registers, to get down
around 2. I've looked at dozens of profiles of programs to say this.
The reason for this is that leaf functions (or things where the compiler
can make them act like leaf functions for the most frequent calls),
seldom save/restore *any* registers.

2) Implementations (such as HP or MIPS, for example) that started with
split I & D caches normally have 1-cycle loads and 1-2 cycle stores,
depending on the implementation. Hence, a lower bound on the
cost is something like:
2 registers * (1-2 (per save) + 1 per restore)
= 4-6 cycles.
One expects that the restore should seldom cache-miss.
In a write-back cache, the first store may cause a writeback and/or
a memory read.
Of course, the instructions may I-cache miss.

3) The last time I looked, I observed that realistic C program typically
averaged 80-100 instructions function call; FORTRAN programs varied wildly
but were usually higher; I recall 200 instructions or so was common.
There are of course more cycles due to everything else, and a 1.2 CPI
would be typical; hence, what you're talking about is 100-120 cycles ..
of which 4-6 are spent saving/restoring registers.

So, as a gross rule of thumb, what's happening is that a MIPS_like
CPU is probably spending about 5% of the cycles saving/restoring
registers across function calls, in user-level code typical of
the early 1980s. It's probably less for FORTRAN code (I guess
3%) and maybe for kernel code.

4) Now, most earlier SPARCs use joint caches, and hence have 2-cycle
loads and 2-3-cycle stores, and so the penalty *there* for
explicit save/restores, assuming same compiler technology:
is 2 * (2 + 2-3) or 8-10 cycles, hence there are more cycles to
go after in that kind of implementation (although %-wise it's probably
the same, as the joint-cache SPARCs have higher CPIs than 1.2).
Put another way: if you have fast loads and stores, and don't have
to save/restore many registers anyway, you're less interested in
register windows.

5) Anyway, it all depends on the program behavior and the effects of
compilers, certainly as Patterson said quite often.
*IF* you have portable-c-compiler, joint caches, and your model of
programs are user-level C programs, THEN register windows or
stack caches or whatever are probably good ideas.
*IF* the code has gotten *more layered* than it was around 1980,
that is, that it goes up and down the levels faster, and hence hits overflows/
underflows faster [i.e., one must count a *complete* exception-handling
sequence, not just the register load/store time itself], then
the penalties may end up equaling the gains. Now, as it happens,
I suspect that GUIs and similar layered code have more levels,
but I could be wrong.

6) I've watched UNIX OS's with debuggers. One time, just following
a regular system call, I saw it quickly go down 12 levels, and then back
up the same 12 very quickly ... besides the several levels deep it already
was in the user code. Note that this was *not* getpid, which is dandy
for a register window machine because it is so shallow you need not
save/restore any windows. Anyway, it's amusing to watch something
like *stat* or *read* go at it.

For this reason, a peculiarity of register windows not seen in other
programs, is that reworking code to make it more layered, rather than having
a slight performance impact, may surprise you, if adding 1-2 levels
causes the windows to thrash.

7) Just so I can defend Sun occasionaly, I point out, once again,
that in UNIX, context-switching of register windows is pretty much
a non-issue, in workstations, and in fact, in most systems.

It certainly *does* cost cycles ... but these are heavily dominated by other
effects in UNIX. When you add everything up, I'd exepct a SuperSPARC,
even at 33Mhz, to require no more than 5-6 microseconds to
save/restore integer registers + FP registers (which of course are not
part of the windows), assuming you actually save/restore them.
(A 50MHz R4000 does this in 1-2 microseconds with similar cache-miss
assumptions, as it would save and restore ~48 registers.)

Of course, if you're running a large transaction system with light transactions
you might care.

The knees in curves that people have often seen in SPARC systems
often have often come from the SRAM-based MMU designs used in earlier
systems, where there is some fixed number of MMU contexts,
and when you switch between N+1 tasks often, there is a major overhead
for unloadingand loadin the MMU ... but this has zero to do with
register windows.

Now, all of the above is appropriate for most UNIXes. Whether or not it's
appropriate for other OSs depends on what they do. For real-time
OS's, or things like telephone-switch OSs (zillions of tasks,
high context switch rate), people who do these things do *not* generally
like register windows much, to put it mildly. Note that the
common reply of "dedicate register windows to processes" doesn't
fly with most of these folks, because there are often too many processes
to do that ... plus you need the compilers to act different.
[You'd better *not* act like you've got register windows if a task only
has 1-2 of its own... :-)]

8) Another non-argument is that there is a huge difference in
register access time just because there are more of them.
This might be an effect in some implementations in the future
(i.e., with register renaming, some implementations might get to be
awkward, perhaps; perhaps not), but I do not believe has been a major
issue so far. [I doubt this has very much to do with SuperSPARC's
getting 40MHz so far... I'd be happy to be proved wrong, of course.]

Anyway: bottom line is:

1) Register windows go after about 5% of cycles that integer C user
programs spend saving/restoring registers.
a) It is fairly easy to model how much might be saved,
and hence the motivation for doing it in the first place.

2) They lose some of it back via:
a) User-level code causes overflow/underflow traps
b) Heavy use of OS code (which seems to go up and down call
stacks rapidly) causes more overflow/underflow traps
and, way down the list in UNIX
c) A little extra overhead on context switch.

It is reasonable to model a), and trivial to compute c), and harder
to figure out what's going on with b). It does say you'd
expect register windows to be at their best for single-thread user
programs.

Carl Wuebker

unread,
May 12, 1992, 12:43:03 PM5/12/92
to

> Protocols that define one set of registers as the preserve-responsibility
> of the caller, and another as the responsibility of the callee, are also
> meant to minimize the intersection of in-use-by-caller/callee-used registers.
> The caller will try not to have the registers in it's responsibility-set
> in use at the time of a call, by clever register allocation. The callee
> will try to use only those registers that were the caller's responsibility.
> If both the caller's objective and the callee's objective can be
> accomplished, no register saves/restores are needed. Of course, once the
> callee becomes a caller itself, things break down a bit.

Dennis' explanation is one of the clearest I've seen. In most cases,
programs call subroutines by name (e.g. pr_string("hello, World\n")), and a
compiler which is given a chance to compile all routines could allocate
registers efficiently. However, I can think of several cases where a compiler
is more limited:

* Calling library routines (compiler can't remap registers)
* Calling via pointers (compiler doesn't usually know which routine
is being called)
* Recursion (subroutine uses same set of registers as
itself)

A fast solution to this problem might be a "register block pointer"
(similar to a stack pointer, but for registers). This is also useful for
interrupt service routines... it would work something like this:

* Assume a CPU with 64 internal registers, divided into 8 blocks of 8
each

* A "register block pointer", also in the CPU, specifies the upper 3
bits of the register

* The main routine starts with register block 0
When it calls a subroutine or interrupts, the register block pointer
is automatically incremented, so the subroutine or ISR gets a
"clean slate" of registers

* To handle recursion or deep nesting, a register block pointer overflow
(i.e. 7->0) would save (say) the first 2 blocks of registers and
rotate the rest of them... an underflow would reload the registers

* Relative instructions would allow the subroutine to access the
previous 1 or 2 register blocks

The proposal is not original with me -- it is fairly similar to the
Burroughs 5500 or HP 3000 register stack... just uses blocks to avoid an
adder in accessing the registers. Of course, this sort of thing starts
to look more like a CISC (many cycles in certain cases) machine...

Disclaimer: I have very little to do with HP's RISC CPUs -- these are
my opinions alone.

Thanks,
Carl Wuebker * HP Roseville * c...@hprnd.rose.hp.com * (916) 785-4296

Dennis O'Connor

unread,
May 16, 1992, 2:10:14 AM5/16/92
to

c...@hprnd.rose.hp.com (Carl Wuebker) writes:
] [ about split caller/calle save protocals ]
] Dennis' explanation is one of the clearest I've seen.

Aw, gosh. Thanks.

] In most cases, programs call subroutines by name, and a compiler which is


] given a chance to compile all routines could allocate registers efficiently.
] However, I can think of several cases where a compiler is more limited:
]
] * Calling library routines (compiler can't remap registers)
] * Calling via pointers (compiler doesn't usually know which routine
] is being called)
] * Recursion (subroutine uses same set of registers as
] itself)
]
] A fast solution to this problem might be a "register block pointer"
] (similar to a stack pointer, but for registers).

]
] [ ... feature set editted out ... ]
]
] * Relative instructions would allow the subroutine to access the

] previous 1 or 2 register blocks

What you describe, with the exception of an efficient mechanism for
performing the above-quoted previous-block-reference feature, is pretty
much what the Intel i960(tm) processor family does if you use the "call"
instruction. Exclusive of spills and restores of register sets, which
you alluded to, this instruction takes 2 cycles on an i960MM or i960MX,
and 4 cycles on an i960CA or i960CF. The "register block" size is 16 regs.

A recent comp.arch thread discussed the wisdom of this feature.

Note that the i960 also defines a RISC-style single-cycle branch-and-link
instruction for "calling" subroutines. The compiler or assembly language
programmer is free to choose which method to use.

Tim Olson

unread,
May 16, 1992, 3:33:39 PM5/16/92
to
Dennis O'Connor writes

| What you describe, with the exception of an efficient mechanism for
| performing the above-quoted previous-block-reference feature, is pretty
| much what the Intel i960(tm) processor family does if you use the "call"
| instruction. Exclusive of spills and restores of register sets, which
| you alluded to, this instruction takes 2 cycles on an i960MM or i960MX,
| and 4 cycles on an i960CA or i960CF. The "register block" size is 16 regs.
|
| A recent comp.arch thread discussed the wisdom of this feature.
|
| Note that the i960 also defines a RISC-style single-cycle branch-and-link
| instruction for "calling" subroutines. The compiler or assembly language
| programmer is free to choose which method to use.

Wouldn't the decision actually be pushed down to the linker?

The "CALL" instruction on the i960 binds the "branch & link"
semantics with the "allocate a new register window" semantics, so
different instructions are required to call leaf functions and
non-leaf functions. However, the compiler does not necessarily
know the type of function it has to generate the call for.

0 new messages