Ideal number of CPU registers (various scenarios)

James Harris

unread,

Sep 2, 2013, 10:05:17 AM9/2/13

to

Is there any consensus on the ideal number of integer and FP registers a CPU
*should* have in order to efficiently execute most common algorithms?

Some thoughts on the basics:

Disadvantages of smaller register sets
* Register spilling needed for more algorithms.
* Fewer regs available for parameter passing.

Disadvantages of larger register sets
* More state to save and load
- increased interrupt latency
- slower task switch.
* Slower to access (capacitive effects).
* More bits needed in the instruction space.

For example, to hold integers and pointers:
* 15 integer registers for load-store architectures.
* 7 integer registers for architectures which allow one or more operands to
be in memory.

To hold floating point values
* 16 FP registers for load-store architectures.
* 8 FP registers where one or more operands can be in memory.

(For the integer cases I have reserved one for a stack pointer.)

For whatever the answer is for the above how would the answer change if,
rather than having one bank of integer registers and one bank of FP
registers, the integer and FP instructions operated on the same register
set?

James

MitchAlsup

unread,

Sep 2, 2013, 12:42:14 PM9/2/13

to

On Monday, September 2, 2013 9:05:17 AM UTC-5, James Harris wrote:
> Is there any consensus on the ideal number of integer and FP registers a CPU
> *should* have in order to efficiently execute most common algorithms?
>

In general, there is no consensus; however, most of the RISC machines used 32 Int and 32 FP registers. So that may be some kind of consensus.

>
> Some thoughts on the basics:
>
> Disadvantages of smaller register sets
>
> * Register spilling needed for more algorithms.

Stak machines (0-registers) suffer little from this. Also spilling to the stack results in good hit rates for the reload operation, and few stack misses for the write operation.

> * Fewer regs available for parameter passing.

4-5 registers is over the knee of the curve for parameter passing.

>
>
> Disadvantages of larger register sets
> * More state to save and load
> - increased interrupt latency

Not if the interrupt routine has a spare set of registers!
> - slower task switch.
Not if the register swap is done in the background!

>
> * Slower to access (capacitive effects).

Easily hidden in the pipeline.

> For example, to hold integers and pointers:
>
> * 15 integer registers for load-store architectures.

How amny would you need for a CDC6600-like architecture--which is sort-of load/store and different at the same time.

> * 7 integer registers for architectures which allow one or more operands to

I would venture closer to 12 than 7.

> be in memory.
>
>
>
> To hold floating point values
> * 16 FP registers for load-store architectures.

My guess is that the consensus, here, is that one wants at least 32 double precision FP registers.

> * 8 FP registers where one or more operands can be in memory.

Closer to 24.

>
> (For the integer cases I have reserved one for a stack pointer.)
>
> For whatever the answer is for the above how would the answer change if,
> rather than having one bank of integer registers and one bank of FP
> registers, the integer and FP instructions operated on the same register
> set?

I have always been a fan of combined registers, however, in a combined register set one will want at least 32 double precision registers. This is well over the knee of the curve for integers, and close enough to the knee of the curve for FP. It is no where close to the knee of the curve when registers are used programatically to absorb memory latency (where you end up needing 256+ registers.

Mitch

Stephen Sprunk

unread,

Sep 2, 2013, 3:32:24 PM9/2/13

to

On 02-Sep-13 11:42, MitchAlsup wrote:
> On Monday, September 2, 2013 9:05:17 AM UTC-5, James Harris wrote:
>> Is there any consensus on the ideal number of integer and FP
>> registers a CPU *should* have in order to efficiently execute most
>> common algorithms?
>
> In general, there is no consensus; however, most of the RISC machines
> used 32 Int and 32 FP registers. So that may be some kind of
> consensus.

I've read somewhere (here?) that 16 registers is roughly the tipping
point, but that exceeds the capacity of a 16-bit instruction encoding,
and once you go to a 32-bit encoding, there's enough bits left over that
you might as well do 32 registers.

S

--
Stephen Sprunk "God does not play dice." --Albert Einstein
CCIE #3723 "God is an inveterate gambler, and He throws the
K5SSS dice at every possible opportunity." --Stephen Hawking

Anton Ertl

unread,

Sep 3, 2013, 8:15:46 AM9/3/13

to

"James Harris" <james.h...@gmail.com> writes:
>Is there any consensus on the ideal number of integer and FP registers a CPU
>*should* have in order to efficiently execute most common algorithms?

Wasn't there something like:

"Imagine an architecture with so many register that you never spill.
Imagine an architecture with 16 registers."

A few years later:

"Imagine an architecture with so many register that you never spill.
Imagine an architecture with 32 registers."

It depends not just on the algorithm, but also on the
(micro)architecture and on the compiler how many registers you need.
If the instructions have long latencies, you software-pipeline loops,
and use modulo variable renaming to deal with write-after-read
dependencies, you can have the need for very many registers
(programmer-visible for in-order processors, or physical for OoO
processors with hardware register renaming).

OoO hardware does the equivalent of software pipelining in hardware,
and I think they have 100-200 physical integer and a similar number of
FP registers these days. IA-64 is intended to do real software
pipelining (with rotating register files instead of modulo variable
renaming), and it has 128 integer and 128 FP registers.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Noob

unread,

Sep 3, 2013, 10:22:24 AM9/3/13

to

James Harris wrote:

> Is there any consensus on the ideal number of integer and FP
> registers a CPU *should* have in order to efficiently execute
> most common algorithms?

I think 42 is about right.

Ivan Godard

unread,

Sep 3, 2013, 12:14:58 PM9/3/13

to

On 9/3/2013 5:15 AM, Anton Ertl wrote:
> "James Harris" <james.h...@gmail.com> writes:
>> Is there any consensus on the ideal number of integer and FP registers a CPU
>> *should* have in order to efficiently execute most common algorithms?
>
> Wasn't there something like:
>
> "Imagine an architecture with so many register that you never spill.
> Imagine an architecture with 16 registers."
>
> A few years later:
>
> "Imagine an architecture with so many register that you never spill.
> Imagine an architecture with 32 registers."
>
> It depends not just on the algorithm, but also on the
> (micro)architecture and on the compiler how many registers you need.
> If the instructions have long latencies, you software-pipeline loops,
> and use modulo variable renaming to deal with write-after-read
> dependencies, you can have the need for very many registers
> (programmer-visible for in-order processors, or physical for OoO
> processors with hardware register renaming).
>
> OoO hardware does the equivalent of software pipelining in hardware,
> and I think they have 100-200 physical integer and a similar number of
> FP registers these days. IA-64 is intended to do real software
> pipelining (with rotating register files instead of modulo variable
> renaming), and it has 128 integer and 128 FP registers.
>
> - anton
>

Well, sort of, but heavily dependent on micro-architecture.

All that's really needed for the program model of a software pipeline is
the product of the number of loop-carried values and the iteration
stride. Thus in:
a[i] = b[i]*b[i-5]+c[i]*c[i-5]
there are two loop-carried values (from a and b) and the stride is 5, so
10 registers would be needed.

The physical implementation does not change the loop-carried count from
the program model, but increases the stride by the latency of the
longest carried operation. In this example that latency is probably four
cycles or so, so the stride increases to 9 and 18 registers are needed
for the body. Then there will be a register or two for the iteration
control (the variable i) and a couple for scratch storage in the
expression, but 32 will e plenty.

Notice that I assumed that the multiply was the longest-latency
operation of the loop. That assumption implies that the loads have less
than four-cycle latency, i.e. will hit in the L$1. If that assumption is
false and the data is in the L$2 (10 cycle latency) then the body needs
30 registers plius extra, and 32 will be at the ragged edhe of trouble.
If the data is in 300+ cycle DRAM then 600+ registers are needed, which
is well beyond practicality.

But this discussion assumes that the only way to do pipelining is by
using registers for values, and that's a micro-architectural choice.
Many machines use special streaming mechanisms to avoid using registers
as stream buffers in the pipeline: DAE machines like the Astranautics
ZS-1; DMA machines like the Sony Cell; coupled co-processor machines
like the Transputer; prefetch machinery all over; others. Registers are
not the only way to deal with long pipes and keep them moving, and IMO
certainly not the best way.

If you are not trying to hide memory latency with them, 32 seems
adequate for machines with issue widths of less than eight, and even for
the wide end of the Mill spectrum (30+ issue width) 64 seems to handle
everything but pathological test cases.

And if you *are* trying to hide DRAM latency, then you can't put enough
on the chip anyway, so the answer is "all you can fit in the power budget".

timca...@aol.com

unread,

Sep 3, 2013, 12:18:46 PM9/3/13

to

You might be right:

http://arxiv.org/ftp/arxiv/papers/1205/1205.1871.pdf

- Tim

Terje Mathisen

unread,

Sep 3, 2013, 4:30:03 PM9/3/13

to

Interesting paper, thanks for the link!

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

John D. McCalpin

unread,

Sep 3, 2013, 5:46:10 PM9/3/13

to

On Tuesday, September 3, 2013 11:14:58 AM UTC-5, Ivan Godard wrote:
> On 9/3/2013 5:15 AM, Anton Ertl wrote:
>

Sometimes it is important to remember that the number of registers needed for a particular pipeline should be computed using the "occupancy" rather than the "latency". In many cases these are similar, but they don't have to be.

The failure to check the occupancy led to an embarrassment in the IBM POWER4 processor. The Floating-Point latency was 6 cycles and the direct computation of the number of registers needed appeared to ne significantly lower than the number of physical rename registers provided by the hardware. Unfortunately, the *occupancy* of the registers was more like 14 cycles, and the number of physical registers was nowhere near enough to maintain issue of two four-register floating-point multiply-add instructions per cycle. This limited the LINPACK performance (dominated by DGEMM, which has an inner loop composed of several sets of two floating-point Multiply-Add instructions and one Load instruction) to about 69% of peak. Since people are used to seeing numbers in the 85%-95% range, this value was a problem. (The overall performance was still roughly the best available at the time, given the dual cores and the high clock speed, but customers and competitors will give a vendor grief over any metric that falls below expectations.)

POWER5 increased the number of rename registers significantly, so that DGEMM ran very efficiently in single-threaded mode. It was a few registers short of the number needed when running one thread in multi-threaded mode, but this only cost a few percent in DGEMM performance.

My (limited) understanding is that the ~14 cycle occupancy in POWER4 and POWER5 came from a combination of decisions related to the microarchitecture for recovery in the event of error conditions.

POWER6 changed this aspect of the architecture completely so that the register occupancy was much shorter.

James Harris

unread,

Sep 4, 2013, 6:55:48 AM9/4/13

to

"MitchAlsup" <Mitch...@aol.com> wrote in message
news:5c8cbd5a-c16a-4236...@googlegroups.com...

> On Monday, September 2, 2013 9:05:17 AM UTC-5, James Harris wrote:
> > Is there any consensus on the ideal number of integer and FP registers a
> > CPU
> > *should* have in order to efficiently execute most common algorithms?
> >
> In general, there is no consensus; however, most of the RISC machines used
> 32 Int and 32 FP registers. So that may be some kind of consensus.

Thanks, though I was thinking more about consensus from the viewpoint of
algorithms rather than implementations. I thought there might be a consensus
on how many registers a CPU ought to provide in order to maintain values
over the iterations of the loops of common algorithms. Naturally, it will
depend on the algorithm but I presumed there might be an accepted number
that CPU designers aim for.

I don't know what the designers of the various RISC machines thought but
w.r.t. their common provision of 32 registers it seems at least possible
that once the decision had been made for some of those machines to use 32
bits per instruction there was the temptation to provide more registers
simply because there was so much encoding space available. Even with three
register operands per instruction and five bits per register there are still
17 bits left to encode the rest of the instruction. Apart from the encoding
of immediate values that's a lot of room to encode opcode and other stuff.

> >
> > Some thoughts on the basics:
> >
> > Disadvantages of smaller register sets
> >
> > * Register spilling needed for more algorithms.

> Stak machines (0-registers) suffer little from this. Also spilling to the
> stack results in good hit rates for the reload operation, and few stack
> misses for the write operation.

Stack machines are great in some ways. They might be able to handle the top
part of the evaluation stack in registers and have the potential to be very
efficient at spilling/restoring the lower parts of the stack because all
such accesses can be a cache line at a time. They can also encode
instructions in a tiny amount of space. However, AFAICS they would tend to
have to process instructions serially because all ALU ops work on one common
stack. To allow overlapping of instructions, AFAIK CPU designs have
generally gone over to a set-of-registers model.

> > * Fewer regs available for parameter passing.
> 4-5 registers is over the knee of the curve for parameter passing.
> >
> >
> > Disadvantages of larger register sets
> > * More state to save and load
> > - increased interrupt latency

> Not if the interrupt routine has a spare set of registers!

> > - slower task switch.

> Not if the register swap is done in the background!

The idea of swapping registers in the background sounds like it could be
helpful. Is there an accepted name for it, something that can be
web-searched?

> >
> > * Slower to access (capacitive effects).

> Easily hidden in the pipeline.

> > For example, to hold integers and pointers:
> >
> > * 15 integer registers for load-store architectures.

> How amny would you need for a CDC6600-like architecture--which is sort-of
> load/store and different at the same time.

> > * 7 integer registers for architectures which allow one or more operands
> > to

> I would venture closer to 12 than 7.

I presume that means that 12 values may need to change during each iteration
of certain loops with any other values being read-only and coming from
memory/cache. Twelve is a surprisingly large number but it's the kind of
info I wanted. It suggests a minimum of 16 integer registers would be
desirable even for a register-memory architecture.

> > be in memory.
> >
> >
> >
> > To hold floating point values
> > * 16 FP registers for load-store architectures.

> My guess is that the consensus, here, is that one wants at least 32 double
> precision FP registers.

> > * 8 FP registers where one or more operands can be in memory.

> Closer to 24.

Thanks.

> >
> > (For the integer cases I have reserved one for a stack pointer.)
> >
> > For whatever the answer is for the above how would the answer change if,
> > rather than having one bank of integer registers and one bank of FP
> > registers, the integer and FP instructions operated on the same register
> > set?
>
> I have always been a fan of combined registers, however, in a combined
> register set one will want at least 32 double precision registers. This is
> well over the knee of the curve for integers, and close enough to the knee
> of the curve for FP. It is no where close to the knee of the curve when
> registers are used programatically to absorb memory latency (where you end
> up needing 256+ registers.

I am thinking principally about encoding space at the moment, i.e. how many
bits to designate in each instruction to encode a register. I have some
other ideas about dealing with memory latency which may tie in with your
comment, above, on swapping registers in the background.

Good to hear that combined registers are acceptable. I wasn't sure whether
my suggesting them would invoke howls of derision. ;-)

James

James Harris

unread,

Sep 4, 2013, 7:01:39 AM9/4/13

to

"Anton Ertl" <an...@mips.complang.tuwien.ac.at> wrote in message
news:2013Sep...@mips.complang.tuwien.ac.at...

> "James Harris" <james.h...@gmail.com> writes:
>>Is there any consensus on the ideal number of integer and FP registers a
>>CPU
>>*should* have in order to efficiently execute most common algorithms?
>
> Wasn't there something like:
>
> "Imagine an architecture with so many register that you never spill.
> Imagine an architecture with 16 registers."
>
> A few years later:
>
> "Imagine an architecture with so many register that you never spill.
> Imagine an architecture with 32 registers."
>
> It depends not just on the algorithm, but also on the
> (micro)architecture and on the compiler how many registers you need.
> If the instructions have long latencies, you software-pipeline loops,
> and use modulo variable renaming to deal with write-after-read
> dependencies, you can have the need for very many registers
> (programmer-visible for in-order processors, or physical for OoO
> processors with hardware register renaming).
>
> OoO hardware does the equivalent of software pipelining in hardware,
> and I think they have 100-200 physical integer and a similar number of
> FP registers these days. IA-64 is intended to do real software
> pipelining (with rotating register files instead of modulo variable
> renaming), and it has 128 integer and 128 FP registers.

The biggest trouble with having so many architectural registers, ISTM, is
the time it takes to change them over on a context switch. I am not sure how
the physical register file behaves when the context changes. Can the PRF
still hold values from one task while the CPU starts executing a different
one, i.e. a different program or an interrupt?

James