Number of registers

593 views
Skip to first unread message

Russell Wallace

unread,
Mar 20, 2018, 6:50:57 AM3/20/18
to RISC-V ISA Dev
I've been looking over the RISC-V ISA, and most of the design decisions make a lot of sense in terms of 'we now know from experience what the best way to do this is'. One that I'm still trying to figure out, however, is the number of registers.

Traditionally, RISC CPUs stuck to 32 registers based on the theory that this was sufficient for most code, but later, out of order CPUs found it was most definitely not enough. Unable to increase the number of architectural registers, they enabled a larger number of physical registers by implementing register renaming, presumably at considerable cost.

Given that high-performance implementations will have a larger number of physical registers anyway, why not make them architectural, go for 64 or 128 architectural registers? This would decrease the number of values that need to be spilled to memory in the admittedly minority of programs that need the extra for that purpose, but it would also seem in at least some cases to avoid the need for expensive register renaming hardware.

What am I missing?

David Chisnall

unread,
Mar 20, 2018, 7:03:59 AM3/20/18
to Russell Wallace, RISC-V ISA Dev
On 20 Mar 2018, at 10:50, Russell Wallace <russell...@gmail.com> wrote:
>
> Traditionally, RISC CPUs stuck to 32 registers based on the theory that this was sufficient for most code, but later, out of order CPUs found it was most definitely not enough. Unable to increase the number of architectural registers, they enabled a larger number of physical registers by implementing register renaming, presumably at considerable cost.

I’ve not come across that interpretation before. Registers, at an ISA level, are an abstraction for expressing the liveness of values. The ideal number of registers is the maximum number of live values in code. There’s been a lot of experimentation that has shown that for almost all input sources there is no difference between the code generated for 16 and 32 registers (the compiler simply doesn’t use half of them). The main exceptions are crypto algorithms (though these are increasingly implemented as single instructions, to avoid timing side channels, so that’s less important now than 5-10 years ago).

The number of architectural registers also defines the *minimum* number of physical registers, but the number of required physical registers is defined by both the number of live values at any given point and the size of the speculation window. If you are speculatively executing a large number of instructions into the ‘future’ then you need to have enough physical registers to describe all of the live values at all points in that window.

Adding more architectural registers doesn’t reduce the number of rename registers required for a RISC architecture. Unless you go to a VLIW / EPIC model (which has its own disadvantages), the number of live registers at any given point is bounded by the number of architectural registers and the number of rename registers is bounded by the number of architectural registers written during a speculation window plus the size of the architectural register set size.

David

Message has been deleted

David Chisnall

unread,
Mar 20, 2018, 7:48:45 AM3/20/18
to Russell Wallace, RISC-V ISA Dev
On 20 Mar 2018, at 11:31, Russell Wallace <russell...@gmail.com> wrote:
>
> So if I'm understanding you correctly, it's basically that there is no point in going halfway. If you are going to commit to outright VLIW/EPIC, great, have 128 architectural registers. But unless you are going to make that commitment (and given the track record of such, it's reasonable to decide to avoid that), having more architectural registers does nothing; you still have to have the register renaming and OOO logic just as if you had stuck with 32?

Worse, you’ll likely need more rename logic if you have 64 architectural registers and speculative execution than if you have 32 registers and speculative execution. Even VLIW doesn’t really let you avoid speculative execution, because the compiler can’t expose opportunities for parallelism that cross conditional branches. The only high-performance processors that avoid register renaming are things like GPUs, which rely on thread-level parallelism instead of instruction-level parallelism.

Architectural registers are actually a pretty poor abstraction. The compiler represents the program in terms of a static single assignment (SSA) form that captures value lifetime and data dependencies. It then transforms this into a representation that pretends that you have a finite, fixed, set of physical registers and inserts moves to and from memory if it needs more. In a processor doing register renaming, the hardware then attempts to reconstruct an SSA representation. Using a fixed number of registers as an intermediate representation between a producer of SSA and a consumer of SSA is a pretty bad idea, it’s just the best one that’s been found to work so far...

David

Cesar Eduardo Barros

unread,
Mar 20, 2018, 8:00:20 AM3/20/18
to Russell Wallace, RISC-V ISA Dev
Em 20-03-2018 07:50, Russell Wallace escreveu:
> I've been looking over the RISC-V ISA, and most of the design decisions
> make a lot of sense in terms of 'we now know from experience what the
> best way to do this is'. One that I'm still trying to figure out,
> however, is the number of registers.

The "official" explanation is on pages 9-10 of riscv-spec-2.2.pdf.

> Traditionally, RISC CPUs stuck to 32 registers based on the theory that
> this was sufficient for most code, but later, out of order CPUs found it
> was most definitely not enough. Unable to increase the number of
> architectural registers, they enabled a larger number of physical
> registers by implementing register renaming, presumably at considerable
> cost.

The reason for register renaming is not to have more registers; the
reason for register renaming is to decouple multiple uses of a register,
so the instructions can be executed in parallel without false
dependencies (which is why it is found on out-of-order implementations).
Without register renaming, you'd be depending on the compiler to avoid
reusing registers, but as shown by the Itanium, "magical" compilers
don't quite exist. Furthermore, some registers are fixed by the ABI,
like a0 for the first argument, so it's hard to avoid reusing them;
register renaming decouples "virtual" registers like a0 from the
"physical" registers.

(Yes, Itanium used something like register windows for calls, but one
call followed by another call at the same level would still reuse
registers.)

> Given that high-performance implementations will have a larger number of
> physical registers anyway, why not make them architectural, go for 64 or
> 128 architectural registers? This would decrease the number of values
> that need to be spilled to memory in the admittedly minority of programs
> that need the extra for that purpose, but it would also seem in at least
> some cases to avoid the need for expensive register renaming hardware.
>
> What am I missing?

First, as I mentioned above, that it wouldn't quite avoid the need for
register renaming.

And there are several costs to having more ABI-visible registers:

- Even small implementations without out-of-order would be forced to
have all these registers. Even 32 registers is too much for some, which
is why people wanted the RV32E variant, which has only 16 registers.

- Each doubling of the number of registers adds one bit to the register
number field in the instruction encoding. For 3-operand instructions,
that is 3 bits. For instructions with 2 register operands and an
immediate (like branches), each doubling of the number of registers
would reduce the reach of the immediate by a quarter.

- All registers have to be saved and restored on a context switch.

32 registers is plenty. Even on math-heavy code, 32 registers can hold a
full 4x4 matrix (16 registers), the 5 "special function" registers (x0,
lr, sp, gp, tp), a vector to be applied to the matrix (4 registers), and
still have a few to spare for intermediate values. (This is not an
artificial example; the core of the Blake2 hash and the Chacha20 cipher
is a 4x4 integer matrix.)

--
Cesar Eduardo Barros
ces...@cesarb.eti.br

Alex Elsayed

unread,
Mar 22, 2018, 4:20:31 AM3/22/18
to RISC-V ISA Dev
Just realized I sent this off-list by accident.

On Tue, Mar 20, 2018, 04:48 David Chisnall <David.C...@cl.cam.ac.uk>
wrote:

<snip>

> Architectural registers are actually a pretty poor abstraction. The
> compiler represents the program in terms of a static single assignment (SSA)
> form that captures value lifetime and data dependencies. It then transforms
> this into a representation that pretends that you have a finite, fixed, set
> of physical registers and inserts moves to and from memory if it needs
> more. In a processor doing register renaming, the hardware then attempts to
> reconstruct an SSA representation. Using a fixed number of registers as an
> intermediate representation between a producer of SSA and a consumer of SSA
> is a pretty bad idea, it’s just the best one that’s been found to work so
> far...

There's been some pretty interesting research on queue machines in the last
few years:

"Efficient compilation for queue size constrained queue processors"
http://web-ext.u-aizu.ac.jp/~benab/publications/journals/parco09/Preprint_PARCO-D-07-00103.9-33.pdf

"Compiler Support for Code Size Reduction Using a Queue-Based Processor"
https://www.researchgate.net/profile/Abderazek_Ben_Abdallah/publication/220284148_Compiler_Support_for_Code_Size_Reduction_Using_a_Queue-Based_Processor/links/02e7e517aa5a95818e000000.pdf

"Queue Machines for Next Generation Computer Systems"
http://web-ext.u-aizu.ac.jp/~benab/publications/conferences/co/sowa08IWMST.pdf

"On a Practical Queue Execution Model"
https://www.researchgate.net/profile/Abderazek_Ben_Abdallah/publication/283727503_On_a_Practical_Queue_Execution_Model/links/5645de3b08aef646e6cd77b4.pdf

"Parallelizing Queue Compiler"
https://uec.repo.nii.ac.jp/index.php?action=pages_view_main&active_action=repository_action_common_download&item_id=1130&item_no=1&attribute_id=20&file_no=1&page_id=13&block_id=21

However, the only group I know of working to bring such research to
application is the Mill team, and there are trends that leave me
unenthusiastic about their odds:

"An update on the architecture purge"
https://lwn.net/Articles/749292/


> Bergmann pointed out a pattern in the architectures that are on their way out:
>
>> In the end, it seems that while the eight architectures are
>> extremely different, they all suffered the same fate: There
>> was one company in charge of an SoC line, a CPU
>> microarchitecture and a software ecosystem, which was
>> more costly than licensing newer off-the-shelf CPU cores
>> from a third party (typically ARM, MIPS, or RISC-V).


Bruce Hoult

unread,
Mar 22, 2018, 7:08:49 AM3/22/18
to Alex Elsayed, RISC-V ISA Dev
It seems to me the Mill's "belt" is equivalent to doing register allocation on a conventional register machine using the following algorithm:

1) each new IR virtual register is allocated to the next higher architectural register modulo the number of registers.

2) if the virtual register already in the architectural register about to be written is still live then insert code to copy it to a "scratchpad" area in SRAM.

This pretty clearly make less efficient use of a given number of architectural registers (slash belt) from a software point of view than does the usual more sophisticated register allocation algorithms. Maybe there are hardware compensations, allowing the belt size to be larger than the usual number of registers.

Also, function calls don't fit easily into the above description. The next available register number would need to be made known to each function call.

My recollection from previous talks by Ivan is that each function invocation sees an essentially empty belt with only its arguments already there.




--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/1572085.WpFBABXZlR%40arkadios.

Allen Baum

unread,
Mar 22, 2018, 10:24:46 AM3/22/18
to Bruce Hoult, Alex Elsayed, RISC-V ISA Dev


-Allen

> On Mar 22, 2018, at 4:08 AM, Bruce Hoult <br...@hoult.org> wrote:
>
> It seems to me the Mill's "belt" is equivalent to doing register allocation on a conventional register machine using the following algorithm:
>
> 1) each new IR virtual register is allocated to the next higher architectural register modulo the number of registers.
>
> 2) if the virtual register already in the architectural register about to be written is still live then insert code to copy it to a "scratchpad" area in SRAM.
>
> This pretty clearly make less efficient use of a given number of architectural registers (slash belt) from a software point of view than does the usual more sophisticated register allocation algorithms. Maybe there are hardware compensations, allowing the belt size to be larger than the usual number of registers.
>
> Also, function calls don't fit easily into the above description. The next available register number would need to be made known to each function call.
>
> My recollection from previous talks by Ivan is that each function invocation sees an essentially empty belt with only its arguments already there.

On the Mill, here is a very clever cheap mechanism that explicitly renames all belt operands in one shot so all belt argument are moved to the front of the belt- ( and marks the rest for background spill).

The same mechanism can move l any mer of) ive about-to-lost belt operands to the front of the belt

All this is single cycle and in parallel with other ops.

Another way to look at this is that all the mechanisms that OOO HW requires logic to discover is explicit and available to the compiler.

Oddly, Transmeta used a similar concept on their architecture.

Alex Elsayed

unread,
Mar 22, 2018, 10:39:55 AM3/22/18
to RISC-V ISA Dev
That assessment is incorrect, and "Efficient compilation for queue size constrained queue processors" explains (much of) why.

For one, register allocation for queue machines has a direct connection to specific traversal orders of the SSA graph.

In addition, code performs withing a small epsilon of optimal given 32 queue slots - just like registers, adding more is unnecessary.

Those allow it to not be worse. Meanwhile, the lack of need for destination specifiers in instructions permits a more compact encoding, and new values being in fresh slots helps eliminate false dependencies, making OoO easier.

Finally, it can be a systolic array rather than a true register file in compact implementations - read ports must still be capable of reading from any register, but only the last register is a valid destination for writes.

To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Richard Herveille

unread,
Mar 23, 2018, 4:34:51 AM3/23/18
to Russell Wallace, RISC-V ISA Dev, Richard Herveille

 

On 20/03/2018, 11:51, "Russell Wallace" <russell...@gmail.com> wrote:

 

Traditionally, RISC CPUs stuck to 32 registers based on the theory that this was sufficient for most code, but later, out of order CPUs found it was most definitely not enough. Unable to increase the number of architectural registers, they enabled a larger number of physical registers by implementing register renaming, presumably at considerable cost.

 

That’s not entirely true. At some point throwing more registers at the problem doesn’t help; diminishing returns …

Register renaming helps in avoiding hazards in out-of-order machines.

For example …

ld a0, …

addi a0, 12

sd a0,…

add a0,a6,a5

 

Note the constant use of a0. In an in-order machine there are a few read-after-write hazards. The ‘ld’ must complete before the ‘addi’ and the ‘addi’ must complete before the ‘sd’. The ‘add’ has no hazard, because it’s result is independent on the above instructions.

If this was an out-of-order machine, then there are more hazards. For example the ‘add’ may be executed before the ‘sd’, leading to a write-after-write hazard. This would stall the execution of the ‘add’. By renaming the ‘a0’ register in the ‘add’ instruction (e.g. to physical, but not logical, register x32), the ‘add’ can start as soon as ‘a6’ and ‘a5’ are known, which may be before any of the ‘ld’, ‘addi’, and ‘sd’ instructions.

Now as there are suddenly many more instructions in flight, and there are a number of virtual register (i.e. same logical register, different physical register) used, the need for more physical registers is apparent. It is not uncommon to have over a hundred physical registers, but only have 32 logical registers.

 

Richard

 

 

 

 

Given that high-performance implementations will have a larger number of physical registers anyway, why not make them architectural, go for 64 or 128 architectural registers? This would decrease the number of values that need to be spilled to memory in the admittedly minority of programs that need the extra for that purpose, but it would also seem in at least some cases to avoid the need for expensive register renaming hardware.

 

What am I missing?

--

You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to


To post to this group, send email to

Reply all
Reply to author
Forward
0 new messages