Alternative RV128: 16x 64b + 16x 128b registers are better than 32x 128b

60 views
Skip to first unread message

Xan Phung

unread,
Feb 6, 2023, 6:42:57 AM2/6/23
to RISC-V ISA Dev
[Please read the attached PDF file on my Alternative RV128 if you need further background information to my discussion below]

Below, I'm going to challenge an assumption in "Original" RV128 that all 32 integer registers should be the same (128b) size.

I instead propose the Alternative RV128 register file consist of 16x 64b plus 16x 128b as follows:

Alternative RV128 Register File.png
The above naming of the registers is for what I call the X64 ABI.  This is intended to be a 10 year transitional ABI - until 128 bit pointers are needed (note addresses in current hardware maxes out at 57b).  Note in X64 ABI, the pointers are plain 64b, there is no segmentation, no tricks to get anything more than 64b.  My attached PDF has details on the next stage 128ABI,, and discusses design choices to allow greater interoperability between X64 ABI and 128 ABI.  Also, please note that transition from X64 ABI to 128 ABI is a software only transition - the whole point of X64 is to get the 128b hardware installed & bedded down, so that software-only transition to 128b pointers can then be done in a highly granular way over 10+yrs.

My motivation is the fact that 128b registers are more energy hungry than 64b registers:
128b registers should be exposed as a separate size class in the ISA, as this is something we actually want to make *harder* for compiler writers.  We don't want them assuming <=64b quantities can just be shoved into 128b registers indiscriminantly.  The CPU shouldn't be abstracting away the fundamental energy consumption difference between large (128b) & small (64b) registers.  We want compilers to make thoughtful choices about handling of large data sizes vs small sizes, to produce more energy efficient code.

Note: in Alternative RV128, I've retained separate 5b register fields are used for each source & the destination, to select from the 16x64b + 16x128b registers.  So integer operations can still effortlessly mix and match any combination of 64b and 128b source operands (once they've allocated registers appropriately).

Of course, the obvious disadvantage of my proposal is that the number of 128b registers is now reduced to only 16 (the other 16 registers being only 64b):

(i) Firstly, this disadvantage is a matter of choice of benchmark, such as compared to "Original" RV128 (which has 32x 128b registers).  Compared to x86_64, the total number of registers (of any size class) in my proposal has actually increased from 16 to 32 and likewise my proposed Alternative RV128 is superior to RV64 & ARM64 in it's register file capacity & "future-proofing".  I believe the latter three comparisons are more appropriate, as I anticipate the target market for 128b computing is to upgrade existing Legacy 64b systems, and RV64/ARM64 will be compared as alternative options in this upgrade process.

(ii) Secondly there are diminishing returns in increasing the register file beyond 16 registers (of any size), and many programs do not use more than 16 registers.

(iii) Thirdly, even amongst programs using all 32 registers, a substantial fraction of registers will be for 8b/16b/32b/64b data.  Many scalar magnitudes in the "real world" are less than 2^32, for which an Int32 is sufficient & pointer sizes/array indices will remain 64b or less for many years (we don't even use all 64b of x86_64 virtual address space).  It is likely for the foreseeable future that most programs will use a range of integer sizes, and universal 128b integers for everything will *not* to happen.

Apart from energy consumption, the other reasons for having dual (64b & 128b) size classes of registers are:

a.  Compactness & orthogonality of 64/128 size class encoding.  In Alternative RV128, the size class is the MSB of every 5 bit src/dest register field, so any instruction referencing one or more registers automatically supports 64b and 128b operations, including for mixed data size operations.  In comparison, RV64 has a triplicate system for encoding operation size, with different function codes used to represent data sizes in LOAD, STORE and OP/OP-32.  Original RV128 adds a further OP-64 and a new encoding for LQ.  Furthermore, RV64 only has one size class for branch instructions, so all branches must do a 64b compare, even if they are comparing against zero or one.  This can waste up to 63 bits, but in "Original" RV128, the problem is even worse, with branches only able to do 128b compares, and so up to 127 bits are wasted.  In Alternative RV128, because branches encode two source registers, any combination of 64b & 128b comparison/branch is automatically & consistently supported.

I admit though that the 32b size isn't dealt with so nicely as there is no separate 32b register size class.  But this size class is less important than the 64b size class - as a 64b register will only waste 32b (when holding a 32b quantity), whereas a 128b register wastes 64b (holding a 64b quantity).

b.  Existing x86_64 ABIs (which assume 16x 64b registers) can be transplanted onto Alternative RV128 easily.  Legacy 64b code (eg. a processor with an x86_64 mode switch) can make function calls to new X64 ABI code, and vice versa, new X64 ABI code can call Legacy 64b code.  Please also see the attached PDF for further thoughts & design choices on how to allow Legacy 64b (x86_64) code to interoperate with 128 ABI code.

c.  Reduced transistor budget for the register file.  Alternative RV128's mixed 64b/128b register file is only 50% more transistors than RV64 (not 100% more like Original RV128).  This is quite a modest transistor budget for a much more future-proof, wider register set.  This one of the reasons why I assert the case that RV64 should be bypassed entirely by Legacy 64b systems, which should instead upgrade directly to Alternative RV128.

RV128 Alternative Proposal.pdf

Allen Baum

unread,
Feb 6, 2023, 11:59:24 PM2/6/23
to Xan Phung, RISC-V ISA Dev
1. I think you should quantify the energy difference, and not just assume that it is significant.
2. There a microarchitectural techniques that can reduce the energy consumption if a register is loaded with a smaller value
3. the amount of register spilling and filling with only 16 registers of any particular size may overwhelm any energy savings you get by partitioning

to name a few possible pobjections.


--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/a91edcdb-26fa-4a02-97e1-1d6620f07167n%40groups.riscv.org.

BGB

unread,
Feb 7, 2023, 12:55:26 AM2/7/23
to isa...@groups.riscv.org
On 2/6/2023 10:59 PM, 'Allen Baum' via RISC-V ISA Dev wrote:
> 1. I think you should quantify the energy difference, and not just
> assume that it is significant.
> 2. There a microarchitectural techniques that can reduce the
> energy consumption if a register is loaded with a smaller value
> 3. the amount of register spilling and filling with only 16 registers of
> any particular size may overwhelm any energy savings you get by partitioning
>
> to name a few possible pobjections.
>

Yeah, partitioning the register space and then having multiple register
sizes that depend on the implementation seems like a terrible idea IMO.


In my ISA, I had instead went for the route of handling 128 bit data as
pairs of 64-bit registers (so, one can see them as either 32x128b or
64x64b).

Generally works OK. Doesn't require the expense of actual 128 bit
registers. It also leveraged that the core already has enough register
ports for multiple instructions to run at the same time, so the register
ports combine together into the larger virtual registers (when operating
on 128 bit values, the core effectively dropping down to scalar operation).


In this case, whether one as one logical 128-bit register, or two 64 bit
registers, is mostly left up to the software. This sorta made sense as
128-bit values are a relative minority, and "actual" 128 bit registers
seemed kinda like overkill in my case.

Also, it means that there is effectively no difference between the
64-bit mode and 128-bit mode at the ISA level; as things like
"sizeof(void*)" mostly only exist at the C ABI level (though, they did
end up with different PE/COFF magic numbers, since the loader and "OS"
will still need to deal with this).


>
> On Mon, Feb 6, 2023 at 3:42 AM Xan Phung <xan....@gmail.com
> <mailto:xan....@gmail.com>> wrote:
>
> [Please read the attached PDF file on my Alternative RV128 if you
> need further background information to my discussion below]
>
> *_Below, I'm going to challenge an assumption in "Original" RV128
> that all 32 integer registers should be the same (128b) size_*.
>
> I instead propose the Alternative RV128 register file consist of 16x
> 64b plus 16x 128b as follows:
>
> Alternative RV128 Register File.png
> /The above naming of the registers is for what I call the X64 ABI.
> This is intended to be a 10 year transitional ABI - until 128 bit
> pointers are needed (note addresses in current hardware maxes out at
> 57b).  Note in X64 ABI, the _pointers are plain 64b_, there is no
> segmentation, no tricks to get anything more than 64b.  My attached
> PDF has details on the next stage 128ABI,, and discusses design
> choices to allow greater interoperability between X64 ABI and 128
> ABI.  Also, please note that transition from X64 ABI to 128 ABI is a
> _software only transition_ - the whole point of X64 is to get the
> 128b hardware installed & bedded down, so that software-only
> transition to 128b pointers can then be done in a highly granular
> way over 10+yrs./
>
> *_My motivation is the fact that 128b registers are more energy
> hungry than 64b registers:_*
> 128b registers should be exposed as a separate size class in the
> ISA, as this is something we actually want to make *_harder_* for
> compiler writers.  We don't want them assuming <=64b quantities can
> just be shoved into 128b registers indiscriminantly.  The CPU
> shouldn't be abstracting away the fundamental energy consumption
> difference between large (128b) & small (64b) registers.  We want
> compilers to make thoughtful choices about handling of large data
> sizes vs small sizes, to produce more energy efficient code.
>
> Note: in Alternative RV128, I've retained separate 5b register
> fields are used for each source & the destination, to select from
> the 16x64b + 16x128b registers.  So integer operations can still
> effortlessly mix and match any combination of 64b and 128b source
> operands (once they've allocated registers appropriately).
>
> *_Of course, the obvious disadvantage of my proposal is that the
> number of 128b registers is now reduced to only 16 (the other 16
> registers being only 64b)_*:
>
> (i) Firstly, this disadvantage is a matter of choice of benchmark,
> such as compared to "Original" RV128 (which has 32x 128b
> registers).  Compared to x86_64, the total number of registers (of
> any size class) in my proposal has actually increased from 16 to 32
> and likewise my proposed Alternative RV128 is superior to RV64 &
> ARM64 in it's register file capacity & "future-proofing".  I believe
> the latter three comparisons are more appropriate, as I anticipate
> the target market for 128b computing is to upgrade existing Legacy
> 64b systems, and RV64/ARM64 will be compared as alternative options
> in this upgrade process.
>
> (ii) Secondly there are diminishing returns in increasing the
> register file beyond 16 registers (of any size), and many programs
> do not use more than 16 registers.
>
> (iii) Thirdly, even amongst programs using all 32 registers, a
> substantial fraction of registers will be for 8b/16b/32b/64b data.
> Many scalar magnitudes in the "real world" are less than 2^32, for
> which an Int32 is sufficient & pointer sizes/array indices will
> remain 64b or less for many years (we don't even use all 64b of
> x86_64 virtual address space).  It is likely for the foreseeable
> future that most programs will use a range of integer sizes, and
> universal 128b integers for everything will *not* to happen.
>
> *_Apart from energy consumption, the other reasons for having dual
> (64b & 128b) size classes of registers are:_*
>
> *a.  Compactness & orthogonality of 64/128 size class encoding*.  In
> Alternative RV128, the size class is the MSB of every 5 bit src/dest
> register field, so any instruction referencing one or more registers
> automatically supports 64b and 128b operations, including for mixed
> data size operations.  In comparison, RV64 has a triplicate system
> for encoding operation size, with different function codes used to
> represent data sizes in LOAD, STORE and OP/OP-32.  Original RV128
> adds a further OP-64 and a new encoding for LQ.  Furthermore, RV64
> only has one size class for branch instructions, so all branches
> must do a 64b compare, even if they are comparing against zero or
> one.  This can waste up to 63 bits, but in "Original" RV128, the
> problem is even worse, with branches only able to do 128b compares,
> and so up to 127 bits are wasted.  In Alternative RV128, because
> branches encode two source registers, any combination of 64b & 128b
> comparison/branch is automatically & consistently supported.
>
> I admit though that the 32b size isn't dealt with so nicely as there
> is no separate 32b register size class.  But this size class is less
> important than the 64b size class - as a 64b register will only
> waste 32b (when holding a 32b quantity), whereas a 128b register
> wastes 64b (holding a 64b quantity).
>
> *b.  Existing x86_64 ABIs (which assume 16x 64b registers) can be
> transplanted* onto Alternative RV128 easily.  Legacy 64b code (eg. a
> processor with an x86_64 mode switch) can make function calls to new
> X64 ABI code, and vice versa, new X64 ABI code can call Legacy 64b
> code.  Please also see the attached PDF for further thoughts &
> design choices on how to allow Legacy 64b (x86_64) code to
> interoperate with 128 ABI code.
>
> *c.  Reduced transistor budget for the register file*.  Alternative
> RV128's mixed 64b/128b register file is only 50% more transistors
> than RV64 (not 100% more like Original RV128).  This is quite a
> modest transistor budget for a much more future-proof, wider
> register set.  This one of the reasons why I assert the case that
> RV64 should be bypassed entirely by Legacy 64b systems, which should
> instead upgrade directly to Alternative RV128.
>
> --
> You received this message because you are subscribed to the Google
> Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to isa-dev+u...@groups.riscv.org
> <mailto:isa-dev+u...@groups.riscv.org>.
> To view this discussion on the web visit
> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/a91edcdb-26fa-4a02-97e1-1d6620f07167n%40groups.riscv.org <https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/a91edcdb-26fa-4a02-97e1-1d6620f07167n%40groups.riscv.org?utm_medium=email&utm_source=footer>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to isa-dev+u...@groups.riscv.org
> <mailto:isa-dev+u...@groups.riscv.org>.
> To view this discussion on the web visit
> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAF4tt%3DBbL%2BxQKOisSs7C-HU_EVSbEsp3dgc8bcg-MAxX%3Dk6kMA%40mail.gmail.com <https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAF4tt%3DBbL%2BxQKOisSs7C-HU_EVSbEsp3dgc8bcg-MAxX%3Dk6kMA%40mail.gmail.com?utm_medium=email&utm_source=footer>.

Xan Phung

unread,
Feb 7, 2023, 7:59:08 AM2/7/23
to RISC-V ISA Dev, Allen Baum, RISC-V ISA Dev, Xan Phung
Hi Allen,
Thanks for your constructive questions, which I appreciated - as they demonstrated intelligence and interest in what I proposed.

I should also point out "Original RV128" doesn't quantify any energy:cost benefit in it's spec, it's simply a theoretical derived, 100% extrapolation of RV64 with no empirical or data driven justification of their  RV128 design decisions.... so please keep that in perspective when looking at the data I present for my Alternative spec below.

Before I answer your questions, I should clarify that in Alternative RV128 the 128b registers are still useable as 64b registers (with options for sign & zero extension in most operations).
So it's not a complete "partition" in the sense that other data sizes can't be used in a larger registers - but it is preferable 64b or smaller data goes into the 64b registers first before being put into 128b.

Q1. Quantifying the energy difference: this is the product of (magnitude of register file energy usage)  x   (% energy saving achieved).

There is a lot of literature which asserts the register file is high magnitude energy user:
(a) J.Tseng and K.Asanovic (2000): "Register files represent a substantial portion of the energy budget in modern microprocessors [2, 3, 9]. For example, in Motorola’s M.CORE architecture, the register file consumes 16% of the total processor power and 42% of the data path power"
(b) Jones et al (2009) "Furthermore, the register file is a hotspot and is already one of the most energy-consuming structures within a modern superscalar processor. Any technique that can reduce the register file’s energy requirements would have a significant impact on the processor’s total power budget."
(c) Tavana et al (2015) "the 32 nm Westmere (WSM) core [45], which is used on mobile, desktop and servers, register files contribute to 30% of both dynamic and leakage power dissipation."
(d)  [This relates to register file size/transistor count, as well as power] J. Yamada et al (2018)  "Figure 1 shows a die photograph of the AMD Bulldozer processor, which is one of the most documented processors among recent ones [8]. The integer core of the processor is a moderatesized, non-multithreaded 4-issue one. Nevertheless, as shown in this figure, the 96-entry integer register file with 8-read+4-write (i.e., 12-port) is comparable with the 16 KB level-1 data cache (L1D) in area, even though their sizes are different: 16K ÷ (96 × 8) 21.3 times. This means that the register file cell is approximately 20 times larger than the L1D cell"
 
What would be the % energy saving for narrowing half of the registers to 64b?

Ergin (2006) simulated a 4 wide superscalar 32b processor in which the register file suppressed read/writes on the upper 24b of the register if they were all 1's or 0's (ie: value was a "narrow" 8 bit value).  Ergin found there was a 20% dynamic power saving.  This technique is similar to what I am proposing (although at the scale of 8+/- bit size partioning rather than 64+/- bits) - Ergin's approach but does not encode the partitioning in the ISA.  It thus suffers greater leakage current (as all registers are full size still) & also has the overhead for detection of all 0's & all 1's for every write into the register as well as additional multiplexer circuitry in the read path.  As stated by Waterman & Asanovic (2022) " comparisons against zero require non-trivial circuit delay (especially after the move to static logic in advanced processes) and so are almost as expensive as arithmetic magnitude compares".

Hence Ergin's approach is a very high overhead approach, and the 20% power saving was achieved despite this overhead.  My approach does not require the detection of all 1's and 0's, nor multiplexing and so has lower overhead then Ergin's.  But as you rightly point out, my approach may cause more register spills for 128 bit data sizes.  (But there won't be more register spills of 64b data sizes, as my ISA allows all 32 registers to hold 64b data).

Q2. Are there a microarchitectural techniques that can reduce the energy consumption? Yes, but.......
I already summarised Ergin 2006 above.  The full link is:
Exploiting Narrow Values for Energy Efficiency in the Register Files of Superscalar Microprocessors,

I view the above technique as (mostly) additive to my approach.  My approach doesn't need to do all 1's or all 0's detection (which as per Waterman and Asanovic, is almost as expensive an a 64b adder itself!)
Also, Ergin seemed to find the partitioning between <8b vs >8b width to be most effective, rather than larger sizes of 16b or 24b.  So there is no reason why it can't be added to the more static, ISA exposed approach I use, with both techniques working concurrently (mine targetting the 64b +/- split, and Ergin's targetting the 8b +/- split).

Note techniques like the above (non static/non ISA) techniques can only add to the complexity & latency of the register file, which in superscalar processors, is already 20x the size (per bit of storage) of Level 1 cache storage, and for which complexity increases as a quadratic function of the number of read/write ports.  My static ISA approach has the benefit of reducing microcircuit level complexity rather than increasing it.

This all goes back to the issue of how much should be exposed to the compiler, or how much should be hidden away/abstracted by the hardware?  My opinion is (like much of the RISC approach overall) is to get the compiler to do more of the work - so that the underlying hardware can be simpler.  Compilers already need to distinguish between 32b ints, 64b longs, and pointers.  My philosophy is why do we need to then hide this at the hardware/register level?

Q3. The amount of register spilling and filling with only 16 registers [CLARIFICATION - only 16 registers for 128b data; there are 32 registers for 64b data]
As above, for 64b only code, there is no additional spilling, as the ISA provides the option for 128b registers to also hold 64b values (with either zero or sign extension).

For 128b reegister spills, most of the push back I am getting is that 128b data sizes are not needed, it's only 1% of computation, 64b is plenty for most people's needs for the next 20yrs, etc etc

So given the amount on pushback that 128b isn't needed, I would consider it a badge of honour if anything I could do with RV128 would cause it to have problems with shortages of 16x 128b registers within the next 20yrs!!!  It means I have succeeded with RV128!

Best regards
Xan Phung
Reply all
Reply to author
Forward
0 new messages