Re: [isa-dev] Digest for isa-dev@groups.riscv.org - 1 update in 1 topic

54 views
Skip to first unread message

Robert Lipe

unread,
Aug 24, 2022, 4:42:37 PM8/24/22
to isa...@groups.riscv.org
I understand die space isn't free.

Neither is spilling and restoring registers constantly, effectively running from ram/cache because the working set no longer fits in registers. Pusha and popa admittedly get smaller and there's likely less interrupt latency to first opcode while the resisters are saved. 

In my own code (bias alert noted) with interprocedural analysis and inlining, GCC does a great job of using ALL the registers. 

Are there any studies available on the resulting performance hit of halving the available register set?

Or is the target market being served here the PIC class of price and performance? That microwave oven timer might fit comfortably in a $.15 part and never notice a performance hit even if the code incurs excessive spills. 

Thanks for any available additional justification for additionally fragmenting our tool and code collections. 

On Tue, Aug 23, 2022, 6:48 AM <isa...@groups.riscv.org> wrote:
Stephano Cetola <step...@riscv.org>: Aug 22 04:59PM -0700

Many of you are already familiar with the two reduced bases RV32E and
RV64E. The RV32E base had been around for a long time in draft form in
the Unprivileged ISA document, and the RV64E form has also been used
for a long time. Implementations have been shipped from various
vendors and support for these bases exists in toolchains. We are now
pushing these forward to ratification.
 
The public review period begins today, August 22, and ends on October
6 (inclusive).
 
You can find the PDF on GitHub in Chapter 6 of the Unprivileged ISA:
https://github.com/riscv/riscv-isa-manual/releases/download/draft-20220707-f518c25/riscv-spec.pdf
 
To respond to the public review, please email comments to the public
isa-dev mailing list or add issues to the github repo. We welcome all
input and appreciate your time and effort in helping us by reviewing
the specification.
 
During the public review period, corrections, comments, and
suggestions, will be gathered for review by the Unprivileged ISA
Committee. Any minor corrections and/or uncontroversial changes will
be incorporated into the specification. Any remaining issues or
proposed changes will be addressed in the public review summary
report. If there are no issues that require incompatible changes to
the public review specification, the Unprivileged ISA Committee will
recommend the updated specifications be approved and ratified by the
RISC-V Technical Steering Committee and the RISC-V Board of Directors.
 
Kind regards,
Stephano
--
Stephano Cetola
Director of Technical Programs
RISC-V International
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page.
To unsubscribe from this group and stop receiving emails from it send an email to isa-dev+u...@groups.riscv.org.

Bruce Hoult

unread,
Aug 24, 2022, 11:34:15 PM8/24/22
to Robert Lipe, RISC-V ISA Dev
I seem to recall early Embench results showing RV32E caused about a 30% slow down because of extra register spills and reloads.

The performance probably doesn't matter in many cases, but I'd have thought that at some point the extra ROM space for the extra instructions would outweigh the savings from a smaller register file. And that point probably hits by the time you've got even a couple of KB of code.

The same goes for not implementing the C extension. It really boggles my mind that someone in India is apparently making a high end Linux capable RISC-V CPU without the C extension. It's just a complete false economy that could only be done by a bureaucrat, not an engineer. And some poor bugger is building every single package as non-C and porting as needed. 

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAGJ6%2Bwq7sgTE-ZH%2BbXgh3SYLkyDMkSO0my%2B2R5gr%2BJiYp458LQ%40mail.gmail.com.

BGB

unread,
Aug 25, 2022, 2:35:40 PM8/25/22
to isa...@groups.riscv.org
On 8/24/2022 10:34 PM, Bruce Hoult wrote:
> I seem to recall early Embench results showing RV32E caused about a 30%
> slow down because of extra register spills and reloads.
>
> The performance probably doesn't matter in many cases, but I'd have
> thought that at some point the extra ROM space for the extra
> instructions would outweigh the savings from a smaller register file.
> And that point probably hits by the time you've got even a couple of KB
> of code.
>

Generally agreed.


My own experiences with ISA design sort of imply that 32 GPRs seems to
be basically the local optimum:
16 has slow-down due to a higher rate of register spills;
32 fits well into a 32-bit word, typically has low spill rate;
64 suffers due to encoding space issues with 32-bit instruction words.


On some machines with 16 registers, there can be a pretty big hit for
some workloads due to high rates of register spill and fill. The
situation is particularly bad if working with a lot of types which
require pairs of registers. Such as 64-bit types on a 32-bit machine, or
128-bit types on a 64-bit machine (in cases where these types are held
in GPRs).


There are also cases where 64 can pay off (such as graphics processing
and "OpenGL style" 3D rasterization; where a fair amount of stuff may be
"in flight" at the same time). Though probably not really as ideal for a
general purpose ISA (where for much general purpose code, the spill rate
is already fairly low with 32 GPRs).

Having 64 registers does potentially allow a majority of leaf functions
to fit entirely into scratch registers though (in which case one can
often skip creating a stack frame or similar).

Though, with the drawback that having 6-bit register fields would not
leave as many bits for things like opcode or immediate fields (so, one
might need to pay for them by making immediate fields smaller).


Though, this is slightly less of an issue if the encoding sticks with
5-bit register fields, but then supports the extended registers via
larger-format instructions and special-case encoding hacks.

Say:
R0..R31: Encoded with 32-bit instruction words;
R32..R63: May require 48 or 64 bit encodings.


> The same goes for not implementing the C extension. It really boggles my
> mind that someone in India is apparently making a high end Linux capable
> RISC-V CPU without the C extension. It's just a complete false economy
> that could only be done by a bureaucrat, not an engineer. And some poor
> bugger is building every single package as non-C and porting as needed.
>

Would be nicer if the 'C' extension's encoding were less like something
a dog chewed up, but alas.

Trying to write a decoder for the 16-bit instructions is kind of
demotivating given how messy it is.


> On Thu, Aug 25, 2022 at 8:42 AM Robert Lipe <rober...@gmail.com
> <mailto:rober...@gmail.com>> wrote:
>
> I understand die space isn't free.
>
> Neither is spilling and restoring registers constantly, effectively
> running from ram/cache because the working set no longer fits in
> registers. Pusha and popa admittedly get smaller and there's likely
> less interrupt latency to first opcode while the resisters are saved.
>
> In my own code (bias alert noted) with interprocedural analysis and
> inlining, GCC does a great job of using ALL the registers.
>
> Are there any studies available on the resulting performance hit of
> halving the available register set?
>
> Or is the target market being served here the PIC class of price and
> performance? That microwave oven timer might fit comfortably in a
> $.15 part and never notice a performance hit even if the code incurs
> excessive spills.
>
> Thanks for any available additional justification for additionally
> fragmenting our tool and code collections.
>
> On Tue, Aug 23, 2022, 6:48 AM <isa...@groups.riscv.org
> <mailto:isa...@groups.riscv.org>> wrote:
>
> isa...@groups.riscv.org
> <https://groups.google.com/a/groups.riscv.org/forum/?utm_source=digest&utm_medium=email#!forum/isa-dev/topics>
> Google Groups
> <https://groups.google.com/a/groups.riscv.org/forum/?utm_source=digest&utm_medium=email/#!overview>
> <https://groups.google.com/a/groups.riscv.org/forum/?utm_source=digest&utm_medium=email/#!overview>
>
> Topic digest
> View all topics
> <https://groups.google.com/a/groups.riscv.org/forum/?utm_source=digest&utm_medium=email#!forum/isa-dev/topics>
>
>
> * Public review for the Base ISAs - RV32E & RV64E
> <#m_2053254214422895321_m_-4386467075249151924_group_thread_0>
> - 1 Update
>
> Public review for the Base ISAs - RV32E & RV64E
> <http://groups.google.com/a/groups.riscv.org/group/isa-dev/t/151f6c3a1131fcef?utm_source=digest&utm_medium=email>
>
> Stephano Cetola <step...@riscv.org
> <mailto:step...@riscv.org>>: Aug 22 04:59PM -0700
> <#m_2053254214422895321_m_-4386467075249151924_digest_top>
> You received this digest because you're subscribed to updates
> for this group. You can change your settings on the group
> membership page
> <https://groups.google.com/a/groups.riscv.org/forum/?utm_source=digest&utm_medium=email#!forum/isa-dev/join>.
> To unsubscribe from this group and stop receiving emails from it
> send an email to isa-dev+u...@groups.riscv.org
> <mailto:isa-dev+u...@groups.riscv.org>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to isa-dev+u...@groups.riscv.org
> <mailto:isa-dev+u...@groups.riscv.org>.
> <https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAGJ6%2Bwq7sgTE-ZH%2BbXgh3SYLkyDMkSO0my%2B2R5gr%2BJiYp458LQ%40mail.gmail.com?utm_medium=email&utm_source=footer>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to isa-dev+u...@groups.riscv.org
> <mailto:isa-dev+u...@groups.riscv.org>.
> To view this discussion on the web visit
> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAMU%2BEkweMYJgMz7fS5Q6nEtOc3QifwU22w1fx85gf3OznBu2Sw%40mail.gmail.com
> <https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAMU%2BEkweMYJgMz7fS5Q6nEtOc3QifwU22w1fx85gf3OznBu2Sw%40mail.gmail.com?utm_medium=email&utm_source=footer>.

kr...@sifive.com

unread,
Aug 28, 2022, 8:06:44 PM8/28/22
to BGB, isa...@groups.riscv.org

The use case is small microcontrollers, where a 32-entry regfile
represents about half the area. Cutting this in half saves ~25% of
the area and makes RISC-V competitive with other solutions in this
space.

These small cores are often used as glorified programmable state
machines and every gate counts, as there may be hundreds of them on
one SoC embedded in other IP blocks on a chip, e.g., think of adding
one per SerDes lane.

I don't think the performance penalty is as high as 30% in general. I
remember it more in the 0-10% range going from 32->16 registers across
most things, though haven't looked at recent results. Note that some
users will be coding these in assembly and will see even less
difference than when using compilers.

Krste
| To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
| To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/996762ab-0280-26a0-aaa8-4fca2962c987%40gmail.com.

Bruce Hoult

unread,
Aug 28, 2022, 10:10:35 PM8/28/22
to Krste Asanovic, BGB, RISC-V ISA Dev
On Mon, Aug 29, 2022 at 12:06 PM <kr...@sifive.com> wrote:

The use case is small microcontrollers, where a 32-entry regfile
represents about half the area.  Cutting this in half saves ~25% of
the area and makes RISC-V competitive with other solutions in this
space.

That's just the core, not counting SRAM and flash/ROM? I don't have a good feel for the trade-off between random gates and ROM (or SRAM) and would love for someone with more knowledge to comment.

 
These small cores are often used as glorified programmable state
machines and every gate counts, as there may be hundreds of them on
one SoC embedded in other IP blocks on a chip, e.g., think of adding
one per SerDes lane.

Appreciate that. If the code is 50 or 100 instructions then it obviously makes complete sense.

 
I don't think the performance penalty is as high as 30% in general.  I
remember it more in the 0-10% range going from 32->16 registers across 
most things

That sounds right for performance. I should have said size.

Recently I did a weird experiment where I cut RV32I down to just 10 instructions (ADDI, ADD, SLL, SRA, NAND [1], JAL, JALR, BLT, LW, SW). It's still capable of efficiently compiling anything. In general, missing instructions require no more than 4 of the remaining instructions to emulate them. LB/LH need more, and SB/SH even more again. DEC Alpha guys thought that was ok :-) 

I then compiled a few programs and semi-automatically converted them to the cut down ISA.  Code size expansion was around 25-30%. Runtime slwodown was much less, because most of the expansion was function prolog saving more registers and loading constants into them (outside of loops), or conditional branches expanding to 2 or 3 instructions but most commonly the first one is taken (e.g. BLE for a loop condition, usually hits on the BLT).

This is pretty pointless because it doesn't save much at all in gates. Maybe it's interesting to get something of a feel for what programming a PDP-8 or other early computer feels like. 

[1] NAND isn't an RV instruction, but B extension ANDN would work instead, a little worse

 

kr...@sifive.com

unread,
Aug 28, 2022, 11:16:02 PM8/28/22
to Bruce Hoult, Krste Asanovic, BGB, RISC-V ISA Dev

>>>>> On Mon, 29 Aug 2022 14:10:20 +1200, Bruce Hoult <br...@hoult.org> said:

| On Mon, Aug 29, 2022 at 12:06 PM <kr...@sifive.com> wrote:
| The use case is small microcontrollers, where a 32-entry regfile
| represents about half the area.  Cutting this in half saves ~25% of
| the area and makes RISC-V competitive with other solutions in this
| space.

| That's just the core, not counting SRAM and flash/ROM? I don't have a good feel for the trade-off between
| random gates and ROM (or SRAM) and would love for someone with more knowledge to comment.

For a given system and technology, you can make the analysis, but
factors will vary greatly depending on absolute size of the memory and
whether the memory in question is shared, e.g., consider 100 cores
sharing a common flash image, or if your code is small enough that
you're holding it in flops. The main point is there are a wide
diversity of cores out there, and many of them are mostly invisible
unless you're actively working in that area.

[...]

| I don't think the performance penalty is as high as 30% in general.  I
| remember it more in the 0-10% range going from 32->16 registers across 
| most things

| That sounds right for performance. I should have said size.

OK - I can see size being somewhat larger in some cases due to
additional static spill/fill code that doesn't appear in dynamic path
difference, but it would be good to have more comprehensive data on
that - 30% seems too high as an average number. To be clear, there is
obvious demand for RV32E cores - the analysis might help folks learn
about the boundary, but many that choose RV32E will have (hopefully)
done the analysis on their own code.

| Recently I did a weird experiment where I cut RV32I down to just 10 instructions (ADDI, ADD, SLL, SRA, NAND
| [1], JAL, JALR, BLT, LW, SW). It's still capable of efficiently compiling anything. In general, missing
| instructions require no more than 4 of the remaining instructions to emulate them. LB/LH need more, and SB/SH
| even more again.
| DEC Alpha guys thought that was ok :-) 

Until they realized it wasn't.

[...]

Krste

Jan Gray

unread,
Aug 28, 2022, 11:55:24 PM8/28/22
to kr...@sifive.com, Bruce Hoult, BGB, RISC-V ISA Dev
| On Mon, Aug 29, 2022 at 12:06 PM <kr...@sifive.com> wrote:
| The use case is small microcontrollers, where a 32-entry regfile
| represents about half the area. Cutting this in half saves ~25% of
| the area and makes RISC-V competitive with other solutions in this
| space.

Also consider an n-hart hardware-multithreaded RISC-V core. RV32E's 16 registers per hart saves 16n registers, and perhaps shortens a critical path.

For example, the UntetherAI Boqueria [1][2] described last week at Hot Chips 34, has 1,458 4-threaded RV32EMC cores. (Each pair of such cores manages 512 inference PEs (373,000 PEs in all).) Here RV32E saves over 90,000 registers.

[1] https://www.nextplatform.com/2022/08/23/untether-ai-pulls-the-curtain-rope-for-its-next-gen-inferencing-system/
[2] https://www.untether.ai/inthenews/untether-ai-unveils-its-second-generation-at-memory-compute-architecture-at-hot-chips-2022-r45np

Jan Gray | Gray Research LLC

Reply all
Reply to author
Forward
0 new messages