Comments on Zc* proposal (v1.0.0-RC5.7)

521 views
Skip to first unread message

L Peter Deutsch

unread,
Oct 12, 2022, 12:04:27 PM10/12/22
to isa...@groups.riscv.org
Given the scarcity of available C encodings, I think it would strengthen
this proposal considerably to add a short Appendix with some real-example
data of the percentage reduction in code size resulting from each section of
the proposal. (I assume that gcc has already been extended to use the
proposed instructions.)

It seems logical to me that there should be a proposal parallel to Zcmp for
similar 32-bit instructions that can span the full 32-register set, and that
the two should be coordinated so that Zcmp is a subset of the 32-bit
instructions. Has this been considered?

Assuming that measurements justify the proposed instructions, I support all
of the proposal other than the Table Jump instructions. However, I have
great concern about the management of the JVT CSR and of the jump table(s).
A number of ABIs for other ISAs have a quasi-reserved base register for some
kind of global data. How is this register, and the storage it addresses,
managed? I believe the most common approach is the one used in WebAssembly,
where the global base is treated as a per-module register and saved and
restored by cross-module calls (possibly trampolines): this ("module") would
have to be a new ABI concept that is not currently part of the RISC-V ABI
design. Alternatively, the Table Jump Overview section of the proposal
suggests that the linker could replace instruction sequences with shorter
ones based on the JVT. This would require that linkable code include enough
relocation information to allow instructions to be replaced with shorter
ones and code compressed without the use of nop padding, and it would also
eliminate the possibility of using the JVT in shared-page libraries unless
they used the trampoline mechanism on entry and exit. In short, I think the
issues are more complex than the proposal discusses at present, and I would
advocate removing the Table Jump section from this proposal for further
consideration.

--

L Peter Deutsch :: Aladdin Enterprises :: Healdsburg, CA & Burnaby, BC

Tariq Kurd

unread,
Oct 13, 2022, 4:28:02 AM10/13/22
to L Peter Deutsch, isa...@groups.riscv.org
Hi,


For the 32-bit versions - they were considered but we agreed to only concentrate on 16-bit encodings for this extension - we may consider wider encodings in the future.

For the ABI question I'll let someone else answer.


Tariq


--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/20221012160420.B7FF9EC2AA2%40serpent.at.major2nd.com.


--

Tariq Kurd | Lead IP Architect | Codasip UK Design Centre | www.codasip.com

L Peter Deutsch

unread,
Oct 13, 2022, 8:42:31 AM10/13/22
to Tariq Kurd, isa...@groups.riscv.org
Thanks. Was the code linked with shared libraries (which necessarily must
have been compiled without the use of JVT)? If so, are the code sizes and
savings measured only on the non-shared part? And in either case, did the
linker use JVT to reduce the code size as suggested in the proposal?

What I see in the benchmarks is that Zcmp accounts for significantly more
space savings than all the others together. I'll be interested to hear the
ARC's thoughts about the fact that AFAIK this is the first Zc extension
where one instruction does not map to a single 32-bit instruction.

I found the "Max benefit" line a bit misleading: the max benefits from all
but Zcmt are large outliers, and in all 3 cases, the code size with the max
benefit was very small (<= 8K). Because of the skewed distribution, I would
suggest adding a line for the overall weighted median, as well as the
overall weighted average.

A lot of good work has clearly gone into this proposal, and aside from the
JVT issues, I'm fine with the result. The one final bit of information I
would be interested in seeing is a list of the "runners-up" -- proposed
instructions that were considered seriously enough to benchmark but that
didn't make the cut. I know I could compile this by going through the
minutes of the TG, but I'm hoping that it already exists somewhere, or would
be easy to create.

Tariq Kurd

unread,
Oct 13, 2022, 9:55:16 AM10/13/22
to L Peter Deutsch, isa...@groups.riscv.org
All the benchmarking results came from an analysis script:


which replaces sequences in existing compiler output with new encodings, and reports the code-size saving.

We also did some work on the released toolchains to correlate the results with the script, but the script output drove the decisions.

Zcmb is still in the spreadsheet (16-bit load/store byte/half with longer immediates) but the benefit is low so I dropped it. We've considered many other instructions over time.

Tariq


Michael Zoran

unread,
Oct 13, 2022, 6:40:29 PM10/13/22
to RISC-V ISA Dev, tariq...@codasip.com, isa...@groups.riscv.org, L Peter Deutsch
Looking at the results, it appears that 10%-20% is at the high end of the compression.

I read that the C extension can cause a whole stream of uncompressed instructions following the compressed instruction to be unaligned.  So I'm wondering if the 10%-20% compression is worth the possible execution performance decrease.

I also wonder if given these results which seems like a good mix of binaries that maybe the standard binary distributions should back away from the C extension and just use G for now until instruction compression gets further along.

Tariq Kurd

unread,
Oct 14, 2022, 5:22:22 AM10/14/22
to RISC-V ISA Dev, mic...@michaelzoran.net, Tariq Kurd, isa...@groups.riscv.org, L Peter Deutsch
The concept is that instructions are composed of 16-bit elements. Currently instructions can be formed of 1 or 2. In the future there may be more.
Implementing C is certainly a choice, but the improvement is I$ utilisation is very compelling, therefore increasing performance. Zc takes this further then only using standard C.

Tariq

Bruce Hoult

unread,
Oct 14, 2022, 7:15:37 AM10/14/22
to L Peter Deutsch, Tariq Kurd, isa...@groups.riscv.org
On Fri, Oct 14, 2022 at 1:42 AM L Peter Deutsch <gh...@major2nd.com> wrote:
> Benchmarking results are here:
> https://docs.google.com/spreadsheets/d/1bFMyGkuuulBXuIaMsjBINoCWoLwObr1l9h5TAWN8s7k/edit#gid=1837831327

Thanks.  Was the code linked with shared libraries (which necessarily must
have been compiled without the use of JVT)?  If so, are the code sizes and
savings measured only on the non-shared part?  And in either case, did the
linker use JVT to reduce the code size as suggested in the proposal?

This extension is aimed primarily at very small systems that run statically-linked code from ROM.

Code using Zcmp or Zcmt is incompatible with the RV64GC ISA assumed by current Linux etc OSes as it redefines and reuses the 16 bit opcodes for double precision floating point load and store.  Full-size instructions must be used instead.

It might be possible for some future platform specification (RVA22, RVA25, whatever) to forbid the use of those encodings and permit the use of Zcmp and Zcmt on machines running shrink-wrapped Linux distros. Such platform specification might also specify an ABI with thunks to swap JVT on shared library entry/exit.

However I don't think that has been thought about much and is out of scope for the ISA extension.

This standard extension is wanted/needed to replace existing practice on small embedded systems with custom extensions from companies such as Andes (e.g. CoDense) and Huawei (IoT platform) that have been shipping hardware for a few years already.

If it is completely ignored (at least Zcmp and Zcmt) on applications processors then that is fine.

Bruce Hoult

unread,
Oct 14, 2022, 7:30:21 AM10/14/22
to Michael Zoran, RISC-V ISA Dev, tariq...@codasip.com, L Peter Deutsch
On Fri, Oct 14, 2022 at 11:40 AM Michael Zoran <mic...@michaelzoran.net> wrote:
Looking at the results, it appears that 10%-20% is at the high end of the compression.

That is in addition to the 25% to 30% savings from using the original C extension. And some more from the B extension. 5%-10% ?
I read that the C extension can cause a whole stream of uncompressed instructions following the compressed instruction to be unaligned.

Undeniably true -- except that with typically 50%-60% of all instructions being C instructions you seldom get more than a couple of unaligned 32 bit opcodes in a row!
 
So I'm wondering if the 10%-20% compression is worth the possible execution performance decrease.

I"m not aware of any existing RISC-V core in commercial production that gets a performance decrease from using C instructions. Maybe if a branch target is to an unaligned 32 bit opcode on a machine that only fetches 32 bits at a time. Compiler options can avoid that.
 
I also wonder if given these results which seems like a good mix of binaries that maybe the standard binary distributions should back away from the C extension and just use G for now until instruction compression gets further along.

Absolutely NOT. The C extension gives massive advantages on any RISC-V CPU that runs code larger than a KiB or two. Only student projects or tiny RV32I cores doing very simple FSM replacement tasks would ever consider not implementing C.

L Peter Deutsch

unread,
Oct 14, 2022, 1:09:59 PM10/14/22
to Bruce Hoult, tariq...@codasip.com, isa...@groups.riscv.org
> This extension is aimed primarily at very small systems that run
> statically-linked code from ROM.

Thanks for this background information. I'm coming from the desktop /
larger system perspective, and I have two suggestions that I believe would
significantly increase the value of the proposal for larger codes while not
impairing it for small ones.

1) The lion's share of size reduction, even for larger codes, comes from
Zcmp. I therefore suggest swapping the encodings for Zcmp with some of
those for Zcb, so that Zcmp doesn't conflict with Zcf/Zcd. Specifically, I
suggest swapping c.mul, c.*ext*, and c.lh* for cm.*. If cm.popret[z] and/or
cm.mv* contribute substantially less reduction, fewer encodings might need
to be swapped. Benchmarking would obviously be needed to determine which of
these swaps, if any, would be helpful.

2) Rather than a new CSR for JVT, I suggest that the base address for cm.jt
and cm.jlt be bits [XLEN-1:N] of the instruction's address, with bits
[N-1:0] set to 0. N is a fixed value (or perhaps settable by a CSR), I
imagine optimal at some value between 14 and 18.

The idea behind suggestion #2 is that it allows small codes (size less than
2^N) to be compressed just as well as a CSR, while it supports PIC and
sharable libraries, with no need for CSR management, as long as each
sharable module is mapped at a base address that is a multiple of 2^N.
Compiler and/or linker intervention is required for codes larger than size
2^N: while some padding may be required, it would increase the amount of
address space but not the amount of code executed.

I realize #2 is a radical idea for this stage of the review process: I offer
it as an alternative that may help support separating the ratification
process for Zcmt from that for the other Zc* proposals.

kr...@sifive.com

unread,
Oct 14, 2022, 3:00:15 PM10/14/22
to Bruce Hoult, L Peter Deutsch, Tariq Kurd, isa...@groups.riscv.org

>>>>> On Sat, 15 Oct 2022 00:15:21 +1300, Bruce Hoult <br...@hoult.org> said:

| On Fri, Oct 14, 2022 at 1:42 AM L Peter Deutsch <gh...@major2nd.com> wrote:

| Code using Zcmp or Zcmt is incompatible with the RV64GC ISA assumed by current
| Linux etc OSes as it redefines and reuses the 16 bit opcodes for double
| precision floating point load and store.  Full-size instructions must be used
| instead.

| It might be possible for some future platform specification (RVA22, RVA25,
| whatever) to forbid the use of those encodings and permit the use of Zcmp and
| Zcmt on machines running shrink-wrapped Linux distros. Such platform
| specification might also specify an ABI with thunks to swap JVT on shared
| library entry/exit.
| However I don't think that has been thought about much and is out of scope for
| the ISA extension.

A conscious decision was made to not make these available to RVA
profiles, as these instructions are awkward for high-end processors.
For example, ARM dropped push/pop them when moving from A32 to A64.

Krste

Bruce Hoult

unread,
Oct 14, 2022, 7:36:14 PM10/14/22
to kr...@sifive.com, L Peter Deutsch, Tariq Kurd, isa...@groups.riscv.org
I completely agree that these extensions (at least Zcmp part, maybe Zcmt) are not a good idea for high end processors, not only because of the conflict with D encodings.

I think it would help a LOT if this was made clear as commentary in the proposal, as the VAST majority of comments I'm seeing here, reddit, telegram etc are from people assuming they are going to be seen in shrink-wrapped Linux distros, required on CPUs that run them, incompatible with current software and hardware etc.



--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

Allen Baum

unread,
Oct 14, 2022, 8:38:10 PM10/14/22
to Tariq Kurd, RISC-V ISA Dev, mic...@michaelzoran.net, L Peter Deutsch
For what it is worth: The architect of ARM's Thumb ISA attributed the success of the company to those compressed ops.
That's because it enabled early cellphones to use fewer ROM chips than that of other solutions - and even 20% would prevent spilling another chip.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

Bruce Hoult

unread,
Oct 14, 2022, 9:01:26 PM10/14/22
to RISC-V ISA Dev, Allen Baum, RISC-V ISA Dev, mic...@michaelzoran.net, L Peter Deutsch, tariq...@codasip.com
While code size was important, the even more important factor was that the ARM7TDMI could run on either a 16 bit or a 32 bit bus -- and this could be different for different address ranges. MANY phones  took advantage of this and put, for example, 32 bit wide ROM with quite a bit of A32 code in it, and 16 bit wide RAM for application T16 program code.  Maybe there would be a small area of 32 bit wide RAM for the stack.

Thumb instructions being 16 bits long meant that instruction fetches took 1 cycle not 2. As instruction execution was in most cases 1 cycle this could almost double the execution speed on a low cost RAM system. The reason for not actually doubling the speed was mostly extra MOV instructions because of the lack of 3-address operations and to move data between upper 8 and lower 8 registers, and load/store for extra register spills.

Paul Campbell

unread,
Oct 14, 2022, 11:11:06 PM10/14/22
to Bruce Hoult, RISC-V ISA Dev, Allen Baum, mic...@michaelzoran.net, L Peter Deutsch, tariq...@codasip.com
Paul Campbell <tan...@gmail.com>
(oops meant to reply to all - Bruce sorry about the dupe)

yeah, well, retrofitting these into an existing high-end processor
(like mine) would be a bit of a night-mare - it would essentially
stall the decode stages in a bad way (because it would overload the
subsequent renaming/scheduling hardware) - but then dropping CISC
instructions that decode into multiple micro-ops into a RISC
instruction set is going to do that anywhere you do it

There's also a bunch of issues around exceptions that probably need to
be defined better - I'd have to expand this to a bunch of load/stores
followed by an add - because these instructions can take a page fault
in the middle - do we restart the whole stream? (remember I'm
executing them all out of order) of just the last portion? I think you
need to specify this because if you expect instructions to restart in
the middle there's additional state that needs to be stored (and you
architecturally need somewhere to store it), if not then there's a
minimum number of TLBs that need to be specified to available (and
pages present in memory) at each access level to make progress
(probably at least 3, so at least 6 system wide, more for a
hypervisor) and you have to define the intermediate states where some
registers or memory are written and not others as valid

(I once worked on an x86 clone that was natively a RISC engine, we
handled the restart problem by having a fractional portion to the EPC
that indicated how far we had got though the expanded micro-ops so
that we could restart and handle state correctly - there's a very
small number of x86 instructions that, like this one, could create
multiple page faults - unless you can trust the underlying OS to be
very careful, you have to be able to handle the pathological case
where for some reason it decides to page out page N to page in page
N+1, and vice-versa - this might happen on a process with a tiny
working set - this is a traditional problem with CISCy instructions,
there was a Vax POLY instruction that needed something like 27
pages/tlbs to make worst case progress)

Finally OpenSBI currently misaligned accesses on machines that only do
aligned accesses - will it be able to emulate this instruction
reliably if an access crosses a page boundary (and the second page is
not paged in) - I suspect the answer is probably "yes" but it's
probably worth someone who understands this better spending some
quality time with the issue.

- Paul

Paul Campbell

unread,
Oct 14, 2022, 11:17:46 PM10/14/22
to Bruce Hoult, RISC-V ISA Dev, Allen Baum, mic...@michaelzoran.net, L Peter Deutsch, tariq...@codasip.com
One other out of order case I forgot there - if a multiple store
writes to memory and crosses a page boundary is it permissible for it
to write the portion after the page boundary while faulting on the
first portion? is it OK if the second portion gets written twice?

- Paul

Paul Campbell

unread,
Oct 14, 2022, 11:22:31 PM10/14/22
to Bruce Hoult, RISC-V ISA Dev, Allen Baum, mic...@michaelzoran.net, L Peter Deutsch, tariq...@codasip.com
Oh - and just in general what if it's an IO space (because you just
know someone is going to use this for cheap soft DMA) - if you cross a
page boundary in IO space what happens, can it be relied on to write
everything once? can it be relied to write the data in order, even on
an out of order machine (that might normally explicitly issue fences)?
this stuff has to be written down to avoid people making non-portable
code

Bruce Hoult

unread,
Oct 14, 2022, 11:37:15 PM10/14/22
to Paul Campbell, RISC-V ISA Dev, Allen Baum, mic...@michaelzoran.net, L Peter Deutsch, tariq...@codasip.com
On Sat, Oct 15, 2022 at 4:11 PM Paul Campbell <tan...@gmail.com> wrote:
Paul Campbell <tan...@gmail.com>

yeah, well, retrofitting these into an existing high-end processor
(like mine) would be a bit of a night-mare

The people on the working group that designed these ISA extensions are WELL AWARE of that and they are NOT INTENDED to be implemented on high performance processors.

They are for processors competing against 8051, AVR, PIC, MSP430, ARMv6-M, ARMv7-M.

mike

unread,
Oct 15, 2022, 5:53:43 AM10/15/22
to RISC-V ISA Dev, Bruce Hoult, RISC-V ISA Dev, Allen Baum, mic...@michaelzoran.net, L Peter Deutsch, tariq...@codasip.com, Paul Campbell
Hi all,

I'm curious about the design of Zcmp and in particular how it interacts with RV32E.  The lowest-end RV32 processors (competing with 8051, AVR, PIC etc) are likely to be RV32E, and also these are the processors that would get the most benefit from code compression.  But Zcmp seems significantly less useful on RV32E, since it can only push or pop at most 3 registers (ra, optionally s0, and if s0 then also optionally s1).

I haven't stared at so much RV32 assembly in my time, but I've looked at a fair bit of Thumb-2, and as I recall the STMIA/STMDB and LDMDB/LDMIA instructions (analogous to push and pop respectively) are very useful for code compression on that platform.  However, they have significant flexibility that cm.push and cm.pop don't have:
* They can use any pointer register, not just SP, so they can be used to load structs or arrays.
* They can push or pop many subsets of registers other than just the ABI callee-saved ones.
* They can increment or decrement, and can adjust the pointer register or not.  These cost only one bit encoding space each, but also seem less critical.

If we're opening the load/store-multiple can of worms, would there be some way to do it that benefits these use cases more than the current cm.push/pop instructions do?  Or does RV32C just not have the encoding space?  Or does that turn out not to be necessary/useful with how RV32 works in practice compared to Thumb-2?  Or is something for this case forthcoming in a future spec?

Thanks,
-- Mike

Bruce Hoult

unread,
Oct 15, 2022, 10:58:14 AM10/15/22
to mike, Allen Baum, L Peter Deutsch, Paul Campbell, RISC-V ISA Dev, mic...@michaelzoran.net, tariq...@codasip.com
Quite apart from the desirability (which I think is of very questionable additional benefit), you physically can’t fit a bitmap for 16 registers in the C extension, which has only 15.58 bits of encoding space in total. The same applies to fitting a bitmap for 32 registers into the 29.95 bit encoding space of “32 bit” instructions.

Also IIRC rv32e increases code size by 20% or 30% so you’d only want to use it for extremely small programs that probably wouldn’t follow any ABI anyway.

Iztok Jeras

unread,
Oct 15, 2022, 4:33:45 PM10/15/22
to RISC-V ISA Dev
I have a comment regarding overhead of unaligned instruction fetch, and a comment about the decrementing address order in push/pop.

Regarding overhead of accessing unaligned opcodes after a branch or jump, I proposed a modified memory decoder which would speed up unaligned access. This could be used either for tightly coupled memories or caches.

Regarding Zcmp push/pop instructions, the pseudo code defines memory accesses in the decrementing address order. This is the same order used in the GCC compiler, but the RISC-V instruction set supports filling the stack in incrementing address order.
I have experience with hardware optimized for incrementing address order, so I know, this has an impact on performance.
One example  are wrapped bursts issued by caches to external RAM, the first and incrementing addresses can be accessed fast, but to access a decrementing counter, the full burst must finish first, since start address - 4 is the last beat in the burst.
Another example is a simple memory prefetch, which can be used to reduce the SRAM latency, this prefetch is also optimized for incrementing addresses.
Incrementing addresses are present in instruction fetch, array accesses, ...

I also looked for signs that instructions would consume large additional resources like adders, and I did not find any.

I would like to see additional tables, which would specify for all instructions:
- compatibility with C extension,
- whether there is a mapping to 32-bit encoding,
- how each extension depends on other extensions.

Iztok

Michael Chapman

unread,
Oct 15, 2022, 6:11:09 PM10/15/22
to isa...@groups.riscv.org

Compilers allocate registers in a predetermined order so you only need to encode the number of registers to store. A bit map of registers to save is not very useful.
Should be SP be kept 16 byte aligned on embedded systems? If so the number of registers to save could be always a multiple of 4 to save another 2 bits.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

Bruce Hoult

unread,
Oct 15, 2022, 11:16:06 PM10/15/22
to Michael Chapman, isa...@groups.riscv.org
On Sun, Oct 16, 2022 at 11:11 AM Michael Chapman <michael.c...@gmail.com> wrote:

Compilers allocate registers in a predetermined order so you only need to encode the number of registers to store. A bit map of registers to save is not very useful.

That's what I said in the message you replied to :-)
 

Should be SP be kept 16 byte aligned on embedded systems? If so the number of registers to save could be always a multiple of 4 to save another 2 bits.

That would impose a pretty significant speed penalty on functions that only save 1 or 2 registers.
 

Tariq Kurd

unread,
Oct 17, 2022, 4:56:18 AM10/17/22
to Bruce Hoult, Michael Chapman, isa...@groups.riscv.org
>I would like to see additional tables, which would specify for all instructions:
>- compatibility with C extension,
>- whether there is a mapping to 32-bit encoding,
>- how each extension depends on other extensions.

these details are all in the document.

Tariq


kr...@sifive.com

unread,
Oct 17, 2022, 5:17:55 PM10/17/22
to Paul Campbell, Bruce Hoult, RISC-V ISA Dev, Allen Baum, mic...@michaelzoran.net, L Peter Deutsch, tariq...@codasip.com

>>>>> On Sat, 15 Oct 2022 16:10:51 +1300, Paul Campbell <tan...@gmail.com> said:

| Paul Campbell <tan...@gmail.com>
| (oops meant to reply to all - Bruce sorry about the dupe)

| yeah, well, retrofitting these into an existing high-end processor
| (like mine) would be a bit of a night-mare - it would essentially
| stall the decode stages in a bad way (because it would overload the
| subsequent renaming/scheduling hardware) - but then dropping CISC
| instructions that decode into multiple micro-ops into a RISC
| instruction set is going to do that anywhere you do it

| There's also a bunch of issues around exceptions that probably need to
| be defined better

The spec section 2.3 explicitly states that the entire sequence is
replayed, and that the stack should only be held in idempotent memory.

Krste
| --
| You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
| To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
| To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAGh7tpicFFRWuDmqMTDGRDeq_YTWNzow2V2u%2BzO5S-Yi70Extw%40mail.gmail.com.
Reply all
Reply to author
Forward
0 new messages