[isa-dev] P.S. re: Zcmp vs expanded Zcmt

46 views
Skip to first unread message

L Peter Deutsch

unread,
Oct 18, 2022, 10:46:58 AM10/18/22
to isa...@groups.riscv.org
I wrote:

> - All of the other cm.* opcodes can be replaced with cm.jalt if we use bit 0
> of the address in the jump table to indicate that the return address should
> be placed in x5 rather than x1, the ABI standard for millicode.

Alternatively, and more in line with the current cm.j*t design, either the
opcode or the range of the index could indicate using x5 rather than x1 for
the return address.

--

L Peter Deutsch :: Aladdin Enterprises :: Healdsburg, CA & Burnaby, BC

Krste Asanovic

unread,
Oct 18, 2022, 3:56:41 PM10/18/22
to L Peter Deutsch, isa...@groups.riscv.org
This was considered during design but entails a non-negligible
performance loss (two pipeline flushes instead of zero in simpler
designs).

Krste
> --
> You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/20221018144654.37E90EC2315%40serpent.at.major2nd.com.

Allen Baum

unread,
Oct 19, 2022, 3:20:06 AM10/19/22
to Krste Asanovic, L Peter Deutsch, isa...@groups.riscv.org
The question that was asked is, for an implementation where code size is important, 
it is worth the performance hit to use the cm.j ops if they have a measurable effect on code size.
For the implementation where these are important, that  pipeline flush may not be that significant.
It's a different micro-architectural model and application space than a server or workstation)

(side note: the architect of the ARM Thumb ISA has publicly said that  THUMB instruction 
set saved the company because of the code size advantages in that application space.
They were able to get better performance because they needed less cache and less bandwidth from the Icache, 
not to mention they could use fewer or smaller ROMs. That's why code density improvements shouldn't be overlooked)


Tariq Kurd

unread,
Oct 19, 2022, 3:51:00 AM10/19/22
to Allen Baum, Krste Asanovic, L Peter Deutsch, isa...@groups.riscv.org
For small platforms like IoT chipsets the code-size is critical, as the system cost is strongly influenced by the cost of the external flash chips (which is why I started looking at code-size in the first place). If moving from another architecture to RISC-V means a larger flash chip is required then it's a bigger problem than any performance loss in the CPU which can be mitigated in other ways. There is a balance to be struck here, but fundamentally the smallest code-size wins for this class of processor. 

After a lot of analysis the current spec for Zcmt gave the best trade-off for code-size, implementational complexity and performance.

Tariq





--

Tariq Kurd | Lead IP Architect | Codasip UK Design Centre | www.codasip.com

Krste Asanovic

unread,
Oct 19, 2022, 4:03:35 AM10/19/22
to Allen Baum, L Peter Deutsch, isa...@groups.riscv.org
I believe the question that was actually asked was whether we could remove push/pop and friends, and just use table jumps for these as well.
If implementation complexity and code size were the only concerns, this might be a reasonable approach.
However, performance is still a concern in these systems, hence the additional instructions.

Krste

Tariq Kurd

unread,
Oct 19, 2022, 6:16:46 AM10/19/22
to Krste Asanovic, Allen Baum, L Peter Deutsch, isa...@groups.riscv.org
Hi,

In fact that approach doesn't work as well as you might think - this comes up repeatedly - because push/pop include the extra stack adjustment (spimm field). The extra stack adjustment varies independently of the register list, so by using a standard software routine you are forced to have a separate addi sp instruction, meaning that you need 32-bits (in the common case) compared to 16-bits if you use push/pop.

Tariq

L Peter Deutsch

unread,
Oct 19, 2022, 10:27:39 AM10/19/22
to Tariq Kurd, kr...@sifive.com, allen...@esperantotech.com, isa...@groups.riscv.org
> In fact that approach doesn't work as well as you might think - this comes up
> repeatedly - because push/pop include the extra stack adjustment (spimm
> field). The extra stack adjustment varies independently of the register list,
> so by using a standard software routine you are forced to have a separate
> addi sp instruction, meaning that you need 32-bits (in the common case)
> compared to 16-bits if you use push/pop.

No, what I was proposing was that there be a separate millicode routine for
each combination of register list and stack adjustment, for the combinations
that occur frequently enough to make the approach worthwhile in the first
place. One can reduce the number of such routines by rounding up the stack
adjustment more coarsely. Also, routines with the same adjustment can be
tail-merged, e.g., pop(N+1)add(D) can fall through into pop(N)add(D).

Tariq Kurd

unread,
Oct 19, 2022, 10:55:52 AM10/19/22
to L Peter Deutsch, kr...@sifive.com, allen...@esperantotech.com, isa...@groups.riscv.org
There is a surprising number of combinations of register list and stack adjustments so the approach doesn't work well as we've already tried it.
Rounding up the stack more coarsely wastes memory on embedded processors which have very little, so that doesn't fly either.

Tariq

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

BGB

unread,
Oct 19, 2022, 12:18:48 PM10/19/22
to isa...@groups.riscv.org
On 10/19/2022 9:55 AM, 'Tariq Kurd' via RISC-V ISA Dev wrote:
> There is a surprising number of combinations of register list and stack
> adjustments so the approach doesn't work well as we've already tried it.
> Rounding up the stack more coarsely wastes memory on embedded processors
> which have very little, so that doesn't fly either.
>

Albeit for a different ISA, one approach I had used in my compiler was
to emit any prolog/epilog sequences which were suitable (saved/restored
a minimum number of registers) as separate code blobs. If another
function comes along which matches a previously seen pattern, it calls
or branches into the previous code sequence.

Albeit, yes, in this case the stack pointer does typically end up being
double-adjusted in these cases (since the total size of the stack frame
is more variable than the number of registers being saved/restored).


Originally this was intended as a size optimization feature, but ended
up enabled in speed optimized modes as well because on-average it
improved performance, most likely by reducing the number of I$ misses
enough to offset what is lost by the additional branch instructions and
similar.
> <mailto:isa-dev%2Bunsu...@groups.riscv.org>.
> To view this discussion on the web visit
> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/20221019142732.91F90EC268B%40serpent.at.major2nd.com <https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/20221019142732.91F90EC268B%40serpent.at.major2nd.com>.
>
>
>
> --
>
> *Tariq Kurd*|/ Lead// IP Architect/| Codasip UK Design Centre |
> www.codasip.com <http://www.codasip.com>
>
> --
> You received this message because you are subscribed to the Google
> Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to isa-dev+u...@groups.riscv.org
> <mailto:isa-dev+u...@groups.riscv.org>.
> To view this discussion on the web visit
> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CABiWu43DmE3RZxNL57CWEgbScL60mOZbO589TiAuSKStcvtOfQ%40mail.gmail.com <https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CABiWu43DmE3RZxNL57CWEgbScL60mOZbO589TiAuSKStcvtOfQ%40mail.gmail.com?utm_medium=email&utm_source=footer>.

Rogier Brussee

unread,
Oct 20, 2022, 11:11:07 AM10/20/22
to isa...@groups.riscv.org


If one allows the push and popret(z) instructions to clobber register t0 one should be able to allow an implementation of the push instruction as an elaborate call instruction with the actual storing of registers on stack done in milicode. However, it also allows implementations that avoid milicode exactly as in the current Zcm proposal when pushing all or a small number of s-registers, e.g. only for at most ra, s0,..,s2 except that t0 is  (or might be) also set. 
 

push ra, s0,...,s<n> imm --> 
          <indivisible> 
                             add sp sp  -((((n+1)*XLEN/8 + imm + 15 ) >> 4)<<4)   # stack adjustment for n+1 registers + 16* imm extra, rounded up in units of 16byte.
                              jalr t0 zero <push_milicode> + (11 - n)*4                   # jal to magic milicode address, t0 now points to instruction after push instruction
          </indivisible>

push_milicode:  #magic absolute address e.g. -2048,  internally  jalr t0, zero,  imm may not have the 12 bit restriction on imm.   
          sx  s11 (13*XLEN/8)(sp).  #sx = sw for 32 bit, sd for 64 bit. Assumed to be 32 bit instructions, if using C.sxsp, the  (11-n)*4 above becomes (12-n)*2
          sx  s10 (12*XLEN/8)(sp)
         ....
          sx s1    (3*XLEN/8)(sp)
          sx s0    (2*XLEN/8)(sp)
          sx ra    (1*XLEN/8)(sp)
          jr t0

The expansion of push would still be two instructions whose combination must be indivisible (i.e. complete uninterrupted once started). Once jalr has run, the normal rules apply, however.  An implementation should also be allowed to behave exactly as is proposed in the Zcmp proposal for some or all values of n, for performance reasons (essentially inlining the milicode), except that one should always set t0 as a link address to avoid implementation dependent behaviour, even though it should not matter what the value of t0 is as far as the calling convention is concerned and implementation defined value of t0 can be accepted. 

Popping can be done similarly with most of the work done in milicode while also allowing to behave as in the proposal for all or some n. However,  AFAICS we again need to clobber an extra register in this case, this time as an argument to the milicode. We take t0 for this because of the precedent we set with push. To avoid implementation behaviour we should set it as the stack adjustment, but since we are going to return, the value of t0 should not matter under the rules of the calling convention, and leaving the value of t0 implementation defined may be acceptable.  
 
popret[z] ra, s0, ..., s<n> imm -->
            <indivisible>
                      [li a0, 0] #only for popretz
                      li t0, +((((n+1)*XLEN + imm + 15 ) >> 4)<<4) # stack adjustment
                      jalr zero zero <pop_milicode> + (11-n)*4. #tail call
            </indivisible>
              
pop_milicode:   #magic address e.g. -2048 + 64 
          lx  s11 (13*XLEN/8)(sp).  #lx = lw for 32 bit, ld for 64 bit assumed to be 32 bit instructions, if using C.lxsp  (11-n)*4 above becomes (12-n)*2
          lx  s10 (12*XLEN/8)(sp)
         ....
          lx s1    (3*XLEN/8)(sp)
          lx s0    (2*XLEN/8)(sp)
          lx ra    (1*XLEN/8)(sp)
          add sp, sp,  t0
          jr ra 


Rogier





Mark Hill

unread,
Oct 20, 2022, 2:23:12 PM10/20/22
to BGB, isa...@groups.riscv.org
Hi BGB,

The code size optimisation you describe is already supported by the gcc/LLVM RISC-V compilers using the -msave-restore flag. However, there are a number of issues with this approach, several of which have already been discussed on this thread, but to summarise:
- the performance overhead of three additional jumps (call/return to prologue milli-code and jump to epilogue)
- codes size and performance overhead of additional stack increments
- performance cost of pessimistic spilling (-msave-restore millicode routines always spill/fill a full 16-byte block)
- disappointing code size savings in production/large embedded software stacks. Especially when milli-code functions are, for example, in a ROM and the functions that use them are far way (in address space terms) in a combination of external flash, embedded flash, SRAM, TCMs etc. In these situations the millicode calls/jumps will typically require 8 bytes of code each compared to the 2 byte load and 2 byte store per register they replace.

BTW if anyone is aware of a technique to coax the toolchains into producing multiple localised milli-code routines to keep the jumps/calls more compact in these situations it would be useful to know about.

Mark
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/d663f56c-3523-49f0-a91a-cc4852ee21d1%40gmail.com.

BGB

unread,
Oct 20, 2022, 6:36:57 PM10/20/22
to isa...@groups.riscv.org
On 10/20/2022 1:23 PM, 'Mark Hill' via RISC-V ISA Dev wrote:
> Hi BGB,
>
> The code size optimisation you describe is already supported by the gcc/LLVM RISC-V compilers using the -msave-restore flag. However, there are a number of issues with this approach, several of which have already been discussed on this thread, but to summarise:
> - the performance overhead of three additional jumps (call/return to prologue milli-code and jump to epilogue)
> - codes size and performance overhead of additional stack increments
> - performance cost of pessimistic spilling (-msave-restore millicode routines always spill/fill a full 16-byte block)
> - disappointing code size savings in production/large embedded software stacks. Especially when milli-code functions are, for example, in a ROM and the functions that use them are far way (in address space terms) in a combination of external flash, embedded flash, SRAM, TCMs etc. In these situations the millicode calls/jumps will typically require 8 bytes of code each compared to the 2 byte load and 2 byte store per register they replace.
>

As noted, my case is a different ISA, and neither using GCC nor LLVM...
But, in terms of "where it matters", should be close enough.


But, yeah, as for the points:
* The additional jumps are unavoidable.
* The extra stack increments, also unavoidable.
* Save restore in my case is padded to a multiple of 2 registers, which
also happens to be 16B in this case.
* Compiler ignores any prologs/epilogs that are "out of range" of the
direct (20 bit) branch, in which case they will be emitted again.

As noted, there is also a set minimum number of registers.
So, in this case, the feature will not be used if saving/restoring fewer
than 6 registers (since in these cases it is cheaper to use an inline
prolog/epilog).


Switching to a larger branch would be undesirable in my case as well
(the larger branches in this case being absolute-addressed and not
handled by the branch predictor; or needing to compose the branch as a
multi-instruction sequence). So, in this case, it is cheaper to "forget"
that the previous version existed.


In the ISA in question, there are no PUSH/POP style instructions, only
SP-rel Load/Store (also no auto-increment either).

Partly this was because (early on) neither of these really won out in a
cost/benefit sense (PUSH/POP being "not free" in terms of FPGA LUT cost,
and not saving much over the use of Load/Store ops).

In a lot of cases, the double-stack-adjust would have been needed either
way on this ISA, since it is not particularly unheard of for the stack
frame to be larger than what is directly reachable by the Load/Store
displacements (9-bit for 32-bit ops, 4-bit for 16-bit ops).

( Exceeding the 9-bit displacement field requiring spending 8 bytes on
the Load/Store. )


With stack-frame layout typically:
(Caller's red-zone and arguments)
-- high address --
Saved LR and GBR (analogous to RA and GP in RV);
Saved registers;
(First stack canary, 1);
Structures and arrays (amorphous storage area);
(Optional, second stack canary, 1);
Local and temporary variables (just above the argument area);
Space for stack arguments ("red zone" + additional arguments);
-- low address --

Where:
The stack-frames are fixed size at runtime:
alloca() and VLAs implemented via implicit heap allocation;
Any large structs/arrays are also folded into heap allocs (2);
There is no dedicated frame-pointer register.
Structures and arrays are passed by reference,
with implicit memory copy where needed.
Functions typically save both the link-register and global-pointer;
Different registers are used for arguments and return values:
For structs, the return register is used for the output struct;
If used, 'this' is also given its own dedicated register.
Up to 8 arguments may be passed in registers;
...

1: Canary values will be put onto the stack if any structures or arrays
are present in the stack frame. These are initialized to a randomized
value on function entry and validated on return.
Note that this would be per-function, not part of the reused sections.
This adds basic protection against buffer overflow and similar.

Well, there are other things, like my compiler will shuffle functions
and variables into a randomized order on each rebuild (intended as a
security feature), along with a limited amount of ASLR.

2: Typical stack sizes being 128K or 256K in this case, and a program
trying to put large arrays on the stack would otherwise bomb the stack.
In this case, the function will automatically free this memory before it
returns. Still generally a good idea not to put big arrays on the stack
though. It depends some on the program how much stack is needed (many
are fine on 64K, Doom needs 128K and Quake needs 256K, even with this
trick).


(Decided to leave out a bunch more stuff related to the ABI design).

But, yeah, I am essentially using a shared ABI for both MMU and NoMMU cases.

Which basically requires functions to save and restore GBR, go through a
ritual to get it reloaded to the correct value for each function prolog;
with all global variables also being accessed relative to GBR.
PC-relative access to global variables being not allowed in this case as
this would be effectively incompatible with NoMMU use-cases
(effectively, the passed in GBR needs to be used to reload the correct
GBR for the called function, with the ABI defining the specifics for how
this ritual works).

Could have used ELF FDPIC, but my ABI has a lower runtime overhead than
the FDPIC approach.

Where FDPIC would handle GOT reload on the caller side; and my ABI
handles GBR reload on the callee side.


Tradeoffs may depend some on design specifics of the C ABI in question.
But, at least in my case at least, it turned out to be a net positive.


> BTW if anyone is aware of a technique to coax the toolchains into producing multiple localised milli-code routines to keep the jumps/calls more compact in these situations it would be useful to know about.
>

This is what my compiler does at least, no idea about GCC or LLVM as I
am not using them for this (nor is are there back-ends for my ISA in
these compilers).


It seemed like it should have been able to work OK on RISC-V, as there
isn't anything really in the ISA design that would prevent it (nor
necessarily make it "obviously worse" than the situation in my ISA).
Reply all
Reply to author
Forward
0 new messages