On 10/20/2022 1:23 PM, 'Mark Hill' via RISC-V ISA Dev wrote:
> Hi BGB,
>
> The code size optimisation you describe is already supported by the gcc/LLVM RISC-V compilers using the -msave-restore flag. However, there are a number of issues with this approach, several of which have already been discussed on this thread, but to summarise:
> - the performance overhead of three additional jumps (call/return to prologue milli-code and jump to epilogue)
> - codes size and performance overhead of additional stack increments
> - performance cost of pessimistic spilling (-msave-restore millicode routines always spill/fill a full 16-byte block)
> - disappointing code size savings in production/large embedded software stacks. Especially when milli-code functions are, for example, in a ROM and the functions that use them are far way (in address space terms) in a combination of external flash, embedded flash, SRAM, TCMs etc. In these situations the millicode calls/jumps will typically require 8 bytes of code each compared to the 2 byte load and 2 byte store per register they replace.
>
As noted, my case is a different ISA, and neither using GCC nor LLVM...
But, in terms of "where it matters", should be close enough.
But, yeah, as for the points:
* The additional jumps are unavoidable.
* The extra stack increments, also unavoidable.
* Save restore in my case is padded to a multiple of 2 registers, which
also happens to be 16B in this case.
* Compiler ignores any prologs/epilogs that are "out of range" of the
direct (20 bit) branch, in which case they will be emitted again.
As noted, there is also a set minimum number of registers.
So, in this case, the feature will not be used if saving/restoring fewer
than 6 registers (since in these cases it is cheaper to use an inline
prolog/epilog).
Switching to a larger branch would be undesirable in my case as well
(the larger branches in this case being absolute-addressed and not
handled by the branch predictor; or needing to compose the branch as a
multi-instruction sequence). So, in this case, it is cheaper to "forget"
that the previous version existed.
In the ISA in question, there are no PUSH/POP style instructions, only
SP-rel Load/Store (also no auto-increment either).
Partly this was because (early on) neither of these really won out in a
cost/benefit sense (PUSH/POP being "not free" in terms of FPGA LUT cost,
and not saving much over the use of Load/Store ops).
In a lot of cases, the double-stack-adjust would have been needed either
way on this ISA, since it is not particularly unheard of for the stack
frame to be larger than what is directly reachable by the Load/Store
displacements (9-bit for 32-bit ops, 4-bit for 16-bit ops).
( Exceeding the 9-bit displacement field requiring spending 8 bytes on
the Load/Store. )
With stack-frame layout typically:
(Caller's red-zone and arguments)
-- high address --
Saved LR and GBR (analogous to RA and GP in RV);
Saved registers;
(First stack canary, 1);
Structures and arrays (amorphous storage area);
(Optional, second stack canary, 1);
Local and temporary variables (just above the argument area);
Space for stack arguments ("red zone" + additional arguments);
-- low address --
Where:
The stack-frames are fixed size at runtime:
alloca() and VLAs implemented via implicit heap allocation;
Any large structs/arrays are also folded into heap allocs (2);
There is no dedicated frame-pointer register.
Structures and arrays are passed by reference,
with implicit memory copy where needed.
Functions typically save both the link-register and global-pointer;
Different registers are used for arguments and return values:
For structs, the return register is used for the output struct;
If used, 'this' is also given its own dedicated register.
Up to 8 arguments may be passed in registers;
...
1: Canary values will be put onto the stack if any structures or arrays
are present in the stack frame. These are initialized to a randomized
value on function entry and validated on return.
Note that this would be per-function, not part of the reused sections.
This adds basic protection against buffer overflow and similar.
Well, there are other things, like my compiler will shuffle functions
and variables into a randomized order on each rebuild (intended as a
security feature), along with a limited amount of ASLR.
2: Typical stack sizes being 128K or 256K in this case, and a program
trying to put large arrays on the stack would otherwise bomb the stack.
In this case, the function will automatically free this memory before it
returns. Still generally a good idea not to put big arrays on the stack
though. It depends some on the program how much stack is needed (many
are fine on 64K, Doom needs 128K and Quake needs 256K, even with this
trick).
(Decided to leave out a bunch more stuff related to the ABI design).
But, yeah, I am essentially using a shared ABI for both MMU and NoMMU cases.
Which basically requires functions to save and restore GBR, go through a
ritual to get it reloaded to the correct value for each function prolog;
with all global variables also being accessed relative to GBR.
PC-relative access to global variables being not allowed in this case as
this would be effectively incompatible with NoMMU use-cases
(effectively, the passed in GBR needs to be used to reload the correct
GBR for the called function, with the ABI defining the specifics for how
this ritual works).
Could have used ELF FDPIC, but my ABI has a lower runtime overhead than
the FDPIC approach.
Where FDPIC would handle GOT reload on the caller side; and my ABI
handles GBR reload on the callee side.
Tradeoffs may depend some on design specifics of the C ABI in question.
But, at least in my case at least, it turned out to be a net positive.
> BTW if anyone is aware of a technique to coax the toolchains into producing multiple localised milli-code routines to keep the jumps/calls more compact in these situations it would be useful to know about.
>
This is what my compiler does at least, no idea about GCC or LLVM as I
am not using them for this (nor is are there back-ends for my ISA in
these compilers).
It seemed like it should have been able to work OK on RISC-V, as there
isn't anything really in the ISA design that would prevent it (nor
necessarily make it "obviously worse" than the situation in my ISA).