On 1/10/2026 2:17 PM, K. York wrote:
> Once you have such a specification, it's really easy to specify how FinX
> & DinX should work:
>
> Call the normal algorithm, except with is_float always set to false
> (even when decomposing structures).
>
> Note on zero size objects: Represented as Align 1, Dataful bytes 0,
> Stride either 1 or 0 depending on the source language, is_struct either
> true or false. Recommended handling is to skip them without assigning a
> register, but some platforms do assign them a register and that has to
> be representable.
>
> Note: An argument with alignment of 0 means the algorithm can be
> immediately terminated, as this function will never be called.
>
Kinda makes sense, at least as far as the idea that size and alignment
remain power-of-2 up to the maximum size.
I am personally against decomposing structs though, as what
potential/theoretical gain could be made by decomposing a struct is
offset by it adding needless complexity and often ending up worse than
the simpler options of always either treating them as a register pair,
or passing a pointer to somewhere in memory.
Implicitly, for any struct passed/returned by value, may also make sense
to pad the copy up to a multiple of the native register size (may
potentially be larger than its "sizeof()" in the case of structs with
smaller member alignment).
As noted, my interpretation of the ABI looked like:
Start assigning arguments from X10;
If type fits in a single register, pass as a single register;
If it needs two registers, pass in two registers;
If bigger than 2 regs, pass a pointer to somewhere in memory;
If 2 regs, and misaligned, pad to next alignment;
If no more argument regs are available, spill to stack.
Stack otherwise follows the same rules as for registers.
As for type-specific rules:
Integer types are always sign or zero extended to the full width of the
register (differs from the normal RV ABI in that "unsigned int" is zero
extended in my case, as sign-extension is both a mess and on average
worse than zero-extension even on plain RV64G or similar, as "ADDW" is
not exactly the bottleneck here).
Narrower scalar floating point types are always promoted to register
width in my case:
So, for a 64-bit machine, this would mean passing "float" arguments as
Binary64 (implicitly promoting them to "double");
On a 32-bit machine, "float" would remain as Binary32, but a 16-bit type
("short float") would still be passed in a form where it is promoted to
"float".
This promotion excludes SIMD though, where vector elements remain in
their native form.
Had note, in general:
I tend to see best efficiency when the number of callee-save and scratch
registers is roughly balanced.
For the X registers, it isn't quite, but "close enough", and to what
extent it was unbalanced is offset by the ISA design needing generally a
few extra scratch registers to avoid getting trapped in a corner (I put
X5..X7 in a special category I call "stomp" registers, or basically a
special category of registers that are ignored by normal register
allocation, and exist solely as spare registers for "when we need extra
registers to make the desired operation actually work").
Example uses of stomp registers being, say, when you want a load/store
displacement, and it doesn't fit into 12 bits, so your LW or similar
needs to break apart into "LUI+ADD+LW" or similar, but then one needs a
register to hold this value.
In this case, there was a partial split between "logical instructions"
(which may include pseudo instructions), and "true instructions", with
the stomp registers not required to be preserved between logical
instructions (typically because they got stomped by decomposing a
pseudo-instruction).
For the F registers, the 12/20 split (of callee save vs scratch) was off
balance, and my compiler kept running out of callee-save F registers in
some cases (particularly when dealing with 128-bit register-pair SIMD),
that I re-balanced it to a 16/16 split.
Note that going further, namely a 20/12 split, would have began to
negatively impact leaf functions and leaf-blocks, which benefit more
from being able to allocate variables in scratch registers.
The plain LP64 ABI (with all F registers as scratch) is a bad situation
due to it effectively making the F registers useless for holding
floating-point local variables in non-leaf functions.
For non-leaf functions, only dynamically-assigned variables may go into
scratch registers, with anything that was held in a scratch register
needing to be spilled across function calls. In this case, this makes
them preferable for temporaries, where the "lifespan" of a given
temporary rarely lives past a given basic-block, and the compiler can
note short-lived temporaries and skip over needing to save their values
to the stack.
Where, if there does not exist a control path where a variable's value
is used as an input to another expression beyond the current basic
block, its value can be discarded once it goes out of scope (no need to
save to the stack or anything, just "poof" and it is gone). Otherwise,
one would need to spill the value to the stack (if it was a scratch
register) but not if it is a callee-save register. Leaf functions can
leave variables inside scratch registers though, as there are no pesky
function calls to stomp all of them.
If a block ends in a function call (in my compiler, function calls break
up basic-blocks), then only non-argument scratch registers may be used
for dynamic assignment, and temporaries whose lifespan end as a function
argument, are mapped directly to said argument (reduces register
pressure slightly and also saves on register moves).
But, say:
Run out of callee-save registers, then you need to start spilling stuff;
Run out of scratch registers, then you may need to use callee-save
registers (slightly less cheap, due to prolog/epilog costs). If
registers needed for variables get tight (in a non-leaf function)
generally preferable to be running out of scratch registers than running
out of callee-save registers though (where, you need either a
callee-save register, or stack spill, for anything that may need to
survive across a function call).
Generally, seems to make sense to leave around 1/4 of the total
registers for function arguments.
Though, did recently experiment with using F10..F17 for arguments 9 to
16 (in an variant ABI where all arguments are passed the same regardless
of type, more like in the RV LP64 ABI), but have noted that in the case
of normal RV64 variants, this is worse for code density and performance
than simply spilling these arguments to the stack (or, IOW, using F
registers to pass integer or pointer arguments is net-negative).
So, the existing "X10..X17, everything more goes onto stack" still seems
to be the best match for my use-cases (while one could argue that LP64D
is preferable if one has F registers available, for more subtle reasons,
this is actually worse in my use-case than always using the X registers
regardless of argument type).
For a different ISA mode (where is has access to both the X and F
registers as a single unified register space), increasing the argument
count to 16 does show a useful performance advantage though; just not so
great for RV64G or similar.
...
> ~Kane
>
> Sent with Shortwave <
https://www.shortwave.com?
> utm_medium=email&utm_content=signature&utm_source=a2FuZXB5b3JrQGdtYWlsLmNvbQ==>
>
> On Sat Jan 10, 2026, 08:04 PM GMT, K. York <mailto:
kane...@gmail.com>
> wrote:
>
> In my opinion, an ABI's register allocation should be defined as an
> algorithm taking as input a sequence of these objects: {stride,
> alignment, dataful_byte_count, {is_struct, is_float, is_vector,
> decompose_struct() -> [Arg, offset]}}
>
> Stride: If this argument was placed in an array, the pointer
> increment between array members. Also known as size.
> Alignment: Must be power of 2.
> Dataful byte count: The total number of non padding bytes. Arguments
> with padding can be detected as dataful != stride.
> Note: Primitive integers have these three values all equal
> The type information object is hopefully self explanatory -- the
> three booleans are mutually exclusive -- except for the
> decompose_struct method which allows for conditional splitting of
> structures into registers when profitable.
>
> A union of float and int is represented as a struct with two members
> both at offset 0.
>
> Preferably, this would be written as a WHATWG style algorithm
> specification.
>
> ~Kane
>
> Sent with Shortwave <
https://www.shortwave.com?
> utm_medium=email&utm_content=signature&utm_source=a2FuZXB5b3JrQGdtYWlsLmNvbQ==>