Calling convention for RV32_Zdinx

28 views
Skip to first unread message

L Peter Deutsch

unread,
Jan 10, 2026, 10:26:56 AM (2 days ago) Jan 10
to RISC-V ISA Dev
With Zdinx, floating point argument and result values presumably follow the
integer calling convention. Section 2.1 of the ABIs specification says that
"Scalars that are 2*XLEN bits wide are passed in a pair of argument
registers," but I don't see a requirement that these be an even-odd pair. I
infer that this applies to 64-bit floats, so for example with:
void f(int x, double y)
x would be passed in x10, and y would be passed with the low bits in x11 and
the high bits in x12. Is this correct? It does mean that in this case, the
RV32 load-pair instruction can't be used, the two lw instructions will be
different for big- and little-endian targets, and both halves of the 64-bit
float will have to be moved individually to other registers for computation.
This doesn't seem unreasonable to me, but it's odd enough that I wanted to
check.

--

L Peter Deutsch :: Aladdin Enterprises :: Healdsburg, CA & Burnaby, BC

Andrew Waterman

unread,
Jan 10, 2026, 2:24:16 PM (2 days ago) Jan 10
to L Peter Deutsch, RISC-V ISA Dev
Yeah, the RV32 ILP32 calling convention passes FP64 values in x-register pairs that are not necessarily aligned.  The explanation is that this calling convention was defined to support soft-float, which doesn't benefit from register alignment.  You're right that this design is suboptimal for RV32_Zdinx; the question is to what extent.  My suspicion is that the perf loss is small enough, and the use case is uncommon enough, that the toolchain folks won't want to support another calling convention, another multilib set, etc.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/20260110152644.4E13CEC25B4%40serpent.at.major2nd.com.

BGB

unread,
Jan 10, 2026, 2:41:32 PM (2 days ago) Jan 10
to isa...@groups.riscv.org
I don't know about the standard ABI here (implies misaligned is
allowed), but in my ISA variant I was in a similar situation (although
for 128-bit types on a 64-bit target), and in my case did end up adding
an additional constraint:
It is required to be in an even+odd pair.

Dealing with misaligned pairs adds more cost (to both code density and
overall performance) than "wasting" an argument by padding to the next
even register. Likewise, with padding, since the number of argument
registers is even, the value is either entirely in registers or entirely
on the stack, with no possibility of a "half in register, half on stack"
edge case.

In my case, does potentially add ambiguity for things like returning
structs by reference (used if too large to fit into a register pair) or
for "thiscall" or similar, which may implicitly add 1 or 2 extra hidden
arguments, with the 1 argument case changing the effective alignment.
Here, they still need to be even registers, so if the first argument is
a pair or similar, it is like it always spends 2 spots on the hidden
argument.
Or, say, the pair always starting at X12 in this case, rather than X11
(misaligned) as would be the case for a single register, or X10 (when no
hidden arguments).

But, yeah, such an ABI is potentially nonstandard...


Whenever in registers, 128-bit values were in-general also required to
be in even+odd pairs. I would assume a similar constraint could also be
imposed on RV32+Zdinx, and also depending on implementation, mandating
an even pair allows for reducing hardware cost in some areas vs allowing
for odd-aligned pairs.

Where can treat each even-pair as a larger virtual register in this
case. But, many ways of supporting larger virtual registers as pairs are
closed off if allowing for odd pairs (say, if one must always use two
ports to the register file, rather than a single port that may
optionally be used at twice the normal width when accessing an even
register, etc).

Say, for example, if you were to implement an RV32 machine in a way
where nominally it had 16x or 32x 64-bit registers internally rather
than 32x or 64x 32-bit. Then, when accessing even, you see the full
width, but accessing odd, the high-half is mirrored into the low half.
Storing even, may either modify the whole register or just the low half,
and storing odd only modifies the upper half.



Others may or may not disagree, but this is the direction I went.

As for big endian targets, dunno there. Personally I have been ignoring
the possibility of native BE.

It is preferable IMO to leave hardware as LE by default, and to deal
with BE via byte-swap instructions, say:
Swap 16/32/64 bit values (64b swap N/A for RV32);
Support sign and zero extension for types narrower than the width.

This means that accessing BE data would have a higher latency than
accessing LE data, but in most cases isn't likely enough to become a
serious issue.

Admittedly, in my case there is also a "__bigendian" keyword added as a
C extension in this case, where:
Applied to a pointer, means pointed-to value is big endian;
Applied to a struct type, means all of its members default to big endian;
For normal scalar types, mostly N/A (does nothing to a local variable
unless you take its address).

...


K. York

unread,
Jan 10, 2026, 3:05:02 PM (2 days ago) Jan 10
to RISC-V ISA Dev, BGB
In my opinion, an ABI's register allocation should be defined as an algorithm taking as input a sequence of these objects: {stride, alignment, dataful_byte_count, {is_struct, is_float, is_vector, decompose_struct() -> [Arg, offset]}}

Stride: If this argument was placed in an array, the pointer increment between array members. Also known as size.
Alignment: Must be power of 2.
Dataful byte count: The total number of non padding bytes. Arguments with padding can be detected as dataful != stride.
Note: Primitive integers have these three values all equal
The type information object is hopefully self explanatory -- the three booleans are mutually exclusive -- except for the decompose_struct method which allows for conditional splitting of structures into registers when profitable.

A union of float and int is represented as a struct with two members both at offset 0.

Preferably, this would be written as a WHATWG style algorithm specification.

~Kane

Sent with Shortwave

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

K. York

unread,
Jan 10, 2026, 3:17:23 PM (2 days ago) Jan 10
to RISC-V ISA Dev, BGB
Once you have such a specification, it's really easy to specify how FinX & DinX should work:

Call the normal algorithm, except with is_float always set to false (even when decomposing structures).

Note on zero size objects: Represented as Align 1, Dataful bytes 0, Stride either 1 or 0 depending on the source language, is_struct either true or false. Recommended handling is to skip them without assigning a register, but some platforms do assign them a register and that has to be representable.

Note: An argument with alignment of 0 means the algorithm can be immediately terminated, as this function will never be called.

~Kane

Sent with Shortwave

BGB

unread,
Jan 10, 2026, 4:58:52 PM (2 days ago) Jan 10
to K. York, RISC-V ISA Dev
On 1/10/2026 2:17 PM, K. York wrote:
> Once you have such a specification, it's really easy to specify how FinX
> & DinX should work:
>
> Call the normal algorithm, except with is_float always set to false
> (even when decomposing structures).
>
> Note on zero size objects: Represented as Align 1, Dataful bytes 0,
> Stride either 1 or 0 depending on the source language, is_struct either
> true or false. Recommended handling is to skip them without assigning a
> register, but some platforms do assign them a register and that has to
> be representable.
>
> Note: An argument with alignment of 0 means the algorithm can be
> immediately terminated, as this function will never be called.
>

Kinda makes sense, at least as far as the idea that size and alignment
remain power-of-2 up to the maximum size.

I am personally against decomposing structs though, as what
potential/theoretical gain could be made by decomposing a struct is
offset by it adding needless complexity and often ending up worse than
the simpler options of always either treating them as a register pair,
or passing a pointer to somewhere in memory.

Implicitly, for any struct passed/returned by value, may also make sense
to pad the copy up to a multiple of the native register size (may
potentially be larger than its "sizeof()" in the case of structs with
smaller member alignment).


As noted, my interpretation of the ABI looked like:
Start assigning arguments from X10;
If type fits in a single register, pass as a single register;
If it needs two registers, pass in two registers;
If bigger than 2 regs, pass a pointer to somewhere in memory;
If 2 regs, and misaligned, pad to next alignment;
If no more argument regs are available, spill to stack.
Stack otherwise follows the same rules as for registers.

As for type-specific rules:
Integer types are always sign or zero extended to the full width of the
register (differs from the normal RV ABI in that "unsigned int" is zero
extended in my case, as sign-extension is both a mess and on average
worse than zero-extension even on plain RV64G or similar, as "ADDW" is
not exactly the bottleneck here).

Narrower scalar floating point types are always promoted to register
width in my case:
So, for a 64-bit machine, this would mean passing "float" arguments as
Binary64 (implicitly promoting them to "double");
On a 32-bit machine, "float" would remain as Binary32, but a 16-bit type
("short float") would still be passed in a form where it is promoted to
"float".

This promotion excludes SIMD though, where vector elements remain in
their native form.



Had note, in general:
I tend to see best efficiency when the number of callee-save and scratch
registers is roughly balanced.

For the X registers, it isn't quite, but "close enough", and to what
extent it was unbalanced is offset by the ISA design needing generally a
few extra scratch registers to avoid getting trapped in a corner (I put
X5..X7 in a special category I call "stomp" registers, or basically a
special category of registers that are ignored by normal register
allocation, and exist solely as spare registers for "when we need extra
registers to make the desired operation actually work").

Example uses of stomp registers being, say, when you want a load/store
displacement, and it doesn't fit into 12 bits, so your LW or similar
needs to break apart into "LUI+ADD+LW" or similar, but then one needs a
register to hold this value.

In this case, there was a partial split between "logical instructions"
(which may include pseudo instructions), and "true instructions", with
the stomp registers not required to be preserved between logical
instructions (typically because they got stomped by decomposing a
pseudo-instruction).


For the F registers, the 12/20 split (of callee save vs scratch) was off
balance, and my compiler kept running out of callee-save F registers in
some cases (particularly when dealing with 128-bit register-pair SIMD),
that I re-balanced it to a 16/16 split.

Note that going further, namely a 20/12 split, would have began to
negatively impact leaf functions and leaf-blocks, which benefit more
from being able to allocate variables in scratch registers.


The plain LP64 ABI (with all F registers as scratch) is a bad situation
due to it effectively making the F registers useless for holding
floating-point local variables in non-leaf functions.



For non-leaf functions, only dynamically-assigned variables may go into
scratch registers, with anything that was held in a scratch register
needing to be spilled across function calls. In this case, this makes
them preferable for temporaries, where the "lifespan" of a given
temporary rarely lives past a given basic-block, and the compiler can
note short-lived temporaries and skip over needing to save their values
to the stack.

Where, if there does not exist a control path where a variable's value
is used as an input to another expression beyond the current basic
block, its value can be discarded once it goes out of scope (no need to
save to the stack or anything, just "poof" and it is gone). Otherwise,
one would need to spill the value to the stack (if it was a scratch
register) but not if it is a callee-save register. Leaf functions can
leave variables inside scratch registers though, as there are no pesky
function calls to stomp all of them.

If a block ends in a function call (in my compiler, function calls break
up basic-blocks), then only non-argument scratch registers may be used
for dynamic assignment, and temporaries whose lifespan end as a function
argument, are mapped directly to said argument (reduces register
pressure slightly and also saves on register moves).


But, say:
Run out of callee-save registers, then you need to start spilling stuff;
Run out of scratch registers, then you may need to use callee-save
registers (slightly less cheap, due to prolog/epilog costs). If
registers needed for variables get tight (in a non-leaf function)
generally preferable to be running out of scratch registers than running
out of callee-save registers though (where, you need either a
callee-save register, or stack spill, for anything that may need to
survive across a function call).


Generally, seems to make sense to leave around 1/4 of the total
registers for function arguments.

Though, did recently experiment with using F10..F17 for arguments 9 to
16 (in an variant ABI where all arguments are passed the same regardless
of type, more like in the RV LP64 ABI), but have noted that in the case
of normal RV64 variants, this is worse for code density and performance
than simply spilling these arguments to the stack (or, IOW, using F
registers to pass integer or pointer arguments is net-negative).


So, the existing "X10..X17, everything more goes onto stack" still seems
to be the best match for my use-cases (while one could argue that LP64D
is preferable if one has F registers available, for more subtle reasons,
this is actually worse in my use-case than always using the X registers
regardless of argument type).

For a different ISA mode (where is has access to both the X and F
registers as a single unified register space), increasing the argument
count to 16 does show a useful performance advantage though; just not so
great for RV64G or similar.


...



> ~Kane
>
> Sent with Shortwave <https://www.shortwave.com?
> utm_medium=email&utm_content=signature&utm_source=a2FuZXB5b3JrQGdtYWlsLmNvbQ==>
>
> On Sat Jan 10, 2026, 08:04 PM GMT, K. York <mailto:kane...@gmail.com>
> wrote:
>
> In my opinion, an ABI's register allocation should be defined as an
> algorithm taking as input a sequence of these objects: {stride,
> alignment, dataful_byte_count, {is_struct, is_float, is_vector,
> decompose_struct() -> [Arg, offset]}}
>
> Stride: If this argument was placed in an array, the pointer
> increment between array members. Also known as size.
> Alignment: Must be power of 2.
> Dataful byte count: The total number of non padding bytes. Arguments
> with padding can be detected as dataful != stride.
> Note: Primitive integers have these three values all equal
> The type information object is hopefully self explanatory -- the
> three booleans are mutually exclusive -- except for the
> decompose_struct method which allows for conditional splitting of
> structures into registers when profitable.
>
> A union of float and int is represented as a struct with two members
> both at offset 0.
>
> Preferably, this would be written as a WHATWG style algorithm
> specification.
>
> ~Kane
>
> Sent with Shortwave <https://www.shortwave.com?
> utm_medium=email&utm_content=signature&utm_source=a2FuZXB5b3JrQGdtYWlsLmNvbQ==>
Reply all
Reply to author
Forward
0 new messages