On 12/1/2025 11:18 PM, Krste Asanovic wrote:
> Just to point out that V is not expensive in area compared to SIMD
> extensions with comparable peak throughput.
> There are very small full V open-source implementations out there.
>
As noted, there were reasons I have for a lot of the seemingly arbitrary
restrictions I would assume being imposed on the SIMD extension. Namely,
if the goal is to be cheap, you don't want to allow too many things that
would make it "not cheap".
> It is a bit more complex than the simplest SIMD extensions, but the
> complexity is in control logic which is a small part of total area.
> This complexity helps improve efficiency, and supports portability.
>
Possibly. If the SIMD unit is natively 4x Binary32 or 2x/4x Binary64 and
supports full IEEE semantics in hardware, this is also going to be
expensive...
It is more practical to cut some corners here, and assume that full IEEE
is limited to Scalar operations, which can be implemented via trap
handlers for cases which the hardware can't it handle natively.
Even as such, a Binary64 FPU is still pretty expensive (with a lot of
corner cutting).
The SIMD unit is expensive in my case, but generally on par with the
scalar FPU; partly because it operates on lower precision.
The original (cheaper but slower) implementation of SIMD ops was via
multi-pumping the FPU (the FPU basically pipelines each element
internally, producing the final result once each element makes it to the
other side). The result of this was basically SIMD ops that took 10
clock cycles to run.
It is a little faster to do 2 or 4 elements in parallel though. Going
much higher, cost goes up, and it doesn't really seem worth it.
But, 2 and 4 element SIMD is where the benefits are the greatest though.
> The various Zve* vector extensions remove some larger datatypes to
> reduce cost for embedded applications.
> If you’re at the point where the vector reg file itself is dominating
> cost, then Zfinx+V can provide lower area by removing the separate
> scalar floating-point register file.
>
I would assume that, if one wants compatibility with existing binary
code, Zfinx+V is going to pose a bigger problem than RV64GC with no V,
or than with wonky SIMD glued onto the F/D extensions.
While wonky SIMD glued onto F/D is wonky, at least it doesn't actively
break things. This is unlike Zfinx/Zdinx; where code compiled to assume
use of F/D here will not work correctly.
Though, seemingly V requires 32x 128-bit for the V registers
(VLEN>=128), which is bigger than 32x 64-bit.
Also, 128-bit register ports would be expensive (while my CPU uses some
128-bit operations, they are still implemented using 64-bit register ports).
Say, for example, if we assume a CPU with a 4R2W 64-bit register file or
similar (though, admittedly, my main configuration is typically up to
3-wide with a 6R3W register file; which can function as 3R1W for 128-bit
operations).
As-is, my CPU core internally uses a 64x 64-bit unified register file
(holding both X and F registers). But, I guess using a unified register
file is far from universal here.
Increasing the size of the register file to 128x, or increasing register
port width, would both come with a fairly steep cost increase. Most of
the 128-bit operations are handled as modified 64-bit operations internally.
So, in a way, a 128-bit 4x Binary32 operation is actually treated with
two 64-bit 2x Binary32 operations in the pipeline. Where, in effect, 2x
Binary32 can be initiated from either of the first two lanes, and two
such operations may co-execute if compatible.
Note that scalar FPU operations may not co-execute as there is only one
scalar FPU in this case. Also 4x Binary16 ops may not co-execute, as
they effectively use both "halves" of the SIMD unit (so, the SIMD unit
deals with Binary16 much as-if it had seen dual-issued 2x Binary32 ops).
It is possible to glue SIMD onto the existing FPU in a way that is
mostly invisible to existing code; but does allow the code to run some
test cases to detect whether or not the SIMD extensions are present.
Like, for example, a program can feeds a 2x Binary32 SIMD vector through
FADD.S or similar:
Does it give the expected result for SIMD?
If yes, assume SIMD works;
Else: No SIMD.
Say, gives a NaN-boxed or otherwise incorrect result.
In the case of code using F/D as before, it works just as it did before.
Here, would also assume that SIMD only exists natively for 16 and 32 bit
elements.
Since, as noted, since Binary64 occupies the entire register, by
definition there is no SIMD on 64-bit elements.
Can still sort of fake it though:
FMUL.D F10, F12, F14, RNE
FMUL.D F11, F13, F15, RNE
Person can, use their imagination for this one.
(Will not co-execute, as the FPU can't do this).
This is also a possible option for 128-bit 4x Binary32 cases:
FMUL.S F10, F12, F14, RNE //(X,Y)
FMUL.S F11, F13, F15, RNE //(Z,W)
(May be understood as potentially co-executing in this case).
Though, one option is to optionally also support explicit 4x SIMD.
In my implementation, this can be encoded by one of the 2 reserved
rounding modes (only defined here for even register pairs). But, on a
cheaper implementation it could make sense to disallow this (only
allowing for 2x Binary32 here).
Say, rounding modes:
000=RNE 001=RTZ 010=RDN 011=RUP
100=RMM 101=QRTZ 110=QRNE 111=DYN
Where, say:
RNE/RTZ: SIMD if not NaN boxed;
RDN/RUP/RMM: Scalar Only
QRTZ/QRNE: Only valid if 128-bit SIMD exists
DYN: Scalar Only
If using a Scalar-Only option, FPU may assume that the operation is
scalar. If not NaN boxed: Trap or similar.
As-is, QRTZ/QRNE would indicate 128-bit SIMD, but only valid for ".S",
else Trap. In this case, it is decoded as a single instruction that
effectively splits in half in the pipeline (and uses two lanes, much
like in the co-execute case). But, doing it this way is more compact
(one instruction rather than two, and with a stronger guarantee they
will co-execute; wheres with two ops it is possible both halves could
end up being run on different clock-cycles, ...).
Note that SIMD cases would not update FPU flags or similar.
So, say, FADD.S:
Is NaN boxed?
Yes: Does scalar things (in IEEE mode).
May update flags, and trap on denormals.
No: Does SIMD things.
No flags updates, DAZ/FTZ, ...
If not in IEEE Mode:
FADD.S/FSUB.S/FMUL.S:
Always behave as SIMD for RNE/RTZ.
RDN/RUP/RMM/DYN: Behave as Scalar
Route to Main FPU rather than SIMD Unit.
Can note that for scalar operations:
FLW and FLH will always load the value as NaN boxed.
So, unless one does something like:
FLD F11, 128(X10)
FADD.S F12, F11, F10, RNE //(Would be understood as SIMD)
It is unlikely that this would be stumbled on by chance.
Where, as noted, for typical operations:
Both items NaN boxed : Scalar
One item NaN-boxed, other Zero : Scalar
Neither item NaN Boxed : SIMD
One item NaN-boxed, other non-zero: Trap
Where, Zero extended items may be encountered sometimes, say:
LUI X6, 0x3F800
FMV.D.X F10, X6
So, can't be overly strict for NaN boxing rules in these sorts of cases.
Though, if a NaN boxed value is mixed with some other non-zero and
non-NaN value, something is sus and trapping seems justified.
The existence of 4x Binary16 would then depend on having both this SIMD
extension and Zfh. If both exist, can probably assume that Binary16 SIMD
exists, and is probably 4 wide.
Doing 4-wide for Binary16 is easier to justify as Binary16 lanes are
cheaper (so one can justify 2 dedicated Binary16 lanes more easily than
they could justify 2 additional Binary32 lanes).
In my existing implementation, there is a little weirdness with the
semantics, but existing code works without issue.
Bigger tradeoff is internal limitations of the FPU:
Only DAZ+FTZ semantics at hardware;
IEEE Mode: Denormals need to Trap.
Reduced precision for Binary64;
Ops like FMUL may need to Trap to give IEEE results in some cases.
Besides denormals, FMUL may also need to trap on non-zero LOBs.
Full width multiplier costs significantly more.
Most of the time, one or both sets of LOBs are zero.
When LOBs are 0, result will match full IEEE result.
FADD/FSUB/FMUL:
Native, but with some limitations.
In IEEE mode:
Inputs are denormal: Trap
Exponent out of range: Trap
Rounding Carry Propagation Fail: Trap
Non-Zero LOBs combination for FMUL: Trap
...
FMADD/FMSUB/FNMADD/FNMSUB:
Most likely Trap (unless FMA is supported);
The hardware doesn't actually do FMA in this case (reasons).
Actual FMA would be nice, but doesn't exactly come cheap.
Though, in some cases:
It is possible to route FMADD.S through the Binary64 FPU;
Can at least fake Single-Rounded FMA for Binary32;
... with a latency of around 12 clock cycles ...
No FMA for SIMD: Trap.
FDIV/FSQRT:
Trap.
Mostly works, with GCC need "-mno-fdiv -ffp-contract=off" to avoid
stepping in to bad performance though... This is a problem either way
though. Otherwise, it is possible to give acceptable performance despite
reliance of trap handlers in some scenarios.
But, alas:
double x, y, z, w;
...
w=x*y+z; //... GCC may try to use FMADD.D here ...
This can hurt bad (and is more common than FDIV).
FMADD.S is less bad, but still slower than using non-fused ops.
As can be noted, for FMUL.D, at present it only computes the high-order
results internally (and results that would fall below the ULP or similar
are discarded). By looking at the input values, it is possible to detect
cases where these discarded regions would be non-zero (which if aiming
for IEEE compliant results, would require using a trap to deal with).
In the non-IEEE mode, it will give an inexact result (namely, one where
all low-order products were assumed to be zero).
One other limitation (that effects both FADD and FMUL) is that the
rounding can only carry-propagate for a certain number of bits (in IEEE
mode, it faults if it could not round). This being an issue with the
latency of the carry propagation in this case.
This being mostly because a fully IEEE compliant FPU is also expensive,
so ended up mostly using trap handlers in an attempt to get the standard
semantics (faster is possible, but the FPU in this case is DAZ+FTZ and
does not give 0.5 ULP rounding).
These problems would still exist regardless of whether or not the FPU
does SIMD. Though, in this case, SIMD comes with additional
restrictions, and trying to do something that isn't allowed in HW
necessarily needs to be handled by trapping.
In some cases though, trapping isn't quite as terrible of an option as
it may at first seem.
Related irony being that I ended up implementing some of the Q
instructions for Binary128 on FPR pairs, in this case pretty much
exclusively via trap handlers. Relative cost of trap-handling being
low-enough relative to Binary128 emulation that it isn't as absurd as it
may seem on the surface.
Also trapping on an FMUL.Q is like 1/8 the code footprint of a function
call. No actual intent to implement Binary128 ops in HW, mostly because
doing Binary128 in hardware is far too expensive; but it still makes
sense to use it for "long double", which is already accepted as being
the "I want precision but don't care about speed".
While some could argue for "Double-Double", this has no advantage over
Binary128 in the absence of hardware support for Single-Rounded Binary64
FMA (and is not an attractive option when emulating the FMA would be
slower than emulating the corresponding Binary128 operations).
So, alas, cheaper to pretend that we have FMUL.Q and FADD.Q and similar
than to try to do it using FMADD.D or FMSUB.D or similar (fewer
emulation traps required, and FMUL.Q is also faster than FMADD.D in this
case).
Still operates on register pairs (unlike the actual Q extension), as in
this case I found register pairs to be the preferable option. Also seems
like pretty much no one is using the actual Q extension.
Also partly overlaps with me using FLQ/FSQ for Load/Store pair (partly
because my original use of LDU/SDU got stomped). And, I had also ended
up using FSQNJ.Q to express 128-bit MOV, so kinda made sense to use the
same pattern for the emulated FADD.Q and similar.
Though, potentially an implementation could trap on these as well, but
implementing these other cases via trap handlers would have a more
significant adverse impact on performance (if using these features).
Would effectively need a mechanism to disable these other features to
have a proper implementation of the Q extension (but, then I would be
again left without Load/Store Pair of MV-Pair, which would be a bigger
loss than not having the Q extension).
...
But, yeah, probably all kinda sucks.
I guess it is a pretty open question as to how much a lot of this could
be generalized to other implementations though.
Granted, I guess maybe one could also debate whether a lot of this is
"actually all that cheap".
>> an email
toisa-dev+...@groups.riscv.org <mailto:
isa-
>>
dev+uns...@groups.riscv.org>.
>> To view this discussion visithttps://
groups.google.com/a/
>>
groups.riscv.org/d/msgid/isa-dev/c34eefd2-fc76-4027-
>> b5f7-27f79710f508%
40gmail.com <
https://groups.google.com/a/
>>
groups.riscv.org/d/msgid/isa-dev/c34eefd2-fc76-4027-
>> b5f7-27f79710f508%
40gmail.com>.
>