Proposal for a consumer baseline profile between RV64GC and RVA23

sven boertjens

unread,

Dec 1, 2025, 1:19:27 PM12/1/25

to RISC-V ISA Dev

Hi all,

I would like to propose and get feedback on the concept of a RV64 consumer-class baseline profile, positioned between RV64GC and RVA23.

Currently, the RVA standards, RVA23 being the latest, are the only real standards for general-purpose processors. A profile more expansive than RV64GC is necessary for this class, as it's too barebones to run modern operating systems and user applications. Allowing CPU vendors to implement any extensions they feel are necessary for consumers would lead to fragmentation in the software space, which is a valid concern. But a standard profile that's too inclusive could also be troublesome, that is also a concern of mine. And the RVA23 profile is very, very inclusive.

A baseline profile is also the industry standard. Major ISAs like x86 and ARM both stabilized towards realistic baselines, not extensive standards, because too much freedom or rigidity both proved themselves troublesome.

Thus, I believe RISC-V is in need of a baseline profile.

What would such a baseline look like?

A baseline is meant to be the minimum instructions, or extensions in this case, that hardware must support. For the consumer class, this should contain the essentials for OSes and applications to function properly. An OS needs to be able to run in supervisor mode, and applications sometimes require FP for example. A baseline should not require convenience extensions, though it can absolutely recommend certain extensions. In this case, it's about modern minimums, not something completely bare, meaning things like bitmanip and other commonly used features are included as well.

An example of an existing baseline is x86-64-v2. This baseline contains modern necessities for software, while also not forcing you to own hardware that supports every x86 SIMD instruction out there.

RVA23 is far stricter than this, and mandates a very large collection of extensions, under which even significant ones like Vector and Hypervisor support. A baseline would focus on extensions that realistically require support in the general consumer environment, instead of covering every single extension that consumers might need even if unnecessary for the larger consumer group.

Note that this baseline is not meant to replace RVA. This would be an attempt to set a clear baseline that software can expect and hardware can target.

The specific extensions that this baseline would require/recommend is up for debate.

What about software fragmentation?

A proper baseline should be responsible for preventing fragmentation, not encourage it. Fragmentation in software tends to be the result of vendor-specific extensions and the lack of a stable baseline that covers necessities.

Most software is able to run perfectly well when built against a baseline. It's only a minority of software does benefit from certain extra features. These are often performance-sensitive domains, which already build locally or use runtime dispatch, not distribute generalized binaries, so the issue of fragmentation is unapparent there. If this affects performance for users with a narrower extension set or older hardware, that's expected.

A standard that's too extensive might actually cause fragmentation rather than prevent it. If the standard is unrealistic and many consumers have non-compliant hardware, software must either drop support for a lot of users, or each create its own set of requirements. That scenario is even worse than fragmentation caused by hardware diversity.

Hardware diversity

Hardware diversity is only an issue if it causes bad fragmentation, but that's not the case with a proper baseline as clarified. Hardware diversity is important in the consumer field, devices can have wildly different needs. Mobile devices will focus on energy efficiency. Laptops/desktops/servers all have cases where power consumption is of concern as well. Forcing compliance to all RVA23 extensions reduces flexibility, potentially conflicting with some designs.

Hardware needs

We mustn't forget that hardware also has needs. Software needs are more visible to most, but that doesn't justify exclusive attention.

When we force hardware to manage every nitpick through an extensive standard, hardware complexity escalates significantly. Even if modern CPUs are inherently complex, that does not justify adding more, especially if it's complex features like hypervisor mechanics. If anything, it would be reason to add less. Software evolves on hardware, yes, but hardware itself can't innovate if its strangled by maintaining compliance alone. Hardware innovation is important, it directly affects performance and power consumption, and brings us new inventions much like SIMD or superscalar were inventions.

The current ecosystem

The RISC-V ecosystem is still very much growing. Unlike x86 with exclusivity to Intel/AMD, or ARM with large licensing fees, RISC-V finally gives us freedom. This means that startups, academia, and even hobbyists work on RISC-V hardware. A realistic baseline allows everyone, even ambitious hobbyists, to actually make a consumer-compliant design, whereas requiring compliance to RVA23 makes it so much harder, nearly impossible even, for anyone without established teams to make a compliant design.

Original intentions

A core feature of RISC-V is its extensions. Hardware can choose which extensions to add, and which to leave out. If we create extensive standards that hardware must be compliant to, we undermine that principle. A lot of extensions aren't even extensions anymore like that, they're just modular requirements.

---

I'd love to hear what others think of this idea, and whether it's missing or overlooking anything.

Regards,
Sven

Greg Favor

unread,

Dec 1, 2025, 1:34:24 PM12/1/25

to sven boertjens, RISC-V ISA Dev

What about RVB23?

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/3f61a4f4-b477-4ed4-9508-58314d948351n%40groups.riscv.org.

L Peter Deutsch

unread,

Dec 1, 2025, 1:51:31 PM12/1/25

to Greg Favor, boertje...@gmail.com, isa...@groups.riscv.org

> What about RVB23?

From the Introduction to the ratified RVB23 Profiles doc:

"Unlike the RVA profiles, it is explicitly a non-goal of RVB profiles to
provide a single standard ISA interface supporting a wide variety of binary
kernel and binary application software distributions."

I looked through the contents of RVB23U64, and while it seems like a
reasonable candidate for the goals that Sven discussed (which I generally
support), I read this disclaimer as incompatible with those goals.

--

L Peter Deutsch :: Aladdin Enterprises :: Healdsburg, CA & Burnaby, BC

sven boertjens

unread,

Dec 1, 2025, 3:00:33 PM12/1/25

to RISC-V ISA Dev, L Peter Deutsch, boertje...@gmail.com, isa...@groups.riscv.org, Greg Favor

I had somehow missed RVB23 when writing my post, hence why it isn't mentioned, apologies for that.

RVB23 does look similar to what a baseline profile could be, but what I'm aiming at is something that explicitly does try to be a standard.

Greg Favor

unread,

Dec 1, 2025, 3:17:32 PM12/1/25

to sven boertjens, RISC-V ISA Dev, L Peter Deutsch

On Mon, Dec 1, 2025, 12:00 PM sven boertjens <boertje...@gmail.com> wrote:

I had somehow missed RVB23 when writing my post, hence why it isn't mentioned, apologies for that.

RVB23 does look similar to what a baseline profile could be, but what I'm aiming at is something that explicitly does try to be a standard.

RVB23 _is_ a ratified RVI standard. And it establishes a mandatory baseline that is in between RV64GC and RVA23. So can you clarify what is inappropriate in RVB23 relative to what you're looking for? And what do you mean by a "standard"?

Greg

sven boertjens

unread,

Dec 1, 2025, 3:53:40 PM12/1/25

to RISC-V ISA Dev, Greg Favor, RISC-V ISA Dev, L Peter Deutsch, sven boertjens

By a “standard,” I meant a profile defining the mandatory ISA features that software can expect to be present on all compliant hardware.

RVB23 isn't intended for that purpose, it explicitly states so as Peter quoted. While it also mentions it can be used as a foundation for ecosystems to make their own interfaces, that implies that it itself isn't an universal baseline for software to rely on.

Since I'm not familiar with the detailed requirements of every software domain, I can't say whether RVB23 contains exactly the right set of extensions to serve as such a baseline.

Ved Shanbhogue

unread,

Dec 1, 2025, 3:58:27 PM12/1/25

to sven boertjens, RISC-V ISA Dev

sven boertjens wrote:
>A baseline profile is also the industry standard. Major ISAs like x86 and
>ARM both stabilized towards realistic baselines, not extensive standards,
>because too much freedom or rigidity both proved themselves troublesome.

RISC-V defines the current profiles for user-mode (not listing
extensions exhaustively):
- RVA20U64 - I, M, A, F, D, C, CSRs, basic counters
- RVA23U64 - Adds V, bit-manip, Zicond, Zawrs, pointer masking

x86 provides the following four:
- x86-64-v1 - 64-bit integer, CMOV, CMPXCHG8B, SSE2
- x86-64-v2 - adds CMPXCHG16B, SSE4.2
- x86-64-v3 - adds AVX2, BMI
- x86-64-v4 - adds AVX512

Industry wise we are seeing many operating systems - such as
RHEL10 - move to x86-64-v3 as the baseline and v2 seems to be
prevalent. All x86 profiles include SSE or AVX instructions and
so some form of SIMD/vector support.

The RVA20U64 could be considered as a baseline but it lacks
vector and would be at a disadvantage even compared to x86-64-v1
which supports SSE2. WIthout vectors most modern uses -
multimedia, cryptography, compression, etc. may provide poor
user experience. RVA23U64 looks like the right profile to
consider as a baseline - equivalent to x86-64-v2.

regards
ved

Greg Favor

unread,

Dec 1, 2025, 4:05:34 PM12/1/25

to sven boertjens, RISC-V ISA Dev, L Peter Deutsch

On Mon, Dec 1, 2025, 12:53 PM sven boertjens <boertje...@gmail.com> wrote:

By a “standard,” I meant a profile defining the mandatory ISA features that software can expect to be present on all compliant hardware.

RVB23 meets the preceding criterion.

RVB23 isn't intended for that purpose, it explicitly states so as Peter quoted.

The intent of RVB23 is to set a lower mandatory standard than RVA23 - which provides the flexibility to be used in a wider variety of systems. And it is a non-goal to provide a single ISA standard as would be needed by binary software distributions.

While it also mentions it can be used as a foundation for ecosystems to make their own interfaces, that implies that it itself isn't an universal baseline for software to rely on.

A "universal baseline" implies having a relatively low baseline. But you want something higher than RV64GC, and do you want a baseline higher or lower than RVB23?

Greg

sven boertjens

unread,

Dec 1, 2025, 4:07:16 PM12/1/25

to RISC-V ISA Dev, Ved Shanbhogue, RISC-V ISA Dev, sven boertjens

I'm aware that RVA23U64 is the current profile, but it just seemed so extensive to me. Even if many extensions are necessary in modern software, including Vector, I thought that the hypervisor extension was pretty overkill to require.

sven boertjens

unread,

Dec 1, 2025, 4:19:11 PM12/1/25

to RISC-V ISA Dev, Greg Favor, RISC-V ISA Dev, L Peter Deutsch, sven boertjens

Thank you for your clarification.

I'm unsure of where it would be positioned relative to RVB23. It could overlap, I'd say? I hadn't thought about the specific instruction set to require yet, nor could I accurately do so. My goal was more to highlight the need for a baseline in the first place.

BGB

unread,

Dec 1, 2025, 5:19:36 PM12/1/25

to isa...@groups.riscv.org

SIMD is needed, but IMHO, 'V' isn't ideal as it would be unreasonably
expensive to support for cost-optimized in-order implementations.

And, if a program is compiled to assume V, is would be at a severe
performance disadvantage if run on a CPU that lacks V in hardware, and,
say, provides it via Trap-and-Emulate or similar.

Better IMHO is to not assume 'V' as required.

Preferable though would be an intermediate SIMD option that wouldn't be
horribly expensive to support on a cost-optimized in-order CPU.

Say, for example, compare with ARM land where for a long time, many
parts of the space have been dominated by the Cortex A53 and A55:
Both are in-order cores.

Are they the biggest and fanciest, or fastest ARM processors: No.

if one excludes this sort of thing as a possibility (by essentially
disallowing such an equivalent segment by making the ISA too expensive),
one is effectively shooting themselves in the foot.

well, more so in a world where Moore's Law is already on the edge of
giving out entirely. It may make sense to target the most efficient use
of transistor budget, rather than assume taking the same trajectory that
Intel had taken (namely a small number of CPU cores that are absurdly
expensive in terms of power and area).

Contrast:
RV64GC or RVA20U64 is a little more sensible.

I would personally like it if some cheaper SIMD existed (probably
something that shares the F registers, preferably with FLEN=64 for
consistency with the existing F/D extensions).

With preferred configurations likely as:
2x Binary32
4x Binary16 (Half Precision)
4x Int16 (ideally wrapping/modulo only)

Then, say, if 128-bit vectors are needed (for 4x Binary32), they can be
implemented as pairs of 64-bit FPRs.

To optimize for cost, some options exist, say:
One additional Binary32 unit
Along with a more powerful Binary 32/64 scalar unit.
Two additional Binary16-only units.
Or:
HW natively only does scalar or 2x ops,
but internally double-pumps as needed.
Say, if 2x Binary32 double-pumps a 32/64 bit FPU;
And, 4x Binary16 double-pumps a 2x Binary16 unit.

Such a "cheap SIMD" should not try to address all of V though, as
otherwise then it loses out by being no longer cheap, so:
Bigger Registers: No, better to stay 64-bit for this.
More registers: No, expanding the register file further is a big cost;
Larger vectors: No, 2 and 4 element only.
Going larger than 2 and 4 quickly goes into diminishing returns.

Technically, one can add a lot of this without needing to actually add
any much in terms of new instructions to RV64G; just sort of
tweak/reinterpret the behavior of the existing F/D encodings.

Such a SIMD need not preclude or exclude 'V' though; I suspect it is
possible that using both could actually have a synergistic effect.
Though, leaving 'V' more for big/expensive CPUs that could actually
afford it.

Otherwise, main things I see as still lacking from RV64 are:
Indexed Load (or Indexed Load/Store)
But, Indexed Load (without Store) is cheaper,
and gives most of the performance advantage.
Load/Store Pair
A way to encode larger Immediate and Displacement fields
Imm12/Disp12 missing and falling back to 3-ops is a weak area;
Could be addressed effectively with a prefix encoding or similar.
Prefix expands immediate or displacement to 33 bits.

At least on my implementation, this particular combination of features
can gain a roughly 40% speedup (over plain RV64GC) for some programs
(though, for others, the gain is less).

Well, I would also like it if ADDWU and SUBWU were revived:
Zero extending "unsigned int" and similar is less messy and better for
performance than sign-extending it, even on plain RV64G (where
zero-extended unsigned-add is a 3-instruction ADD+SLLI+SRLI sequence).
While add and subtract are slightly more expensive, nearly everything
else is cheaper with zero extension.

Here, ADDWU/SUBWU also make the add/subtract cheaper as well (and while
ADD+ADD.UW or SUB+ADD.UW works, it is less satisfactory).

Also better IMHO to not assume a hypervisor in hardware.

Actually makes more sense to assume that the HW does TLB refill in
trap-handlers or similar, and then also fakes things like nested page
tables and similar in a trap handler (either with no HW page walker, or
possibly a limited/partial page walker that is treated as optional).

The partial page-walker option would be, say, the MMU hardware only
managing the last level of the page-table, and then trapping if:
TLB misses;
Requested page isn't within the cache of last-level-page pages.
This could in theory give much of the performance advantages of a full
hardware page walker, without the flexibility disadvantages (or hardware
complexity of needing a mechanism to perform multiple subsequent memory
accesses to walk an N-level page table).

In effect, the virtualized OS can then be run in usermode, with trap
handlers dealing with it whenever it steps outside of the limits of
normal usermode.

Though, probably for most use cases, virtualization is overkill.

While, also, doing the 'A' extension is also unreasonably expensive to
do in hardware, it has the merit that since 'A' instructions are very
rarely used, so it is more practical to handle the 'A' extension by
using trap handlers.

Granted, for 'A' via traps, this still leaves it up to the handler to
deal with things like inter-processor memory coherence (say, for
example, if one assumes CPU cores that natively only do weak coherence
and require explicit cache flushing for things like memory ordering, etc).

Granted, not exactly the highest performance option, but still possible
to do it this way without totally wrecking things.

Granted, in such an implementation, it is likely a lot of these would
need to be handled in the firmware (possibly with the OS proper
pretending it is running on hardware that actually does this stuff in
hardware).

Could still be passable to "pretend" that such hardware has hypervisor
support though, so it isn't on my "bad list", unlike the 'V' extension
which is likely going to be "either it is present in hardware, or any
software that tries to use it is likely going to get horrible
performance..."

Well, and particularly if compilers try to auto-vectorize.

Like, say, enabling AVX when running on a PC with Zen+ (then compiler
assumes the 256-bit YMM registers), where the program runs, but
performance is significantly worse than if one does not enable AVX.

Or, on a "Core 2 Duo" or "Xeon E5410" where trying to use AVX causes the
program to immediately crash due to it not supporting AVX entirely (and
"Trap and Emulate" being "less of a thing" in x86 land).

Or, neither Windows nor Linux wanting to make it look like a late 2000s
era "Core 2 Duo" or similar has AVX support (even if, they could do so
in theory, if they wanted).

Situation is different for RISC-V though, but better to "not" shoot
oneself in the foot, which ideally means assuming a userland ISA that
loos more like RV64GC or similar.

...

> regards
> ved
>

Ved Shanbhogue

unread,

Dec 1, 2025, 5:43:18 PM12/1/25

to sven boertjens, RISC-V ISA Dev

sven wrote:

>I'm aware that RVA23U64 is the current profile, but it just seemed so
>extensive to me. Even if many extensions are necessary in modern software,
>including Vector, I thought that the hypervisor extension was pretty
>overkill to require.

I think most software benefits from vectors directly or indirectly through
commonly used libraries—for example, memcpy, strcmp, and others.

I left out the supervisor extensions because I understood the discussion to
be about the ISA baseline for applications, analogous to something like
x86-64-v2. The presence of the hypervisor extension in hardware does not
imply that a hypervisor must be installed on the platform. But if a
hypervisor is needed, then hardware that is RVA23S64-compliant can be
relied upon to provide the required capabilities. In that sense, the
supervisor extensions are not in the same category as the application
baseline.

Hypervisors are not particularly uncommon or niche. For instance, many
Android phones run a hypervisor such as protected KVM (pKVM) [1] or Knox
HDM [2].

regards
ved

[1] https://android-developers.googleblog.com/2023/12/virtual-machines-as-core-android-primitive.html
[2] https://www.samsungknox.com/en/blog/knox-hdm

Krste Asanovic

unread,

Dec 2, 2025, 7:33:23 AM12/2/25

to sven boertjens, RISC-V ISA Dev, Greg Favor, L Peter Deutsch

The RVA line was expressly designed to support a baseline for binary software ecosystems for high-end offerings to compete with other architectures, and it was a non-goal to support low-effort implementations.

While I’m very sympathetic to the needs of academics and hobbyists, RISC-V needs high-end designs to be successful, and watered-down specifications will limit the growth of the whole RISC-V ecosystem.

It is unrealistic to expect there to be a large and flourishing binary software ecosystem for a less-demanding ISA profile.

Software investments (including open-source) follow where the sockets are, and if the profile is not competitive, those won’t be RISC-V sockets.

RVA will continue to expand in future revisions, but earlier versions do remain valid and supported.

It hasn’t been mentioned in this thread yet, but the earlier RVA22 profile did not mandate vectors or hypervisor, and could fit your needs.

Krste

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To view this discussion visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/a47a8919-bfe5-46dd-8562-ba2f235a0f8cn%40groups.riscv.org.

Krste Asanovic

unread,

Dec 2, 2025, 7:33:26 AM12/2/25

to BGB, isa...@groups.riscv.org

Just to point out that V is not expensive in area compared to SIMD extensions with comparable peak throughput.

There are very small full V open-source implementations out there.

It is a bit more complex than the simplest SIMD extensions, but the complexity is in control logic which is a small part of total area.

This complexity helps improve efficiency, and supports portability.

The various Zve* vector extensions remove some larger datatypes to reduce cost for embedded applications.

If you’re at the point where the vector reg file itself is dominating cost, then Zfinx+V can provide lower area by removing the separate scalar floating-point register file.

Krste

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To view this discussion visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/c34eefd2-fc76-4027-b5f7-27f79710f508%40gmail.com.

BGB

unread,

Dec 2, 2025, 7:34:32 AM12/2/25

to Krste Asanovic, isa...@groups.riscv.org

On 12/1/2025 11:18 PM, Krste Asanovic wrote:
> Just to point out that V is not expensive in area compared to SIMD
> extensions with comparable peak throughput.
> There are very small full V open-source implementations out there.
>

As noted, there were reasons I have for a lot of the seemingly arbitrary
restrictions I would assume being imposed on the SIMD extension. Namely,
if the goal is to be cheap, you don't want to allow too many things that
would make it "not cheap".

> It is a bit more complex than the simplest SIMD extensions, but the
> complexity is in control logic which is a small part of total area.
> This complexity helps improve efficiency, and supports portability.
>

Possibly. If the SIMD unit is natively 4x Binary32 or 2x/4x Binary64 and
supports full IEEE semantics in hardware, this is also going to be
expensive...

It is more practical to cut some corners here, and assume that full IEEE
is limited to Scalar operations, which can be implemented via trap
handlers for cases which the hardware can't it handle natively.

Even as such, a Binary64 FPU is still pretty expensive (with a lot of
corner cutting).

The SIMD unit is expensive in my case, but generally on par with the
scalar FPU; partly because it operates on lower precision.

The original (cheaper but slower) implementation of SIMD ops was via
multi-pumping the FPU (the FPU basically pipelines each element
internally, producing the final result once each element makes it to the
other side). The result of this was basically SIMD ops that took 10
clock cycles to run.

It is a little faster to do 2 or 4 elements in parallel though. Going
much higher, cost goes up, and it doesn't really seem worth it.

But, 2 and 4 element SIMD is where the benefits are the greatest though.

> The various Zve* vector extensions remove some larger datatypes to
> reduce cost for embedded applications.
> If you’re at the point where the vector reg file itself is dominating
> cost, then Zfinx+V can provide lower area by removing the separate
> scalar floating-point register file.
>

I would assume that, if one wants compatibility with existing binary
code, Zfinx+V is going to pose a bigger problem than RV64GC with no V,
or than with wonky SIMD glued onto the F/D extensions.

While wonky SIMD glued onto F/D is wonky, at least it doesn't actively
break things. This is unlike Zfinx/Zdinx; where code compiled to assume
use of F/D here will not work correctly.

Though, seemingly V requires 32x 128-bit for the V registers
(VLEN>=128), which is bigger than 32x 64-bit.

Also, 128-bit register ports would be expensive (while my CPU uses some
128-bit operations, they are still implemented using 64-bit register ports).

Say, for example, if we assume a CPU with a 4R2W 64-bit register file or
similar (though, admittedly, my main configuration is typically up to
3-wide with a 6R3W register file; which can function as 3R1W for 128-bit
operations).

As-is, my CPU core internally uses a 64x 64-bit unified register file
(holding both X and F registers). But, I guess using a unified register
file is far from universal here.

Increasing the size of the register file to 128x, or increasing register
port width, would both come with a fairly steep cost increase. Most of
the 128-bit operations are handled as modified 64-bit operations internally.

So, in a way, a 128-bit 4x Binary32 operation is actually treated with
two 64-bit 2x Binary32 operations in the pipeline. Where, in effect, 2x
Binary32 can be initiated from either of the first two lanes, and two
such operations may co-execute if compatible.

Note that scalar FPU operations may not co-execute as there is only one
scalar FPU in this case. Also 4x Binary16 ops may not co-execute, as
they effectively use both "halves" of the SIMD unit (so, the SIMD unit
deals with Binary16 much as-if it had seen dual-issued 2x Binary32 ops).

It is possible to glue SIMD onto the existing FPU in a way that is
mostly invisible to existing code; but does allow the code to run some
test cases to detect whether or not the SIMD extensions are present.

Like, for example, a program can feeds a 2x Binary32 SIMD vector through
FADD.S or similar:
Does it give the expected result for SIMD?
If yes, assume SIMD works;
Else: No SIMD.
Say, gives a NaN-boxed or otherwise incorrect result.

In the case of code using F/D as before, it works just as it did before.

Here, would also assume that SIMD only exists natively for 16 and 32 bit
elements.

Since, as noted, since Binary64 occupies the entire register, by
definition there is no SIMD on 64-bit elements.

Can still sort of fake it though:
FMUL.D F10, F12, F14, RNE
FMUL.D F11, F13, F15, RNE
Person can, use their imagination for this one.
(Will not co-execute, as the FPU can't do this).

This is also a possible option for 128-bit 4x Binary32 cases:
FMUL.S F10, F12, F14, RNE //(X,Y)
FMUL.S F11, F13, F15, RNE //(Z,W)
(May be understood as potentially co-executing in this case).

Though, one option is to optionally also support explicit 4x SIMD.

In my implementation, this can be encoded by one of the 2 reserved
rounding modes (only defined here for even register pairs). But, on a
cheaper implementation it could make sense to disallow this (only
allowing for 2x Binary32 here).

Say, rounding modes:
000=RNE 001=RTZ 010=RDN 011=RUP
100=RMM 101=QRTZ 110=QRNE 111=DYN

Where, say:
RNE/RTZ: SIMD if not NaN boxed;
RDN/RUP/RMM: Scalar Only
QRTZ/QRNE: Only valid if 128-bit SIMD exists
DYN: Scalar Only

If using a Scalar-Only option, FPU may assume that the operation is
scalar. If not NaN boxed: Trap or similar.

As-is, QRTZ/QRNE would indicate 128-bit SIMD, but only valid for ".S",
else Trap. In this case, it is decoded as a single instruction that
effectively splits in half in the pipeline (and uses two lanes, much
like in the co-execute case). But, doing it this way is more compact
(one instruction rather than two, and with a stronger guarantee they
will co-execute; wheres with two ops it is possible both halves could
end up being run on different clock-cycles, ...).

Note that SIMD cases would not update FPU flags or similar.
So, say, FADD.S:
Is NaN boxed?
Yes: Does scalar things (in IEEE mode).
May update flags, and trap on denormals.
No: Does SIMD things.
No flags updates, DAZ/FTZ, ...

If not in IEEE Mode:
FADD.S/FSUB.S/FMUL.S:
Always behave as SIMD for RNE/RTZ.
RDN/RUP/RMM/DYN: Behave as Scalar
Route to Main FPU rather than SIMD Unit.

Can note that for scalar operations:
FLW and FLH will always load the value as NaN boxed.

So, unless one does something like:
FLD F11, 128(X10)
FADD.S F12, F11, F10, RNE //(Would be understood as SIMD)
It is unlikely that this would be stumbled on by chance.

Where, as noted, for typical operations:
Both items NaN boxed : Scalar
One item NaN-boxed, other Zero : Scalar
Neither item NaN Boxed : SIMD
One item NaN-boxed, other non-zero: Trap

Where, Zero extended items may be encountered sometimes, say:
LUI X6, 0x3F800
FMV.D.X F10, X6

So, can't be overly strict for NaN boxing rules in these sorts of cases.
Though, if a NaN boxed value is mixed with some other non-zero and
non-NaN value, something is sus and trapping seems justified.

The existence of 4x Binary16 would then depend on having both this SIMD
extension and Zfh. If both exist, can probably assume that Binary16 SIMD
exists, and is probably 4 wide.

Doing 4-wide for Binary16 is easier to justify as Binary16 lanes are
cheaper (so one can justify 2 dedicated Binary16 lanes more easily than
they could justify 2 additional Binary32 lanes).

In my existing implementation, there is a little weirdness with the
semantics, but existing code works without issue.

Bigger tradeoff is internal limitations of the FPU:
Only DAZ+FTZ semantics at hardware;
IEEE Mode: Denormals need to Trap.
Reduced precision for Binary64;
Ops like FMUL may need to Trap to give IEEE results in some cases.
Besides denormals, FMUL may also need to trap on non-zero LOBs.
Full width multiplier costs significantly more.
Most of the time, one or both sets of LOBs are zero.
When LOBs are 0, result will match full IEEE result.
FADD/FSUB/FMUL:
Native, but with some limitations.
In IEEE mode:
Inputs are denormal: Trap
Exponent out of range: Trap
Rounding Carry Propagation Fail: Trap
Non-Zero LOBs combination for FMUL: Trap
...
FMADD/FMSUB/FNMADD/FNMSUB:
Most likely Trap (unless FMA is supported);
The hardware doesn't actually do FMA in this case (reasons).
Actual FMA would be nice, but doesn't exactly come cheap.
Though, in some cases:
It is possible to route FMADD.S through the Binary64 FPU;
Can at least fake Single-Rounded FMA for Binary32;
... with a latency of around 12 clock cycles ...
No FMA for SIMD: Trap.
FDIV/FSQRT:
Trap.

Mostly works, with GCC need "-mno-fdiv -ffp-contract=off" to avoid
stepping in to bad performance though... This is a problem either way
though. Otherwise, it is possible to give acceptable performance despite
reliance of trap handlers in some scenarios.

But, alas:
double x, y, z, w;
...
w=x*y+z; //... GCC may try to use FMADD.D here ...
This can hurt bad (and is more common than FDIV).
FMADD.S is less bad, but still slower than using non-fused ops.

As can be noted, for FMUL.D, at present it only computes the high-order
results internally (and results that would fall below the ULP or similar
are discarded). By looking at the input values, it is possible to detect
cases where these discarded regions would be non-zero (which if aiming
for IEEE compliant results, would require using a trap to deal with).

In the non-IEEE mode, it will give an inexact result (namely, one where
all low-order products were assumed to be zero).

One other limitation (that effects both FADD and FMUL) is that the
rounding can only carry-propagate for a certain number of bits (in IEEE
mode, it faults if it could not round). This being an issue with the
latency of the carry propagation in this case.

This being mostly because a fully IEEE compliant FPU is also expensive,
so ended up mostly using trap handlers in an attempt to get the standard
semantics (faster is possible, but the FPU in this case is DAZ+FTZ and
does not give 0.5 ULP rounding).

These problems would still exist regardless of whether or not the FPU
does SIMD. Though, in this case, SIMD comes with additional
restrictions, and trying to do something that isn't allowed in HW
necessarily needs to be handled by trapping.

In some cases though, trapping isn't quite as terrible of an option as
it may at first seem.

Related irony being that I ended up implementing some of the Q
instructions for Binary128 on FPR pairs, in this case pretty much
exclusively via trap handlers. Relative cost of trap-handling being
low-enough relative to Binary128 emulation that it isn't as absurd as it
may seem on the surface.

Also trapping on an FMUL.Q is like 1/8 the code footprint of a function
call. No actual intent to implement Binary128 ops in HW, mostly because
doing Binary128 in hardware is far too expensive; but it still makes
sense to use it for "long double", which is already accepted as being
the "I want precision but don't care about speed".

While some could argue for "Double-Double", this has no advantage over
Binary128 in the absence of hardware support for Single-Rounded Binary64
FMA (and is not an attractive option when emulating the FMA would be
slower than emulating the corresponding Binary128 operations).

So, alas, cheaper to pretend that we have FMUL.Q and FADD.Q and similar
than to try to do it using FMADD.D or FMSUB.D or similar (fewer
emulation traps required, and FMUL.Q is also faster than FMADD.D in this
case).

Still operates on register pairs (unlike the actual Q extension), as in
this case I found register pairs to be the preferable option. Also seems
like pretty much no one is using the actual Q extension.

Also partly overlaps with me using FLQ/FSQ for Load/Store pair (partly
because my original use of LDU/SDU got stomped). And, I had also ended
up using FSQNJ.Q to express 128-bit MOV, so kinda made sense to use the
same pattern for the emulated FADD.Q and similar.

Though, potentially an implementation could trap on these as well, but
implementing these other cases via trap handlers would have a more
significant adverse impact on performance (if using these features).

Would effectively need a mechanism to disable these other features to
have a proper implementation of the Q extension (but, then I would be
again left without Load/Store Pair of MV-Pair, which would be a bigger
loss than not having the Q extension).

...

But, yeah, probably all kinda sucks.

I guess it is a pretty open question as to how much a lot of this could
be generalized to other implementations though.

Granted, I guess maybe one could also debate whether a lot of this is
"actually all that cheap".

>> an email toisa-dev+...@groups.riscv.org <mailto:isa-
>> dev+uns...@groups.riscv.org>.
>> To view this discussion visithttps://groups.google.com/a/
>> groups.riscv.org/d/msgid/isa-dev/c34eefd2-fc76-4027-
>> b5f7-27f79710f508%40gmail.com <https://groups.google.com/a/
>> groups.riscv.org/d/msgid/isa-dev/c34eefd2-fc76-4027-
>> b5f7-27f79710f508%40gmail.com>.
>

sven boertjens

unread,

Dec 2, 2025, 10:25:20 AM12/2/25

to RISC-V ISA Dev, Krste Asanovic, RISC-V ISA Dev, Greg Favor, L Peter Deutsch, sven boertjens

That clarifies the intentions behind the RVA profiles a lot, thank you.

I'm still unsure of how exactly these profiles are intended for software, though. Are software ecosystems expected to maintain support for earlier profiles, since they remain valid and supported? Or is the expectation that they will converge on a specific profile?
If it's up to software ecosystems to independently decide on the supported profile, would that not risk fragmentation in support from different ecosystems?

I also noticed that there isn't a profile that requires vectors but no hypervisor. So if a profile is decided, it must either pull in the hypervisor if V is wanted, or V must be seen as entirely optional.
I'm not entirely aware of how important a hypervisor is for all software ecosystems. Mobile devices make use of it already, but not as much in desktops as far as I've read about.
If an ecosystem were to require "RVA23 without hypervisor" because of this, that might complicate the function of profiles as it ignores their purpose as a solid baseline.

And since RVA23 appears quite heavy and inclusive to me, even if that's purposeful, could that conflict with high-end designs seeking efficiency rather than throughput?
RVA23 does mandate some rather significant extensions, such as the hypervisor extension, that such efficiency-focused designs would rather avoid.

For startups, could an ecosystem where a heavier profile like RVA23 is expected make entry into the high-end industry more difficult? If only existing high-end vendors are capable of keeping up with the expected profiles, that might limit competition somewhat.

These are the reasons why I originally thought a clearer baseline could be beneficial. But if the RVA line already answers these concerns in its own way, that would suffice too.
Will there at least be "official" expectations once things play out some more, or is it entirely up to the ecosystem what to expect with the ISA itself staying out of it?

Tom Gall

unread,

Dec 2, 2025, 11:57:57 AM12/2/25

to sven boertjens, RISC-V ISA Dev, Krste Asanovic, Greg Favor, L Peter Deutsch

On Tue, Dec 2, 2025 at 9:25 AM sven boertjens <boertje...@gmail.com> wrote:

That clarifies the intentions behind the RVA profiles a lot, thank you.

I'm still unsure of how exactly these profiles are intended for software, though. Are software ecosystems expected to maintain support for earlier profiles, since they remain valid and supported? Or is the expectation that they will converge on a specific profile?
If it's up to software ecosystems to independently decide on the supported profile, would that not risk fragmentation in support from different ecosystems?

The linux distros can be pointed to as an example. Redhat, SUSE, Ubuntu, will utilize RVA23. Put another way, the binaries they build will target hardware that meets the RVA23 profile. Prior a number of distros have been taking the tact to tie support to specific hardware as compared to specific profiles.

Distros where end users build from source such as Yocto, Arch, Gentoo etc are mixed as of yet. Yocto for instance will have RVA23 support. These distros have a lot more freedom to support different profiles or mixtures of extensions.

Zephyr, an open source RTOS, identifies extensions it supports.

I also noticed that there isn't a profile that requires vectors but no hypervisor. So if a profile is decided, it must either pull in the hypervisor if V is wanted, or V must be seen as entirely optional.
I'm not entirely aware of how important a hypervisor is for all software ecosystems. Mobile devices make use of it already, but not as much in desktops as far as I've read about.
If an ecosystem were to require "RVA23 without hypervisor" because of this, that might complicate the function of profiles as it ignores their purpose as a solid baseline.

Android, Chromebooks for instance need a hypervisor. It's important to understand just what sorts of consumer devices you have in mind.

And since RVA23 appears quite heavy and inclusive to me, even if that's purposeful, could that conflict with high-end designs seeking efficiency rather than throughput?
RVA23 does mandate some rather significant extensions, such as the hypervisor extension, that such efficiency-focused designs would rather avoid.

For startups, could an ecosystem where a heavier profile like RVA23 is expected make entry into the high-end industry more difficult? If only existing high-end vendors are capable of keeping up with the expected profiles, that might limit competition somewhat.

I feel RVB23 or RVA22 as Krste mentioned might address what you're thinking.

The good thing about the RISC-V ecosystem is there is a path for consumer focused companies in this space to join RVI as members, work together to identify gaps through our structure of HCs, SIGs, Task Groups etc. It can ultimately lead to some combination of best practices, new extensions, a new profile, etc, or companies might find they're able to work within the bounds of what already exists or is covered by current efforts. Regardless, involvement helps ensure there is a solid foundation on which to build products. It takes effort, yet it is well worth it.

To view this discussion visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/b81664dc-475d-4786-b753-5ee183675784n%40groups.riscv.org.

--

Tom Gall (he / him)

VP Technology, RISC-V International

t...@riscv.org

Greg Favor

unread,

Dec 2, 2025, 12:07:57 PM12/2/25

to sven boertjens, RISC-V ISA Dev, Krste Asanovic, L Peter Deutsch

There will never be one profile for all software ecosystems. For example, RVA is too heavy for many embedded Linux and similar systems, while RVB is too light for ecosystems with substantial binary software distribution. And there will be an RVM profile attuned to the needs of microcontroller-based systems.

At the same time, only a small number of profiles are expected to exist (3-4?).

Greg

Earl Killian

unread,

Dec 2, 2025, 2:03:03 PM12/2/25

to sven boertjens, RISC-V ISA Dev

I would to back to what does Sven mean by a consumer core? Is it meant to be a desktop/notebook processor, as opposed to a datacenter processor? Or is it meant to be some IoT thing? To me the term “consumer core” is not well defined.

Sven seems to object to vector and hypervisor. I would argue that a desktop/notebook processor, if that is what is meant, at least requires vector. RISC-V has explicitly done RVV instead of SIMD, and SIMD is pretty much required in the desktop/notebook space. Also, it is important for crypto. Moreover, in modern process nodes, a quad-core or something like that is probably pad limited, and so RVV is not adding to the cost. Also, the size of RVV is dwarfed by a reasonable L3, which also is probably required in desktop processors. I understand that some RV cores are targeted at old process nodes, but they aren’t going to be competitive on the desktop.

If consumer core means IoT, then I think RVB probably is the appropriate profile.

Anyway, can Sven define what is proposed?

-Earl

To view this discussion visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CA%2BQh7TnE6Wy07cmeBN4QS%2BGWy%3D4MBgyQtvkciUwJCCcco2w7Mg%40mail.gmail.com.

sven boertjens

unread,

Dec 2, 2025, 2:59:01 PM12/2/25

to RISC-V ISA Dev, Earl Killian, RISC-V ISA Dev, sven boertjens

I mainly meant all machines running general-purpose software, like desktops/laptops, if that clarifies it better?

I agree that SIMD is pretty much required in this space. The baseline I described would contain the instruction set necessary for modern software, which you could argue that SIMD is part of.

BGB

unread,

Dec 2, 2025, 5:40:58 PM12/2/25

to isa...@groups.riscv.org

On 12/2/2025 1:59 PM, sven boertjens wrote:
> I mainly meant all machines running general-purpose software, like
> desktops/laptops, if that clarifies it better?
>

Most CPUs you would want to put in a desktop or laptop could probably
justify RVA23...

Typically, these types of CPUs prioritize single-threaded performance
above all else, so it makes sense to have whichever combination of
features can best maximize single-threaded performance.

Usually, the thinking in PC/Laptop space is the only reason one needs
multiple cores is so that background tasks and similar are less likely
to disrupt the performance of the user facing task; which is usually
assumed to be primarily single threaded (or a small number of threads if
multi-threaded). In this case, core count is modest or small.

Contrast to say, the cellphone space, which had been largely dominated
by one of:
ARM Cortex A53, A55, or A510 (newer).

Which are generally all in-order designs.
Some higher-end OoO cores exist, but usually in more expensive
"flagship" phones, and not everyone is going and buying the most
expensive cellphones.

But, like, if one goes and buys an $150 phone or similar, it generally
(at least for a long time) comes with one of the in-order CPU designs.

A lot of the higher end phones will come with 1 or 2 fast cores
(typically OoO), but these are often absent on the lower end models.

Often, in this case, cost and power efficiency are the dominant concerns
(or, CPU needs to be as cheap as possible, with the least power use when
in-operation; but CPU is expected to spend much of its time being idle).

Typical configurations seem to be:
4 little
1 big, 3 little
8 little
2 big, 6 little
...

RVA23 might not necessarily make sense for the cellphone space; as V
wouldn't necessarily align with minimizing cost and energy use.

Then one has things like supercomputers, where single threaded
performance is less of a concern than how many processors one can cram
into the same area and power budget (and whichever combination of
features gives the highest overall performance for the target area).

RVA23 could make sense here, assuming V can in-fact give the highest
throughput here.

Though, a lot does depend on what dominates the workload.
Like, if it floating-point dominated, it makes sense.

Like, V would probably be a good option for running neural nets.

If your goal is instead to run something like raytracers, V doesn't make
as much sense. Here, you would want cheaper SIMD and to try to
effectively maximize core count and memory bandwidth (where the per-core
performance is a much lower concern; just the ability to fire up as many
cores as possible and have them trace rays through the scene).

Or, if the goal is effectively something like a "build farm" or similar
(compiling lots of code), then something like V (and SIMD in general) is
no longer particularly relevant (likely, goal here would be to have
moderately fast CPU cores, and to maximize memory bandwidth as the
dominant concern).

...

So, say, if trying to pick optimal RV configurations:
Laptop: RVA23 (most likely, prioritize single-threaded perf)
Cellphone: RV64GC/RVA20 (disfavors V for cost)
Neural Nets: RVA23 (V favored strongly)
Raytracers: RV64G (preferably with cheap SIMD)
Compilers: RV64IM (prioritizing RAM bandwidth and big caches)

> I agree that SIMD is pretty much required in this space. The baseline I
> described would contain the instruction set necessary for modern
> software, which you could argue that SIMD is part of.
>

Yeah, I think it is pretty well understood that we need SIMD.

More a question of which SIMD strategy is best for area cost, energy
use, etc.

I guess one possibility could be (probably not allowed, but alas):
V, but where only V0..V15 exist, and are aliased with F pairs.
So:
V0=F1:F0, V1=F3:F2, ..., V15=F31:F30

This combination allows both V and F/D to exist, and avoids the
implementation footprint of having twice the register-file area.

Downside: Wouldn't exactly be transparent for code that expects V0..V31
to exist and to not overlap with the F0..F31.

Can note that typically best results with SIMD are for 16 and 32 bit
elements, and 2 or 4 elements per vector.

So:
2x16: 32-bit, uncommon
2x32: 64-bit, common
4x16: 64-bit, less common
4x32: 128-bit, common

It is much less common to see cased where 8x vectors make sense, and in
many cases the added complexity of dealing with an 8x vector would eat
any gains (many new complexities are introduced if trying to work with
8x or larger vectors).

For 2x and 4x vectors, one can also get good use of "PACK" style
instructions for shuffles, but this doesn't scale well to larger vectors.

I can make the case that it isn't really obvious to me if 128-bit SIMD
registers are "worth it" in general case use (vs paired 64-bit). If
efficient pairing is possible, handling 4x32 and 2x64 using pairs seems
to make the most sense (and also the use of paired registers can have
lower instruction counts as well by reducing how often one needs to use
things like PACK or SHUF type instructions).

Though, this makes more sense for cost which has sporadic use of SIMD
thrown in at random places; rather than code which is dominated by
processing large arrays of floating-point data (which seems more like
V's domain).

Though, in my uses (in my own ISA), it is very common to need 4x Int16
SIMD, which isn't currently addressed by my own attempt at SIMD on RV
(and is more the 'P' extension's domain).

Typically, it makes more sense to use Int16 SIMD for stuff like working
with RGB/RGBA pixel data (but, less often Int8, even when the
input/output is something like RGBA32 or RGB555; typically one ends up
needing 16 bits or so for the intermediate calculations).

Going directly from RGBA32 or RGB555 to 4x Binary16 is possible in
theory (and is more the pattern implied by GLSL), but going directly
between these formats doesn't make as much sense (more often these paths
are RGB555 <-> 4x Int16 <-> 4x Binary16 or similar).

To some extent, I have often ended up using Binary16 SIMD along audio
processing paths and similar (works pretty well here).

But, a lot of this is stuff that exists in my ISA that hasn't been
mapped over to RV encodings (well, i have my XG3 thing, or gluing RV64G
and my ISA together in a shared encoding space, but... This isn't the
same as having these instructions available as native RISC-V encodings).

Well, and I don't really expect many people would want to adopt my XG3
thing (particularly as its encodings are mutually incompatible with the
16-bit 'compressed' instructions; so switching between RV64GC and XG3
effectively requires an internal mode change; with RV64G as the common
subset between RV64GC and XG3).

Well, and it adds the architectural wonk of CPU mode being implicitly
encoded into PC and the link register, ...

Though, in this case, both sets of ISAs used register pairs for 128-bit
SIMD.

...

kr...@sifive.com

unread,

Dec 3, 2025, 2:01:05 AM12/3/25

to BGB, isa...@groups.riscv.org

>>>>> On Tue, 2 Dec 2025 16:38:18 -0600, BGB <cr8...@gmail.com> said:

| On 12/2/2025 1:59 PM, sven boertjens wrote:
|| I mainly meant all machines running general-purpose software, like
|| desktops/laptops, if that clarifies it better?
||

| Most CPUs you would want to put in a desktop or laptop could probably
| justify RVA23...

[...]

| Contrast to say, the cellphone space, which had been largely dominated
| by one of:
| ARM Cortex A53, A55, or A510 (newer).

| Which are generally all in-order designs.

I'll just note that you can make a competitive RVA23 core that is
smaller than these ARM cores.

| Some higher-end OoO cores exist, but usually in more expensive
| "flagship" phones, and not everyone is going and buying the most
| expensive cellphones.

[...]

Many smartphones and tables, have considerably higher performance
cores than these.

Krste

Krste Asanovic

unread,

Dec 3, 2025, 2:02:53 AM12/3/25

to BGB, isa...@groups.riscv.org

^tablets

>
> Krste

Al Martin

unread,

Dec 3, 2025, 2:54:26 AM12/3/25

to RISC-V ISA Dev

I just wanted to note that profiles (RVA*, RVB*, etc.) are shortcuts that make life easier for software. A binary compiled for RVA23 should run and behave the same (+/-performance) on any implementation that complies with RVA23. This does not prevent one from explicitly stating what has been implemented. The compiler just needs to be told which extensions are available. If you implement more than is required for RVA23, you are guaranteed that a binary compiled for RVA23 will work with that implementation.

I don't think a separate "baseline profile" is needed, if it isn't going to be widely adopted. You can still take RV64GC or RV32GC as your baseline, and then add whatever extensions, or your own custom extensions, or even do something that isn't exactly RISC-V. You just have to own more of the software stack than a standard profile like RVA23.

2c.

Al Martin (@Akeana)

K. York

unread,

Dec 3, 2025, 11:37:31 AM12/3/25

to RISC-V ISA Dev, sven boertjens, Krste Asanovic, RISC-V ISA Dev, Greg Favor, L Peter Deutsch, sven boertjens

Windows 11 has a hard requirement on either a TPM or a hypervisor, preferably both.

The last decade and a half has demonstrated that you really, really want vector stuff on consumer hardware, and limiting it to only data center class machines causes extreme pain.

There is a very good reason none of the profiles saw significant software uptake for general purpose operating systems until both H and V were included. Don't try to yank them out.

~Kane

Sent with Shortwave

To view this discussion visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/b81664dc-475d-4786-b753-5ee183675784n%40groups.riscv.org.

Earl Killian

unread,

Dec 5, 2025, 7:35:49 AM12/5/25

to sven boertjens, RISC-V ISA Dev

Right, and RISC-V Vector is our SIMD, so it is basically required. I think you could argue whether hypervisor is required in this space, but RVV pretty much is, both for computation and for cryptography.

RVV is fairly flexible on configurations, which is one strong advantage over SIMD. One can have low-end implementations, e.g., VLEN=128, which are completely compatible with more performant implementations.

Yes, RVV is a big design challenge; there is a lot of complexity in it. But companies are managing to do it nonetheless.

-Earl

Reply all

Reply to author

Forward