On 9/18/2025 3:08 PM, Kevin Cameron wrote:
> The ISA is somewhat irrelevant; the software ecosystem isn't designed to
> handle heterogeneous computing. RISC-V is a classic open-source project
> in that it is just trying to be cheaper than ARM, and ignores a bunch of
> practical issues.
>
> From an ecosystem perspective, routine-level off-load works a lot
> better than ISA extension, because you can deal with it mostly in the
> linker, and don't need an LLVM team to go with your processor team.
>
It depends on what you want:
Something to run Linux and PC or server?
Not an ideal situation at present.
But, almost good, just needs a few more things.
Something for embedded/DSP tasks?
A lot more promise.
But, for the latter, you mostly want a CPU that is both cheap and fast.
Also, you can just sort of compile the code for whatever happens to be
running in the target device.
Something like RVA23 maybe makes sense for PC or server.
But may well end up too expensive for embedded use-cases.
Ubuntu wanting RVA23 may make sense, as (AFAIK) Ubuntu doesn't really
target nor is widely used for embedded systems.
For a PC or server use case, ISA stability is likely to be more important.
In the embedded space, last I checked, one of the dominant processors
for a long time had been the "ARM Cortex A53".
There are other cores that are faster in terms of single threaded
performance, but use more energy and are less cheap. Likely whatever
comes along would need to out-compete the A53 at its own game.
I don't think they choose A53 for its stability, but mostly because it
is good at the things it does well.
Well, and RV32IMFC makes a lot of sense as a competitor to the Cortex-M4
or similar (or RV32IMC vs Cortex-M0+), ...
Going smaller, there is the MSP430, but not clear if it makes as much
sense for RISC-V to try to compete with the MSP430 or similar (well,
outside the range where it would also compete with Cortex-M).
All that said, I am not actually opposed to Zibi or anything, but more
that it exists in a territory where it is more debatable if it makes
enough of a difference to be particularly worthwhile.
For a lot of things, I mostly go on a "does the performance delta cross
1%?" heuristic. I am more pessimistic of features which are most likely
to fall well short of 1%, ...
> If RISC-V was 10x faster than ARM/X86 then it would be a different
> story, SiFive refused my help on that years ago, so I'm not surprised by
> the current mess.
>
There are ways to make it faster, but it sometimes seems to me like
things are more often going off in random directions that result in more
complexity than ideal, or previous poor choices are compounding to make
the situation worse (say, the proliferation of ".UW" instructions is a
dark path; that ".UW" instructions are seen as beneficial should IMHO be
taken as a bad omen).
Granted, my own project is also subject to needless complexity, so I
can't say too much, but alas...
Like, while some of my own efforts seem promising, I couldn't reasonably
expect anyone to use them (or, widespread adoption might actually end up
being net-negative...). So, many things, I still consider as experimental.
In some areas, RISC-V likely actually needs "less" than what it has already.
I would assume trying to optimize things mostly for cost/benefit
tradeoffs. So, if a feature is expensive, or fails to cross some minimum
level of benefit, it can be trimmed.
Like, even within RV64GC, there is still a lot of stuff that could be
trimmed or demoted to emulation through traps without much negative
impact on overall performance.
For example, one can limit JAL to X0 and X1 only (faulting if Rd is not
either X0 or X1), and handle pretty much all of the 'A' extension with
traps, and code basically continues to run as it did before.
One can also turn FDIV and FSQRT into traps (FDIV is infrequently used;
FSQRT "hardly ever"; so the cost of the emulation traps tends to fall
below the "relevance threshold").
Then there are some cases that would be preferable to handle as traps
(like FMADD), except that if using GCC, and GCC is merging "a*b+c" into
FMADD or similar, this happens often enough to be a bad thing for
performance. But, also if one assumes a single-rounded result, with an
affordable FPU design, it may still require a trap to deal with this.
Though, for contrast, things like 64-bit integer multiply and divide
happen often enough that trapping would be a bad option, but still not
enough to justify making them "actually fast".
So, say, one ends up with an implementation where, say:
80 cycle MUL/DIV makes sense (1-bit Shift-and-ADD).
500 cycles is too much cost (tanks performance);
10 cycles, while possible, is too expensive to justify.
( Eg: 4-bit Radix-16 logic. )
One can bemoan all the stuff that RISC-V does that wastes excessive
amounts of encoding space, or could have been done better with a
different mechanism, but "it is what it is" sometimes...
As for stuff to add:
Something like the recent Zilx/Zisx is strongly needed.
Load/Store Pair and Jumbo-Prefixes in larger cores.
For small cores: Optional / Absent.
Needs 64-bit instruction fetch and a 4R2W register file.
...
But, granted, this could likely be because a lot of the code I use for
testing is more strongly effected by this.
I was mildly annoyed by the Zilx/Zisx proposal breaking my LDP/SDP
encodings, but I have now resolved to move them over to FLQ/FSQ (since I
have already determined that, outside of some external force, I am not
going to implement the Q extension; so I can consolidate all of LDP/SDP
under FLQ/FSQ since the LDU/SDU space got stomped).
Though, I am left to consider the possibility of a "pseudo Q":
Binary128 values are represented as register pairs;
If you try to do FADD.Q or FMUL.Q or whatever, it traps;
Trap handler then deals with it.
This leaves open the option of eventual hardware support, and isn't that
much more expensive than using runtime calls. In my case, it can also be
added with almost no change to the ISA or decoder or pipeline as they
exist in my case.
I still have some reservations about pre/post increment, but it does
have some merit at least for code density, so OK.
Though, as much as it might seem paradoxical to strive for performance,
but also end up endorsing dealing with various parts of the ISA by just
using a bunch of emulation traps (and burning many hundreds of cycles
every time this happens).
One could also debate, say, whether unaligned load/store could also be
handled by traps (to reduce the cost of the L1 cache). I had assumed
keeping unaligned load/store fast, as these have a few major "killer
apps": Huffman and LZ77 compression. In effect, not having unaligned
memory access has a strong negative effect on the ability to do
semi-fast data compression.
But, perfection is impossible...