Comments on Zibi 0.1

L Peter Deutsch

unread,

Sep 16, 2025, 11:10:39 AMSep 16

to isa...@groups.riscv.org

I just read through the draft doc, and have a few thoughts.

* Subject to the comments below, and competition for encoding space, I'm in
favor of this proposal.

* THe doc says "immediate zero, which is represented as negative one." I'm
sure what was intended is "negative one, which is represented as immediate
zero."

* I question whether the special handling of 0 as -1 is worthwhile: I would
favor either treating an immediate value of 0 as reserved, or just letting
it be a redundant encoding of this function.

* The li + bne/beq pair that this instruction would replace could be a
candidate for fusion in execution. Also, the li can be a c.li if the C
extension is present.

* I would like to see both space and time measurements of this proposal
applied to some real-world benchmarks. The draft doc refers to "Profiling
general benchmarks," so I would guess that this information has already been
gathered.

* Complementary to this proposal, would it be useful to propose a branch on
bit zero/non-zero, where the immediate would encode a bit number? (How this
would extend to RV64 and RV128 is TBD). This would also compress 2
instructions (sll, bge/blt or bexti, beq/bne) into 1, where the first
instruction cannot be compressed. aarch64 has this instruction, but I don't
have any information about how often it's used.

--

L Peter Deutsch :: Aladdin Enterprises :: Healdsburg, CA & Burnaby, BC

BGB

unread,

Sep 17, 2025, 5:24:19 AMSep 17

to isa...@groups.riscv.org

On 9/16/2025 10:10 AM, L Peter Deutsch wrote:
> I just read through the draft doc, and have a few thoughts.
>
> * Subject to the comments below, and competition for encoding space, I'm in
> favor of this proposal.
>

Personally I am more on the fence, as in my testing (of a very similar
feature) the relative effects (either in terms of code density or
performance) seem rather small.

It can help, but the situation it addresses isn't quite common enough to
have a particularly large effect.

Or, say:
Branches: roughly 1/6 of instructions;
Conditional branches with constant non-zero value: ~ 11% of branches.
Percentage of these that are also a BEQ/BNE and similar: ~ 38%.

Overall: Has an effect on roughly an 0.7% of instructions.

Though, ironically, at least Imm5 has a statistically good hit rate in
this case.

> * THe doc says "immediate zero, which is represented as negative one." I'm
> sure what was intended is "negative one, which is represented as immediate
> zero."
>
> * I question whether the special handling of 0 as -1 is worthwhile: I would
> favor either treating an immediate value of 0 as reserved, or just letting
> it be a redundant encoding of this function.
>
> * The li + bne/beq pair that this instruction would replace could be a
> candidate for fusion in execution. Also, the li can be a c.li if the C
> extension is present.
>
> * I would like to see both space and time measurements of this proposal
> applied to some real-world benchmarks. The draft doc refers to "Profiling
> general benchmarks," so I would guess that this information has already been
> gathered.
>

Some testing was done.

I think the general status was it does help some.

> * Complementary to this proposal, would it be useful to propose a branch on
> bit zero/non-zero, where the immediate would encode a bit number? (How this
> would extend to RV64 and RV128 is TBD). This would also compress 2
> instructions (sll, bge/blt or bexti, beq/bne) into 1, where the first
> instruction cannot be compressed. aarch64 has this instruction, but I don't
> have any information about how often it's used.
>

I would have almost assumed adding BTST/BNTST instead:
BTST Rs1, Rs2, Label //Branch if (Rs1&Rs2)!=0
BNTST Rs1, Rs2, Label //Branch if (Rs1&Rs2)==0

Which in my compiler output tends to be behind BEQ/BNE and BLT/BGE, but
slightly ahead of BLTU/BGEU.

Where, in my case, BLTU and BGEU are much less common, partly because my
compiler uses zero-extension for "unsigned int" (which also eliminates
most of the use cases for ".UW" instruction variants; but did create an
incentive to re-add ADDWU/SUBWU which were much more useful in this case).

In this case, BLTU and BGEU became mostly relegated to dealing with
"unsigned long" and "unsigned long long", which are a little less common.

"Branch on bit set/clear" could also make sense, and ironically better
addresses the case of "if(x&32)" and similar, but (with some compiler
cleverness), BTST and BNTST can deal with "if(cond1 && cond2)" and similar.

Down side in this case is partly one of logic cost (implementing
branch-on-bit being "more expensive than one might expect").

...

Robert Lipe

unread,

Sep 18, 2025, 7:04:53 AMSep 18

to RISC-V ISA Dev

I missed the announcement. I used https://github.com/riscv/zibi/commit/c7a9b9b3083b3c79a9eb27e5daf4addfb1450122

> In real world applications, conditional branches comparing with constants are of great universality.

Citation needed.

No, really. Argue in data. How universal? Is this really universal enough for Another Darned Fragmentation Argument? How often would this be useful inside V8 in Chrome, for example?

> which will increase register pressure

At this point, someone should have an instrumented version of clang or gcc that can be used to rebuild, oh, a Fedora distribution. How many times do you have so many live registers *where you don't need the value otherwise,* where this would cause another spill?

> and potentially cause register spills/reloads.

Data, please.

Last I heard the numbers, admittedly a long time ago, a leading cause of branch/comparisons *in cases that matter* was for induction variables. In most complicated architectures, the branch part of branch/compare dwarfs loading and holding an immediate anyway. The impact on the execution pipeline is more turbulent than a load immediate (fusion candidate) and probably more tragic than infrequent reloads outside the loop.

Compilers have gotten very good at turning indexed C loops into reverse loops to test for zero.
C++ loops over a container are generally comparing ->first() and ->last(), which, for common containers, are just pointers inside the container itself, though that doesn't extrapolate to all containers and all implementations. Generally, as a style, c++ is dealing with iterators and not indices of compile-time indices.

The ISO language process for established languages strongly encourages demonstrated experience with new proposals. A proposer is encouraged to grab a research or production compiler and prove that something can be implemented *and is useful*.

In cases like this, "is useful" gets answered by prototyping it. If the only place in the compiler that would trigger is a 'make check && make bootstrap' is in the test case for the new proposal, that's an effectiveness that rounds to 0%. I'm pretty sure it would be trivial to get a compiler to emit this, but the only interesting case is if the load of the immediate happens in a non-hot path (for an induction variable, it will), the induction variable isn't needed inside the loop, the comparison is +/4096, and this therefore saves a spill.

Like BGB, I'm skeptical, and like Mr. Deutsch, I'd like to see the data that shows this will rise above noise. I realize I used "lots", "most", "probably" and such several times above, but I'm not the one making the elevator pitch. I think experience is useful in a discussion/review, but data is needed for a proposal.

...and generally I'd encourage a floor upon proposals that a new opcode has to save N% in some critical workload or OS build or something. (Duly noted that "critical workload" is different for a CH32V003 than a VitalStone V100-class part.) Having architectures where every CPU implements RV64 plus a different 70% of the extensions is less useful to the industry than implementing only RV64. Rebuilding Chrome for every different chip because this one has Zbbs test but that other one has some other new hotness that seems like the opposite of what we should encourage.

We've already made such a mess of the ISA (sorrynotsorry) that the leading Linux distro has basically given up on RISC-V for now and said, "Wake us up when RVA23 ships." I'd like the ratifiers to say "no" to a lot more things. "More wood behid fewer arrows" was a cliche used at my former employer, and I can't help but think that RISC-V would have benefited from that.

kr...@sifive.com

unread,

Sep 18, 2025, 3:07:54 PMSep 18

to Robert Lipe, RISC-V ISA Dev

I'd urge the last two commenters to take the time to actually follow
the links in the original email and RISC-V website to find the data
supporting this proposal. It's in PDF attachment on the linked to web
page. Also the discussion in the email threads for more context.

You might still be negative on the proposal after reviewing the data,
but please be more constructive in these discussions and be mindful of
everyone's time.

Krste

| --
| You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
| To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
| To view this discussion visit
| https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/bdca5a9e-b009-47e9-a049-6b07ba103b1cn%40groups.riscv.org.

kr...@sifive.com

unread,

Sep 18, 2025, 3:29:15 PMSep 18

to Robert Lipe, RISC-V ISA Dev

>>>>> On Thu, 18 Sep 2025 04:04:53 -0700 (PDT), Robert Lipe >>>>> <rober...@gmail.com> said:

[...]

| We've already made such a mess of the ISA (sorrynotsorry) that the leading Linux distro has basically given up on RISC-V for now and
| said, "Wake us up when RVA23 ships."

This is a gross mischaracterization of the situation.

RISC-V is still in the early days of being widely used as an
application processor with a large binary software ecosystem.
Everyone has aligned behind RVA23 as being the real starting point for
that, and hardware will ship over next 6-12 months with full RVA23
support.

I'd encourage you to explore the ISA versions that same distro
supports on other architectures. RVA23 represents a considerable step
forward from those.

| I'd like the ratifiers to say "no" to a lot more things. "More wood behid fewer arrows" was a
| cliche used at my former employer, and I can't help but think that RISC-V would have benefited from that.

The community decides what goes ahead, and the software ecosystems
wanting to move to RISC-V asked for many of the features added along
the way to RVA23. An important part of the profile story is providing
stability. The RVA23 standard will be the base for apps processors
for many years, even after future RISC-V major profile releases.

Krste

Kevin Cameron

unread,

Sep 18, 2025, 4:08:20 PMSep 18

to isa...@groups.riscv.org

The ISA is somewhat irrelevant; the software ecosystem isn't designed to
handle heterogeneous computing. RISC-V is a classic open-source project
in that it is just trying to be cheaper than ARM, and ignores a bunch of
practical issues.

From an ecosystem perspective, routine-level off-load works a lot
better than ISA extension, because you can deal with it mostly in the
linker, and don't need an LLVM team to go with your processor team.

If RISC-V was 10x faster than ARM/X86 then it would be a different
story, SiFive refused my help on that years ago, so I'm not surprised by
the current mess.

Kev.

BGB

unread,

Sep 18, 2025, 8:23:14 PMSep 18

to isa...@groups.riscv.org

On 9/18/2025 3:08 PM, Kevin Cameron wrote:
> The ISA is somewhat irrelevant; the software ecosystem isn't designed to
> handle heterogeneous computing. RISC-V is a classic open-source project
> in that it is just trying to be cheaper than ARM, and ignores a bunch of
> practical issues.
>
> From an ecosystem perspective, routine-level off-load works a lot
> better than ISA extension, because you can deal with it mostly in the
> linker, and don't need an LLVM team to go with your processor team.
>

It depends on what you want:
Something to run Linux and PC or server?
Not an ideal situation at present.
But, almost good, just needs a few more things.
Something for embedded/DSP tasks?
A lot more promise.

But, for the latter, you mostly want a CPU that is both cheap and fast.
Also, you can just sort of compile the code for whatever happens to be
running in the target device.

Something like RVA23 maybe makes sense for PC or server.
But may well end up too expensive for embedded use-cases.

Ubuntu wanting RVA23 may make sense, as (AFAIK) Ubuntu doesn't really
target nor is widely used for embedded systems.

For a PC or server use case, ISA stability is likely to be more important.

In the embedded space, last I checked, one of the dominant processors
for a long time had been the "ARM Cortex A53".

There are other cores that are faster in terms of single threaded
performance, but use more energy and are less cheap. Likely whatever
comes along would need to out-compete the A53 at its own game.

I don't think they choose A53 for its stability, but mostly because it
is good at the things it does well.

Well, and RV32IMFC makes a lot of sense as a competitor to the Cortex-M4
or similar (or RV32IMC vs Cortex-M0+), ...

Going smaller, there is the MSP430, but not clear if it makes as much
sense for RISC-V to try to compete with the MSP430 or similar (well,
outside the range where it would also compete with Cortex-M).

All that said, I am not actually opposed to Zibi or anything, but more
that it exists in a territory where it is more debatable if it makes
enough of a difference to be particularly worthwhile.

For a lot of things, I mostly go on a "does the performance delta cross
1%?" heuristic. I am more pessimistic of features which are most likely
to fall well short of 1%, ...

> If RISC-V was 10x faster than ARM/X86 then it would be a different
> story, SiFive refused my help on that years ago, so I'm not surprised by
> the current mess.
>

There are ways to make it faster, but it sometimes seems to me like
things are more often going off in random directions that result in more
complexity than ideal, or previous poor choices are compounding to make
the situation worse (say, the proliferation of ".UW" instructions is a
dark path; that ".UW" instructions are seen as beneficial should IMHO be
taken as a bad omen).

Granted, my own project is also subject to needless complexity, so I
can't say too much, but alas...

Like, while some of my own efforts seem promising, I couldn't reasonably
expect anyone to use them (or, widespread adoption might actually end up
being net-negative...). So, many things, I still consider as experimental.

In some areas, RISC-V likely actually needs "less" than what it has already.

I would assume trying to optimize things mostly for cost/benefit
tradeoffs. So, if a feature is expensive, or fails to cross some minimum
level of benefit, it can be trimmed.

Like, even within RV64GC, there is still a lot of stuff that could be
trimmed or demoted to emulation through traps without much negative
impact on overall performance.

For example, one can limit JAL to X0 and X1 only (faulting if Rd is not
either X0 or X1), and handle pretty much all of the 'A' extension with
traps, and code basically continues to run as it did before.

One can also turn FDIV and FSQRT into traps (FDIV is infrequently used;
FSQRT "hardly ever"; so the cost of the emulation traps tends to fall
below the "relevance threshold").

Then there are some cases that would be preferable to handle as traps
(like FMADD), except that if using GCC, and GCC is merging "a*b+c" into
FMADD or similar, this happens often enough to be a bad thing for
performance. But, also if one assumes a single-rounded result, with an
affordable FPU design, it may still require a trap to deal with this.

Though, for contrast, things like 64-bit integer multiply and divide
happen often enough that trapping would be a bad option, but still not
enough to justify making them "actually fast".

So, say, one ends up with an implementation where, say:
80 cycle MUL/DIV makes sense (1-bit Shift-and-ADD).
500 cycles is too much cost (tanks performance);
10 cycles, while possible, is too expensive to justify.
( Eg: 4-bit Radix-16 logic. )

One can bemoan all the stuff that RISC-V does that wastes excessive
amounts of encoding space, or could have been done better with a
different mechanism, but "it is what it is" sometimes...

As for stuff to add:
Something like the recent Zilx/Zisx is strongly needed.
Load/Store Pair and Jumbo-Prefixes in larger cores.
For small cores: Optional / Absent.
Needs 64-bit instruction fetch and a 4R2W register file.
...

But, granted, this could likely be because a lot of the code I use for
testing is more strongly effected by this.

I was mildly annoyed by the Zilx/Zisx proposal breaking my LDP/SDP
encodings, but I have now resolved to move them over to FLQ/FSQ (since I
have already determined that, outside of some external force, I am not
going to implement the Q extension; so I can consolidate all of LDP/SDP
under FLQ/FSQ since the LDU/SDU space got stomped).

Though, I am left to consider the possibility of a "pseudo Q":
Binary128 values are represented as register pairs;
If you try to do FADD.Q or FMUL.Q or whatever, it traps;
Trap handler then deals with it.

This leaves open the option of eventual hardware support, and isn't that
much more expensive than using runtime calls. In my case, it can also be
added with almost no change to the ISA or decoder or pipeline as they
exist in my case.

I still have some reservations about pre/post increment, but it does
have some merit at least for code density, so OK.

Though, as much as it might seem paradoxical to strive for performance,
but also end up endorsing dealing with various parts of the ISA by just
using a bunch of emulation traps (and burning many hundreds of cycles
every time this happens).

One could also debate, say, whether unaligned load/store could also be
handled by traps (to reduce the cost of the L1 cache). I had assumed
keeping unaligned load/store fast, as these have a few major "killer
apps": Huffman and LZ77 compression. In effect, not having unaligned
memory access has a strong negative effect on the ability to do
semi-fast data compression.

But, perfection is impossible...

Greg Favor

unread,

Sep 18, 2025, 8:32:21 PMSep 18

to BGB, isa...@groups.riscv.org

On Thu, Sep 18, 2025 at 5:23 PM BGB <cr8...@gmail.com> wrote:

Something like RVA23 maybe makes sense for PC or server.
But may well end up too expensive for embedded use-cases.

This is why there is also the RVB23 profile (for, say, embedded Linux or "rich OS" class systems that don't depend on a standard binary software distribution ecosystem). And by next year there will be an RVM profile (for microcontroller-based systems).

Greg

Ubuntu wanting RVA23 may make sense, as (AFAIK) Ubuntu doesn't really
target nor is widely used for embedded systems.

For a PC or server use case, ISA stability is likely to be more important.

Kevin Cameron

unread,

Sep 19, 2025, 5:16:21 AMSep 19

to isa...@groups.riscv.org

At OCP we discuss how to deploy a range of processors across DSP, GPUs,
FPGAs, matrix-math, and networking; while there are a number of extended
RISC-V processors in the mix, there's just no general support mechanisms
for new processors with fancy features.

I've been on two RISC-V ISA extension projects, and they were a mess. I
tried to fix it so I could just haul in the RTL for the extension in the
last one, but got stymied.

It's also the least friendly machine code / asssembly language I've had
to deal with.

Ideally you would use FPGA backed cores to prototype extensions (as
routine-level offload), and only go to ISA extensions if it made sense
for an ASIC (for some embedded standalone task).

While I use Ubuntu/Mint all the time, Unix was a pretty brain-dead OS,
and Linux is a straight copy to avoid license costs.

Cheaper isn't better, try to do better.

Kev.

BGB

unread,

Sep 20, 2025, 2:49:33 AMSep 20

to isa...@groups.riscv.org

On 9/19/2025 4:16 AM, Kevin Cameron wrote:
> At OCP we discuss how to deploy a range of processors across DSP, GPUs,
> FPGAs, matrix-math, and networking; while there are a number of extended
> RISC-V processors in the mix, there's just no general support mechanisms
> for new processors with fancy features.
>

For experimenting with custom ISA features, one needs to have a custom
or customized C compiler. Pretty much unavoidable...

Though, there is also the tradoeff of whether a target is supported by
GCC. So, for example, much harder to consider porting something like
Linux or similar to a target with no GCC support.

But, in other ways, a full custom compiler can make sense. Also there is
merit when ones compiler is small and simple enough to rebuild quickly.
If the compiler is very large and slow to rebuild, it kinda ruins
motivation to work on it (say, when ones' compiler toolchain isn't
millions of lines of code...).

Also, without a specialized compiler, hard to even really gauge which
instructions it might make sense to add. Some things are also a lot
easier when one can instrument the compiler to dump a bunch of useful
stats, and if it doesn't dump the stat, it is possible to add code to so so.

Sadly, my compiler isn't quite a sufficient stand-in for GCC, but for a
Linux build of my compiler, I have managed to (in some simple cases) get
"configure" scripts to use it as a cross compiler. Not enough to port
anything non-trivial though.

Nevermind if, in terms of how it works, it is very different from GCC.
Even then, wouldn't get that far, eg, with woefully incomplete support
for C++.

Other routes had looked at were:
Trying to get dynamically-linked RV64GC ELF-PIE binaries working on my
makeshift OS, but partly put that on hold as I couldn't figure out how
to get "ld-linux.so" to not immediately explode (or how exactly it all
works). Where, the specifics of how glibc interacts with the kernel were
not exactly well documented.

But, in theory, if it worked, could start trying to borrow parts of the
Linux userland.

> I've been on two RISC-V ISA extension projects, and they were a mess. I
> tried to fix it so I could just haul in the RTL for the extension in the
> last one, but got stymied.
>
> It's also the least friendly machine code / asssembly language I've had
> to deal with.
>

If a feature needs ASM to use, or can't really be used easily by a C
compiler, its usefulness is greatly diminished.

Sometimes, it may be useful to optimize a special case or a specific
algorithm, but going too far in this direction is detrimental to an ISA
(leads to a lot of cruft).

Often it may make sense when considering a feature to ask:
"Which common C construct or repeating pattern can be helped by this?"
If it doesn't address a common C construct or repeating pattern, often
better to be cautious.

Granted, in my ISA, it isn't exactly free of niche helper ops or wacky
stuff either. A lot of the one I have were related to graphics or pixel
processing, or to SIMD stuff. Some wackiness in my ISA design was also
related to trying to make dynamic typing fast-ish. Some doesn't map
cleanly to RISC-V, which tends to assume untagged pointers (and which
tends to use the same instructions both for ALU and for working with
pointers).

Decided to spare the people here from me going into things like
dynamic-type-tagged pointers and hardware-assisted dynamic bounds checks.

While not quite so relevant to C, they are more relevant if running a
language that resembles JavaScript or similar, or as an experimental
mode to bounds-check pointers in C (the code mostly looks like normal,
except that out-of-bounds memory accesses will trigger an exception; at
the cost of a certain percentage of runtime performance overhead).

Sometimes care is needed to limit how much bleeds over.
Say, when the different ISAs tend to do things differently;
And, the ABIs are very different.

...

> Ideally you would use FPGA backed cores to prototype extensions (as
> routine-level offload), and only go to ISA extensions if it made sense
> for an ASIC (for some embedded standalone task).
>
> While I use Ubuntu/Mint all the time, Unix was a pretty brain-dead OS,
> and Linux is a straight copy to avoid license costs.
>
> Cheaper isn't better, try to do better.
>

I experiment with trying to be both cheaper *and* better...

These goals aren't necessarily mutually exclusive, as resources saved in
one area can often be directed somewhere else, to one can quest for the
best balance (and, say, the LUT budget in an FPGA is a finite resource).

One can then try to figure out what is the best balance.

Like, for example, things like implementing features using exceptions
and runtime calls isn't necessarily bad, if they happen infrequently enough.

But, if something happens "all the time", it matters a fair bit more.

Granted, can't really make that many changes to the core of the ISA, as
breaking existing binaries is preferably avoided.

One can look at JAL, or LUI+ADDI, and be like, "well, this kinda sucks"
due to their inefficient use of encoding space (and not really
addressing 64-bit values effectively). But, can't change these...

Or, like FMADD and friends, which burn an absurd amount of encoding space.

A lot depends on the relative size of the CPU as well.
Like, say, a Desktop PC, DSP, or microcontroller, are very different use
cases with very different tradeoffs.

...

> Kev.
>

Reply all

Reply to author

Forward