Add with carry instructions

Vadím Sukhomlínov

unread,

Feb 12, 2025, 5:58:21 PMFeb 12

to RISC-V ISA Dev

Hi All,

When implementing big-number operations on RISC-V (e.g. for cryptography, scientific applications) lack of carry flag results in relative inefficiency of code compared to x86 where `adc` instruction exists. On RISC-V you need to explicitly get carry and add carry resulting in the code like for common primitive add_with_carry(a,b,carry)->(sum, carry):

```
// In: a0 - a, a1 -b a2 -carry
// Out: a0 = (a+b+carry), a1(carry) = (a+b+carry)>>XLEN

add a3, a2, a0
sltu a2, a3, a2
add a0, a3, a1
sltu a1, a0, a3
add a1, a1, a2
```

On x86 that would be a single `adc` instruction taking both operands and carry flag in CF.

My proposal is to introduce `addlo` and `addhi` instructions which would replace sequence above into:

```
mv a3, a0 // can be possibly avoided with proper unrolling
addlo a0, a1, a2 // a0 = a0 + a1+ a2 (low XLEN bits)
addhi a1, a2, a3 // a1 = a1 + a2 + a3 (high XLEN bits)
```

If 4-operand instructions are possible, or, say, use of paired destinations, then it can be a single instruction producing 2*XLEN result of 3 value addition.

I'm not sure if this is a viable proposal (maybe it was discussed earlier) and wherever it should be an addition to existing extensions or be a new extension.

Thanks,
Vadim

Robert Finch

unread,

Feb 12, 2025, 8:11:50 PMFeb 12

to RISC-V ISA Dev, Vadím Sukhomlínov

In my Q+ architecture I handle extended precision with adds with three source operands.

ADD (a,b,c) => sum

And

ADDGC(a,b,c) => carry (add generate carry, two bits)

Both add instructions can be performed simultaneously on a machine more than one wide.

Something similar could be done in RISCV using wider opcodes (48/64 -bit).

But a simple Add with carry instruction would be more compact. Adding a carry flag to the architecture would be a PITA but maybe it could be done for the special cases of crypto / scientific extended precision. It is additional state that would need to be managed. If the extended precision addition could be performed in an atomic manner then the flag could be made temporary and not need its state managed.

In Q+ there is an ATOM instruction modifier that makes a short sequence of instructions atomic, disabling interrupts between them. Therefore, extended precision arithmetic could be performed as an atom.

ATOM 3

ADD r1,r3,r5

ADC r2,r4,r6

MOV r7,C ; if C is needed later

Alternately the C flag could be resident in a CSR somewhere.

I think having an explicit carry flag would not be too much burden for an OoO machine supporting register renaming. The C flag would just need its own register designation to be renamed.

BGB

unread,

Feb 12, 2025, 8:16:20 PMFeb 12

to isa...@groups.riscv.org

Seems to make sense...

In my case, I have some ADC/SBB style instructions in my ISA, but these
don't map over to RV as they imply the existence of architectural state
that doesn't exist in RISC-V (namely, the existence of 'S' and 'T'
status flags).

Another option could be ADDX and SUBX instructions, which operate on
pairs of registers (though, even numbered only in my case). Not yet
mapped over to the RV encoding space, would have to check though if
anyone else has defined encodings for these.

Though, I am not sure how well something like 128-bit ALU operations
would go over in the rest of RISC-V land...

Say:
ADDX X10, X12, X14 // X11:X10 = X13:X12 + X15:X14

These would get convoluted for larger adders though.

And, say:
CLRT //Clear T status bit
ADCX X10, X16, X20 //128-bit ADD with Carry (via T)
ADCX X12, X18, X22
Likely wouldn't go over well in RV land...

While it could arguably be nicer to support odd-numbered pairs, this
opens up a lot of issues in the instruction decoder, so seemed better
off to limit these sorts of instructions to even numbered register pairs.

So, one can probably assume only having "basic" cases with no added
architectural state.

But, say, for a 256 bit adder, one could do:
ADDX X10, X16, X20
ADDX X12, X18, X22
OR X14, X17, X21
XORI X15, X11, -1
AND X14, X14, X15
SRL X14, X14, 63
LI X15, 0
ADDX X12, X12, X14 //add the carry

Though, an option could be to add a 3-input instruction to figure out
whether there was a carry, say:
ADDX X10, X16, X20
ADDX X12, X18, X22
MV X14, X11
LI X15, 0
ADDCO X14, X17, X21
ADDX X12, X12, X14

Where ADDCO does Xd = ZExt( (Xs[63] || Xt[63]) && !Xd[63] );

Also an ADDCO operation would also work on 64-bit values (doesn't need
to model the carry, just infers if one had occurred). Could be a little
nicer with a 4R encoding though, but this raises different issues, no
great solution.

Though, in my case, I had rarely needed values over 128 bits.

But, 256-bit integer values may come up when trying to implement support
functions for Binary128 FMUL and FDIV, so...

As for how to best implement a 128 x 128 -> 256 bit multiplier... Yeah...

My experiments in hardware support for 128 bit integer operations did
not extend to Multiply or similar; and in this case the fastest option
had generally been to build it piecewise from 32-bit unsigned widening
multiplies (sadly, RV doesn't natively have this instruction either,
though in premise it can be done with 64-bit multiply, though at present
my 64-bit multiplier is a fair bit slower).

Say, one builds values from the intermediate piecewise multipliers, and
then adds them all together. Though, all this is made easier if one has
instructions to select the high/low halves of registers to build new
registers. One could borrow these from the 'P' extension though (though,
for the low-low case, the B extension has the 'PACK' instruction; but
for best results one really needs all 4 cases).

In my case, I have the cheaper/easier and more useful cases:
ADDX/SUBX
ANDX/ORX/XORX: Just a decoder hack to run two operations in parallel.
CMPEQX/CMPGTX: Would be renamed SEQX and SLTX for RV.
MOVX: Decoder hack, was two parallel MOV / MV operations.
SHADX/SHLDX: Shift, would be SLLX/SRLX/SRAX in RV terms.
Internally, can be done with two independent funnel shifters.

The ADDX/SUBX and CMPxxX cases were special in that they required
internal signaling (mostly of carry-flag and zero-flag bits) between the
two ALUs.

Though, the main incentive here was wanting "long double" to not be a
boat anchor, but also the cost of supporting Binary128 in hardware is
impractical (so, at the time, adding 128-bit integer math instead seemed
like a reasonable compromise).

> Thanks,
> Vadim
>
> --
> You received this message because you are subscribed to the Google
> Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to isa-dev+u...@groups.riscv.org <mailto:isa-
> dev+uns...@groups.riscv.org>.
> To view this discussion visit https://groups.google.com/a/
> groups.riscv.org/d/msgid/isa-dev/
> CADEg7HmNHFvDQ7gR9Rnet7_Qii1T1b5ecV%3Dwobv1tJS69YDsrA%40mail.gmail.com
> <https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/
> CADEg7HmNHFvDQ7gR9Rnet7_Qii1T1b5ecV%3Dwobv1tJS69YDsrA%40mail.gmail.com?
> utm_medium=email&utm_source=footer>.

Bruce Hoult

unread,

Feb 12, 2025, 8:18:09 PMFeb 12

to Vadím Sukhomlínov, RISC-V ISA Dev

People always intuitively think it looks inefficient, but have you
benchmarked it in the context of a real application, not a
micro-benchmark?

With a natively 64 bit CPU the need for small multi-word calculations
is vastly reduced compared to 8 bit machines, where such operations
were very frequent and having an ADC instruction and carry flag made a
huge difference. On RISC-V it should usually either be a very rare
operation where the speed doesn't matter, or else be interleaved with
a lot of loads and stores, possibly missing cache for true bignums.

Also if you really care about performance then you'll use the VADC and
VMADC instructions in the V extension, which use a mask register to
hold the carry bits.

> --
> You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.

> To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
> To view this discussion visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CADEg7HmNHFvDQ7gR9Rnet7_Qii1T1b5ecV%3Dwobv1tJS69YDsrA%40mail.gmail.com.

Vadím Sukhomlínov

unread,

Feb 13, 2025, 12:45:55 AMFeb 13

to Bruce Hoult, RISC-V ISA Dev

Bruce, yes, I benchmarked and compared to specialized HW (OTBN in OpenTitan https://opentitan.org/book/hw/ip/otbn/index.html) for RSA operations. The difference was 20x+ vs Ibex Core (32-bit):
1. OTBN uses a 256bit bus and loads/stores 256 bits at the same time, while Ibex needs 8 32-bit loads/stores

2. OTBN have add with carry instruction (https://opentitan.org/book/hw/ip/otbn/doc/isa.html#bnaddc) which in 1 cycle does what needs 8x5 cycles on OTBN

3. OTBN requires loading microcode, copying data in/out, etc.

Adjusting to the same width the difference seems to be ~3X as measured and 5X theoretical for addition only.

I agree that the "V" extension with VADC/VMADC/VSBC addresses the problem when available. My proposal doesn't require a fully-featured "V" extension and is suitable for low power devices.

BBG, your proposal is interesting, but it adds a new type of architectural state. In my opinion if register pairing is used, then `ADCX X2, X0, X3` - 3-way 32-bit addition with 64-bit output can be a simpler primitive performing X3:X2 = X2 + X0 + X3, so carry flag would remain in X3 and you'd load values to add in X0 and X2 sequentially.

Regards,

Vadim

BGB

unread,

Feb 13, 2025, 6:24:12 AMFeb 13

to isa...@groups.riscv.org

On 2/12/2025 11:45 PM, 'Vadím Sukhomlínov' via RISC-V ISA Dev wrote:
> Bruce, yes, I benchmarked and compared to specialized HW (OTBN in

> OpenTitan https://opentitan.org/book/hw/ip/otbn/index.html <https://
> opentitan.org/book/hw/ip/otbn/index.html>) for RSA operations. The

> difference was 20x+ vs Ibex Core (32-bit):
> 1. OTBN uses a 256bit bus and loads/stores 256 bits at the same time,
> while Ibex needs 8 32-bit loads/stores
> 2. OTBN have add with carry instruction (https://opentitan.org/book/hw/

> ip/otbn/doc/isa.html#bnaddc <https://opentitan.org/book/hw/ip/otbn/doc/
> isa.html#bnaddc>) which in 1 cycle does what needs 8x5 cycles on OTBN

> 3. OTBN requires loading microcode, copying data in/out, etc.
>
> Adjusting to the same width the difference seems to be ~3X as measured
> and 5X theoretical for addition only.
>
> I agree that the "V" extension with VADC/VMADC/VSBC addresses the
> problem when available. My proposal doesn't require a fully-featured "V"
> extension and is suitable for low power devices.
>

Yeah, V seems like a pretty heavyweight option in my view as well.

Big OoO chip for a PC or high-end phone, why not?...

Smaller, dunno.

Seems like a (in terms of implementation) it would be fairly expensive
for what it does (vs a more minimalist design).

Can note that seemingly nothing has come along to dethrone the
Cortex-A53. There are plenty of newer / bigger / faster cores. But, last
I checked, if one gets a cheap smartphone, very possibly it will still
have an A53.

I guess, a person could try to figure out how to capture the same sort
of magic...

Reasonably fast but also cheap. Ideally needs to be cheaper, faster, and
more energy efficient...

One can seemingly almost judge the relative cost of a core by its L1
data cache:
16K direct-mapped: cheapish
32K direct-mapped: medium / balance
64K set-associative: money to burn
...
More so if they also have a similarly large instruction cache.

Whereas, if cache is 8K or 4K (or less), likely cost was a much bigger
concern than it performing well.

But, this is partly by observing that once one reaches 32K of L1 cache,
adding any more is solidly in diminishing returns territory. Whereas,
going below 16K, performance drops off rapidly. This pattern seems to
hold across a range of workloads.

Set associativity gains a certain amount of performance relative to
cache size, but also at a chunk of logic cost; Cost optimization often
favoring direct mapped caches.

Though, this is assuming 2-wide in-order superscalar. If the CPU is OoO,
they have money to burn. Cache is probably set-associative as well, as
seemingly no one is going to bother pairing an OoO core with a
direct-mapped cache.

Not sure if a common practice, but had noted the ability to mop up a lot
of the conflict misses by putting a much smaller set-assoc cache between
the L1 and L2 caches (it is almost like one has a set-assoc L2 cache,
but cheaper).

Say, gluing a 32-entry x 4-way cache onto the front-end of a 128K
direct-mapped L2 cache, shielding the L2 cache from a lot of the
conflict misses.

...

>
> BBG, your proposal is interesting, but it adds a new type of
> architectural state. In my opinion if register pairing is used, then
> `ADCX X2, X0, X3` - 3-way 32-bit addition with 64-bit output can be a
> simpler primitive performing X3:X2 = X2 + X0 + X3, so carry flag would
> remain in X3 and you'd load values to add in X0 and X2 sequentially.
>

Yeah.

It is a tradeoff, as I can note, my CPU isn't a pure RISC-V core (or,
even natively a RISC-V core), but more sort of a bit more wonky of a
design...

I have doubts that anyone would likely adopt my design as-is, but it
does exist as a possible platform for design experimentation.

So, my wonky way of doing things is partly reflective of this...

Its native ISA design can actually be traced backwards back to the
"Hitachi SuperH SH-4" (the CPU/ISA used in the Sega Dreamcast) but had
mutated quite significantly":
SH-4: Had 16x 32-bit GPRs, and fixed-length 16-bit instruction words.

First ISA design attempts were a "BJX1" project:
BJX1-32:
Core ISA was still SH-4;
Added some custom instructions;
Added some stuff borrowed from SH-2A.
Added some constant load and displacement load/store ops (*1).
Also borrowed stuff from SH-4A.
BJX1-64A
Switched to 64-bit registers
Used Mode flags to switch in/out parts of the ISA
Similar to the SH-4 FPU
Not enough encoding space to fit everything at the same time.
BJX1-64C
Switched to 64-bit registers
Dropped some parts of the SH ISA to free up encoding space.
No more mode-twiddling was needed.
Though, this was itself essentially a sub-mode.

*1: In SH-4, direct displacements weren't really a thing (for the most
part), as it had effectively a few major address modes:
@Rm //Register used as an address
@(R0,Rm) //Rm+R0
@-Rm //Pre-Decrement
@Rm+ //Post-Increment
So, in the general case, if you wanted a displacement, you did, say:
MOV #Disp8s, R0
MOV.L @(R0,R4), R8
There were a few instructions that had displacements, but could often
only load into R0. Though, for a subset of cases, 4-bit displacements
existed.

Variants like SH-2A had added instructions that could do, say:
MOV.L @(R4,0x180), R8

The SH-4 also had a wonky FPU:
There were 16x 32-bit FPRs (FR0..FR15);
They would be paired into logical 64-bit registers for Double;
...

You couldn't encode both Single and Double operations at the same time
(the FPU instructions were effectively bank-swapped depending on
mode-bits in FPSCR). Moving values between Single and Double was a pain,
and multiple instructions were needed to Load/Store Double values.

The SH-4A had added a mode that partially fixed some of the worst of the
issues here, and I then realized it was "better" to just leave the FPU
in this mode and then use format-conversion to load/store Single.

IIRC: "MOV R8, R1; MOV.L @R1+, FPUL; FCNVSD DR2"
To load from R4 into DR2 (only with "@Rm+" addressing existing for a
load into FPUL, ...). Yes, this still kinda sucks (but was "less bad"
than the alternatives).

Had initially went with PE/COFF and took some inspiration from WinCE's
design.

However, the ISA design was kind of a horrible mess...
When I started trying to implement it on an FPGA, the extent of the mess
became quite evident.

I then did a "soft reboot", into an ISA I then called "BJX-2" (hyphen
added for "reasons"), which was in its initial form, mostly backwards
compatible with 64C's ASM code.

The encoding scheme however was quite significantly redesigned (at this
stage, it was a 16/32 ISA). But, not in the same way as RISC-V (in that
its 16-bit ISA was still fairly similar to SuperH; it was the 32-bit
encodings that got kinda weird).

Things like Post-Increment or Pre-Decrement no longer existed in the ISA
proper, but the assembler could still fake them if needed (using
multiple instructions). It also eliminated the use of branch delay-slots.

Not long into the process, I ended up also dropping the SH-4 style FPU
in favor of merging FPU operations into GPR space (each GPR being
understood as a Binary64 value).

The 16-bit ops could mostly only access R0..R15, whereas 32-bit ops also
had access to R16..R31.

At this stage, the primary addressing modes were now:
(Rm, Disp) //Fixed displacement
(Rm, Ri) //Index register [Rm+Ri*Scale]

R0 and R1 became special for memory ops (as base or index registers),
and could be used to mimic many of the addressing modes that had existed
in SuperH.

The interrupt mechanism was also partly redesigned, and the CR (Control
Register) and SR (Status Register) spaces were merged and reorganized
somewhat.

Basically, trying to clean up a lot of the worst of the awfulness.

It also gained a few features:
Predicated instructions (partly inspired by 32-bit ARM);
Explicit bundle encoding (like TMS320 and MSP32);
Jumbo-Prefixes, which allowed larger 64 and 96 bit encodings, using the
same underlying mechanism as the bundle encoding (in the vast majority
of cases, the prefixes are used solely to make immediate and
displacement fields bigger; but can also be used for other things).

Then also added SIMD (still using GPRs), ...

When something was lacking or broken in the design, would generally try
to address it. Design was not "frozen", so I also reserved the right to
break things or drop things if needed (though I do try to limit this).

The ISA was later expanded again to 64 GPRs, but initially only a subset
of the ISA had direct access to R32..R63, which was very annoying.

I later ended up adding a modified form of the ISA that I had called XG2
(I had then renamed the original form "Baseline", but am now calling it
XG1). This was added as a new operating mode.

As for the XG2 variant:
Dropped the 16-bit encoding space;
Used the freed up bits to go natively over to 64 GPRs for the entire ISA;
Makes some immediate and displacement fields slightly bigger.

Then RISC-V enters the story:

Around this point, I noticed that my ISA had basically become almost a
direct superset of RISC-V...

So, I basically just had to throw an alternate set of instruction
decoders on it and, "poof", now it also runs RISC-V code. Wasn't quite
that easy, but more or less that is what happened. Some amount of what
features I ended up adding for filling the gaps for RISC-V support also
ended up back-ported to my ISA (excluding stuff that "wouldn't make
sense" due to design differences).

For FPU, I split the register space in half:
Low 32 registers go to the X registers;
High 32 registers go to the F registers.

Some of the registers are mapped around for RISC-V.
X0 -> ZR
X1 -> LR (C1)
X2 -> SP (R15)
X3 -> GBR (C6)
X4..X13: R4..R13
X14/X15: R2/R3
X16..X31: R16..R31

I initially thought, "well, GCC is super clever, and my compiler sucks,
so maybe RISC-V will be faster..."

But... It was not. Initially it was close, but with some more beating on
my compiler, my ISA is in the lead (both in terms of speed and also
making smaller binaries).

More recently though, I also went and added support for RISC-V output to
my compiler.

Had noted that if a few major features are glued onto RISC-V, it can get
a notable performance boost.

Most notably:
Register Indexed Load/Store;
Jumbo prefixes;
Load/Store Pair.

This can at least somewhat close the gap.

But, I can also note that in my CPU, most instructions have a 2 or 3
cycle latency, where it appears other RV cores would have 1 or 2 cycles
of latency. Higher average-case latency may have a non-zero effect on
performance tradeoffs.

Though, this doesn't seem unreasonably unfair to RV64 IMO (both ISAs
have to work with the same instruction latency values in this case).

Where, in this case, Integer/FPU/SIMD all use the same registers and
pipeline.

Did add yet another ISA variant for my ISA, which I call XG3:
Shuffles bits around to be able to coexist in the same encoding space as
RV64G;
Switches to the same register numbering and ABI as RISC-V.

This allows basically crazy-gluing both ISA's together into a sort of
conglomerate ISA.

Also it allows direct function calls between XG3 and RV64G with no
special handling needed (however, this does not fully extend RV64GC,
which still requires some special handling as the encoding spaces exist
in separate operating modes).

This mode is able to operate at roughly similar speeds to XG2, while
also being (mostly) function-call compatible with code compiled by GCC.

My compiler (BGBCC) also mostly treats XG3 as-if it were part of the
RV64 space (but then targeted RV64 by hacking around in the same backend
as XG1/XG2; which is itself a fork of the original SH/BJX1 backend, and
parts of the compiler are still not entirely free of SH-isms).

Yet somehow all works, and manages to be performance competitive.

Though, with all of this, things have gotten hairy again.
Almost tempting to consider another soft reboot, but would need to
decide what direction to go with it...

Might be almost tempting to do a CPU with RV64G as the primary ISA, but
there is still the matter (from my perspective) that XG2 is still around
30-40% faster for programs like Doom.

( Where, ~ 30% due to known features, and another ~ 7%-ish due to
unknown/uncertain factors... )

For GLQuake and Quake 3 Arena, plain RV64G currently gets kinda wrecked;
the OpenGL implementation being entirely unusable on a 50MHz CPU (the GL
implementation benefits a fair bit from SIMD and lots of registers).

The likely compromise being to keep a similar set of extensions to what
I have already (just maybe change the boot-time ISA and hardware memory
maps, properly implement the various RV CSRs, ...).

Maybe also try to reduce the LUT budget (my existing core uses a lot
more FPGA resources than I would prefer; but unclear how to really
reduce budget without dropping a lot of features).

Probably a bit heavyweight as to fit a reasonable amount of the feature
set, one needs to go towards the higher end of the Spartan or Artix
class FPGAs...

But, seemingly, can't fit that much more than a fairly basic RISC style
core on an XC7S25 or similar, dunno if skill issue or just that this is
about as much as one can do on such an FPGA. At least on an XC7S50, can
fit a superscalar core that can also chew through Binary16 SIMD
operations (but, 128 bit Binary32 SIMD is a bit much for this FPGA); ...

Though, it may all be kinda moot, when one can just buy an existing SBC
with either an ARM or RISC-V processor, which is then going to promptly
run circles around any kind of soft processor that one can implement on
an FPGA.

Like, I am not going to get GHz speeds out of an FPGA, but (not having
any resources to do an ASIC), an FPGA is all it will ever be...

Kinda burnt a good number of years on all this, but am faced with the
issue that seemingly, whatever one does, it is essentially moot.

...

> send an email to isa-dev+u...@groups.riscv.org <mailto:isa-
> dev%2Bunsu...@groups.riscv.org>.

> > To view this discussion visit https://groups.google.com/a/
> groups.riscv.org/d/msgid/isa-dev/

> CADEg7HmNHFvDQ7gR9Rnet7_Qii1T1b5ecV%3Dwobv1tJS69YDsrA%40mail.gmail.com <https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CADEg7HmNHFvDQ7gR9Rnet7_Qii1T1b5ecV%3Dwobv1tJS69YDsrA%40mail.gmail.com>.

>
> --
> You received this message because you are subscribed to the Google
> Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send

> an email to isa-dev+u...@groups.riscv.org <mailto:isa-
> dev+uns...@groups.riscv.org>.

> To view this discussion visit https://groups.google.com/a/

> groups.riscv.org/d/msgid/isa-dev/CADEg7H%3DTp%2BOHmMrTZtkc-
> gHwYsj4LCH%3DcCCQq6oji17rfOkK1Q%40mail.gmail.com <https://
> groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/
> CADEg7H%3DTp%2BOHmMrTZtkc-
> gHwYsj4LCH%3DcCCQq6oji17rfOkK1Q%40mail.gmail.com?
> utm_medium=email&utm_source=footer>.

Vadím Sukhomlínov

unread,

Feb 13, 2025, 1:24:37 PMFeb 13

to BGB, isa...@groups.riscv.org

BGB, thanks for the deep overview! It is really interesting for me to trace the evolution of microarchitectures. There are many design choices which depend on the intended use cases. For example, we typically have number of registers as a power of 2 - it is straightforward how to encode it in ISA, but I couldn't find any research on the impact of register count on code density and performance for common tasks (e.g. evaluate register pressure for compilers, trade-offs). Is 32 registers a sweet spot or say 24 wouldn't be much worse but result in lower power consumption? or do we need 50? Techniques like software pipelining (e.g. on Itanium) with rotating register windows may enable interesting optimizations without need to expose direct access to register into ISA. I was also looking into "unregistered" architectures where instead of a register I'd just point to relative instruction which produced results or a memory location (directly encoding data flow graph in ISA), but this was purely theoretical and I never had time to evaluate efficiency.

There is great overview of various ideas tried in the past https://gordonbell.azurewebsites.net/computer_structures__readings_and_examples/contents.html

Regards,

Vadim

To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/0d35c290-f337-4c48-a323-068045fea9e6%40gmail.com.

Vadím Sukhomlínov

unread,

Feb 13, 2025, 4:38:22 PMFeb 13

to Robert Finch, RISC-V ISA Dev

Robert,

What you've described with 3 source ADD and ADDGC is very similar to my proposal. Indeed carry can be 2-bit for the general case with arbitrary `c`. No extended encoding is necessary if `rd` is also a source register - this seems to be the primary difference from existing conventions.

In my proposal output of `addhi` does exactly the explicit carry flag in the specified register. The only drawback I see is that source and destination register are the same, but you can specifically designate it as a `carry register` for your code. This way you can avoid the creation of the new entity and save instructions to load carry flag in the GPR when you need it.

BGB

unread,

Feb 13, 2025, 6:58:53 PMFeb 13

to Vadím Sukhomlínov, isa...@groups.riscv.org

On 2/13/2025 12:24 PM, Vadím Sukhomlínov wrote:
> BGB, thanks for the deep overview! It is really interesting for me to
> trace the evolution of microarchitectures. There are many design choices
> which depend on the intended use cases. For example, we typically have
> number of registers as a power of 2 - it is straightforward how to
> encode it in ISA, but I couldn't find any research on the impact of
> register count on code density and performance for common tasks (e.g.
> evaluate register pressure for compilers, trade-offs). Is 32 registers a
> sweet spot or say 24 wouldn't be much worse but result in lower power
> consumption? or do we need 50? Techniques like software pipelining (e.g.
> on Itanium) with rotating register windows may enable interesting
> optimizations without need to expose direct access to register into ISA.
> I was also looking into "unregistered" architectures where instead of a
> register I'd just point to relative instruction which produced results
> or a memory location (directly encoding data flow graph in ISA), but
> this was purely theoretical and I never had time to evaluate efficiency.
>
> There is great overview of various ideas tried in the past https://
> gordonbell.azurewebsites.net/computer_structures__readings_and_examples/

> contents.html <https://gordonbell.azurewebsites.net/
> computer_structures__readings_and_examples/contents.html>
>

For register count:
16: Not really enough
There is still a fairly high amount of spill and fill.
32: Mostly good
For most normal programs, spill/fill rate is low.
64: Useful for code with unusually high register pressure.
For many functions, allows assigning all variables to registers.
For leaf functions, allows most to use purely scratch registers.

I would expect 24 to sit between 16 and 32.

So, with 64, it is possible to achieve a state where, apart from the
prolog/epilog for non-leaf functions, a many the functions can be made
free of spill-and-fill.

Though, this is excluding cases where the program had taken the address
of a local variable or similar. Then you need to spill and reload the
variable as needed in the possibility that its backing memory may have
been modified.

For argument registers:
8 arguments covers around 98% of functions.
16 arguments covers ~ 100%
Sometimes there may be a few outliers that exceed 16 arguments.

For how many registers needed (eg, callee save) for full coverage of all
the local variables:
8: 60% (SH-4, 7)
12: 70% (RISC-V, 12, or 12+12 X/F)
16: 76% (XG1, 15)
24: 87% (XG3, 24, X+F are combined)
32: 93% (XG2, 31)

Any remaining functions (beyond the percentage) would need to make use
of at least some spill-and-fill. Reaching 100% would require a machine
with 256 GPRs, which would be impractical.

Granted, whatever registers are left to callee save are taken away from
scratch, and often for leaf functions, scratch registers are preferable.

The RISC-V ABI balances it around:
60% / 40%: Scratch versus Callee-Save
My ISA's native ABI:
48% / 52%: Scratch versus Callee-Save

Despite RV64G and XG3 having the same registers, the main difference is
that XG3 can use any register at any time. This can significantly help
with things like register allocation. Usually, the register pressure is
highly uneven, so an architectural split leads to inefficiency here.

Though, my Jumbo-Prefix extension does allow RV to use non-default
registers (say, allowing the X side to overflow onto the F side), I have
noted that in this case, using a 64-bit encoding every an instruction
uses a non-default register, is worse for code density and performance
than the register spills...

This differs from XG2 and XG3 which instead have 6 bits for each
register. The tradeoff though, is that XG2 and XG3 have less bits for
opcode and also smaller immediate/displacement fields (10 bits).

However, the statistical "sweet spot" seems to be around 9 bits, so the
loss due to the smaller immediate fields is small.

XG2 and XG3 use scaled displacements, so the average case loss better:
RISC-V: 2K for Load/Store
XG2/XG3, depends on Size:
B: +/- 0.5K, less
H: +/- 1K, less
W: +/- 2K, same
Q: +/- 4K, more
Bcc in XG3 uses a 4-byte scale, so +/- 2K, vs +/- 4K for RV.

XG3 can't directly encode misaligned displacements in a 32-bit encoding,
but can do so in a 64-bit encoding, however misaligned displacements are
rare in practice.

The shorter displacement on Bcc doesn't appear to be too big of an issue
in a statistical sense (around 92% vs 97% for local branches). At
present, in my existing implementation, Bcc+JAL or similar is preferable
as the fallback case when a Bcc misses the target range.

For note, XG3's instruction formats:
* XXXX-oooooo-mmmmmm-ZZZZ-nnnnnn-QY-YYPw (3R / 3 Registers)
* jiii-iiiiii-mmmmmm-ZZZZ-nnnnnn-QY-YYPw (3RI, 2 Registers and Imm10)
* iiii-iiiiii-iiiiii-aZZZ-nnnnnn-bY-YYPw (2RI, Register and Imm16)
* iiii-iiiiii-iiiiii-aZZZ-jjjjjj-bY-YYPw (Branch)
* iiii-iiiiii-iiiiii-jjjj-jjjjaa-a1-1Y10 (Jumbo Prefix)

Where: X/Y/Z/Q: Opcode
Y: Divides into major instruction block.
Z: Instruction (or group within this block)
X: 3rd part of Opcode
Q: Modifier bit or Opcode Bit
a/b: Opcode or extension bits.
And, n/m/o: Register fields.
n: Destination register (Rd)
m: Source Register 1 (Rs/Rs1)
o: Source Register 2 (Rt/Rs2)

This format tries to follow a similar structure to the RISC-V format,
albeit with the difference of 6-bit register fields, and internally the
bits are mostly shuffled around to turn it back into my original ISA.

If the low 2 bits are 11, it encodes a RISC-V instruction.

If the Imm10/Disp10 field fails to be sufficient, a jumbo prefix may be
used which will expand it to 33 bits. The remaining 3 'a' bits may go to
opcode.

Alternatively, the prefixes can be used to encode larger forms for
less-common instructions or less common edge cases (such as gluing an
immediate field onto an instruction which doesn't otherwise have one).

For example, if the Immediate-extension form of the jumbo prefix is used
on a 3R instruction, it is understood as turning Rt/Rs2 into an
immediate field (also 33 bits in this case, but encoded in a different
way, 24+3+6 bits).

In this case, 33 bits was preferred partly as this is what my pipeline
had used, but also because 33 bits turns out to be a bit of a magic
number of bits.

Outside of small values, the distribution is not a bell curve.
There are big spikes near 16/17 and 32/33 bits.
Between 18 and 30 bits is fairly sparse.
And, 34..47, or 49..62, there is often hardly anything.
There is a small bump at 48,
but mostly due to MMIO addresses in my case.
So, next stop is a full 64 bits.

Up at 64 bits, there tends to be an island of 64-bit patterns and
Binary64 constants.

Though, a large majority of floating-point immediate values can be
(losslessly) converted to Binary16. So, my ISA had an instruction to
load a Binary16 immediate and convert it to Binary64 in-register. To
what extent I have experimented with FPU instructions with immediate
values, 64-bit encodings with Binary16 immediate values has often been
what I went for for this reason.

These are exceptions though, say, 0.1, 0.2, 1.0/3.0, ... can't be
expressed exactly as Binary16; but 3.0 or 0.375 or similar can.

Theoretically, one could embed a smaller format in NaN range, say:
B.E5.M4
B selects a base:
0=2.0/3.0 (0.66..)
1=4.0/5.0 (0.8)
E5.M4 encodes a value that is multiplied by the base.

But, this likely isn't really worth it...
And, more fractional bases would require fetching too many bits.
But, could improve hit rate.
And would result in repeating 4 bit-patterns for low-order bits,
making lookup cheaper.
5 -> 12 bits (4x LUT5's)
Repeating the last 4 bits for the rest of the mantissa.
However, the coverage isn't likely enough to offset the logic cost.
Even if it could express 0.1 and 0.2 and similar exactly,
which are a bit of a hassle as far as FPU constants go...

Note, compiler isn't allowed to round here...

If user types 3.1415926538359... and gets 3.1416016, they will not be
happy. But, one still gets pretty good coverage with Binary16 even if
rounding is not allowed (and compiler is only allowed if the result will
be bit-identical to the original value). And, constants like PI still
need be encoded as a 64-bit value.

XG3 can encode 3RI ops with a 64-bit immediate, but my RV+Jumbo
experiment can't do so, as there are not enough bits in the prefix in
this case:
* tiiiiii-iiiii-jjjjj-100-kkkkk-01-11111

One can do 21+21+12, but this is still mostly in the "no man's land".
22+22+20 gives 64, but this only covers LUI and similar.

Can almost make a stronger case for RV+Jumbo for limiting thing to
64-bit encodings (to be more friendly to 2-wide machines).

However, cases where this would make a difference is rare, say:
ANDI X10, X18, 0x7F7F7F7F7F7F7F7F
Would not be all that common even if could be encoded directly.

Note that currently, it is not allowed to mix the jumbo prefixes between
ISAs (eg: using an RV Jumbo Prefix with an XG3 op, or XG3 Prefix with an
RV op).

Register Windows:
Unless the hardware has access to special facilities that the pipeline
lacks, doing it in hardware tends to be worse for performance than
having software do it. Load/Store Pair or dual-issue Load/Store seems to
be better both for hardware cost and performance.

Positional Encodings (like Mill's "Belt Architecture") are possible, but
I lack any first hand experience, and don't know of any real-world
implementations.

My guess is that this could likely be used to improve code density, but
would likely have no advantage in practice over a traditional register
machine.

My design is fairly conventional in these areas.
I generally tried to avoid things where the hardware would get too
"hands on".

I was actually a lot more on the "compiler and software deals with
everything" side. Did end up relenting and going from static-scheduling
to superscalar, partly as static-scheduling does add a bit of pain, and
going superscalar ended up being cheaper than I had originally imagined.

Though, my implementation of superscalar is basically entirely defeated
if one tries to use the 'C' extension though (in this case, it falls
back to fetching and dispatching instructions one at a time). Dealing
with the 'C' instructions here would make things a fair bit more difficult.

...

> Regards,
> Vadim
>
>
>
> On Thu, Feb 13, 2025 at 3:24 AM BGB <cr8...@gmail.com
> <mailto:cr8...@gmail.com>> wrote:
>
> On 2/12/2025 11:45 PM, 'Vadím Sukhomlínov' via RISC-V ISA Dev wrote:
> > Bruce, yes, I benchmarked and compared to specialized HW (OTBN in
> > OpenTitan https://opentitan.org/book/hw/ip/otbn/index.html

> <https://opentitan.org/book/hw/ip/otbn/index.html> <https://
> > opentitan.org/book/hw/ip/otbn/index.html <http://opentitan.org/

> book/hw/ip/otbn/index.html>>) for RSA operations. The
> > difference was 20x+ vs Ibex Core (32-bit):
> > 1. OTBN uses a 256bit bus and loads/stores 256 bits at the
> same time,
> > while Ibex needs 8 32-bit loads/stores
> > 2. OTBN have add with carry instruction (https://opentitan.org/

> book/hw/ <https://opentitan.org/book/hw/>
> > ip/otbn/doc/isa.html#bnaddc <https://opentitan.org/book/hw/ip/
> otbn/doc/ <https://opentitan.org/book/hw/ip/otbn/doc/>

> > <mailto:br...@hoult.org <mailto:br...@hoult.org>>> wrote:
> >
> > People always intuitively think it looks inefficient, but
> have you
> > benchmarked it in the context of a real application, not a
> > micro-benchmark?
> >
> > With a natively 64 bit CPU the need for small multi-word
> calculations
> > is vastly reduced compared to 8 bit machines, where such
> operations
> > were very frequent and having an ADC instruction and carry
> flag made a
> > huge difference. On RISC-V it should usually either be a very
> rare
> > operation where the speed doesn't matter, or else be
> interleaved with
> > a lot of loads and stores, possibly missing cache for true
> bignums.
> >
> > Also if you really care about performance then you'll use the
> VADC and
> > VMADC instructions in the V extension, which use a mask
> register to
> > hold the carry bits.
> >
> > On Thu, Feb 13, 2025 at 11:58 AM 'Vadím Sukhomlínov' via
> RISC-V ISA

> > Dev <isa...@groups.riscv.org <mailto:isa-
> d...@groups.riscv.org> <mailto:isa...@groups.riscv.org <mailto:isa-

> <mailto:isa-dev%2Bunsu...@groups.riscv.org> <mailto:isa-
> <mailto:isa->
> > dev%2Bunsu...@groups.riscv.org
> <mailto:dev%252Buns...@groups.riscv.org>>.

> > > To view this discussion visit https://groups.google.com/a/

> <https://groups.google.com/a/>
> > groups.riscv.org/d/msgid/isa-dev/ <http://groups.riscv.org/d/
> msgid/isa-dev/>
> >
> CADEg7HmNHFvDQ7gR9Rnet7_Qii1T1b5ecV%3Dwobv1tJS69YDsrA%40mail.gmail.com <http://40mail.gmail.com> <https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CADEg7HmNHFvDQ7gR9Rnet7_Qii1T1b5ecV%3Dwobv1tJS69YDsrA%40mail.gmail.com <https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CADEg7HmNHFvDQ7gR9Rnet7_Qii1T1b5ecV%3Dwobv1tJS69YDsrA%40mail.gmail.com>>.

> >
> > --
> > You received this message because you are subscribed to the Google
> > Groups "RISC-V ISA Dev" group.
> > To unsubscribe from this group and stop receiving emails from it,
> send
> > an email to isa-dev+u...@groups.riscv.org <mailto:isa-

> dev%2Bunsu...@groups.riscv.org> <mailto:isa- <mailto:isa->
> > dev+uns...@groups.riscv.org
> <mailto:dev%2Bunsu...@groups.riscv.org>>.

> > To view this discussion visit https://groups.google.com/a/

> <https://groups.google.com/a/>
> > groups.riscv.org/d/msgid/isa-dev/CADEg7H%3DTp%2BOHmMrTZtkc-
> <http://groups.riscv.org/d/msgid/isa-dev/CADEg7H%3DTp%2BOHmMrTZtkc->
> > gHwYsj4LCH%3DcCCQq6oji17rfOkK1Q%40mail.gmail.com
> <http://40mail.gmail.com> <https://
> > groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/ <http://
> groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/>
> > CADEg7H%3DTp%2BOHmMrTZtkc-
> > gHwYsj4LCH%3DcCCQq6oji17rfOkK1Q%40mail.gmail.com
> <http://40mail.gmail.com>?

> > utm_medium=email&utm_source=footer>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it,

> send an email to isa-dev+u...@groups.riscv.org <mailto:isa-
> dev%2Bunsu...@groups.riscv.org>.
> To view this discussion visit https://groups.google.com/a/

> groups.riscv.org/d/msgid/isa-dev/0d35c290-f337-4c48-
> a323-068045fea9e6%40gmail.com <https://groups.google.com/a/
> groups.riscv.org/d/msgid/isa-dev/0d35c290-f337-4c48-
> a323-068045fea9e6%40gmail.com>.
>

Vadím Sukhomlínov

unread,

Feb 14, 2025, 2:57:29 PMFeb 14

to BGB, isa...@groups.riscv.org

BGB, thank you for the very detailed analysis of the pros and cons of various designs.

This is extremely insightful. And thank you for referencing Mill's "Belt Architecture" -

I never heard about it before, but when I researched it - I found it very close to the idea

came up with. "Belt" representation (i referred to it as SSA static single assignment) and

direct addressing of instruction results seems like a very nice way to express parallelism

in the architecture, and then implementation can decide how many different ALU blocks

are needed for various purposes. Ivan seems to make some prototypes to demonstrate

advantages and I'm not sure why this approach is not being explored further

(or maybe I'm wrong and it is).

Floating point constants: I didn't look into this area and don't have data on what

constants are commonly used in the computations. To me major advantage vs.

loading from memory is the reduction of memory bandwidth (only 1 instruction

vs. instructions to form address and load + load itself).

This would improve memory/fp ratio for performance critical loops.

I'd explore the use of BF16 format, load of exponent immediate (covers 2^exp

constants e.g. 2, 1/2, etc), load of small integer immediates and common constants

like e, pi. Having an instruction to set/update exponent can be combined with load

immediate to form a wide range of numbers, but since it would be 2 instructions,

it is not clear if it is better than loading from memory. Another common operation is

to load an inverse of small integers, e.g. 1/3, 1/5 - if immediates are in the small range,

this can be a LUT, only odd values are needed, if you can update an exponent on top of it.

The power of 10 constants may also be valuable.

Integer constants/immediates: I think for this case code/data density would be a higher

priority. Loading full XLEN constant means that instruction length will be larger, so if

code size is important, a variable instruction length will be needed. RISC-V approach

seems like a reasonable trade-off. I'd think that it can be extended with loading

various patterns, e.g. 0x7f7f7f7f7f as an 8-bit 0x7f pattern. So you can encode

bit length in 3 bits (1-8) and pattern in 8 bits and still fit 12-bits. For 20 bits this

would enable 4 bit for length and 16 bits for pattern.

For fixed-point values, including integer inverses (2^(XLEN*2)/r)>>XLEN) used to

replace division with multiplication a dedicated set of commands can be used same

as for floats.

Overall, it seems that ISA optimization can be automated given a set of target applications,

design constraints (e.g. power, performance, performance/power, cost, etc) and a compiler

that given an ISA would be able to effectively use it and produce code which can be

evaluated for desired metrics and drive further optimization effort. This would help to

answer what number of registers, size of immediates, displacements, etc is optimal.

It also needs new tools - I found extending LLVM for new instruction in existing ISA

is already a challenge, making it use instruction effectively - even more challenging,

so custom instruction sets today shift cost to custom compiler maintenance.

Having a compiler framework that given ISA description can come up with good

back-end optimization would greatly help RISC-V and all the alternative experiments :)

Andrew Waterman

unread,

Feb 14, 2025, 7:39:15 PMFeb 14

to Vadím Sukhomlínov, BGB, isa...@groups.riscv.org

On Fri, Feb 14, 2025 at 11:57 AM 'Vadím Sukhomlínov' via RISC-V ISA Dev <isa...@groups.riscv.org> wrote:

BGB, thank you for the very detailed analysis of the pros and cons of various designs.
This is extremely insightful. And thank you for referencing Mill's "Belt Architecture" -
I never heard about it before, but when I researched it - I found it very close to the idea
came up with. "Belt" representation (i referred to it as SSA static single assignment) and
direct addressing of instruction results seems like a very nice way to express parallelism
in the architecture, and then implementation can decide how many different ALU blocks
are needed for various purposes. Ivan seems to make some prototypes to demonstrate
advantages and I'm not sure why this approach is not being explored further
(or maybe I'm wrong and it is).

Floating point constants: I didn't look into this area and don't have data on what
constants are commonly used in the computations.

We performed this analysis for the FLI instructions in the Zfa extension. The unsurprising TL;DR is that a very small number of constants dominate.

To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CADEg7H%3DY_48o6Ljn1y7R9%3DjoPKYNQ7KtvLJ%3DcPKN-yPpFWhj4g%40mail.gmail.com.

BGB

unread,

Feb 14, 2025, 9:49:15 PMFeb 14

to Vadím Sukhomlínov, isa...@groups.riscv.org

On 2/14/2025 1:57 PM, Vadím Sukhomlínov wrote:
> BGB, thank you for the very detailed analysis of the pros and cons of
> various designs.
> This is extremely insightful. And thank you for referencing Mill's

> "Belt Architecture" <https://millcomputing.com/docs/belt> -

> I never heard about it before, but when I researched it - I found it
> very close to the idea
> came up with. "Belt" representation (i referred to it as SSA static
> single assignment) and
> direct addressing of instruction results seems like a very nice way to
> express parallelism
> in the architecture, and then implementation can decide how many
> different ALU blocks
> are needed for various purposes. Ivan seems to make some prototypes to
> demonstrate
> advantages and I'm not sure why this approach is not being
> explored further
> (or maybe I'm wrong and it is).
>

It has some possible merits, but I think the downside is mostly that (in
terms of bare performance) not much obvious advantages over a
traditional register machine with a slightly larger number of
architectural registers.

Some of the problems it solves can be addressed:
By having a sufficient number of registers;
By compiler-side handling (allocating temporaries round-robin);
...

And, on the other side, it would add a fair bit of architectural
complexity and some issues of its own.
Belt doesn't fully eliminate the need for registers;
Local variables still need to go somewhere.
As do function arguments, ...
And, memory is still a worse option here.
Addressing the belt is more complex for hardware;
...

Such a machine could allow for smaller binaries, but will not
necessarily save anything in terms of instruction-count, which is a
higher priority concern for performance.

Well, and sometimes, actually, going for bigger instruction formats can
be beneficial as well:
If a single 64-bit instruction can do the work of 3 or 4 32-bit
instructions, one may be ahead.

Similar to how a machine with 32-bit instruction words can beat one with
16-bit instruction words.

Say, the code size for the fixed-length 16-bit machine is 25% smaller;
but one can also realize this means it is running 50% more instructions
to do a similar amount of work (thus, it is going to be a lot slower).

Can note that when comparing performance to instruction counts, it isn't
strictly linear:
The delta in terms of performance tends to be slightly larger than the
delta in terms of instruction counts (mostly due to things like register
RAW penalties and similar).

And, when the machine with 32/64 bit instructions is also generating
smaller code size than one with 32-bit instructions, it is likely
faster, because to get a smaller code size with larger instruction
formats, also means a comparably larger reduction in the total
instruction count. Even if it may seem counter-intuitive that bigger
instruction formats could lead to a smaller code size.

Not, say, that 16-bit instructions are "bad", but rather:
If one simply sees smaller ".text" as "better", this may be misguided
(unless the primary goal is, in-fact, to make the smallest code sizes,
say, to fit into a smaller ROM chip). If the goal is performance, it is
misleading.

Though, instruction count is arguably better here, but not without its
own issues: If the option is to throw high-complexity or high-latency
instructions at the problem, then performance may end up worse.

Say, for example, "Load with Post-Increment" may seemingly check off the
"reduces instruction count" box, but can actually end up being slower.

Say:
x=*cs++;
y=*cs++;
z=*cs++;
As:
MOV.L @R10+, R12 //2c
MOV.L @R10+, R13 //2c
MOV.L @R10+, R14 //1c
Vs:
MOV R10, R6 //0c
ADD 12, R10 //1c
MOV.L (R6, 0), R12 //1c
MOV.L (R6, 4), R13 //1c
MOV.L (R6, 8), R14 //1c

If the auto-increment form has a 2 cycle latency on Rm, it may be slower
than the latter (5 cycles vs 4 cycles), despite using fewer instructions.

> Floating point constants: I didn't look into this area and don't have
> data on what
> constants are commonly used in the computations. To me major advantage vs.
> loading from memory is the reduction of memory bandwidth (only 1 instruction
> vs. instructions to form address and load + load itself).
> This would improve memory/fp ratio for performance critical loops.
>

IME, the majority of floating point constants are fairly simple values
(either integer values or small fractions).

The most common "headache case" being fractional powers of 10 or
multiples of fractional powers of 10. Or, multiples of 1/3.

4/5 would address multiples of 0.1, but not 0.001 or 0.0001, which are
still an issue. Downside though is that more complex bases will have
longer bit patterns (so, not likely worth trying to deal with).

Storing a full set of constants though is undesirable though, as ideally
one wants values they can generate from a smallish lookup table of
bit-patterns with no real actual computation.

But, even this is debatable:
case({valIn[9], valIn[3:0]})
5'h00: tHiFill=12'hXXX;
...
endcase
tExp = valIn[8:4];
tLoFill = tHiFill[3:0];
if(valIn[14:10]!=5'h1F)
begin
tExp=valIn[14:10];
tHiFill={ valIn[9:0], 2'b0 };
tLoFill=4'h0;
end
tValOut[63:32] = { valIn[15],
tExp[4], tExp[4]?6'h00:6'h3F,
tExp[3:0],
tHiFill[11:0],
tLoFill[3:0], tLoFill[3:0] };
tValOut[31:24] = tValOut[39:32];
tValOut[23:16] = tValOut[39:32];
tValOut[15: 0] = tValOut[31:16];

But, this is debatable, as it would add a lot of cost over a simpler
converter:
tValOut = {
valIn[15],
valIn[14], valIn[14]?6'h00:6'h3F, valIn[13:10],
valIn[9:0], 42'h0 };

Where, for immediate values, we can simply ignore Inf/NaN/Zero (they are
disallowed).

But, the majority of cases map fine to Binary16.

Interestingly, it is possible to also get an OK hit rate with an S.E3.F2
microfloat.

> I'd explore the use of BF16 <https://en.wikipedia.org/wiki/
> Bfloat16_floating-point_format> format, load of exponent immediate

> (covers 2^exp
> constants e.g. 2, 1/2, etc), load of small integer immediates and common
> constants
> like e, pi. Having an instruction to set/update exponent can be combined
> with load
> immediate to form a wide range of numbers, but since it would be 2
> instructions,
> it is not clear if it is better than loading from memory. Another common
> operation is
> to load an inverse of small integers, e.g. 1/3, 1/5 - if immediates are
> in the small range,
> this can be a LUT, only odd values are needed, if you can update an
> exponent on top of it.
> The power of 10 constants may also be valuable.
>

Binary16 has a more limited dynamic range than BF16, but more mantissa
bits, so on-average it will have a better hit rate than having a larger
dynamic range.

As is, it can be noted that Binary16->Binary64 in this case is mostly a
bit-repacking scheme.

You don't really want to get all that fancy here...

The issue with constants like E or PI is that:
The constant itself is fairly large;
They are not usually expressed in a pure form;
They are not usually expressed EXACTLY.
One may have variants that are almost the magic PI.
But, rounded differently, or very slightly off...
Does the compiler then go "Well, it is within 8 ULP" ?...

Ironically, magic constants like PI, E, and variants thereof, are one of
the places where compiler generated constant pools can actually make sense.

Likely, a dedicated instruction for loading PI / 2*PI / PI/2 / E /
sqrt(2) / ... could be possible, but the relative cost of the mechanism
to load these constants would make them unattractive.

> Integer constants/immediates: I think for this case code/data density
> would be a higher
> priority. Loading full XLEN constant means that instruction length will
> be larger, so if
> code size is important, a variable instruction length will be needed.
> RISC-V approach
> seems like a reasonable trade-off. I'd think that it can be extended
> with loading
> various patterns, e.g. 0x7f7f7f7f7f as an 8-bit 0x7f pattern. So you can
> encode
> bit length in 3 bits (1-8) and pattern in 8 bits and still fit 12-bits.
> For 20 bits this
> would enable 4 bit for length and 16 bits for pattern.
> For fixed-point values, including integer inverses (2^(XLEN*2)/r)>>XLEN)
> used to
> replace division with multiplication a dedicated set of commands can be
> used same
> as for floats.
>

The issue quickly becomes that there are too many patterns, and the
logic cost to try to optimize for all of them quickly exceeds the savings.

You don't want to get too complicated here.

Usually, also is makes sense to use cases where the logic already exists
in the CPU (if there is already an instruction to do something, it is
possible to leverage that instruction just feed it an immediate as input).

Say, for example:
Load simple integer constant:
Basic case.
Load Binary16 to Binary64:
I already had a converter for this, could reuse it.
Nevermind if I ended up with a few faster-path options as well.
Ended up special-casing it into a 1-cycle operator;
The normal converter would be 2 cycle.
Loading an Imm32 into the high bits of the destination register.
Likewise (similar logic to RV's "PACK" instruction);
...

A case could almost be made for ARM32 style constants, except (unlike
ARM) there wasn't a good way to route the inputs into the Shift unit,
and (also unlike ARM) there was no Shift unit in the Register Fetch path.

Atually doing so seems like shift in the RF path would be expensive; as
Shift is actually a fairly expensive chunk of logic, so not really
something one would want stick in the RF path...

( Decided to leave out going down a rabbit hole about SIMD ops with
integrated shuffle and similar... )

> Overall, it seems that ISA optimization can be automated given a set of
> target applications,
> design constraints (e.g. power, performance, performance/power, cost,
> etc) and a compiler
> that given an ISA would be able to effectively use it and produce code
> which can be
> evaluated for desired metrics and drive further optimization effort.
> This would help to
> answer what number of registers, size of immediates, displacements, etc
> is optimal.
> It also needs new tools - I found extending LLVM for new instruction in
> existing ISA
> is already a challenge, making it use instruction effectively - even
> more challenging,
> so custom instruction sets today shift cost to custom compiler maintenance.
> Having a compiler framework that given ISA description can come up with good
> back-end optimization would greatly help RISC-V and all the alternative
> experiments :)
>

I am mostly using my own compiler, as trying to work on GCC or LLVM is a
decidedly unpleasant experience (much preferable to have a compiler one
can recompile in 20 or 30 seconds, vs one where a rebuild may take
several hours...).

Much like trying to hunt down bugs in Verilog simulations, long build
times kinda drain the motivation to do anything.

My architecture isn't really parametic though, as (much like RISC-V)
most major ISA features are already established.

Though, in some regards, both ended up in similar areas, despite me
taking a different path to get here.

At least on an FPGA, Vivado seems to estimate my CPU core at ~ 0.8W, but
most of this is estimated to go into the Block-RAMs and similar.

For my FPGA board (a Nexys A7), did end up sticking a RasPi CPU heatsink
onto the FPGA, after noting that it did in-fact start to get a little
warm when in use.

However, the heating was still significantly less than if trying to run
MicroBlaze, where the FPGA gets good and hot. But, I am also running at
a lower clock speed than MicroBlaze (50MHz vs 100MHz). Though,
MicroBlaze is also a 32-bit design (sorta comparable to RV32IM).

I also have an Arty S7-50 (XC7S50), but need to trim down the core to
make it fit. Design needs to be smaller, but ironically (if it fits) it
is easier to push the XC7S50 to a higher speed (*).

*: There being the other possible route to 75MHz of dropping to 2-wide
and also disabling the FPU-SIMD unit. But, this is kinda lame.

Where, say, known paths to 75MHz:
A: Boosting latency on everything;
B: Tiny L1 caches;
C: 2-wide core with no FPU-SIMD.

But, generally, everything I had tried left resulting performance worse
than just leaving it at 50.

But, one also has to drop some stuff to make it fit onto the XC7S50,
which is ironically also seemingly enough to loosen up timing enough to
allow the higher clock-speeds (but, ... I also like having things like
SIMD and similar ...).

It is kind of a similar thing with the Binary64 FPU: It could free up a
lot of LUTs and improve timing, but a hard sell if no FPU.

Though, a pretty big chunk of LUTs also goes to the instruction decoders.

Going smaller (XC7A35T or XC7S25) had generally involved more severe
limitations.

Not sure what an ASIC version would look like, if it existed...

...

> computer_structures__readings_and_examples/ <http://
> gordonbell.azurewebsites.net/
> computer_structures__readings_and_examples/>
> > contents.html <https://gordonbell.azurewebsites.net/ <https://
> gordonbell.azurewebsites.net/>

> m: Source Register 1 (Rs/Rs1 <https://rs.corp.google.com/Rs1>)

> > <mailto:cr8...@gmail.com <mailto:cr8...@gmail.com>>> wrote:
> >
> > On 2/12/2025 11:45 PM, 'Vadím Sukhomlínov' via RISC-V ISA Dev
> wrote:
> > > Bruce, yes, I benchmarked and compared to specialized HW
> (OTBN in
> > > OpenTitan https://opentitan.org/book/hw/ip/otbn/index.html
> <https://opentitan.org/book/hw/ip/otbn/index.html>

> > <https://opentitan.org/book/hw/ip/otbn/index.html <https://

> opentitan.org/book/hw/ip/otbn/index.html>> <https://
> > > opentitan.org/book/hw/ip/otbn/index.html <http://

> opentitan.org/book/hw/ip/otbn/index.html> <http://opentitan.org/

> <http://opentitan.org/>
> > book/hw/ip/otbn/index.html>>) for RSA operations. The
> > > difference was 20x+ vs Ibex Core (32-bit):
> > > 1. OTBN uses a 256bit bus and loads/stores 256 bits at the
> > same time,
> > > while Ibex needs 8 32-bit loads/stores
> > > 2. OTBN have add with carry instruction (https://

> opentitan.org/ <https://opentitan.org/>
> > book/hw/ <https://opentitan.org/book/hw/ <https://
> opentitan.org/book/hw/>>
> > > ip/otbn/doc/isa.html#bnaddc <https://opentitan.org/book/
> hw/ip/ <https://opentitan.org/book/hw/ip/>

> d...@groups.riscv.org> <mailto:isa- <mailto:isa->
> > d...@groups.riscv.org <mailto:d...@groups.riscv.org>> <mailto:isa-

> d...@groups.riscv.org <mailto:isa...@groups.riscv.org> <mailto:isa-

> <mailto:isa->

> > <mailto:isa-dev%2Bunsu...@groups.riscv.org <mailto:isa-
> dev%252Buns...@groups.riscv.org>> <mailto:isa- <mailto:isa->

> > <mailto:isa- <mailto:isa->>
> > > dev%2Bunsu...@groups.riscv.org
> <mailto:dev%252Buns...@groups.riscv.org>

> > <mailto:dev%252Buns...@groups.riscv.org
> <mailto:dev%25252Bun...@groups.riscv.org>>>.

> > > > To view this discussion visit https://
> groups.google.com/a/ <https://groups.google.com/a/>
> > <https://groups.google.com/a/ <https://groups.google.com/a/>>
> > > groups.riscv.org/d/msgid/isa-dev/ <http://

> groups.riscv.org/d/msgid/isa-dev/> <http://groups.riscv.org/d/
> <http://groups.riscv.org/d/>
> > msgid/isa-dev/>
> > >
> >
> CADEg7HmNHFvDQ7gR9Rnet7_Qii1T1b5ecV%3Dwobv1tJS69YDsrA%40mail.gmail.com <http://40mail.gmail.com> <http://40mail.gmail.com <http://40mail.gmail.com>> <https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CADEg7HmNHFvDQ7gR9Rnet7_Qii1T1b5ecV%3Dwobv1tJS69YDsrA%40mail.gmail.com <https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CADEg7HmNHFvDQ7gR9Rnet7_Qii1T1b5ecV%3Dwobv1tJS69YDsrA%40mail.gmail.com> <https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CADEg7HmNHFvDQ7gR9Rnet7_Qii1T1b5ecV%3Dwobv1tJS69YDsrA%40mail.gmail.com <https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CADEg7HmNHFvDQ7gR9Rnet7_Qii1T1b5ecV%3Dwobv1tJS69YDsrA%40mail.gmail.com>>>.

> > >
> > > --
> > > You received this message because you are subscribed to
> the Google
> > > Groups "RISC-V ISA Dev" group.
> > > To unsubscribe from this group and stop receiving emails
> from it,
> > send
> > > an email to isa-dev+u...@groups.riscv.org

> <mailto:isa-dev%2Bunsu...@groups.riscv.org> <mailto:isa-
> <mailto:isa->
> > dev%2Bunsu...@groups.riscv.org

> <mailto:dev%252Buns...@groups.riscv.org>> <mailto:isa-
> <mailto:isa-> <mailto:isa- <mailto:isa->>
> > > dev+uns...@groups.riscv.org
> <mailto:dev%2Bunsu...@groups.riscv.org>
> > <mailto:dev%2Bunsu...@groups.riscv.org

> <mailto:dev%252Buns...@groups.riscv.org>>>.
> > > To view this discussion visit https://groups.google.com/a/
> <https://groups.google.com/a/>
> > <https://groups.google.com/a/ <https://groups.google.com/a/>>
> > > groups.riscv.org/d/msgid/isa-dev/

> CADEg7H%3DTp%2BOHmMrTZtkc- <http://groups.riscv.org/d/msgid/isa-dev/
> CADEg7H%3DTp%2BOHmMrTZtkc->
> > <http://groups.riscv.org/d/msgid/isa-dev/
> CADEg7H%3DTp%2BOHmMrTZtkc- <http://groups.riscv.org/d/msgid/isa-dev/
> CADEg7H%3DTp%2BOHmMrTZtkc->>
> > > gHwYsj4LCH%3DcCCQq6oji17rfOkK1Q%40mail.gmail.com
> <http://40mail.gmail.com>
> > <http://40mail.gmail.com <http://40mail.gmail.com>> <https://
> > > groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/
> <http://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/> <http://

> > groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/ <http://
> groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/>>
> > > CADEg7H%3DTp%2BOHmMrTZtkc-
> > > gHwYsj4LCH%3DcCCQq6oji17rfOkK1Q%40mail.gmail.com
> <http://40mail.gmail.com>

> > <http://40mail.gmail.com <http://40mail.gmail.com>>?

> > > utm_medium=email&utm_source=footer>.
> >
> > --
> > You received this message because you are subscribed to the
> Google
> > Groups "RISC-V ISA Dev" group.
> > To unsubscribe from this group and stop receiving emails from it,
> > send an email to isa-dev+u...@groups.riscv.org

> <mailto:isa-dev%2Bunsu...@groups.riscv.org> <mailto:isa-
> <mailto:isa->
> > dev%2Bunsu...@groups.riscv.org
> <mailto:dev%252Buns...@groups.riscv.org>>.
> > To view this discussion visit https://groups.google.com/a/
> <https://groups.google.com/a/>

> > groups.riscv.org/d/msgid/isa-dev/0d35c290-f337-4c48- <http://
> groups.riscv.org/d/msgid/isa-dev/0d35c290-f337-4c48->
> > a323-068045fea9e6%40gmail.com <http://40gmail.com> <https://
> groups.google.com/a/ <https://groups.google.com/a/>
> > groups.riscv.org/d/msgid/isa-dev/0d35c290-f337-4c48- <http://
> groups.riscv.org/d/msgid/isa-dev/0d35c290-f337-4c48->
> > a323-068045fea9e6%40gmail.com <http://40gmail.com>>.
> >
>

BGB

unread,

Feb 14, 2025, 11:35:23 PMFeb 14

to Andrew Waterman, Vadím Sukhomlínov, isa...@groups.riscv.org

On 2/14/2025 6:38 PM, Andrew Waterman wrote:
>
>
> On Fri, Feb 14, 2025 at 11:57 AM 'Vadím Sukhomlínov' via RISC-V ISA Dev
> <isa...@groups.riscv.org <mailto:isa...@groups.riscv.org>> wrote:
>
> BGB, thank you for the very detailed analysis of the pros and cons
> of various designs.
> This is extremely insightful. And thank you for referencing Mill's

> "Belt Architecture" <https://millcomputing.com/docs/belt> -

> I never heard about it before, but when I researched it - I found
> it very close to the idea
> came up with. "Belt" representation (i referred to it as SSA
> static single assignment) and
> direct addressing of instruction results seems like a very nice
> way to express parallelism
> in the architecture, and then implementation can decide how many
> different ALU blocks
> are needed for various purposes. Ivan seems to make some
> prototypes to demonstrate
> advantages and I'm not sure why this approach is not being
> explored further
> (or maybe I'm wrong and it is).
>
> Floating point constants: I didn't look into this area and don't
> have data on what
> constants are commonly used in the computations.
>
>
> We performed this analysis for the FLI instructions in the Zfa
> extension. The unsurprising TL;DR is that a very small number of
> constants dominate.
>

I am not so sure about a "small number of constants".

But, can note that I did get OK hit rates IIRC with an S.E3.M2 microfloat...

Say:
...
0x08: 0.500
0x09: 0.625
0x0A: 0.750
0x0B: 0.875
0x0C: 1.000
0x0D: 1.250
0x0E: 1.500
0x0F: 1.750
0x10: 2.000
0x11: 2.500
0x12: 3.000
0x13: 3.500
0x14: 4.000
...
Where:
0x20..0x3F: Same values, but negative.

Tried a few schemes, but IIRC, the simple format basically did best of
the options I tried (and, is cheaper to unpack, vs a table of ad-hoc
constants).

But, did get better hit rates with Binary16 (S.E5.M10).
Binary16 can also express, say, every integer value between 1 and 1023, ...

Where, small integer value, and small integer divided by a power of 2,
... seem to be fairly common.

So, can note (checking some stats):
Percentage of FP constants that hit with Binary16 : 72.7%
Percentage of FP constants that hit with a 6b microfloat: 51.9%

Something like S.5.4 could probably also be usable.

It is unclear what is the best option for things like PI / etc, but most
likely options being:
Load it inline as a 64-bit constant;
Put if off in ".rodata" or similar, and do a memory load.

Currently, BGBCC tends to use an inline constant.

Though, might make sense to count how many times each constant is used
in the program, and use a memory load if it exceeds some predefined
limit (such as 8 or something).

Though, there is a factor though that an inline constant is faster in my
ISA than a constant load from memory (1 cycle vs 3 cycles).

But, for plain RISC-V, a stronger case could be made for using a memory
load here. Since, generally, single-cycle 64-bit constants are not a
thing in RV.

Luckily, these sorts of constants tend to be rare...

Semi common, but annoying, are things like:
y=x*0.1;
y=x*0.9;

Which people may casually type, but (as-is) can't be represented exactly
in a smaller format.

Well, or multiples of 0.01 or 0.001 ...

As noted, can't round in these cases; as the precision of a constant
can't be notably worse than the nominal target format of said constant,
and for Binary32 or Binary64, rounding 0.1 to 0.099976 or similar would
not be acceptable.

A case could almost be made for, say:
Imm12*0.001
Which can also represent 0.01 and 0.1 and multiples thereof...

But, there is no obvious "sufficiently cheap" way to decode it...

And, any option that would be unreasonably expensive or slower than a
memory load is basically no-go, ...

But, as I can note, a fair number of the constants that miss with
Binary16 seem to be things in this area (eg: integer multiples of a
power-of-10). Most likely because, well, humans tend to write
floating-point numbers in decimal.

But, as can be noted, this is within the around 28% that misses with
Binary16 (would need to write code for a more accurate breakdown, but it
looks like a guestimate): 16% = integer multiples of powers of 10 or
similar, 9% = irrationals or other complex fractions; 3% exponent is too
big/small.

...

> To me major advantage vs.
> loading from memory is the reduction of memory bandwidth (only 1
> instruction
> vs. instructions to form address and load + load itself).
> This would improve memory/fp ratio for performance critical loops.
>

> I'd explore the use of BF16 <https://en.wikipedia.org/wiki/
> Bfloat16_floating-point_format> format, load of exponent immediate

BGB

unread,

Feb 15, 2025, 2:39:33 AMFeb 15

to Andrew Waterman, Vadím Sukhomlínov, isa...@groups.riscv.org

...

>
> But, as I can note, a fair number of the constants that miss with
> Binary16 seem to be things in this area (eg: integer multiples of a
> power-of-10). Most likely because, well, humans tend to write floating-
> point numbers in decimal.
>
>
> But, as can be noted, this is within the around 28% that misses with
> Binary16 (would need to write code for a more accurate breakdown, but it
> looks like a guestimate): 16% = integer multiples of powers of 10 or
> similar, 9% = irrationals or other complex fractions; 3% exponent is too
> big/small.
>

Wrote some stats gathering code for these (after missing Binary16):
Exponent out of range : 0.00% (1)
Has 5 as a divisor : 1.62%
Has 3 as a divisor : 2.07%
Has 7 as a divisor : 1.08%
Large but terminating : 14.70% (2)
Other Or Irrational : 13.23%

1: None of the floating point constants sampled were outside of the
2^(-15) to 2^15 exponent range, so, BF16 in this case would gain,
effectively, nothing... Granted, I would have expected at least SOME...

I suspect they may be non-zero, just too rare to register as a
percentage (and/or my code is failing to recognize them for some reason).

Looks like BF16 would have a higher miss rate than Binary16 mostly due
to the shorter mantissa. But, exponent range would appear to not be much
of an issue.

2: Look Like (looking at stats for FPU constants):
72.6% Hit as Binary16 (can be represented exactly)
7.8% Hit as Binary32 (after missing Binary16)
6.9% Loaded into high bits (Truncated Binary64).
12.7% Fall back to Binary64

It appears the misses from Binary16 are not exponent related, but
mantissa related. Some amount of constants still hit with one of the
32-bit constant loads though (resulting in a 64-bit encoding), and the
rest fall back to a full 64-bit constant load.

So, rather than repeating patterns, apparently a lot of cases are simply
"mantissa is too wide" to fit into a narrower format (but does terminate
at some point before reaching the ULP).

Looks like:
E3.F2 (5b)
S.E3.F2 (6b)
S.E4.F3 (8b)
E4.F4 (8b)
S.E5.F4 (10b)
E5.F5 (10b)
...
Could probably also be usable, if a smaller immediate is needed.

Granted, my analysis may be broken.

Detecting 5 as a divisor was based on an assumption: if multiplying by
5, most of the central mantissa bits should drop to 0 (ignoring both the
high and low order bits), indicating that the value is now in alignment
with a power of 2.

However, given powers of 2 are implicit, detecting 5 is equivalent to
detecting a factor of 10.

Factors of 3 and 7 were similar.

Similarly, significant hits on powers either.

So, I had seemingly significantly underestimated the number of floating
point constants that appear to be "mostly random" or irrational numbers.

Granted, this final group would also include anything where the value is
a ratio involving multiple prime factors (such as both 3 and 5). But, I
am no closer to being able to represent these in a compact form than I
am irrational numbers.

Either way, it doesn't appear there is probably all that much that can
be done to significantly increase hit rate here...

Earl Killian

unread,

Feb 15, 2025, 12:50:47 PMFeb 15

to BGB, Vadím Sukhomlínov, isa...@groups.riscv.org

On Thursday, February 13, 2025, BGB <cr8...@gmail.com> wrote:

Register Windows:
Unless the hardware has access to special facilities that the pipeline lacks, doing it in hardware tends to be worse for performance than having software do it. Load/Store Pair or dual-issue Load/Store seems to be better both for hardware cost and performance.

I haven’t seen data to back this up. My personal experience is that register windows if done better than SPARC are potentially quite effective. Tensilica’s Xtensa ISA has a much more effective window option allowing the compiler to rotate by 4, 8, or 12. In Tensilca’s case it was chosen for its code size savings, which was significant. Code size was the #1 priority for the ISA and also led to the unusual instruction width.

Earl Killian

unread,

Feb 15, 2025, 1:01:20 PMFeb 15

to BGB, Vadím Sukhomlínov, isa...@groups.riscv.org

See http://www.securerisc.org/SecureRISC/xtensa_20000224.pdf slides 34-36 on this.

Kyle Hayes

unread,

Feb 15, 2025, 1:35:13 PMFeb 15

to Earl Killian, BGB, Vadím Sukhomlínov, isa...@groups.riscv.org

I thought that studies * had shown that having the hardware do renaming etc. as is done in "modern" OoO cores is a lot more efficient with resources as well as not limiting the number of "windows" to that in the ISA. However, hardware for such renaming is not exactly simple and uses a lot of power. AMD's version of the x86 ISA with 16 registers does fairly well. 24 or 32 would probably have been better, but IMHO AMD did pretty well at hitting all their goals with that extended ISA. Too bad they just did not go for a RISC-ish fixed 32-bit instruction word...

Itanium was the culmination of software visible register windows as far as I can tell. That was not particularly what led to its demise, but it probably did not help (especially the extremely weird spill/fill mechanism).

Best,

Kyle

PS * This was discussed on comp.arch a _long_ time ago. 20, 30 years?

BGB

unread,

Feb 15, 2025, 3:40:26 PMFeb 15

to Kyle Hayes, Earl Killian, Vadím Sukhomlínov, isa...@groups.riscv.org

On 2/15/2025 12:34 PM, Kyle Hayes wrote:
>
> On Sat, Feb 15, 2025 at 10:01 AM Earl Killian <earl.k...@arilinc.com
> <mailto:earl.k...@arilinc.com>> wrote:
>
>
>
> On Saturday, February 15, 2025, Earl Killian

> <earl.k...@arilinc.com <mailto:earl.k...@arilinc.com>> wrote:
>
>
>
> On Thursday, February 13, 2025, BGB <cr8...@gmail.com

FWIW, x86-64 was fairly conservative in its changes:
REX prefix;
A few Mod/RM tweaks;
Dropping some rarely used legacy instructions.

Even if, yes, a 32-bit instruction word with 3 register encodings might
have been better in the long term.

But, yeah, personally I don't really think the idea of register windows
seems particularly worthwhile. Load/Store sequences seem to be pretty
effective.

Code-space savings are possible by essentially turning save/restore into
a sort of runtime call.

In RISC-V terms, say, prolog:
JAL __abi_enter, X5
Epilog:
JAL __abi_exit, X0

Which could then consolidate the prolog/epilog register saving and
restoring without needing any additional hardware complexity.

> Itanium was the culmination of software visible register windows as far
> as I can tell. That was not particularly what led to its demise, but
> it probably did not help (especially the extremely weird spill/fill
> mechanism).
>

On nearly every front, IA-64 was an exercise in absurdity...
Probably not quite as bad as iAPX432 though...

Registers:
Too many.
But, split up into 4 groups of 32 (odd).

Instruction format:
Bulky and weird;
Lacks ability to deal with non-parallel code efficiently;
Wastes lots of space holding NOPs;
...
Predication:
Via 128 1-bit registers.

I currently favor 64 registers, as it is enough to deal with high
register pressure scenarios without being so many as to be entirely wasted.

Whereas, 128 registers:
Higher hardware cost;
Can't really fit 7 bit register fields effectively into a 32-bit
instruction word;
Software will not make effective use of them, so they go to waste.

Granted, one can argue that, 64 is already in diminishing returns area
as average-case code will not utilize it as effectively.

IOW:
Do you choose 32:
50-70% of functions can fit all their variables in registers.
Or, 64:
Where 92% can fit all their variables in registers.

Though, the average case performance-difference isn't that large.
Except when running functions in the hot path which have 100+ local
variables (and uses pretty much all of them in a loop), which isn't so
good on 32 registers, but holds up a fair bit better with 64.

I have a few functions like this in my OpenGL implementation in the
geometry-handling path. These functions tend to take a fairly severe
performance hit on RV64G, partly due to register pressure and partly due
to being written with SIMD in mind (for RV64G, there is no SIMD).

Basically, it is for an algorithm like:
While stack not empty:
Take a triangle/quad off a stack;
Project vertices into screen space (as needed);
Figure if on-screen areas is below limits;
If yes, draw it.
Calculate midpoints along each edge;
Break triangle/quad into pieces;
Push the pieces back onto the stack.

Effectively, the backend rasterizer does affine projection, so it is
necessary to decompose geometry into smaller pieces as it is being drawn
to limit the amount of obvious warping effects. In the current
implementation, this process is done in world space (though, there are
some tricks to limit how much re-projection is done).

Where each vertex has coordinates:
XYZ, world-space
XYZW, screen space
RGBA
ST0 (Texture 0)
ST1 (Texture 1)
Mostly Binary32 floating point.
Along with a bunch of working variables.
When decomposing a quad, there are effectively 9 vertices.

In my implementation, both triangles and quads exist as primitives. The
reason for quads is that it is faster to handle quads directly than to
decompose each quad into 2 triangles (any more complex polygons are
decomposed though).

Currently, it supports 2-level multi-texture (more multitexture will be
broken up). No support for shaders yet.

...

Earl Killian

unread,

Feb 15, 2025, 6:11:50 PMFeb 15

to Vadím Sukhomlínov, BGB, isa...@groups.riscv.org

On Feb 14, 2025, at 11:57, 'Vadím Sukhomlínov' via RISC-V ISA Dev <isa...@groups.riscv.org> wrote:

And thank you for referencing Mill's "Belt Architecture”

Yes, thank you. It reminds me a lot of some old VLIWs that used a FIFO position at the output of functional units as source operands to other functional units.

I explored something less tight than that but along similar lines in my SecureRISC ISA, where within a basic block the instruction number producing a result is the source operand of other instructions. SecureRISC is block-structured (Basic Blocks are part of the ISA, and not inferred by parsing the instruction stream—the BTB becomes a cache of a compiler-generated structure). There was a variant with following description:

Imagine the basic block descriptor included, in addition to what SecureRISC includes, the set of source registers used by the basic block, and the set of output registers of the basic block. Renaming could be done for the basic block as a whole, rather than on each instruction in the block. Within the basic block, instruction sources would either reference the Nth source register to the block or the result of the Nth instruction local to the basic block. Instructions would not need explicit destination register fields as a result (this would be in the basic block descriptor).

In the above, registers are only used for inter-BB communication. Someone might want to play with the above thought. I don’t I will have time to persue it.

-Earl

Ved Shanbhogue

unread,

Feb 15, 2025, 6:19:23 PMFeb 15

to Earl Killian, Vadím Sukhomlínov, BGB, isa...@groups.riscv.org

> I explored something less tight than that but along similar lines in my SecureRISC ISA, where within a basic block the instruction number producing a result is the source operand of other instructions. SecureRISC is block-structured (Basic Blocks are part of the ISA, and not inferred by parsing the instruction stream—the BTB becomes a cache of a compiler-generated structure). There was a variant with following description:
>
> Imagine the basic block descriptor included, in addition to what SecureRISC includes, the set of source registers used by the basic block, and the set of output registers of the basic block. Renaming could be done for the basic block as a whole, rather than on each instruction in the block. Within the basic block, instruction sources would either reference the Nth source register to the block or the result of the Nth instruction local to the basic block. Instructions would not need explicit destination register fields as a result (this would be in the basic block descriptor).
>
> In the above, registers are only used for inter-BB communication. Someone might want to play with the above thought. I don’t I will have time to persue it.

Would an interrupt/exception restart the BB and all stores be buffered
till the BB ends? Would instructions like AMOs be a BB of their own?

regards
ved

Earl Killian

unread,

Feb 15, 2025, 6:38:02 PMFeb 15

to Ved Shanbhogue, Vadím Sukhomlínov, BGB, isa...@groups.riscv.org

I thought of that, and was tempted, but the buffering seemed a bit much. But one of the nice things about the BB descriptor is that you can have option bits, and what you describe could be one, and perhaps such an option bit would limit the number of loads and stores to reduce the buffering? Then one could compose powerful AMOs, as you suggest.

RIght now the SecureRISC BB descriptor has a prev field (what set of things can target the BB for CFI checking), a next field (e.g. branch, call, indirect jump, loop, etc.), the relative target bb descriptor (for calls, branches), and an instruction set bitmask (so variable length instruction decoding can be done in parallel without parsing the instruction stream), and some other things. One nice thing is that the BB Descriptor Cache (what was formerly the BTB) gets filled 512 bits in a single cycle on a miss, so you get the benefit of locality on populating the BTB. When statistics for his approach were simulated, it was one of the bigger performance benefits. Another performance benefit was the dense index bits for branch prediction improved prediction results.

The motivation for block-structured ISAs for me was pre-TAGE (mid 1990s), where ILP was limited by instruction fetch more than anything else. That may no longer be true with TAGE. Here's what I wrote long ago: "The issue with instruction fetch is that it resembles linked list processing. Even worse than simple list processing, there is parsing and arithmetic at each node to find the next link. Linked list processing is particularly latency-sensitive and is often replaced with array processing in high-performance computing, where possible, because of its performance advantages. I explored ways to make instruction fetch behave more like array processing but didn’t succeed in fully realizing that goal. Instead, I focused on reducing the parsing required at each list node. I refer to this general approach as a block-structured instruction set.”

-Earl

Robert Finch

unread,

Feb 15, 2025, 10:13:09 PMFeb 15

to Earl Killian, Vadím Sukhomlínov, BGB, isa...@groups.riscv.org

SecureRISC’s basic block processing reminded me of this paper:

https://repositories.lib.utexas.edu/server/api/core/bitstreams/b4cb1521-894f-45ca-b102-a78f01c6da6c/content

Where code is broken up into “braids” which are executed in parallel.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/B3979DAB-6F3A-4246-8121-77B6218630B9%40arilinc.com.

Earl Killian

unread,

Feb 16, 2025, 5:15:59 PMFeb 16

to Robert Finch, Vadím Sukhomlínov, BGB, isa...@groups.riscv.org

Thank you for sharing this. It is thought-provoking. However, it would seem to have a problem efficiently exploiting pipelined functional units, such as is typical for floating-point. Tseng included FP benchmarks in his simulations, but didn’t mention the issue that I saw in my skim other than to say he simulated pipelined 4-cycle FP latencies. Do you remember anything from a more detailed reading?

Imagine a compiler that unrolls saxpy four times, so there are four parallel braids, each with load/load/multiply/add/store. (Ignore the possibility of fused multiply/add to keep the argument simple.) These four braids would be sent to four BEUs, which might execute a the two loads on successive cycles, then the multiply, then wait four cycles, then the add, then wait four cycles, then the store. The other four BEUs would be doing the same. So the pipelined multiply and pipelined add would perform no work for 3 of their 4 cycles. This seems inefficient.

One could do braid merging in this case, but Tseng did not go into that that I noticed.

Thoughts?

-Earl

Allen Baum

unread,

Feb 21, 2025, 1:46:59 AMFeb 21

to BGB, Andrew Waterman, Vadím Sukhomlínov, isa...@groups.riscv.org

Not ethat the Mill approach only needs 2 belt source specifiers rather than 2source+ 1 destination, so the encodings are compact (and they "borrow" sources from adjacent instructions, so 4 source operand instructions are possible (a weird from of instruction fusion, in a way)

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To view this discussion visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/dedb1527-7542-4b6a-b890-622146a0376d%40gmail.com.

BGB

unread,

Feb 21, 2025, 4:06:12 AMFeb 21

to Allen Baum, Andrew Waterman, Vadím Sukhomlínov, isa...@groups.riscv.org

On 2/21/2025 12:46 AM, Allen Baum wrote:
> Not ethat the Mill approach only needs 2 belt source specifiers rather
> than 2source+ 1 destination, so the encodings are compact (and they
> "borrow" sources from adjacent instructions, so 4 source operand
> instructions are possible (a weird from of instruction fusion, in a way)
>

Yes, that is true.

It can help with code density, but in itself would not matter much for
performance; and would not help that a Belt machine would likely end up
needing to be more complicated than an in-order register machine.

If the goal is the smallest possible binary, a Belt machine could have
some merit.

On the other topic (FPU stuff), can note from further testing:
There was in-fact around 2.6% of FPU constants that fall outside the
Binary16 exponent range, there was just a bug in my stats gathering code
that caused them to fail to be counted.

It looks like, in terms of a statistical aggregation, in theory S.E6.M9
would have a slightly better hit rate than S.E5.M10, and both better
than S.E7.M8. Though, the difference would be small (around 1.2% or so).

Though, still, the majority of constants that miss are still due to the
limited number of mantissa bits.

This also seemingly has a "smell" that something "not bell curve shaped"
is probably going on.

Or:
Cohort 1: Constants that fit within 5 or 6 exponent bits and a small
mantissa ( ~ 72-74% );
Cohort 2: Constants that need ~ 16-23 mantissa bits ( ~ 16% );
Cohort 3: Constants that need full Binary64 (~ 12% ).
Of which, ~ 4% could be subject to a 2/3 || 4/5 scaling trick.
Leaving ~ 8% as other or irrational.

Cohort 2 would likely be values with around 4 to 6 significant digits
(which manage to somehow avoid resulting in a repeating bit pattern).

Though, whenever writing a free hand constant that does not follow
(Int/(2^n)) form, it it almost invariably needs too many bits, implying
possibly that (Int/(2^n)) constants are actually rather common (and
fully free-hand constants are less common). Decimal is not this well
behaved. May still require more investigation.

Note that in C rules, any FP constant without a suffix is assumed to be
Double, and implicitly any calculation that has a Double operand
implicitly promotes to Double (before getting narrowed back to Single).

My FPU strategy avoids excessive type conversions mostly by leaving
scalar values in Binary64 format in registers most of the time (in which
case, no conversion steps are necessary).

Though, IIRC, GCC had dealt with this scenario on some targets instead
by not promoting FPU values in cases where the result would need to be
immediately narrowed again, even if widening then immediately narrowing
would seem to be the correct behavior according to C language rules.

Say:
float x, y;
...
y=x*3.14;
Behaving as-if the user had written:
y=x*3.14f;
...

I guess, one other thing possibly worth looking at here is whether FMUL
by a power-of-2 constant is common enough to justify a special
instruction (could be done in a 2 cycle latency vs the usual 6 cycle
latency in my case).

Otherwise, recently I am mostly caught up in debugging.
Mostly, some stuff had worked in the emulator but not yet in my Verilog
implementation (and had recently widened superscalar operation from 2
wide to 3 wide; the same as the explicitly parallel operation). But, the
3-wide superscaler logic does noticeably increase LUT cost, ...

Also, can note that, for instruction scheduling for ALU ops, when is the
best minimum distance to use the result from an ALU op:
2W1c: OP, No, Yes
3W1c: OP, No, No, Yes
2W2c: OP, No, No, No, Yes
3W2c: OP, No, No, No, No, No, Yes
For higher latency operations, it gets longer.

For my ISA, there was special compiler logic to try to shuffle
instructions around to try to make the most efficient use of the
pipeline. Though, this hasn't yet been adapted to the RV-based modes.

Wasn't super clever, just would check whether pairs of instructions
could be swapped (no register dependencies, neither has a reloc, ...),
and if doing so would improve latency. It would generally stop when no
more swaps can happen. Note that it would not swap instructions across
labels, and was also effected by things like register allocation
(generally works better when temporaries and similar are spread across a
range of registers). It could also check non-adjacent instructions as
well, for occasional non-local swaps (say, to allow swapping around
non-movable instructions so that they don't function as barriers).

Was also limited in that it would generally operate over a sliding
window of up to 12 instruction words or so. Can't do as much with
shorter instruction sequences, and could miss cases where improving
things within the window could make them worse outside the window (the
alternative would be to always estimate cost based on the entire
basic-block rather than a sliding window).

Though, this is one place having a lot of registers can help. Whenever a
register gets reused, it means these instructions can't be swapped around.

...

> On Fri, Feb 14, 2025 at 8:35 PM BGB <cr8...@gmail.com
> <mailto:cr8...@gmail.com>> wrote:
>
> On 2/14/2025 6:38 PM, Andrew Waterman wrote:
> >
> >
> > On Fri, Feb 14, 2025 at 11:57 AM 'Vadím Sukhomlínov' via RISC-V
> ISA Dev
> > <isa...@groups.riscv.org <mailto:isa...@groups.riscv.org>

> <mailto:isa...@groups.riscv.org <mailto:isa...@groups.riscv.org>>>

> send an email to isa-dev+u...@groups.riscv.org <mailto:isa-
> dev%2Bunsu...@groups.riscv.org>.

> To view this discussion visit https://groups.google.com/a/
> groups.riscv.org/d/msgid/isa-dev/dedb1527-7542-4b6a-

> b890-622146a0376d%40gmail.com <https://groups.google.com/a/
> groups.riscv.org/d/msgid/isa-dev/dedb1527-7542-4b6a-
> b890-622146a0376d%40gmail.com>.
>

MitchAlsup

unread,

Apr 10, 2025, 4:09:45 PMApr 10

to RISC-V ISA Dev, Vadím Sukhomlínov

While I am late to this party, there is a "thing" I would like you all to see::

On Wednesday, February 12, 2025 at 4:58:21 PM UTC-6 Vadím Sukhomlínov wrote:

Hi All,

When implementing big-number operations on RISC-V (e.g. for cryptography, scientific applications) lack of carry flag results in relative inefficiency of code compared to x86 where `adc` instruction exists. On RISC-V you need to explicitly get carry and add carry resulting in the code like for common primitive add_with_carry(a,b,carry)->(sum, carry):

```
// In: a0 - a, a1 -b a2 -carry
// Out: a0 = (a+b+carry), a1(carry) = (a+b+carry)>>XLEN

add a3, a2, a0
sltu a2, a3, a2
add a0, a3, a1
sltu a1, a0, a3
add a1, a1, a2
```

On x86 that would be a single `adc` instruction taking both operands and carry flag in CF.

My proposal is to introduce `addlo` and `addhi` instructions which would replace sequence above into:

In My 66000 ISA I chose not to provide a "bunch of semi-unique" instructions

in order to deal with multi-precision calculations. After much thought, I came

up with what is essentially a multi-instruction prefix. I call this instruction-modifier

CARRY.

CARRY specifies a register that will be used to carry the extra precision from

instruction to instruction, CARY also provides a 2-bit indicator as to whether

{-} carry does not participate

{I} carry is consumed as another operand

{O} carry is generated as a result

{IO} carry is both consumed and generated.

256-bit addition ism simply::

CARRY R16,{{O}{IO}{IO}{I}}
ADD R12,R4,R8 // carry Out only
ADD R13,R5,R9 // Carry In and Out
ADD R14,R6,R10 // Carry In and Out
ADD R15,R7,R11 // Carry In only

CARRY can be applied to Shifts, Extracts, Inserts, ADD, SUB, MUL, DIV, over
in the FP side of ISA it provides access to exact floating point calculations
and Kahan-Babuška Summation.

So, the addition of a single instruction modifier provides access to 30-odd
multi-precision calculations.

The important thing is that the carry result contains more than 1-bit !!

Mitch

Allen Baum

unread,

Apr 11, 2025, 2:24:16 AMApr 11

to MitchAlsup, RISC-V ISA Dev, Vadím Sukhomlínov

Carry needs to be a CSR so it can be saved and restored during traps, context switches.I

If used with LI, or LUI, it could be used to generate long immediates too.,transputer style

--

You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/c9bb536c-1a13-4675-a148-152beeb57e15n%40groups.riscv.org.

Bruce Hoult

unread,

Apr 11, 2025, 7:20:35 AMApr 11

to Allen Baum, MitchAlsup, RISC-V ISA Dev, Vadím Sukhomlínov

Recently I made some tests on different SBCs I own and found that at
least the U74 and P550 perform better in GNU bignum library's own
benchmark than the A53 and A72, respectively.

Despite the widely-cited criticism of RISC-V by a GMP author in
September 2001, and needing five instructions to fully replicate ADC,
lack of a carry flag does not seem to be a significant issue for
overall performance.

https://old.reddit.com/r/RISCV/comments/1jsnbdr/gnu_mp_bignum_library_test_riscv_vs_arm/

On Fri, Apr 11, 2025 at 6:24 PM 'Allen Baum' via RISC-V ISA Dev

> To view this discussion visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAF4tt%3DCXd-YknjzRTUTN69byU%2BgLwga%2BwtQkUqpocR0Fu_V7ag%40mail.gmail.com.

MitchAlsup

unread,

Apr 11, 2025, 2:53:16 PMApr 11

to RISC-V ISA Dev, Allen Baum, RISC-V ISA Dev, Vadím Sukhomlínov, MitchAlsup

On Friday, April 11, 2025 at 1:24:16 AM UTC-5 Allen Baum wrote:

Carry needs to be a CSR so it can be saved and restored during traps, context switches.I

The CARRY instruction donates a GPR as the carry-register (R10 below)

a) that gets saved and restored with context switches.

b) used any way the application chooses.

CARRY R10,{{0}{IO}}
MUL R8,R16,R20 // 64 × 64 ≡ 128 {R10, R8}

DIV R9,R8,R21 // 128 / 64 ≡ 64 + 64 {R10, R8}

If used with LI, or LUI, it could be used to generate long immediates too.,transputer style

My ISA already has long constants (immediates and displacements) without

wasting instructions pasting constants together::

DIV R7,#1234567890,R6

or

STD #0x123456789ABCDEF,[IP,R7<<3,#.LBB12345-.]

Where both constants are 64-bits.

Allen Baum

unread,

Apr 12, 2025, 5:41:51 PMApr 12

to MitchAlsup, RISC-V ISA Dev, Vadím Sukhomlínov

Ooh, interesting way to do this, but there still has to be a CSR that indicates which GPR has been confused to be the carryin/out else that gets lost on context switch.

Don’t recall if that was already part of the proposal.

To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.
To view this discussion visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/8cb6448e-3234-4a5a-adbb-2b135581422dn%40groups.riscv.org.

MitchAlsup

unread,

Apr 12, 2025, 9:45:38 PMApr 12

to RISC-V ISA Dev, Allen Baum, RISC-V ISA Dev, Vadím Sukhomlínov, MitchAlsup

Yes, indeed, one needs both the CARRY shadow (16-bits) and the CARRY register (5-bits) to

retain semantics across interruption. Shadow is a shift register that strips off 2-bits per

instruction in DECODE.

To view this discussion visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/8cb6448e-3234-4a5a-adbb-2b135581422dn%40groups.riscv.org.

Allen Baum

unread,

Apr 15, 2025, 8:08:25 PMApr 15

to MitchAlsup, RISC-V ISA Dev, Vadím Sukhomlínov

If you need the register anyway, there’s no point to using a shadow GPR along with a tag. Just make it a CSR, which will simplify both hardware and save/restore software.

To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.

To view this discussion visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/8cb6448e-3234-4a5a-adbb-2b135581422dn%40groups.riscv.org.

--

You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.
To view this discussion visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/60b3d204-e7d1-4fd9-9d1f-9b4ab9025b9dn%40groups.riscv.org.

MitchAlsup

unread,

Apr 15, 2025, 9:28:38 PMApr 15

to RISC-V ISA Dev, Allen Baum, RISC-V ISA Dev, Vadím Sukhomlínov, MitchAlsup

On Tuesday, April 15, 2025 at 7:08:25 PM UTC-5 Allen Baum wrote:

If you need the register anyway, there’s no point to using a shadow GPR along with a tag. Just make it a CSR, which will simplify both hardware and save/restore software.

Yes, that would be the standard implementation using a "sea of control registers" architectural

model. The rest of my architecture does not use such a model, so heading in that direction would

be more than a "bit painful". However, since RISC-V does have that kind of CR model, feel free to

borrow or steal the notion; with my blessing.

------------just skip if you don't want anything but RISC-V in your head-------------------------------------------------

But on to some assumptions you seem to have made that are not well founded in my architecture.

------------instruction modification-----------

Instruction-modification (basically a prefix) concatenates bits from one donating instruction to

a handful of consuming instructions to modify their behavior. CARRY combines a sequence of

regular instructions to perform multiprecision integer double-wide bit-fields, or exact FP arithmetics

with a single instruction addition to ISA.

Predication (PRED) allows the then-clause and the else-clause to be issued into the execution

window. This is a good strategy any time FETCH will get to the join point before the condition

has evaluated in EXECUTE.

A third form of shadowing provides access to both SIMD and Vectorization with the addition of

exactly 2 instructions into ISA--I was quoted ARM as having ~1300 SIMD instructions as a reference.

All told, with CARRY, PRED, VEC and LOOP, the ISA only contains 63 instructions (a notoriously

hard thing to count with any degree of accuracy.)

-----------thread state---------------------------------

HW considers "thread-state" to be a write-back cache in the core, this includes 8 doublewords of

header {Root pointer, ASID, modes, enablements, modes, ...} and 32 doublewords of registers.

Thus, layer-switch is HW only--that is; when control arrives at *.handler the core is already

reentrant and various stuff can be assumed by SW to be saved in various places--in HW is is

more likely that save and restore are still in progress when control arrives. SW does not have

to save or restore state, thinning the SW path-length on all excursions between layers of the

SW-stack. The reverse path has a similar reduction in SW state management.

Being done in HW means the entirety of thread-state is coherent immediately--there is not

a tortuous path in changing ROOT, adding ASID, setting privilege, setting priority, or for that

matter disabling interrupts. Bang, a cache line of data arrives and is shoved into the core

CRs en massé. One cycle it was all the previous state, next cycle it is all the new state.

So, I have a fundamental disagreement that doing it the CSR way does anything but

add to the complexity SW sees. Complexity that may already be solved--but complexity

none-the-less.

-----------designer freedom-----------------

On the other hand, the length of the shadow is small enough (8 or fewer instructions) that HW

designers have the option of simply backing up to CARRY, and ignore any pending state. Low

end machines will likely take this route. Higher end machines will likely have several thread-

states loaded so that transitions between layers of SW-stack is seamless. In all cases, when

control arrives, thread-state is present to SW in that layer. SW performs no saving nor restoring

of the register file across privilege switches.

Lobbing "all these things into a sea of CRs" prevents this kind of designer freedom as all the

instructions have to be present, and the SW already has to execute all those instructions

in each privilege switch.

-----------end of just skip-----------------------------

On Saturday, April 12, 2025, 'MitchAlsup' via RISC-V ISA Dev <isa...@groups.riscv.org> wrote:
Yes, indeed, one needs both the CARRY shadow (16-bits) and the CARRY register (5-bits) to
retain semantics across interruption. Shadow is a shift register that strips off 2-bits per
instruction in DECODE.

merciful snip

Allen Baum

unread,

Apr 15, 2025, 11:26:36 PMApr 15

to MitchAlsup, RISC-V ISA Dev, Vadím Sukhomlínov

If you're doing something from scratch, sure, that might make sense, but this did start on a RISC-V thread, so I was thinking from that perspective. The mill architecture has explicit register/belt port borrowing behavior for things like this, which you probably know.

There are other problems with the CSR approach, of course (having to rename it in a wide OOO implementation),

and effectively switching to another bank of registers is a mass renaming on top of reserving extra rename register file that can't(?) be used during non-trapping operations (I haven't thought about this as much as you, so that could be all wrong.)

--

You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To view this discussion visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/53640c76-d87c-413d-a2be-8066fbeb7177n%40groups.riscv.org.

Robert Finch

unread,

Apr 15, 2025, 11:49:39 PMApr 15

to RISC-V ISA Dev, Allen Baum, RISC-V ISA Dev, Vadím Sukhomlínov, MitchAlsup

I may be wrong, but it is not necessary to rename the CSRs if one is willing to live with some performance loss. For my core I do not have the CSRs renamed; way I see it is they are rarely used registers. They are updated only at the end of the pipeline. In software the CSR access is surrounded with fence (sync) instructions. With potentially thousands of CSRs it is not practical to rename them all.

I borrowed the carry idea from the My66000. But have only three dedicated carry registers. The core saves the instruction in a CSR on exceptions, to allow the state to be restored.

What happens if there is a branch out of the carry shadow? Do instructions in a called routine get affected by the carry?

MitchAlsup

unread,

Apr 16, 2025, 12:46:42 PMApr 16

to RISC-V ISA Dev, Allen Baum, RISC-V ISA Dev, Vadím Sukhomlínov, MitchAlsup

On Tuesday, April 15, 2025 at 10:26:36 PM UTC-5 Allen Baum wrote:

If you're doing something from scratch, sure, that might make sense, but this did start on a RISC-V thread, so I was thinking from that perspective. The mill architecture has explicit register/belt port borrowing behavior for things like this, which you probably know.

Consider multi-precision integer multiply--the inner loop is going to need a CARRY register for the high order bits of

the multiply and a second CARRY register for propagating 1-bit carries while adding up all the multiply terms. So,

you need more than one CARRY register to "get anywhere". And doing it by having CARRY donate the register puts

selection control in the compiler--{rather than some brain dead washed up architect}

There are other problems with the CSR approach, of course (having to rename it in a wide OOO implementation),
and effectively switching to another bank of registers is a mass renaming on top of reserving extra rename register file that can't(?) be used during non-trapping operations (I haven't thought about this as much as you, so that could be all wrong.)

Basically, you add 1-bit to the rename stuff and get on with life.

On Tue, Apr 15, 2025 at 6:28 PM 'MitchAlsup' via RISC-V ISA Dev <isa...@groups.riscv.org> wrote:

snip

MitchAlsup

unread,

Apr 16, 2025, 12:48:47 PMApr 16

to RISC-V ISA Dev, Robert Finch, Allen Baum, RISC-V ISA Dev, Vadím Sukhomlínov, MitchAlsup

On Tuesday, April 15, 2025 at 10:49:39 PM UTC-5 Robert Finch wrote:

I may be wrong, but it is not necessary to rename the CSRs if one is willing to live with some performance loss. For my core I do not have the CSRs renamed; way I see it is they are rarely used registers. They are updated only at the end of the pipeline. In software the CSR access is surrounded with fence (sync) instructions. With potentially thousands of CSRs it is not practical to rename them all.

I borrowed the carry idea from the My66000. But have only three dedicated carry registers. The core saves the instruction in a CSR on exceptions, to allow the state to be restored.

What happens if there is a branch out of the carry shadow? Do instructions in a called routine get affected by the carry?

Any taken branch clears out the shadow. {It seemed the only sane thing to do.}

snip

Allen Baum

unread,

Apr 17, 2025, 12:29:35 PMApr 17

to MitchAlsup, RISC-V ISA Dev, Vadím Sukhomlínov

Ah, I see your point; for that particular approach, you want multiple simultaneous carries.

I'd argue that RISC-V already solves that problem using the MULH instruction.

That's a separate op, but can be fused to have the same effect, and can use existing rename logic.

So, I guess I'm still not convinced that a CSR isn't the right approach ( for non-multiprecision-multiply in general.

--

You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To view this discussion visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/f4649d24-c863-42f3-8fd7-b0ed8aefd88an%40groups.riscv.org.

Vadím Sukhomlínov

unread,

Apr 17, 2025, 12:48:14 PMApr 17

to Allen Baum, MitchAlsup, RISC-V ISA Dev

My original approach with "addhi" and "addlo" allows you to have multiple carries in different registers, doesn't require new CSRs and is transparent for the architecture state. If you have fusion, then addlo and addhi with the same inputs can be fused and avoid duplicated computations, same as rem/div with same sources.

However, inconvenience comes from 3-operand nature - I'm using destination as source too, so might need to preserve original value - ideally such operations need 3 inputs and 1 or 2 outputs, leading to new instruction format. Overhead on extra operations (configuring CSRs, copying registers) may only be beneficial for many word operations.

One can just use hard-coded CSRs for carry (-ies), similar to x86's flags register and specify which flag to use in instruction encoding.

MitchAlsup

unread,

Apr 17, 2025, 2:44:28 PMApr 17

to RISC-V ISA Dev, Allen Baum, RISC-V ISA Dev, Vadím Sukhomlínov, MitchAlsup

On Thursday, April 17, 2025 at 11:29:35 AM UTC-5 Allen Baum wrote:

Ah, I see your point; for that particular approach, you want multiple simultaneous carries.
I'd argue that RISC-V already solves that problem using the MULH instruction.

On Thursday, April 17, 2025 at 11:48:14 AM UTC-5 Vadím Sukhomlínov wrote:

My original approach with "addhi" and "addlo" allows you to have multiple carries in different registers,

Adding instructions to ISA IS the difference. I added 1 instruction which effectively concatenates

2 bits per instruction at DECODE instead of inventing various instructions which squirrel those

2-bits elsewhere. So, my whole multiprocessor extension adds 1 instruction to ISA and can be

used for {signed, unsigned, shifts, bit fields, float, double} data types.

In addition, since CARRY supplies a temporary register, one changes IMUL into IMAC without

needing that instruction in ISA. And the way the temp register is used allows for the data-

path to be built 3-operand 2-result while the rest of ISA is 2-operand 1-result. This, for example,

allows for insertion of a bit-field spanning into a 128-bit container as a single calculation

instruction and an instruction-modifier. {containerHI, containerLO} = field:9 << 59

Allen Baum

unread,

Apr 18, 2025, 11:14:33 AMApr 18

to MitchAlsup, RISC-V ISA Dev, Vadím Sukhomlínov

Both approaches enable a 3 source, 2 destination datapath, yet only require the actual registerfile to be 2 source, 1 dest.

That isn't the issue. I haven't had time to think about some of those details, but your example of spanning could work with either method. If you have multiple live carries, then you need effectively store the tagged temporary and load the new one, which sounds a lot like a CSRRW op to me - that doesn't need a new operation. You do need to identify the CSR as being "live" for input an output - which is what you new ops do also.

--

You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To view this discussion visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/50b086ea-e118-4658-a364-ae34a2e3f057n%40groups.riscv.org.

Vadím Sukhomlínov

unread,

Apr 18, 2025, 12:39:45 PMApr 18

to Allen Baum, MitchAlsup, RISC-V ISA Dev

Yes, there are many design choices. My primary concern with CSRs is potential impact on code size and a need to store content on context switching. I'm less familiar with implementation costs in HW.

To compare approaches I'd try to compare how 2 XLEN, 4 XLEN, 8 XLEN, 16 XLEN, etc operations will look like with different approaches in terms of number of instructions, performance and probably energy consumption (actual ALU operations, etc).

I'd think that dedicated CSRs serving as both carry input and carry destination can indeed be efficient. You'd need 1 CSRW to set in carry to zero, each `addc` computes 3-way addition, and final high bit/carry would be read by CSRR. And you can add several CSRs of this kind as well as `addcX` to select which one to use. This would also open a path to an integer FMA.

Practically, if you have a way to define different 3-source 2-destination operations, you can construct useful things, including equivalent to X86's DIV using 2XLEN dividend and XLEN divisor to produce reminder and quotient - nice improvement for big number division.

MitchAlsup

unread,

Apr 18, 2025, 1:35:21 PMApr 18

to RISC-V ISA Dev, Allen Baum, RISC-V ISA Dev, Vadím Sukhomlínov, MitchAlsup

On Friday, April 18, 2025 at 10:14:33 AM UTC-5 Allen Baum wrote:

Both approaches enable a 3 source, 2 destination datapath, yet only require the actual registerfile to be 2 source, 1 dest.
That isn't the issue. I haven't had time to think about some of those details, but your example of spanning could work with either method. If you have multiple live carries, then you need effectively store the tagged temporary and load the new one, which sounds a lot like a CSRRW op to me - that doesn't need a new operation. You do need to identify the CSR as being "live" for input an output - which is what you new ops do also.

The other way to do this is to have a "feedback/forwarding loop" for each function unit.

It is vary rare to need a single FU having 2 carry-feedbacks at a time.

snip

MitchAlsup

unread,

Apr 18, 2025, 1:53:08 PMApr 18

to RISC-V ISA Dev, Vadím Sukhomlínov, MitchAlsup, RISC-V ISA Dev, Allen Baum

On Friday, April 18, 2025 at 11:39:45 AM UTC-5 Vadím Sukhomlínov wrote:

Yes, there are many design choices. My primary concern with CSRs is potential impact on code size and a need to store content on context switching. I'm less familiar with implementation costs in HW.
To compare approaches I'd try to compare how 2 XLEN, 4 XLEN, 8 XLEN, 16 XLEN, etc operations will look like with different approaches in terms of number of instructions, performance and probably energy consumption (actual ALU operations, etc).
I'd think that dedicated CSRs serving as both carry input and carry destination can indeed be efficient. You'd need 1 CSRW to set in carry to zero, each `addc` computes 3-way addition, and final high bit/carry would be read by CSRR. And you can add several CSRs of this kind as well as `addcX` to select which one to use. This would also open a path to an integer FMA.
Practically, if you have a way to define different 3-source 2-destination operations, you can construct useful things, including equivalent to X86's DIV using 2XLEN dividend and XLEN divisor to produce reminder and quotient - nice improvement for big number division.

256-bit by 64-bit IDIV

CARRY R8,{{O}{I0}{I0}{I}}
DIV R12,R8,R21 // R8 / R21

DIV R13,R9,R21 // {R8,R9} / R21

DIV R14,R10,R21 // {R8,R10}/ R21

DIV R15,R11,R21 // {R8,R11}/ R21

Quotient {R12, R12, R13, R16} final remainder {R8}

And we also have the Overflow preventing MUL/DIV sequence::

CARRY R10,{{0}{IO}}
MUL R8,R16,R20 // 64 × 64 ≡ 128 -- can't overflow

DIV R9,R8,R21 // 128 / 64 ≡ 64 + 64

Or the (thought to be) somewhat difficult bit-field insertion into a 128-bit container::

CARRY R10,{IO}
INS R9,R9,R20,#17,#56 // {R10,R9}[72..56] = R20[16..0]

On Fri, Apr 18, 2025 at 8:14 AM Allen Baum <allen...@esperantotech.com> wrote:

snip

Allen Baum

unread,

Apr 18, 2025, 4:18:52 PMApr 18

to MitchAlsup, RISC-V ISA Dev, Vadím Sukhomlínov

Nice examples; the only issue is really whether the carry state is mapped to a GPR or a CSR.

The carry control (the bit-pairs that specify o,i,io, ~) has to be mapped to a CSR;

I'd expect the tag (which register is used as carryin/out) to be in that CSR.

That does reduce the number of ops that can fit into that CSR, but that's pretty irrelevant.

You could maximize the number of ops that are dependent to (say) 3 or 4 and have multiple tags. The implementation would be a bit tricky, as the tag change requires saving the previous carry value and loading another - effectively inserting an op into the istream vs. making it explicit in the istream.

Conversely, if you don't need a tag, then you can use the immediate from of the CSR instructions to load the common cases with a single CSR op instead of a LI, CARRY sequence

In any case, save/restore should be easier without a tag, because you know which register needs to be saved and restored (the CSR), vs. having to read the tag from the CARRY CSR and then having to case_branch to a routine to save that specific general register (assuming, of course, that only SREGs can use this facility) and do the same for restore.

--

You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To view this discussion visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/29b68f48-f23b-4ae6-83ef-f440e85fafb3n%40groups.riscv.org.

MitchAlsup

unread,

Apr 18, 2025, 4:43:05 PMApr 18

to RISC-V ISA Dev, Allen Baum, RISC-V ISA Dev, Vadím Sukhomlínov, MitchAlsup

On Friday, April 18, 2025 at 3:18:52 PM UTC-5 Allen Baum wrote:

Nice examples; the only issue is really whether the carry state is mapped to a GPR or a CSR.

Thank you.

The carry control (the bit-pairs that specify o,i,io, ~) has to be mapped to a CSR;
I'd expect the tag (which register is used as carryin/out) to be in that CSR.

Yes, in effect, 21-bits from CARRY instruction are copied into 21-bits of Program-Status-Line.

That does reduce the number of ops that can fit into that CSR, but that's pretty irrelevant.

With 32 registers:

a) we only have 14-effective registers doing 128-bit calculations

b) we only have 7-effective registers doing 256-bit calculations

c) or 3-effective registers doing 512-bit calculations

So, in practice, 8-instructions under a shadow is sufficient for the very vast majority of use cases.

Also; in practice, 4-instructions under a shadow cover a healthy majority of use cases.

You could maximize the number of ops that are dependent to (say) 3 or 4 and have multiple tags. The implementation would be a bit tricky, as the tag change requires saving the previous carry value and loading another - effectively inserting an op into the istream vs. making it explicit in the istream.

But when the carry-container is a GPR it can be referenced normally before and after its use in

some kind of carry-chain, without executing any instructions "to get at it".

Conversely, if you don't need a tag, then you can use the immediate from of the CSR instructions to load the common cases with a single CSR op instead of a LI, CARRY sequence

In an ISA with universal constants, you never have to even have a LI instruction.

IDIV R7,#0x123456789ABCDEF,R13 // is a single instruction occupying 3 words.

In any case, save/restore should be easier without a tag, because you know which register needs to be saved and restored (the CSR), vs. having to read the tag from the CARRY CSR and then having to case_branch to a routine to save that specific general register (assuming, of course, that only SREGs can use this facility) and do the same for restore.

Two points here:

a) When carry-container is a GPR, SW does not have to do any special in identifying

and saving/restoring that register--it happens because the save/restore procedures

saves all the registers (subject to damage).

b) My architecture has HW do the saving and restoring. So SW would be blissfully

unaware that special things need to happen, in any event.

On Fri, Apr 18, 2025 at 10:53 AM 'MitchAlsup' via RISC-V ISA Dev <isa...@groups.riscv.org> wrote:

snip

Allen Baum

unread,

Apr 21, 2025, 6:04:38 PMApr 21

to MitchAlsup, RISC-V ISA Dev, Vadím Sukhomlínov

I stand corrected; for the save/restore, you can be oblivious of the tag value because you will be saving/restoring the associated register, regardless.

I was assuming that the carry would be referenced only at the end of a sequence, not at an intermediate time.

Referencing in an intermediate time is tricky because the instruction that references it is still subject to the 2 action bits, so it takes extra "--" action bits to use it.

It has to be treated as both an extra input, and as a bypass register - it really is a 2nd bypass register. that you have to keep inflight.

The number of effective registers you list assumes that they are all in registers simultaneously, and you aren't preloading them from the cache while executing out of registers. i.e. the effective number of GPRs can be larger than you suggest. I don't know if that matters, in practice,

since you need to have the prefetch ops use extra "--" encodings. I do agree that having 4-instruction under a shadow will cover the majority of cases.

Trying to get multiple, otherwise independent streams to be operate in parallel seems possible also,

Note that I am concerned only how this fits into a RISC-V implementation, so the universal constants and HW save/restore in your architecture isn't a thing.

I think HW save/restore will become a thing when CLIC gets finalized

--

You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To view this discussion visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/257dd64c-15ee-4cfe-bcef-ae75f917a2f5n%40groups.riscv.org.

MitchAlsup

unread,

Apr 21, 2025, 8:34:39 PMApr 21

to RISC-V ISA Dev, Allen Baum, RISC-V ISA Dev, Vadím Sukhomlínov, MitchAlsup

On Monday, April 21, 2025 at 5:04:38 PM UTC-5 Allen Baum wrote:

I stand corrected; for the save/restore, you can be oblivious of the tag value because you will be saving/restoring the associated register, regardless.

I was assuming that the carry would be referenced only at the end of a sequence, not at an intermediate time.
Referencing in an intermediate time is tricky because the instruction that references it is still subject to the 2 action bits, so it takes extra "--" action bits to use it.
It has to be treated as both an extra input, and as a bypass register - it really is a 2nd bypass register. that you have to keep inflight.

Yes, correct. But in addition, each FU has this same forwarding loop back to itself, so IMUL

and ADD can run simultaneously.

Also note: Higher end machines can completely "perform" CARRY instructions in DECODE.

Which is why I call them an instruction-modifier rather than an instruction.

The number of effective registers you list assumes that they are all in registers simultaneously, and you aren't preloading them from the cache while executing out of registers. i.e. the effective number of GPRs can be larger than you suggest. I don't know if that matters, in practice,
since you need to have the prefetch ops use extra "--" encodings. I do agree that having 4-instruction under a shadow will cover the majority of cases.

This is my predilection, treat single values that occupy multiple registers as if it were a single register,

and a single register is always loaded prior to calculation on it.

Trying to get multiple, otherwise independent streams to be operate in parallel seems possible also,

Large multiplication uses 2 CARRY's, one feeding back to the IMUL making it IMAC the other

converting ADD Rd,Rs1,Rs2 into ADD {Rcarry,Rd},Rs1,{Rcarry,Rs2}

Note that I am concerned only how this fits into a RISC-V implementation,

Yes, that is why I prefaced my extended comments as they don't pertain to RISC-V.

And while (eventually) I might be a competitor, I still want RISC-V to be as good as

it can be.

Allen Baum

unread,

Apr 22, 2025, 12:26:23 AMApr 22

to MitchAlsup, RISC-V ISA Dev, Vadím Sukhomlínov

inline

On Mon, Apr 21, 2025 at 5:34 PM 'MitchAlsup' via RISC-V ISA Dev <isa...@groups.riscv.org> wrote:

On Monday, April 21, 2025 at 5:04:38 PM UTC-5 Allen Baum wrote:

....

It has to be treated as both an extra input, and as a bypass register - it really is a 2nd bypass register. that you have to keep inflight.

Yes, correct. But in addition, each FU has this same forwarding loop back to itself, so IMUL
and ADD can run simultaneously.

I'm not understanding this comment; could you give an example of how they would be used in parallel instead of serially?

Also note: Higher end machines can completely "perform" CARRY instructions in DECODE.
Which is why I call them an instruction-modifier rather than an instruction.

I'm interpreting that to mean it will expand a 2in, 1out add into a 3in, 2out add with the appropriate register numbers filled in.

Sometime, the extra input will use a bypass reg; sometimes, though, you might need an extra cycle to fetch it (e.g. return from a context switch). That might come for free when the CARRY CSR gets restored (it would move the carry GPR into a bypass)

Cute.

The number of effective registers you list assumes that they are all in registers simultaneously, and you aren't preloading them from the cache while executing out of registers. i.e. the effective number of GPRs can be larger than you suggest. I don't know if that matters, in practice,
since you need to have the prefetch ops use extra "--" encodings. I do agree that having 4-instruction under a shadow will cover the majority of cases.

This is my predilection, treat single values that occupy multiple registers as if it were a single register,
and a single register is always loaded prior to calculation on it.

Trying to get multiple, otherwise independent streams to be operate in parallel seems possible also,

Large multiplication uses 2 CARRY's, one feeding back to the IMUL making it IMAC the other
converting ADD Rd,Rs1,Rs2 into ADD {Rcarry,Rd},Rs1,{Rcarry,Rs2}

Yes, that matches what I thought you meant.

Note that I am concerned only how this fits into a RISC-V implementation,

Yes, that is why I prefaced my extended comments as they don't pertain to RISC-V.

And while (eventually) I might be a competitor, I still want RISC-V to be as good as
it can be.

so the universal constants and HW save/restore in your architecture isn't a thing.
I think HW save/restore will become a thing when CLIC gets finalized

On Fri, Apr 18, 2025 at 1:43 PM 'MitchAlsup' via RISC-V ISA Dev <isa...@groups.riscv.org> wrote:
snip

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/257dd64c-15ee-4cfe-bcef-ae75f917a2f5n%40groups.riscv.org.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To view this discussion visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/ce00dbbc-0551-4e90-8567-65300b089c85n%40groups.riscv.org.

BGB

unread,

Apr 22, 2025, 3:00:23 AMApr 22

to isa...@groups.riscv.org

FWIW, this is why in the past I had argued that ideally RISC-V should
also at least define a few major things:
Register Indexed Load/Store;
A scheme to encode larger immediate and displacement values;
Load/Store Pair (for 64-bit mode);
...

Namely, if RISC-V stands to become more widespread, better to have it
hopefully not have too many issues with arbitrary limitations negatively
effecting performance.

Well, and probably limiting mandating too many features which add
significant implementation costs while not providing all that much of a
performance gain.

Can note:
First case:
Only needs Base+Index*Scale, not Base+Index*Scale+Disp

Having both an index and disp adds more significant cost for a much
smaller gain (and likely served "good enough" using, say, "SH3ADD+LD"
and similar).

A possible provision could be made for only having Indexed Load though,
which is the more common case, and avoids the need for an extra register
port. If encoded as normal 3 register ops, the encoding cost is modest.

I have ended up using part of the 64-bit encoding space for large
immediate values to good effect (with no impact on the existing 32-bit
encoding space). I ended up only using 64-bit encodings for extended
RISC-V, as the relative gains from 96-bit encodings were lesser, more so
when the only 96-bit instructions that could be encoded were 64-bit
constant loads (which, if one has a 64-bit SHORI encoding
("Rd=(Rd<<32))|Imm32u", it is still possible to load a 64-bit constant
in 2 logical instructions, which is mostly good enough; given that full
64-bit constants tend to be relatively infrequent, whereas 32/33 bit
values are far more common).

Granted, the majority of savings I had saw here were mostly from the
larger Load/Store displacements (it is not too uncommon for the 12-bit
load/store displacement to be insufficient, but a 33 bit displacement is
nearly always sufficient).

Load/Store Pair was primarily a benefit for function prologs and
epilogs. I had reused the LDU/SDU encodings for this. In my
implementation, it could only access even-numbered register pairs, but
this is mostly sufficient for the use-case (another variant I have seen
has two independent registers with a much smaller displacement, but if
only that form were added, it is likely to be a worse option for
prologs/epilogs, which are best served if, much of the time, they can
directly access all of the stack frame; which is mostly true of Disp12
but not so much Disp5 or similar).

But, admittedly, I have gotten a little burnt out on a lot of this at
the moment.

Was recently mostly distracting myself by working on a 3D engine of
mine, and trying partly to build a retro-style text-adventure style
thing inside it (sorta like 1980s Sierra games, but in full 3D). I also
have some mixed results with trying to go to a 16-color palette (CGA/EGA
style) for the texture and sprite graphics (I suspect I may not have the
art skills to pull this off effectively; but at least trying to make an
effort at the moment). I don't expect it to amount to much (and, apart
from the "text parser" part, not much is new compared with past projects
of mine).

At least, with this sort of thing, I can do things and see results
within a reasonable level of effort. Like, I ended up writing a BASIC
interpreter for some types of scripting, and pulled of a usable
interpreter in mostly a single day (with a few more days of additional
fiddling, like ending up adding dynamic scoping because the "global
variables only" scoping of 80s style BASIC was limiting, and dynamic
scoping was compatible with GOSUB/RETURN style control-flow). Mostly 80s
style BASIC being a fairly simple language to implement (only a little
over 1000 lines of C for the interpreter ATM); and also avoids external
dependencies.

Well, and implemented some 3D model stuff (in case I want to use 3D
models in this. And then half left debating whether I should write a
SCAD (OpenSCAD) interpreter (thus far, I had been using OpenSCAD to some
extent for 3D modeling; though mostly in the context of making models
for 3D printed parts; but as-is, OpenSCAD doesn't support
texture-mapping or animation; though it is possible to project textures
onto STL models after the fact, but this seems crufty vs, say, if one
could specify texturing in the SCAD code; but, also possible could be to
do CSG stuff in BASIC, but this also seems crufty). But, then again,
maybe one doesn't need textures (a case could be made for just using
flat colors as "poor man's cel shading", already going "low-poly" for
everything, and trying to adhere to a 16-color palette for base colors,
but allowing for more colors via lighting effects; but, would still need
textures for dithering effects, which is borderline mandatory with
16-color; ...). I had just decided to look into 3D models more as some
things aren't really "working" as billboard sprites.

Not yet come up with a plot, and still working on basic mechanics. A lot
of this does at least give more of a feeling of "getting something done"
even if not likely all that relevant (and probably still suck and no one
will care).

Maybe will get back to working on CPU stuff, but at this point it seems
like endless time and effort for relatively little gain. Had started
working on trying to support Verilog in my compiler (running it in
software), but this seems a bit niche (and, would be an uphill battle
mostly to try to have an alternate way to try to hunt down some bugs
that have thus far eluded my in my Verilog code).

For whatever reason, my extended RV encodings for:
ADDI Xd, X3, Imm33
Seems to be unstable, but can't seem to find anything for why it would
be such (when pretty much everything else seems to be working). Looking
at the decoder output, pipeline states, etc, everything seems to check
out, but none the less, emitting this encoding from my compiler causes
programs to break. Note, issue seems specific to X3 (GP), which is the
primary base register used for accessing global variables in my compiler
output (and an Imm33 here would mostly apply to loading the address of
arrays in the ".bss" section).

I started wanting to be able to do line by line stepping and looking at
variables individually, which one can do with C in something like Visual
Studio. Can't really inspect the Verilator output in GDB in the same
way, mostly because Verilator mangles things a little too much. But,
writing my own Verilog compiler (and better debugger) has ran into the
"too much effort" needed issue. Though, did at least come up with a few
ISA and compiler features that can make Verilog style code less inefficient.

But, spending months hunting down an obscure bug is kinda demoralizing
(and possibly part of the burn out in this case).

Well, along with the issue of, while my CPU runs my own stuff OK, it
doesn't run a mainline OS (such as Linux) in RV64G mode really at all,
and making this work also seems like an uphill battle (as is trying to
make my kernel able to run binaries from the Linux userland).

Might just be better if someone else implemented some similar
extensions, and see whether a different implementation could also get a
similar roughly 30% speedup in some programs (mostly things like Doom
and similar).

Though, at this point, it seems like, even if a 30 to 40% speedup (for
plain C, plain integer code) is possible via ISA and compiler tuning, it
might not be enough to matter.

...

> so the universal constants and HW save/restore in your
> architecture isn't a thing.
> I think HW save/restore will become a thing when CLIC gets finalized
>
> On Fri, Apr 18, 2025 at 1:43 PM 'MitchAlsup' via RISC-V ISA Dev
> <isa...@groups.riscv.org> wrote:
>
> snip
>
> --
>
> You received this message because you are subscribed to the
> Google Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from
> it, send an email to isa-dev+u...@groups.riscv.org.
>
> To view this discussion visit https://groups.google.com/a/
> groups.riscv.org/d/msgid/isa-dev/257dd64c-15ee-4cfe-bcef-

> ae75f917a2f5n%40groups.riscv.org <https://groups.google.com/a/
> groups.riscv.org/d/msgid/isa-dev/257dd64c-15ee-4cfe-bcef-
> ae75f917a2f5n%40groups.riscv.org?
> utm_medium=email&utm_source=footer>.

>
> --
> You received this message because you are subscribed to the Google
> Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send

> an email to isa-dev+u...@groups.riscv.org <mailto:isa-
> dev+uns...@groups.riscv.org>.

> To view this discussion visit https://groups.google.com/a/
> groups.riscv.org/d/msgid/isa-dev/

> ce00dbbc-0551-4e90-8567-65300b089c85n%40groups.riscv.org <https://
> groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/
> ce00dbbc-0551-4e90-8567-65300b089c85n%40groups.riscv.org?
> utm_medium=email&utm_source=footer>.

MitchAlsup

unread,

Apr 22, 2025, 1:17:46 PMApr 22

to RISC-V ISA Dev, Allen Baum, RISC-V ISA Dev, Vadím Sukhomlínov, MitchAlsup

On Monday, April 21, 2025 at 11:26:23 PM UTC-5 Allen Baum wrote:

inline

On Mon, Apr 21, 2025 at 5:34 PM 'MitchAlsup' via RISC-V ISA Dev <isa...@groups.riscv.org> wrote:
On Monday, April 21, 2025 at 5:04:38 PM UTC-5 Allen Baum wrote:
....
It has to be treated as both an extra input, and as a bypass register - it really is a 2nd bypass register. that you have to keep inflight.

Yes, correct. But in addition, each FU has this same forwarding loop back to itself, so IMUL
and ADD can run simultaneously.

I'm not understanding this comment; could you give an example of how they would be used in parallel instead of serially?

void Long_multiplication( uint64_t multiplicand[],
multiplier[],
sum[],
ilength, jlength )
{
for( uint64_t i = 0; i < (ilength + jlength); i++ )
sum[i] = 0;

for( uint64_t acarry = j = 0; j < jlength; j++ )
{
for( uint64_t mcarry = i = 0; i < ilength; i++ )
{
{mcarry, product} = multiplicand[i]*multiplier[j] + mcarry;
{acarry,sum[i+j]} = {sum[i+j],acarry} + product;
}
}

}

mcarry and acarry are used in parallel.

Also note: Higher end machines can completely "perform" CARRY instructions in DECODE.
Which is why I call them an instruction-modifier rather than an instruction.

I'm interpreting that to mean it will expand a 2in, 1out add into a 3in, 2out add with the appropriate register numbers filled in.
Sometime, the extra input will use a bypass reg; sometimes, though, you might need an extra cycle to fetch it (e.g. return from a context switch). That might come for free when the CARRY CSR gets restored (it would move the carry GPR into a bypass)
Cute.

DECODE does not "really" convert from 2-operand 1-result into 3-operand 2-result.

DECODE concatenates 2 bits from the shift register to each instruction in decode

It is the data-path that is pre-wired to be 3-operand, 2-result but used 2-operand 1-result most of the time.

On Fri, Apr 18, 2025 at 1:43 PM 'MitchAlsup' via RISC-V ISA Dev <isa...@groups.riscv.org> wrote:
snip

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To view this discussion visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/ce00dbbc-0551-4e90-8567-65300b089c85n%40groups.riscv.org.

Allen Baum

unread,

Apr 22, 2025, 5:31:37 PMApr 22

to MitchAlsup, RISC-V ISA Dev, Vadím Sukhomlínov

There is an explicit dependency on the product variable that prevents them from being done in parallel, ( for the first and last iterations)

But even if you do, both acarry and mcarry exist simultaneously - and that means you have to be able to have 2 carry registers active at the same time,

and there is no way to distinguish them in the operations. You would need to expand your 2bit io field to a 3bit field and have two tags, at the very least.

Whether you put the 2bit field into the opcode or expand it to a register number is a distinction without a difference most likely.
You'd really have to go into the wees to figure out which is the right way to do it.

To view this discussion visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/ac17810d-41eb-4a1c-8c5d-1ba86718b083n%40groups.riscv.org.

MitchAlsup

unread,

Apr 22, 2025, 8:36:20 PMApr 22

to RISC-V ISA Dev, Allen Baum, RISC-V ISA Dev, Vadím Sukhomlínov, MitchAlsup

On Tuesday, April 22, 2025 at 4:31:37 PM UTC-5 Allen Baum wrote:

There is an explicit dependency on the product variable that prevents them from being done in parallel, ( for the first and last iterations)

Could you try again to state the point above you are trying to make.

In the case of IMUL under the shadow of CARRY becomes IMAC {Rcarry,Rd} = Rs1×Rs2 + Rcarry

with all 128-bits of multiplier product contributing to the result.

While we showed ADD under the shadow of CARRY becomes ADD {Rcarry,Rd} = Rs1 + Rs2 + Rcarry.

But even if you do, both acarry and mcarry exist simultaneously - and that means you have to be able to have 2 carry registers active at the same time,

Sure, chosen by the compiler via standard register allocation, ...

and there is no way to distinguish them in the operations. You would need to expand your 2bit io field to a 3bit field and have two tags, at the very least.

2-bits from the shift register are concatenated onto the instructions in the CARRY shadow,

and the 5-bit register donated by the CARRY instruction itself (something the DECODEr

will have readily at hand); henceforth the instructions are simply executed. DECODEr

has to track whether the first instruction in the shadow "gets" a carry {I} and {IO} does

{O} and {} do not. If Rcarry is not consumed by the first instruction, Rcarry does not

have to be read from RF. Also, if Rcarry is not produced by the last instruction, Rcarry

does not have to be written into RF, either.

One way to look at this is the the register donated by CARRY is an accumulator that is

re-used across every instruction under that CARRY's shadow.

Whether you put the 2bit field into the opcode or expand it to a register number is a distinction without a difference most likely.
You'd really have to go into the wees to figure out which is the right way to do it.

I would assert that this is the proper way to design EVERY Instruction in ISA.

a) invent a new instruction

b) teach compiler to use instruction when appropriate

c) if instruction can be used enough

... or instruction solves a particular nasty issue (CARRY)

d) and it does no serious damage to pipeline length or cycle time

e) then go ahead and put it in.

Snip

BGB

unread,

Apr 23, 2025, 1:19:57 AMApr 23

to isa...@groups.riscv.org

Yes, agreed.

My process was highly driven by compiler twiddling (using a Verilog/FPGA
implementation to evaluate cost tradeoffs).

This is how I went in my case as well:
Compiler first;
Then an implementation in Verilog;
Then a lot of fine-tuning.

A lot of details ended up converging towards those in RISC-V, apart from
non-functional aspects (such as ASM syntax; my compiler and ISA's syntax
being more similar to M68K or VAX style).

There were lots of things that were in the original ancestor ISA that
were dropped:
Branch delay slots;
Auto-increment addressing;
Various instructions with niche addressing modes
(such as "(PC+Disp*4)&(~3)");
...

Well, and wacky stuff that I just straight up didn't bother with:
MAC.W @Rm+, @Rn+
Which loaded two signed 16 bit values from memory, multiplied them
(producing a 32 bit value), and then adding it to a 64-bit accumulator
(MACH:MACL).

I was generally against anything that violated the Load/Store model
(whereas, neither SuperH nor MSP430 were exactly free of this). Though,
ironically, despite originally evolving out of SH and being influenced
by MSP430 (and TMS320), a lot of things didn't survive all that long...

As noted, there were a few points where things ended up different:
I went for 64 GPRs in a GPR-only ISA;
As opposed to X/F, or X/F/V.
It internally uses Rb+Ri*Sc as the core addressing mode.
Internally, Disp addressing is just the indexed addressing.
Just using an immediate for the corresponding register port.

As least as far as I can see, a single register space seems preferable
in terms of cost, complexity, and code generation efficiency, if
compared with having separate GPR and FPU spaces. The only obvious
sticking point being that FPU operations tend to have higher latency
than integer ops. Admittedly, I don't have an ideal solution for this.

But, at least to me, a discrepancy in terms of average case latency is a
smaller issue than that of having multiple register files and pipelines,
and having to duplicate a bunch of machinery as a result.

Similarly, I am not positively inclined towards implementing the 'V'
extension as it seems weak in terms of benefit relative to
implementation cost (as in, 'V' would be fairly expensive to implement;
and unclear how much cleverness is needed from the compiler to use it
effectively).

Though, admittedly, I can note that my implementation tends to have
higher latency than is typical in RISC-V land:
Most instructions (including ALU instructions), having a 2 cycle latency;
Load operations generally have a 3 cycle latency.

I have noted before that the relative performance advantage of some of
my own ISA features is reduced if one assumes 1 cycle ALU operations and
2 cycle memory load (but, lower latency adds cost on the FPGA side of
things). But, this seems to depend a lot on ISA + compiler.

In general, there is not separate logic internally for register or
immediate forms of instructions. I saw little benefit in having separate
logic or function units for register vs immediate forms of instructions.

There is another difference in terms of architectural state:
My ISA had S and T bits, with the T bit mainly used for conditional
instructions. There were ADDC(ADC) and SUBC(SBB) instructions (that use
T as a Carry bit), but their use is comparably infrequent.

Conditional instructions do have some benefit, but the difference is
small enough in many cases that I can mostly ignore them. Similarly,
they don't map over well to RISC-V.

I guess, like Mitch's ISA:
It is possible to encode large constants inline,
though not quite as fancy.
In my case, prefixes are used to make immediate fields bigger.
GPRs only (just 64 vs 32).

Unlike Mitch's ISA:
No direct equivalent of CARRY
I handle predication in a different way
Directly encoded in instruction, more like 32-bit ARM.
All save/restore was handled in software.

A few of these cases was because the added implementation complexity
didn't seem worth the cost.

So, for features like CARRY or ENTER/EXIT, I am not personally convinced
that they bring enough benefit to justify the associated hardware cost.

I put them in a similar category as PUSH/POP instructions and
auto-increment addressing, which while there is some potential merit,
their existence in an ISA brings up some thorny issues that may leave
their presence as a "net negative".

Only certain ISA variants allow for full 64-bit immediate values.
XG1: No
Imm64 limited to "MOV Imm64, Rn" and "ADD Imm64,Rn"
XG2: Yes, all 3RI instructions.
XG3: Yes, all 3RI instructions.
RISC-V + JumboPrefix: No
Limited to "LI Xd, Imm64" (J+J+LUI).
I decided to leave support for this as optional.

Decided to leave it shorter, but:
In all 4 cases, Imm33 encodings are available to all normal Imm/Disp
cases (though XG1 imposed a few extra edge-case restrictions).

It is possible to use the immediate prefix to create an immediate on an
arbitrary 3 register instruction, though:
XG1 and XG2: Limited to 29 bits.
XG3: Can also synthesize a 33 bit immediate;
RV+JX: Limited to 26 and 17 bits.

I can note that 64-bit immediate values are comparably rare in the
compiler output. A case could almost be made for assuming 32 or 33 bits
as a hard limit for direct immediate values.

Say:
y=x+0x123456789ABCDEFLL;
Will need 3 instructions rather than 1 (and 5 instruction words rather
than 3), if a 33 bit hard limit were assumed. Counter argument being
that, if it only happens a few hundred times in a program binary, and is
not in the hot path, it may not matter all that much.

Had noted before, when doing a race in terms of Doom startup times based
on ISA variants, speed ranking was:
XG3 (best), XG2, RV+Jx, RV64G (worst)

With a ".text" size ranking:
XG2 (smallest), XG3, RV+Jx, RV64G (biggest)

If XG1 were included, it is the smallest, but isn't all that high on a
speed ranking (XG1, RV64G, and RV64GC being on the slower end).

RV64GC, while smaller than RV64G, is still on the bulkier side of this
ranking (generates bigger binaries than XG2 and XG3, neither of which
have 16-bit instructions). Ironically, RV64G+Jx also generating smaller
binaries than RV64GC, but this is a lot closer to tied.

XG1 is the closest to RV64GC in terms of features (primarily 16/32 bit
encodings, with 5-bit register fields), and generates the smallest
binaries (of the group), just is not so great on the performance side of
things (though, in implementation choices, both are towards the slower
end; as the 16-bit ops effectively come with performance penalties).

Had noted:
Recently, XG3 has seemingly edged out XG2 in terms of performance.
RV+Jx is faster than RV64G, but was not enough to fully close the gap
with XG2 and XG3.

I guess, I can note, working ISA listing for XG3:
https://github.com/cr88192/bgbtech_btsr1arch/blob/master/docs/2024-10-22_XG3RV.txt

But, still relevant to understanding how it works:
https://github.com/cr88192/bgbtech_btsr1arch/blob/master/docs/2020-04-30_BJX2D.txt

https://github.com/cr88192/bgbtech_btsr1arch/blob/master/docs/2020-04-30_BJX2_IsaDescD.txt

The IsaDescD spec mostly lists things in terms of XG1 encoding rules,
but isn't too hard to figure out.

If I were to do a reboot project, I might consider using a mix of RISC-V
and XG3 as the primary ISAs (mostly, as if XG1 and XG2 were dropped,
likely some cost-savings could be possible).

Either way, probably to be "much more than a toy" would need to be able
to run something more of a "real OS and userland" on the thing.

This poses an issue on multiple fronts:
Run an RV Linux kernel:
Hardware interfaces are too much different;
Kinda defeats the point vs just using a more generic RV core (*1).
Port Linux kernel:
Likely not gonna happen without GCC support;
Linux kernel is very tightly tied to assumption of using GCC;
More so as parts of it are now written in Rust, ...
Port BSD kernel and userland:
At least possible.
Run RV64G ELF PIE binaries on my kernel.
Thus far GLIBC immediately explodes with my ELF loader;
Also still need to implement a lot of the various Linux syscalls.
This route allows still using my ISA for the OpenGL backend, etc.
Using my compiler:
Only fairly trivial programs work via "./configure" scripts;
Need to "port the whole jungle" to bootstrap the userland;
Like, this stuff is an ever expanding tangle of dependencies.
Doesn't take very long for "cracks" to become visible.

*1: Say, hypothetically if one just tried taking a generic RV64G Linux
build, and then running it on a CPU with the XG3 ISA available, and had
the applications switching to XG3 mode themselves. There is little to
say this just wouldn't immediately explode. It is likely that using an
off-the-shelf Linux would also limit things to only running RISC-V code
(and rendering any support for other ISA modes as irrelevant).

Some of my original use-cases, like using it as a CNC or robot
controller, haven't really materialized (vs the more practical option of
"just use a RasPi or something"; but then one has less reason to care
what ISA it is running, as everything still looks mostly similar from C
land).

Well, and the status of, well:
It runs Doom and similar pretty well;
It runs Quake, but it is slow at 50MHz;
It runs Quake 3, but even slower;
It sorta has an almost comically bad experimental GUI;
"Hey check out this slow and very limited Minecraft clone";
Well, also ports of Heretic, Hexen, ROTT, ...
Has a video player which can mostly just play "Bad Apple"
Too slow to use generic MPEG style codecs.
Had to specially transcode any videos to a viable codec.
...

Well, something, but kinda lame, still mostly "proof of concept" or a toy.

It is seemingly not doing "that badly" in terms of performance, as I
guess (contrary to my childhood memories), Quake doesn't run so well on
a 486DX2-66 either (and, given both are limited to single digits, not
"that" much worse).

GLQuake and Quake3 sorta need my ISA though (in some form) to be usable,
as trying to run pure RV64G builds are painfully slow (not so much
"single digit fps" so much as "seconds per frame"). A lot of this case
likely has to do with SIMD and various helper ops.

Software Quake is a better showing at least for RISC-V in that it isn't
significantly slower.

Note that using RV64G or similar for most programs could still make
sense, as most are not overly performance sensitive.

...

In my case, the core ISA has mostly stabilized, and I am not finding
that many new features to add that can help much with generic C performance.

The most recent ISA additions (BITMOV and BITMOVS) instructions actually
had relatively little to do (directly) with C performance, but rather to
try to make Verilog style code more practical. But, can at least offer
some performance gains for "extreme bit twiddling" if using language
extensions.

Where BITMOV basically combines a shift and masked MUX, to move a
bitfield from one register to another. And, BITMOVS combines this with
XOR'ing the non-masked region with the sign-extension of the copied
bit-field. This can be used both for signed-bitfield extract,
bit-extension, and some niche operations like zigzag sign-folding.

Could maybe be applied more to generic C (inferring from shift-and-mask
code), but the compiler inference needed would be a little advanced.

But, generally, everything that is left is quickly turning into uphill
battles. If months go by with relatively little outwardly visible signs
of progress, this gets demoralizing.

> Snip
>
> --
> You received this message because you are subscribed to the
> Google Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from
> it, send an email to isa-dev+u...@groups.riscv.org.
>
> To view this discussion visit https://groups.google.com/a/
> groups.riscv.org/d/msgid/isa-dev/
> ac17810d-41eb-4a1c-8c5d-1ba86718b083n%40groups.riscv.org

> <https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/
> ac17810d-41eb-4a1c-8c5d-1ba86718b083n%40groups.riscv.org?
> utm_medium=email&utm_source=footer>.

>
> --
> You received this message because you are subscribed to the Google
> Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send

> an email to isa-dev+u...@groups.riscv.org <mailto:isa-
> dev+uns...@groups.riscv.org>.

> To view this discussion visit https://groups.google.com/a/

> groups.riscv.org/d/msgid/isa-dev/5f23eba1-c020-49db-
> a76d-2ed3d509049fn%40groups.riscv.org <https://groups.google.com/a/
> groups.riscv.org/d/msgid/isa-dev/5f23eba1-c020-49db-
> a76d-2ed3d509049fn%40groups.riscv.org?utm_medium=email&utm_source=footer>.

Allen Baum

unread,

Apr 23, 2025, 2:44:58 AMApr 23

to MitchAlsup, RISC-V ISA Dev, Vadím Sukhomlínov

I must be misunderstanding the argument to the carry instruction then. I thought there were "N" 2bit fields, and a single register tag.

is there instead a single 2bit field per 5bit tag? (so you could store at most 4 sets in a 32b register)

Otherwise, you need to switch back&forth between the mcarry and the carry between each instruction that you're trying to run in parallel.

What am I missing here?.

Maybe should should just expand the code as assembler instructions...

To view this discussion visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/5f23eba1-c020-49db-a76d-2ed3d509049fn%40groups.riscv.org.

MitchAlsup

unread,

Apr 23, 2025, 12:26:01 PMApr 23

to RISC-V ISA Dev, Allen Baum, RISC-V ISA Dev, Vadím Sukhomlínov, MitchAlsup

On Wednesday, April 23, 2025 at 1:44:58 AM UTC-5 Allen Baum wrote:

I must be misunderstanding the argument to the carry instruction then. I thought there were "N" 2bit fields, and a single register tag.
is there instead a single 2bit field per 5bit tag? (so you could store at most 4 sets in a 32b register)

The instruction occupies a "slot" where a 16-bit immediate is available.

The 16-bit field is sliced into 8 containers, and concatenated to instructions 2-bits at a time.

The Destination register designates the CARRY register concatenated to each instruction where the 2-bit field is not 00.

Otherwise, you need to switch back&forth between the mcarry and the carry between each instruction that you're trying to run in parallel.
What am I missing here?.

The large multiply loop has 2 CARRY instructions in it, each with a shadow of 1 instruction.

Allen Baum

unread,

Apr 24, 2025, 3:43:10 AMApr 24

to MitchAlsup, RISC-V ISA Dev, Vadím Sukhomlínov

not helping. The carry instruction has a single "destination" register (which is the tag) and a 16bit "immediate" (which isn't used as a typical immediate.

So, in order to have parallel execution, I think you're saying you have to execute 4 instructions in parallel:

Carry that applies to the mul with mcarry as the dest and a single non-zero 2 bit field

Mul which is affected by the previous Carry op

Carry that applies to the add with acarry as the dest and a single non-zero 2 bit field

Add which is affected by the previous Carry op

That sounds inefficient and painful

To view this discussion visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/5cfb2231-aa7c-4865-a695-02fc6b9d8b4fn%40groups.riscv.org.

Robert Finch

unread,

Apr 24, 2025, 7:37:22 AMApr 24

to RISC-V ISA Dev, Allen Baum, RISC-V ISA Dev, Vadím Sukhomlínov, MitchAlsup

Modifiers remind me of the ATNI (add-to-next instruction) of the Apollo guidance computer. They are really just expanding the bits available of the following instructions. One could just make the instructions wider, but that would waste bits for cases where the wider instructions are not needed.

I suppose if a variable length instruction were used, one could just have a:

ADD Rco,Rd,Rs1,Rs2,Rci

Instruction. Maybe a 48-bit instruction format would work. I recall reading a post that there was a movement towards 3 read, 2 write port instructions for some newer designs, at least for their internal workings. The extra ports being hidden by the ISA unless needed.

To get five register specs in a 32-bit ISA, it might be better to have only 16 regs. It would cost a little bit in performance, but may be more humanly understandable.

MitchAlsup

unread,

Apr 24, 2025, 1:19:32 PMApr 24

to RISC-V ISA Dev, Allen Baum, RISC-V ISA Dev, Vadím Sukhomlínov, MitchAlsup

On Thursday, April 24, 2025 at 2:43:10 AM UTC-5 Allen Baum wrote:

not helping. The carry instruction has a single "destination" register (which is the tag) and a 16bit "immediate" (which isn't used as a typical immediate.

I had a ~50-line response written--when it occurred to me that this would be a good

place to request what the inner loop of a large multiply function* would look like for

RISC-V at both a C-level and ASM level, so we can compare more apple to apple.

(*) similar to what I illustrated a couple of days ago, but optimized for RISC-V.

It only seems fair.

snip

Allen Baum

unread,

Apr 24, 2025, 1:45:35 PMApr 24

to Robert Finch, RISC-V ISA Dev, Vadím Sukhomlínov, MitchAlsup

Mitch has described a prefix instruction; there are many in x86, only one has been proposed (but not ratified I think) for RISC-V (so far...)

There are other commerical architectures that have prefix instructions (e.g. Transputer , RPM40).

What makes Mitch's idea stand out is that his prefix

- covers multiple following instructions (not just one) and

- effectively adds new data input/outputs (i.e. register ports).

Issues with prefix instructions are first: how do deal with exceptions and interrupts

(usually trivially solved by requiring extra state to be saved without using any prefix instructions,

or by only defining prefixes that don't change the architectural semantics of its shadows, like RISC-V "hint" ops)

and off course, ensuring they get used by compilers.

BGB

unread,

Apr 24, 2025, 3:07:25 PMApr 24

to isa...@groups.riscv.org

On 4/24/2025 12:45 PM, 'Allen Baum' via RISC-V ISA Dev wrote:
> Mitch has described a prefix instruction; there are many in x86, only
> one has been proposed (but not ratified I think) for RISC-V (so far...)
> There are other commerical architectures that have prefix instructions
> (e.g. Transputer , RPM40).
> What makes Mitch's idea stand out is that his prefix
> - covers multiple following instructions (not just one) and
> - effectively adds new data input/outputs (i.e. register ports).
> Issues with prefix instructions are first: how do deal with exceptions
> and interrupts
> (usually trivially solved by requiring extra state to be saved
> without using any prefix instructions,
> or by only defining prefixes that don't change the architectural
> semantics of its shadows, like RISC-V "hint" ops)
> and off course, ensuring they get used by compilers.
>

I had been arguing for adding a "Jumbo Prefix" to RISC-V, or "mostly
makes immediate fields bigger, maybe does other things depending on
target instruction", unless this prefix is the one being mentioned here
(as proposed), or if there is some more formal process to this I am not
aware of.

I guess, I can note that my initial encoding would have overloaded the
space formerly occupied by BitManip's dropped ADDIWU, but I had since
relocated it to the 64-bit encoding space:
* tiiiiii-iiiii-jjjjj-100-kkkkk-01-11111 J22

TBD if it still counts as a prefix in this case, my decoder still treats
it as a prefix.

My thinking here is:
No additional architectural state is associated with the prefix;
An interrupt may only occur during the prefix, not before or after;
For sake of PC relative calculations (Disp33 branches), the PC before
the prefix is used.

Well, unless someone else has proposed a prefix that basically does the
same things. In this case, either option will work.

But, yeah, I am also in the camp of being skeptical of the general
implementation viability of CARRY prefixes due to the needed
architectural state and other "thorny" issues they would raise.

Well, and on naive implementations, there would likely be no way to
handle them other than a less desirable "trap and emulate" strategy.

I would also be skeptical of a PRED style prefix, but at least one merit
of this one is that it could be defined in a way that it is essentially
equivalent to a short forward conditional branch on implementations
which are too naive to deal with "Mask off next N instructions in the
pipeline" (just, maybe, a short forward branch with an unusually short
displacement).

Though, the "pseudo option" (that would exist in XG3) is that the T bit
is aliased into the high order bits of PC (along with the ISA mode), and
use more traditional "add-with-carry" or "subtract-with-borrow" type
instructions. But, I don't expect this to be a particularly popular
option (and does come at the cost of reducing the maximum usable size of
the address space, at least with respect to where one can put code at;
other option being to put it in a CSR or similar).

For more complex instructions, like multiply/divide, "there be dragons
here". My preference was to have "widening multiply" instructions
instead (which RV currently lacks). Narrowing divide could also be
possible in theory.

Wide shift basically turns into a funnel shift operator. IIRC, these
already exist in the B extension (would need to look more to be certain
if B's funnel shift operators work in the way I think they do; but if
they do, wide shift should already be addressed in theory).

...

> ac17810d-41eb-4a1c-8c5d-1ba86718b083n%40groups.riscv.org <https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/ac17810d-41eb-4a1c-8c5d-1ba86718b083n%40groups.riscv.org?utm_medium=email&utm_source=footer>.

>
> --
> You received this message because you are subscribed
> to the Google Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving
> emails from it, send an email to isa-
> dev+u...@groups.riscv.org.
>
> To view this discussion visit https://
> groups.google.com/a/groups.riscv.org/d/msgid/isa-

> dev/5f23eba1-c020-49db-
> a76d-2ed3d509049fn%40groups.riscv.org <https://
> groups.google.com/a/groups.riscv.org/d/msgid/isa-
> dev/5f23eba1-c020-49db-
> a76d-2ed3d509049fn%40groups.riscv.org?
> utm_medium=email&utm_source=footer>.

>
> --
> You received this message because you are subscribed to the
> Google Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails
> from it, send an email to isa-dev+u...@groups.riscv.org.
>
> To view this discussion visit https://groups.google.com/a/

> groups.riscv.org/d/msgid/isa-dev/5cfb2231-aa7c-4865-
> a695-02fc6b9d8b4fn%40groups.riscv.org <https://
> groups.google.com/a/groups.riscv.org/d/msgid/isa-
> dev/5cfb2231-aa7c-4865-
> a695-02fc6b9d8b4fn%40groups.riscv.org?
> utm_medium=email&utm_source=footer>.

>
> --
> You received this message because you are subscribed to the Google
> Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send

> an email to isa-dev+u...@groups.riscv.org <mailto:isa-
> dev+uns...@groups.riscv.org>.

> To view this discussion visit https://groups.google.com/a/

> groups.riscv.org/d/msgid/isa-dev/CAF4tt%3DCA00Api9r4re_61DPk5fmt-
> xZv64eNcA3TpojR7NEYFw%40mail.gmail.com <https://groups.google.com/a/
> groups.riscv.org/d/msgid/isa-dev/CAF4tt%3DCA00Api9r4re_61DPk5fmt-
> xZv64eNcA3TpojR7NEYFw%40mail.gmail.com?utm_medium=email&utm_source=footer>.

MitchAlsup

unread,

Apr 24, 2025, 3:35:47 PMApr 24

to RISC-V ISA Dev, Allen Baum, RISC-V ISA Dev, Vadím Sukhomlínov, MitchAlsup, Robert Finch

On Thursday, April 24, 2025 at 12:45:35 PM UTC-5 Allen Baum wrote:

Mitch has described a prefix instruction; there are many in x86, only one has been proposed (but not ratified I think) for RISC-V (so far...)
There are other commerical architectures that have prefix instructions (e.g. Transputer , RPM40).
What makes Mitch's idea stand out is that his prefix
- covers multiple following instructions (not just one) and
- effectively adds new data input/outputs (i.e. register ports).

I tried to address this twice already--it seems a third attempt is in order::

If you use an instruction-modifier which covers multiple subsequent instructions,

a) you add 0 or 1 register read at the beginning of the sequence.

b) you use a (?new?) forwarding path between all instructions under the shadow.

c) you add 0 or 1 register writes after the end of the sequence.

d) you add state (as below) to deal with interruptions.

{{There may be subtle modifications to the standard data-path not mentioned above.}}

You do not "necessarily" add a register read and a register write to each instruction under the shadow,

although you could.

If the instruction at the beginning of the sequence does not receive a carry-in

then DECODE does not have to arrange that that register is read, only that a

0x0 arrives as an operand.

If the instruction at the end of the sequence does not produce a carry-out then

the write-back stage of the pipeline does not write the result in the renamed

target register.

All of the dense multi-precision sequences start with {O} and end with {I} meeting

both requirements above.

----------------My 66000 details--------------------

I have been trying to avoid spill-over from CARRY to My 66000 vectorization in my

descriptions. My 66000 Vectorization, effectively, adds an index count to each

renamed operand register in the reservation station and concatenates said index

when said instruction is launched for the [index]^th time. This means that a vectorized

loop can rename registers once and then be performed any number of times one

desires. So, FETCH-through-DECODE remains quiet when performing vector loops.

CARRY makes use of this index in connection the dots for instructions under the

shadow. So a single renamed register can be used for all instructions under the

shadow without loosing the ability to keep individual instances separate (and

DECODE sane).

Since these uses are necessarily in order per CARRY, the index simply prevents

launch too early. The indexes are independent between different CARRY's.

-----------------------------------------------------------------

Allen Baum

unread,

Apr 25, 2025, 4:04:10 AMApr 25

to MitchAlsup, RISC-V ISA Dev, Vadím Sukhomlínov, Robert Finch

To be clear, I said "effectively adds" , not "neccessarily". How you implement it is different; you don't need to actually add a (full) register port, and I understand that.

It could all just live in the bypass logic and never go to the physical register file. Or not...

The fact that you could implement them as an actual read from a register file (and you'd have to sometimes) implies that that is the effect.

So we agree there.

And, I just figured out how 2 carry registers can be active at the same time. They alias to real registers, so a second CARRY op with a different Rd will

(*effectively*...) just write the previous carry back into its associated register, and read the new one from the register file.

So, a little renaming magic lets you keep more balls up in the air.

This is the inner loop you showed

{mcarry, product} = multiplicand[i]*multiplier[j] + mcarry;
{acarry,sum[i+j]} = {sum[i+j],acarry} + product;

You can't have the above execute in parallel.. But you can pipeline them:

{mcarry, product} = multiplicand[i]*multiplier[j] + mcarry;

{acarry,sum[i+ j]} = {sum[i+ j],acarry} + product; {mcarry, product} = multiplicand[i+1]*multiplier[j] + mcarry;

{acarry,sum[i+1+j]} = {sum[i+1+j],acarry} + product; {mcarry, product} = multiplicand[i+2]*multiplier[j] + mcarry

So, I think I see and agree how this works now; I was interpreting your wording a bit too literally,

The renaming logic to do this sounds a bit formidable, , but maybe the 3 pseudo register port is just attached to a small register file (1..4 registers).

They have to plumbed into the regular ALU ports as well as the carry input port of functional units, but that may not be too onerous.

To view this discussion visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/ca9ade67-922c-4dba-aca6-bb848295fc69n%40groups.riscv.org.

MitchAlsup

unread,

Apr 25, 2025, 2:26:14 PMApr 25

to RISC-V ISA Dev, Allen Baum, RISC-V ISA Dev, Vadím Sukhomlínov, Robert Finch, MitchAlsup

On Friday, April 25, 2025 at 3:04:10 AM UTC-5 Allen Baum wrote:

To be clear, I said "effectively adds" , not "neccessarily". How you implement it is different; you don't need to actually add a (full) register port, and I understand that.
It could all just live in the bypass logic and never go to the physical register file. Or not...
The fact that you could implement them as an actual read from a register file (and you'd have to sometimes) implies that that is the effect.
So we agree there.

And, I just figured out how 2 carry registers can be active at the same time. They alias to real registers, so a second CARRY op with a different Rd will
(*effectively*...) just write the previous carry back into its associated register, and read the new one from the register file.
So, a little renaming magic lets you keep more balls up in the air.

This is the inner loop you showed
{mcarry, product} = multiplicand[i]*multiplier[j] + mcarry;
{acarry,sum[i+j]} = {sum[i+j],acarry} + product;

You can't have the above execute in parallel.. But you can pipeline them:
{mcarry, product} = multiplicand[i]*multiplier[j] + mcarry;
{acarry,sum[i+ j]} = {sum[i+ j],acarry} + product; {mcarry, product} = multiplicand[i+1]*multiplier[j] + mcarry;
{acarry,sum[i+1+j]} = {sum[i+1+j],acarry} + product; {mcarry, product} = multiplicand[i+2]*multiplier[j] + mcarry

So, I think I see and agree how this works now;

Ok, great. My work her is now complete.

I was interpreting your wording a bit too literally,

Many do--it is part of my Cassandra affliction {Always tell the truth, no one ever believes.}

MitchAlsup

unread,

Apr 26, 2025, 1:52:53 PMApr 26

to RISC-V ISA Dev, MitchAlsup, Allen Baum, RISC-V ISA Dev, Vadím Sukhomlínov, Robert Finch

I wanted to add, then summarize, and then abstract this discussion:

----------------Addition------------------

But first I need to introduce a different but related way of attaching additional

information to an instruction. For this we have a short diversion into my DBLE

instruction for Samsung GPU.

DBLE is an instruction that informs the subsequent instruction of 2 things::

a) perform twice as wide a calculation

b) the DBLE instruction provides the other register specifiers.

So,

DBLE R7,R19,R3

ADD64 R9,R7,R22

is performed as::

ADD128 {R7,R9},{R19,R7},{R3,R22}

This completely isolates register allocation from double-wide register

operands and results. The sequencing of the data path should be obvious.

-----------Abstraction---------------

So, there are ways one instruction can append additional OpCode bits

to a future instruction. CARRY is one such embodiment that made

sense for the My 66000 ISA and goals of keeping the ISA small while

still providing means to express seldom used but intricate instruction

encoding. Depending on how YOUR ISA is formatted, it may be better

to use 3-bits rather than 2-bits, to have a shadow closer to 4 than 8,

or to provide more than 1 register specifier as the accumulator[s], and/or

that additional registers can be attached using something akin to DBLE.

This is YOUR choice-space !

It extends your ISA giving the ability to compute twice-as-wide rather

efficiently, without consuming "lots" of OpCode space.

CARRY for double-wide accumulator style.

DBLE for double-wide random register style.

Invariably, HW function units will have to be able to digest said encodings,

so the entirety of the savings is in main ISA and the name-space of the

instructions themselves.

------------Justification--------------

But these attachment/appending means allow one to extend the ISA

without dramatically consuming OpCode space which would have

been necessary to consume without such a means. This kind of

extension makes LITTLE sense when the OpCode is used often (>3%)

but are completely reasonable for OpCode sequences that are seldom

(<1%) used.

Instructions used less than 0.1% of the time should occupy no more than 1%

of ISA OpCode space. {My preference; to Handwaving accuracy}

On a 64-bit machine, having the capability of performing 128-bit calculations

(integer and/or FP maybe memory &) is an advantage that one is not willing

to throw "that much" OpCode space at due to low (current) use rates.

Earl Killian

unread,

Apr 26, 2025, 4:02:28 PMApr 26

to RISC-V ISA Dev, MitchAlsup, Allen Baum, RISC-V ISA Dev, Vadím Sukhomlínov, Robert Finch

It is an interesting approach given your goals, but they aren't the goals I would think most important. For me, performance and lack of complexity would be more important than minimzing opcode usage. Complexity is detrimental to performance in some cases and also tends to lead to bugs, which is even more worrisome.

I suspect that the primary performance driver for integer multiply is crypto, but PQC is making RSA obsolete soon. I'm curious what other drivers of multiword integer multiplication performance people can identify?

If there are still important uses of multiword integer multiplication, we should look at how to do it in as few cycles as possible. I think the primary issue with multiword integer multiply performance is the latency of the multiply itself. My estimate is that a straightforward 256×256+512→512 integer multiply takes about 38 cycles for the 16 multiplies, with a 3-cycle 64×64+64+64→128 operation latency (i.e., an average of 2.375 cycles per multiply) in 50 instructions. This includes the loads (2 cycles latency) and stores. A lot of this time is waiting on the multiplication results. A small bit is just not being able to issue more than two multiplies per cycle. An interesting question is how far off an OoO RISC-V processor is from this using the existing ISA.

MitchAlsup

unread,

Apr 26, 2025, 7:30:03 PMApr 26

to RISC-V ISA Dev, Earl Killian, MitchAlsup, Allen Baum, RISC-V ISA Dev, Vadím Sukhomlínov, Robert Finch

On Saturday, April 26, 2025 at 3:02:28 PM UTC-5 Earl Killian wrote:

It is an interesting approach given your goals, but they aren't the goals I would think most important. For me, performance and lack of complexity would be more important than minimzing opcode usage. Complexity is detrimental to performance in some cases and also tends to lead to bugs, which is even more worrisome.

Fair enough.

I suspect that the primary performance driver for integer multiply is crypto, but PQC is making RSA obsolete soon. I'm curious what other drivers of multiword integer multiplication performance people can identify?

Bignums, exact FP, ...

But these use-cases seem to decline as the std. register width increased from 32-bits to 64-bits.

If there are still important uses of multiword integer multiplication, we should look at how to do it in as few cycles as possible. I think the primary issue with multiword integer multiply performance is the latency of the multiply itself. My estimate is that a straightforward 256×256+512→512 integer multiply takes about 38 cycles for the 16 multiplies, with a 3-cycle 64×64+64+64→128 operation latency (i.e., an average of 2.375 cycles per multiply) in 50 instructions. This includes the loads (2 cycles latency) and stores. A lot of this time is waiting on the multiplication results. A small bit is just not being able to issue more than two multiplies per cycle. An interesting question is how far off an OoO RISC-V processor is from this using the existing ISA.

Couple of points: all the 5GHz machines have 4-5 cycle LDs, and no less than 4 cycle integer multiply.

Other than contention for AGEN cycles, STs tend to vanish in the pipeline. {{None of this pertains to

integer multipliers that cannot start a new multiply every cycle}}

The integer multiplier should be fully pipelined--just like the FP multiplier. So, if we start multiply 1 in

cycle 1, we have the final multiply result available at cycle 20 of a single IMUL FU.

Following the 2D carry propagation and restricting ourselves to 64-bit additions but with carry-in

and carry-out controls:: I count 22 ADDs of the 16 multiplier double wide products. Without carry-out

controls we need to clear carry-in 3 times.

Giving ourselves ADDC {Rdhi,Rdlo}, {Rs1hi,Rs1lo} Rs2 {{Add 64-bits to 128 bits with carry} the ADDs

decrease to 15 using 12 of the 64-bits added to 128-bits and 3 64-bits adds with carry.

With a 6-wide machine with 3 MEM units 4 cycle LDs, I IMUL unit 5 cycle latency, and 1 ADDC unit 1 cycle

latency; I count 26 cycles from a GBOoO machine-style core (assuming cache hits).

Were I to use the interior of IMUL to perform 256×256 directly, the multiply would take 25 cycles

((4 cycle LD latency, 16 multiplies and 5 cycle latency gives 21 cycles)). This leads me to believe

that supporting SW in this large multiply endeavor (26 cycles compared to 25) is completely

reasonable. So what we need is something of the form "ADD 64-bits to 128-bits with carry in and

out as an instruction. {{But WARNING:: I am assuming that I can start the 256×256 multiply after

the first LD result is delivered and before the last have been delivered; the FU capturing them in

some reasonable order for subsequent multiplication.}}

The integrated 256×256 multiply would be lower power and allow other FUs to be utilized in

parallel with the multiplication--and this would reduce register reads, writes, and uses of the

forwarding path--so it could be justified by a power argument; but only if you use this a lot.

Contrary to the above: I found no problem launching all the LDs in a pipeline-timely fashion,

or pipelining the STs back to cache. {{My assumed pipeline separates STs into the AGEN

activity and a much later capture STdata activity--so STs can AGEN earlier giving the GBOoO

time to solve ST-to-LD forwarding and memory aliasing issues.}}

Doing the code blindly, you will run out of your 32-integer registers, and you need almost all

of them for a good pipeline schedule. So prologue and epilogue code will save and restore

all of the preserved registers adding overhead not counted above for a subroutine implementation.

---

Earl Killian

unread,

Apr 26, 2025, 9:41:13 PMApr 26

to RISC-V ISA Dev, MitchAlsup, Earl Killian, Allen Baum, RISC-V ISA Dev, Vadím Sukhomlínov, Robert Finch

On Saturday, April 26, 2025 at 4:30:03 PM UTC-7 MitchAlsup wrote:

On Saturday, April 26, 2025 at 3:02:28 PM UTC-5 Earl Killian wrote:
I suspect that the primary performance driver for integer multiply is crypto, but PQC is making RSA obsolete soon. I'm curious what other drivers of multiword integer multiplication performance people can identify?

Bignums, exact FP, ...

Yes, but what are the applications for bignums and exact FP that show a performance sensitivity to this? I.e., what is the justification for spending effort on this.

FWIW, I think you're suggesting your cycle counts were based on ignoring the loads and stores ("staring multiply 1 in cycle 1"), and taking 16+22=38 cycles or 16+15=31 cycles. However, I think it takes 24 total cycles using the appropriate primitives (a A*B+C+CARRYIN multiply).

carry1a = dmul(a0, b0, p0, 0); // 1 (p0)

carry2a = dmul(a0, b1, p1, 0); // 2 (p1)
carry4a = dmul(a0, b3, p3, 0); // 4 (p3 b3)
carry2b = dmul(a1, b0, p1, carry1a); // 5 (p1)
carry3a = dmul(a0, b2, p2, carry2a); // 6 (dmul)
carry5a = dmul(a1, b3, p4, carry4a); // 7 (carry4a)
carry3b = dmul(a1, b1, p2, carry2b); // 9 (p2)
carry4b = dmul(a1, b2, p3, carry3a); // 10 (dmul)
carry6a = dmul(a2, b3, p5, carry5a); // 11 (dmul)
carry3c = dmul(a2, b0, p2, 0); // 12 (p2)
carry4c = dmul(a2, b1, p3, carry3b); // 13 (p3)
carry5b = dmul(a2, b2, p4, carry4b); // 14 (dmul)
carry7a = dmul(a3, b3, p6, carry6a); // 15 (dmul)
carry4d = dmul(a3, b0, p3, carry3c); // 16 (p3)
carry5c = dmul(a3, b1, p4, carry4c); // 17 (p4)
carry6b = dmul(a3, b2, p5, carry5b); // 18 (dmul)
carry8a = addc(p7, carry7a); // 18 (carry7a)
carry5d = addc(p4, carry4d); // 20 (p4)
carry6c = addc(p5, carry5c); // 21 (p5)
carry7b = addc(p6, carry6b); // 21 (carry6b)
carry6d = addc(p5, carry5d); // 22 (p5)
carry7c = addc(p6, carry6c); // 22 (p6 carry6c)
carry8b = addc(p7, carry7b); // 22 (carry7b)
carry7d = addc(p6, carry6d); // 23 (p6 carry6d)
carry8c = addc(p7, carry7c); // 23 (p7 carry7c)
carry8d = addc(p7, carry7d); // 24 (p7 carry7d)

where dmul and addc are defined (using C++ notation):

static inline uint64_t
dmul (uint64_t m, uint64_t n, uint64_t& a, uint64_t b)
{
__uint128_t t = __uint128_t(m) * n + a + b;
a = t;
return t >> 64;
} // dmul

static inline uint64_t
addc (uint64_t& a, uint64_t b)
{
uint64_t t = a + b;
b = t < b;
a = t;
return b;
} // addc

MitchAlsup

unread,

Apr 26, 2025, 10:17:25 PMApr 26

to RISC-V ISA Dev, Earl Killian, MitchAlsup, Allen Baum, RISC-V ISA Dev, Vadím Sukhomlínov, Robert Finch

I have a partial timing diagram that looks like::

{Note the ADD sequence is NOT COMPLETE}

............MEM[0]........MEM[1]........MEM[2]......IMUL..........................ALU
Cycle 1: LD mpcand[0], LD mpcand[1], LD mplier[0],
Cycle 2: LD mpcand[2], LD mpcand[3], LD mplier[1],
Cycle 3: LD mplier[2[, LD mplier[3], LD p[0]
Cycle 4: LD p[1], LD p[2], LD p[3]
Cycle 5: LD p[4], LD p[5], LD p[6], IMUL[0] mpcand[0],mplier[0]
Cycle 6: LD p[7] ST.agen[0], STagen[1] IMUL[1]
Cycle 7: STagen[3], STagen[4], STagen[5] IMUL[2]
Cycle 8: STagen[6], STagen[7], IMUL[3]
Cycle 9: IMUL[4] , MOV p[0] = IMUL[0].lo
Cycle 10-18: STdata p[0] IMUL[*] , ADDCo {p[2],p[1]} = (IMUL[1].lo,IMUL[0].hi} + IMUL[1].lo
Cycle 19: STdata p[1] IMUL[15] , ADDC P[2] = IMUL[1].hi + IMUL[2].lo + C
Cycle 20: STdata p[2] , ADDC P[3] = IMUL[1].hi + IMUL[2].lo + C
Cycle 21: STdata[2] , ADDC p[3] +=
Cycle 22: Stdata[3]
Cycle 23: STdata[4]
Cycle 24: Stdata[5]
Cycle 25: Stdata[6]
Cycle 26: STdata[7]

Still working on the order for the ADDs. Notice3 IMUL[0] starts in cycle 6.

Given 15 ADDs starting in cycle 10, 26 still looks reasonable. I have a timing figure (blocks and arrows}

indicating that I am in the right ball park. but the ADD sequence I have not yet completed.

If you label the multiplies IMUL[0] = mplier[0] × mpcand[0] through IMUL[15] = mplier[3]×mpcand[3]

.............IMUL[3],IMUL[2],IMUL[1],IMUL[0] // horizontal first

IMUL[7],IMUL[6],IMUL[5],IMUL[4] // vertical second

.........................................

...........................

you want to perform the multiplies in the following order::

IMUL[0] first

IMUL[1] second

IMUL[4] third

IMUL[2] fourth

IMUL[5] fifth

...

0,1,4,2,5,8,3,6,9,7,10,13,8,12,14,15

So that the lower order product terms arrive at time for the ADDCs to use as few registers

as possible, and this causes you to schedule LDs appropriately.

Maybe tomorrow I will have a complete schedule

Michael Chapman

unread,

Apr 27, 2025, 1:59:04 PMApr 27

to MitchAlsup, RISC-V ISA Dev, Earl Killian, Allen Baum, Vadím Sukhomlínov, Robert Finch

What would fit better into the RISC-V ISA way of doing things would be new add and subtract instructions which instead of producing a result just produce the carry from the result.
No extra machine state needs to be used.

Most machines will be able to start at least one integer multiply per cycle. Latencies are typically 3 cycles except for low clock rate implementations.
For high performance implementations there are plenty of opportunities for parallel execution to reduce cycle latencies.

But if 256x256+512 multiply adds are really important for your application, then you should may be spend the transistors doing exactly that. You could use the registers in the vector unit and implement a standalone multiply/add unit for that which would have a latency of about 14 cycles and could be fully pipelined for a through put of one multiply add (256*256+512) per cycle. The implementation (RTL and physical layout) is straightforward and largely regular.

--

You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To view this discussion visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/f9d6fff9-c84d-480a-9cc3-147f1552deacn%40groups.riscv.org.

MitchAlsup

unread,

Apr 27, 2025, 4:23:27 PMApr 27

to RISC-V ISA Dev, Earl Killian, MitchAlsup, Allen Baum, RISC-V ISA Dev, Vadím Sukhomlínov, Robert Finch

The cycle schedule for the 256×256 = 512 multiply using a double wide IMUL and single wide ADD with carry:

............MEM[0]........MEM[1]........MEM[2]......IMUL..........................ALU
Cycle 1: LD mpcand[0], LD mpcand[1], LD mplier[0],
Cycle 2: LD mpcand[2], LD mpcand[3], LD mplier[1],

Cycle 3: LD mplier[2[, LD mplier[3], ST.agen[0],
Cycle 4: STagen[1], STagen[2], STagen[3],
Cycle 5: STagen[4], STagen[5], STagen[6], IMUL[0] mpcand[0],mplier[0]
Cycle 6: STagen[7], IMUL[1] mpcand[1],mplier[0]
Cycle 7: IMUL[4] mpcand[0],mplier[1]
Cycle 8: IMUL[2] mpcand[2],mplier[0]
Cycle 9: IMUL[5] mpcand[1],mplier[1] , MOV p[0] = IMUL[0].lo
Cycle 10: STdata p[0] IMUL[8] mpcand[0],mplier[2] , ADDco t = IMUL[0].hi + IMUL[1].lo
Cycle 11: IMUL[3] mpcand[3],mplier[0] , ADDcio p[1] = t + IMUL[4].lo
Cycle 12: STdata p[1] IMUL[6] mpcand[2],mplier[1] , ADDcio t = IMUL[1].hi + IMUL[3].lo
Cycle 13: IMUL[9] mpcand[1],mplier[2] , ADDcio t = IMUL[2].hi + t
Cycle 14: IMUL[7] mpcand[0],mplier[3] , ADDcio t = t + IMUL[4].lo
Cycle 15: IMUL[10] mpcand[1],mplier[3] , ADDcio p[2] = t + IMUL[5].lo
Cycle 16: STdata p[2] IMUL[13] mpcand[2].mplier[2] , ADDcio t = IMUL[3].hi + IMUL[6].lo
Cycle 17: IMUL[8] mpcand[3],mplier[1] , ADDcio t = IMUL[4].hi + t
Cycle 18: IMUL[12] mpcand[2],mplier[3] , ADDcio t = t + IMUL[7].lo
Cycle 19: IMUL[14] mpcand[3],mplier[2] , ADDcio t = IMUL[5].hi + t
Cycle 20: IMUL[15] mpcand[3],mplier[3] , ADDcio t = t + IMUL[8].lo
Cycle 21: , ADDcio p[3] = t + IMUL[9].lo
Cycle 22: STdata p[3] , ADDcio t = IMUL[6].hi + IMUL[10].lo
Cycle 23: , ADDcio t = IMUL[7].hi + t
Cycle 24: , ADDcio t = t + IMUL[11].lo
Cycle 25: , ADDcio t = IMUL[8].hi + t
Cycle 26: , ADDcio t = IMUL[9].hi + t
Cycle 27: , ADDcio p[4] = t + IMUL[12].lo
Cycle 28: STdata p[4] , ADDcio t = IMUL[10].hi + IMUL[13].lo
Cycle 29: , ADDcio t = IMUL[11].hi + t
Cycle 30: , ADDcio t = IMUL[12].hi + t
Cycle 31: , ADDcio P[5] = t + IMUL[14].lo
Cycle 32: STdata p[5] , ADDcio t = IMUL[13].hi + IMUL[15].lo
Cycle 33: , ADDcio p[6] = IMUL[14].hi + t
Cycle 34: STdata p[6] , ADDci p[7] = IMUL[15].hi + #0
Cycle 35: STdata p[7]
Cycle 36:

I count only 35 cycles, 34 if we are only worried about when the final bits of the product

are available. {Now if we could only eliminate the white space eater}

MitchAlsup

unread,

Apr 28, 2025, 3:28:19 PMApr 28

to RISC-V ISA Dev, MitchAlsup, Earl Killian, Allen Baum, RISC-V ISA Dev, Vadím Sukhomlínov, Robert Finch

I should mention that several of the carries are performed improperly.

Reply all

Reply to author

Forward