In my Q+ architecture I handle extended precision with adds with three source operands.
ADD (a,b,c) => sum
And
ADDGC(a,b,c) => carry (add generate carry, two bits)
Both add instructions can be performed simultaneously on a machine more than one wide.
Something similar could be done in RISCV using wider opcodes (48/64 -bit).
But a simple Add with carry instruction would be more compact. Adding a carry flag to the architecture would be a PITA but maybe it could be done for the special cases of crypto / scientific extended precision. It is additional state that would need to be managed. If the extended precision addition could be performed in an atomic manner then the flag could be made temporary and not need its state managed.
In Q+ there is an ATOM instruction modifier that makes a short sequence of instructions atomic, disabling interrupts between them. Therefore, extended precision arithmetic could be performed as an atom.
ATOM 3
ADD r1,r3,r5
ADC r2,r4,r6
MOV r7,C ; if C is needed later
Alternately the C flag could be resident in a CSR somewhere.
I think having an explicit carry flag would not be too much burden for an OoO machine supporting register renaming. The C flag would just need its own register designation to be renamed.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/0d35c290-f337-4c48-a323-068045fea9e6%40gmail.com.
BGB, thank you for the very detailed analysis of the pros and cons of various designs.This is extremely insightful. And thank you for referencing Mill's "Belt Architecture" -I never heard about it before, but when I researched it - I found it very close to the ideacame up with. "Belt" representation (i referred to it as SSA static single assignment) anddirect addressing of instruction results seems like a very nice way to express parallelismin the architecture, and then implementation can decide how many different ALU blocksare needed for various purposes. Ivan seems to make some prototypes to demonstrateadvantages and I'm not sure why this approach is not being explored further(or maybe I'm wrong and it is).Floating point constants: I didn't look into this area and don't have data on whatconstants are commonly used in the computations.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CADEg7H%3DY_48o6Ljn1y7R9%3DjoPKYNQ7KtvLJ%3DcPKN-yPpFWhj4g%40mail.gmail.com.
Register Windows:
Unless the hardware has access to special facilities that the pipeline lacks, doing it in hardware tends to be worse for performance than having software do it. Load/Store Pair or dual-issue Load/Store seems to be better both for hardware cost and performance.
On Feb 14, 2025, at 11:57, 'Vadím Sukhomlínov' via RISC-V ISA Dev <isa...@groups.riscv.org> wrote:
And thank you for referencing Mill's "Belt Architecture”
Imagine the basic block descriptor included, in addition to what SecureRISC includes, the set of source registers used by the basic block, and the set of output registers of the basic block. Renaming could be done for the basic block as a whole, rather than on each instruction in the block. Within the basic block, instruction sources would either reference the Nth source register to the block or the result of the Nth instruction local to the basic block. Instructions would not need explicit destination register fields as a result (this would be in the basic block descriptor).
SecureRISC’s basic block processing reminded me of this paper:
Where code is broken up into “braids” which are executed in parallel.
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/B3979DAB-6F3A-4246-8121-77B6218630B9%40arilinc.com.
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/dedb1527-7542-4b6a-b890-622146a0376d%40gmail.com.
Hi All,
When implementing big-number operations on RISC-V (e.g. for cryptography, scientific applications) lack of carry flag results in relative inefficiency of code compared to x86 where `adc` instruction exists. On RISC-V you need to explicitly get carry and add carry resulting in the code like for common primitive add_with_carry(a,b,carry)->(sum, carry):
```
// In: a0 - a, a1 -b a2 -carry
// Out: a0 = (a+b+carry), a1(carry) = (a+b+carry)>>XLEN
add a3, a2, a0
sltu a2, a3, a2
add a0, a3, a1
sltu a1, a0, a3
add a1, a1, a2
```
On x86 that would be a single `adc` instruction taking both operands and carry flag in CF.
My proposal is to introduce `addlo` and `addhi` instructions which would replace sequence above into:
CARRY R16,{{O}{IO}{IO}{I}}
ADD R12,R4,R8 // carry Out only
ADD R13,R5,R9 // Carry In and Out
ADD R14,R6,R10 // Carry In and Out
ADD R15,R7,R11 // Carry In only
CARRY can be applied to Shifts, Extracts, Inserts, ADD, SUB, MUL, DIV, over
in the FP side of ISA it provides access to exact floating point calculations
and Kahan-Babuška Summation.
So, the addition of a single instruction modifier provides access to 30-odd
multi-precision calculations.
The important thing is that the carry result contains more than 1-bit !!
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/c9bb536c-1a13-4675-a148-152beeb57e15n%40groups.riscv.org.
Carry needs to be a CSR so it can be saved and restored during traps, context switches.I
If used with LI, or LUI, it could be used to generate long immediates too.,transputer style
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.
To view this discussion visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/8cb6448e-3234-4a5a-adbb-2b135581422dn%40groups.riscv.org.
To view this discussion visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/8cb6448e-3234-4a5a-adbb-2b135581422dn%40groups.riscv.org.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.
To view this discussion visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/8cb6448e-3234-4a5a-adbb-2b135581422dn%40groups.riscv.org.
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.
To view this discussion visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/60b3d204-e7d1-4fd9-9d1f-9b4ab9025b9dn%40groups.riscv.org.
If you need the register anyway, there’s no point to using a shadow GPR along with a tag. Just make it a CSR, which will simplify both hardware and save/restore software.
Yes, indeed, one needs both the CARRY shadow (16-bits) and the CARRY register (5-bits) toretain semantics across interruption. Shadow is a shift register that strips off 2-bits perinstruction in DECODE.
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/53640c76-d87c-413d-a2be-8066fbeb7177n%40groups.riscv.org.
If you're doing something from scratch, sure, that might make sense, but this did start on a RISC-V thread, so I was thinking from that perspective. The mill architecture has explicit register/belt port borrowing behavior for things like this, which you probably know.
There are other problems with the CSR approach, of course (having to rename it in a wide OOO implementation),and effectively switching to another bank of registers is a mass renaming on top of reserving extra rename register file that can't(?) be used during non-trapping operations (I haven't thought about this as much as you, so that could be all wrong.)
On Tue, Apr 15, 2025 at 6:28 PM 'MitchAlsup' via RISC-V ISA Dev <isa...@groups.riscv.org> wrote:
snip
I may be wrong, but it is not necessary to rename the CSRs if one is willing to live with some performance loss. For my core I do not have the CSRs renamed; way I see it is they are rarely used registers. They are updated only at the end of the pipeline. In software the CSR access is surrounded with fence (sync) instructions. With potentially thousands of CSRs it is not practical to rename them all.I borrowed the carry idea from the My66000. But have only three dedicated carry registers. The core saves the instruction in a CSR on exceptions, to allow the state to be restored.What happens if there is a branch out of the carry shadow? Do instructions in a called routine get affected by the carry?
snip
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/f4649d24-c863-42f3-8fd7-b0ed8aefd88an%40groups.riscv.org.
Ah, I see your point; for that particular approach, you want multiple simultaneous carries.I'd argue that RISC-V already solves that problem using the MULH instruction.
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/50b086ea-e118-4658-a364-ae34a2e3f057n%40groups.riscv.org.
Both approaches enable a 3 source, 2 destination datapath, yet only require the actual registerfile to be 2 source, 1 dest.That isn't the issue. I haven't had time to think about some of those details, but your example of spanning could work with either method. If you have multiple live carries, then you need effectively store the tagged temporary and load the new one, which sounds a lot like a CSRRW op to me - that doesn't need a new operation. You do need to identify the CSR as being "live" for input an output - which is what you new ops do also.
snip
Yes, there are many design choices. My primary concern with CSRs is potential impact on code size and a need to store content on context switching. I'm less familiar with implementation costs in HW.To compare approaches I'd try to compare how 2 XLEN, 4 XLEN, 8 XLEN, 16 XLEN, etc operations will look like with different approaches in terms of number of instructions, performance and probably energy consumption (actual ALU operations, etc).I'd think that dedicated CSRs serving as both carry input and carry destination can indeed be efficient. You'd need 1 CSRW to set in carry to zero, each `addc` computes 3-way addition, and final high bit/carry would be read by CSRR. And you can add several CSRs of this kind as well as `addcX` to select which one to use. This would also open a path to an integer FMA.Practically, if you have a way to define different 3-source 2-destination operations, you can construct useful things, including equivalent to X86's DIV using 2XLEN dividend and XLEN divisor to produce reminder and quotient - nice improvement for big number division.
On Fri, Apr 18, 2025 at 8:14 AM Allen Baum <allen...@esperantotech.com> wrote:
snip
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/29b68f48-f23b-4ae6-83ef-f440e85fafb3n%40groups.riscv.org.
Nice examples; the only issue is really whether the carry state is mapped to a GPR or a CSR.
The carry control (the bit-pairs that specify o,i,io, ~) has to be mapped to a CSR;I'd expect the tag (which register is used as carryin/out) to be in that CSR.
That does reduce the number of ops that can fit into that CSR, but that's pretty irrelevant.
You could maximize the number of ops that are dependent to (say) 3 or 4 and have multiple tags. The implementation would be a bit tricky, as the tag change requires saving the previous carry value and loading another - effectively inserting an op into the istream vs. making it explicit in the istream.
Conversely, if you don't need a tag, then you can use the immediate from of the CSR instructions to load the common cases with a single CSR op instead of a LI, CARRY sequence
In any case, save/restore should be easier without a tag, because you know which register needs to be saved and restored (the CSR), vs. having to read the tag from the CARRY CSR and then having to case_branch to a routine to save that specific general register (assuming, of course, that only SREGs can use this facility) and do the same for restore.
On Fri, Apr 18, 2025 at 10:53 AM 'MitchAlsup' via RISC-V ISA Dev <isa...@groups.riscv.org> wrote:
snip
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/257dd64c-15ee-4cfe-bcef-ae75f917a2f5n%40groups.riscv.org.
I stand corrected; for the save/restore, you can be oblivious of the tag value because you will be saving/restoring the associated register, regardless.I was assuming that the carry would be referenced only at the end of a sequence, not at an intermediate time.Referencing in an intermediate time is tricky because the instruction that references it is still subject to the 2 action bits, so it takes extra "--" action bits to use it.It has to be treated as both an extra input, and as a bypass register - it really is a 2nd bypass register. that you have to keep inflight.
The number of effective registers you list assumes that they are all in registers simultaneously, and you aren't preloading them from the cache while executing out of registers. i.e. the effective number of GPRs can be larger than you suggest. I don't know if that matters, in practice,since you need to have the prefetch ops use extra "--" encodings. I do agree that having 4-instruction under a shadow will cover the majority of cases.
Trying to get multiple, otherwise independent streams to be operate in parallel seems possible also,
Note that I am concerned only how this fits into a RISC-V implementation,
On Monday, April 21, 2025 at 5:04:38 PM UTC-5 Allen Baum wrote:
....
It has to be treated as both an extra input, and as a bypass register - it really is a 2nd bypass register. that you have to keep inflight.
Yes, correct. But in addition, each FU has this same forwarding loop back to itself, so IMULand ADD can run simultaneously.
Also note: Higher end machines can completely "perform" CARRY instructions in DECODE.Which is why I call them an instruction-modifier rather than an instruction.
The number of effective registers you list assumes that they are all in registers simultaneously, and you aren't preloading them from the cache while executing out of registers. i.e. the effective number of GPRs can be larger than you suggest. I don't know if that matters, in practice,since you need to have the prefetch ops use extra "--" encodings. I do agree that having 4-instruction under a shadow will cover the majority of cases.This is my predilection, treat single values that occupy multiple registers as if it were a single register,and a single register is always loaded prior to calculation on it.Trying to get multiple, otherwise independent streams to be operate in parallel seems possible also,Large multiplication uses 2 CARRY's, one feeding back to the IMUL making it IMAC the otherconverting ADD Rd,Rs1,Rs2 into ADD {Rcarry,Rd},Rs1,{Rcarry,Rs2}
--Note that I am concerned only how this fits into a RISC-V implementation,Yes, that is why I prefaced my extended comments as they don't pertain to RISC-V.And while (eventually) I might be a competitor, I still want RISC-V to be as good asit can be.so the universal constants and HW save/restore in your architecture isn't a thing.I think HW save/restore will become a thing when CLIC gets finalizedOn Fri, Apr 18, 2025 at 1:43 PM 'MitchAlsup' via RISC-V ISA Dev <isa...@groups.riscv.org> wrote:snip--You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.To view this discussion visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/257dd64c-15ee-4cfe-bcef-ae75f917a2f5n%40groups.riscv.org.
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/ce00dbbc-0551-4e90-8567-65300b089c85n%40groups.riscv.org.
inlineOn Mon, Apr 21, 2025 at 5:34 PM 'MitchAlsup' via RISC-V ISA Dev <isa...@groups.riscv.org> wrote:On Monday, April 21, 2025 at 5:04:38 PM UTC-5 Allen Baum wrote:....It has to be treated as both an extra input, and as a bypass register - it really is a 2nd bypass register. that you have to keep inflight.Yes, correct. But in addition, each FU has this same forwarding loop back to itself, so IMULand ADD can run simultaneously.I'm not understanding this comment; could you give an example of how they would be used in parallel instead of serially?
Also note: Higher end machines can completely "perform" CARRY instructions in DECODE.Which is why I call them an instruction-modifier rather than an instruction.I'm interpreting that to mean it will expand a 2in, 1out add into a 3in, 2out add with the appropriate register numbers filled in.Sometime, the extra input will use a bypass reg; sometimes, though, you might need an extra cycle to fetch it (e.g. return from a context switch). That might come for free when the CARRY CSR gets restored (it would move the carry GPR into a bypass)Cute.
--On Fri, Apr 18, 2025 at 1:43 PM 'MitchAlsup' via RISC-V ISA Dev <isa...@groups.riscv.org> wrote:snip
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/ce00dbbc-0551-4e90-8567-65300b089c85n%40groups.riscv.org.
To view this discussion visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/ac17810d-41eb-4a1c-8c5d-1ba86718b083n%40groups.riscv.org.
There is an explicit dependency on the product variable that prevents them from being done in parallel, ( for the first and last iterations)
But even if you do, both acarry and mcarry exist simultaneously - and that means you have to be able to have 2 carry registers active at the same time,
and there is no way to distinguish them in the operations. You would need to expand your 2bit io field to a 3bit field and have two tags, at the very least.
Whether you put the 2bit field into the opcode or expand it to a register number is a distinction without a difference most likely.
You'd really have to go into the wees to figure out which is the right way to do it.
To view this discussion visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/5f23eba1-c020-49db-a76d-2ed3d509049fn%40groups.riscv.org.
I must be misunderstanding the argument to the carry instruction then. I thought there were "N" 2bit fields, and a single register tag.is there instead a single 2bit field per 5bit tag? (so you could store at most 4 sets in a 32b register)
Otherwise, you need to switch back&forth between the mcarry and the carry between each instruction that you're trying to run in parallel.What am I missing here?.
To view this discussion visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/5cfb2231-aa7c-4865-a695-02fc6b9d8b4fn%40groups.riscv.org.
Modifiers remind me of the ATNI (add-to-next instruction) of the Apollo guidance computer. They are really just expanding the bits available of the following instructions. One could just make the instructions wider, but that would waste bits for cases where the wider instructions are not needed.
I suppose if a variable length instruction were used, one could just have a:
ADD Rco,Rd,Rs1,Rs2,Rci
Instruction. Maybe a 48-bit instruction format would work. I recall reading a post that there was a movement towards 3 read, 2 write port instructions for some newer designs, at least for their internal workings. The extra ports being hidden by the ISA unless needed.
To get five register specs in a 32-bit ISA, it might be better to have only 16 regs. It would cost a little bit in performance, but may be more humanly understandable.
not helping. The carry instruction has a single "destination" register (which is the tag) and a 16bit "immediate" (which isn't used as a typical immediate.
Mitch has described a prefix instruction; there are many in x86, only one has been proposed (but not ratified I think) for RISC-V (so far...)There are other commerical architectures that have prefix instructions (e.g. Transputer , RPM40).What makes Mitch's idea stand out is that his prefix- covers multiple following instructions (not just one) and- effectively adds new data input/outputs (i.e. register ports).
To view this discussion visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/ca9ade67-922c-4dba-aca6-bb848295fc69n%40groups.riscv.org.
To be clear, I said "effectively adds" , not "neccessarily". How you implement it is different; you don't need to actually add a (full) register port, and I understand that.It could all just live in the bypass logic and never go to the physical register file. Or not...The fact that you could implement them as an actual read from a register file (and you'd have to sometimes) implies that that is the effect.So we agree there.And, I just figured out how 2 carry registers can be active at the same time. They alias to real registers, so a second CARRY op with a different Rd will(*effectively*...) just write the previous carry back into its associated register, and read the new one from the register file.So, a little renaming magic lets you keep more balls up in the air.This is the inner loop you showed{mcarry, product} = multiplicand[i]*multiplier[j] + mcarry;
{acarry,sum[i+j]} = {sum[i+j],acarry} + product;You can't have the above execute in parallel.. But you can pipeline them:{mcarry, product} = multiplicand[i]*multiplier[j] + mcarry;{acarry,sum[i+ j]} = {sum[i+ j],acarry} + product; {mcarry, product} = multiplicand[i+1]*multiplier[j] + mcarry;{acarry,sum[i+1+j]} = {sum[i+1+j],acarry} + product; {mcarry, product} = multiplicand[i+2]*multiplier[j] + mcarrySo, I think I see and agree how this works now;
I was interpreting your wording a bit too literally,
It is an interesting approach given your goals, but they aren't the goals I would think most important. For me, performance and lack of complexity would be more important than minimzing opcode usage. Complexity is detrimental to performance in some cases and also tends to lead to bugs, which is even more worrisome.
I suspect that the primary performance driver for integer multiply is crypto, but PQC is making RSA obsolete soon. I'm curious what other drivers of multiword integer multiplication performance people can identify?
If there are still important uses of multiword integer multiplication, we should look at how to do it in as few cycles as possible. I think the primary issue with multiword integer multiply performance is the latency of the multiply itself. My estimate is that a straightforward 256×256+512→512 integer multiply takes about 38 cycles for the 16 multiplies, with a 3-cycle 64×64+64+64→128 operation latency (i.e., an average of 2.375 cycles per multiply) in 50 instructions. This includes the loads (2 cycles latency) and stores. A lot of this time is waiting on the multiplication results. A small bit is just not being able to issue more than two multiplies per cycle. An interesting question is how far off an OoO RISC-V processor is from this using the existing ISA.
On Saturday, April 26, 2025 at 3:02:28 PM UTC-5 Earl Killian wrote:I suspect that the primary performance driver for integer multiply is crypto, but PQC is making RSA obsolete soon. I'm curious what other drivers of multiword integer multiplication performance people can identify?Bignums, exact FP, ...
What would fit better into the RISC-V ISA way of doing things
would be new add and subtract instructions which instead of
producing a result just produce the carry from the result.
No extra machine state needs to be used.
Most machines will be able to start at least one integer multiply
per cycle. Latencies are typically 3 cycles except for low clock
rate implementations.
For high performance implementations there are plenty of
opportunities for parallel execution to reduce cycle latencies.
But if 256x256+512 multiply adds are really important for your
application, then you should may be spend the transistors doing
exactly that. You could use the registers in the vector unit and
implement a standalone multiply/add unit for that which would have
a latency of about 14 cycles and could be fully pipelined for a
through put of one multiply add (256*256+512) per cycle. The
implementation (RTL and physical layout) is straightforward and
largely regular.
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/f9d6fff9-c84d-480a-9cc3-147f1552deacn%40groups.riscv.org.