Multi-word addition with a strict Ra = Rb op Rc ISA

Thomas Koenig

ungelesen,

27.12.2021, 08:24:5227.12.21

an

Triggered by this post

https://gmplib.org/list-archives/gmp-devel/2021-September/006013.html

(where somebody discusses, in rather stringent terms, the performance
of RISC-V for multitword addition in GMP), I wonder:

RISC-V has (from that post) the instruction sequence

add t0, a4, a6 // add low words
sltu t6, t0, a4 // compute carry-out from low add

so one would need to wait for the result of the add to compare
to generate the carry.

Would it not be better to have an "add and generate the carry only"
instruction instead, one coud write, hypothetically,

add t0, a4, a6
addc t6, a4, a6

so there would be no interdependency of the instructions (and it
would be an obvious candidate for instruction fusion)?

Or am I missing something obvious here?

robf...@gmail.com

ungelesen,

27.12.2021, 09:12:0327.12.21

an

Not missing anything, except that using SLTU uses an existing instruction and
does not require additional hardware or opcode which may be important if
trying to keep the implementation small. Personally, I think RISCV tries too
hard to be small, leaving out some common features of other architectures.
But RISCV can always be extended using custom instructions.
I would argue for adding ADDC as a custom instruction. It is reminiscent of
the MULH instruction which returns the high order bits. So maybe ADDH
rather than ADDC. When I see ADDC I think of it as an add with carry. Could
also use a SUBH instruction.

Anton Ertl

ungelesen,

27.12.2021, 13:51:2027.12.21

an

Thomas Koenig <tko...@netcologne.de> writes:
>Triggered by this post
>
>https://gmplib.org/list-archives/gmp-devel/2021-September/006013.html
>
>(where somebody discusses, in rather stringent terms, the performance
>of RISC-V for multitword addition in GMP), I wonder:
>
>RISC-V has (from that post) the instruction sequence
>
>
> add t0, a4, a6 // add low words
> sltu t6, t0, a4 // compute carry-out from low add

That's only if you don't have a carry-in (i.e., a half-adder). For a
full-adder, it's AFAIK a four-instruction idiom (hmm, the link you
give uses five instructions for the full addition that follows the
half addition at the start).

The original MIPS and Alpha also needed these sequences; AFAIK
recently MIPS has added a carry flag to make this stuff more
efficient.

>so one would need to wait for the result of the add to compare
>to generate the carry.
>
>Would it not be better to have an "add and generate the carry only"
>instruction instead, one coud write, hypothetically,
>
> add t0, a4, a6
> addc t6, a4, a6

That helps a little bit. But I think the RISC-V answer is that
instead of adding that, they fuse the add and the sltu to produce the
two results in one cycle, without needing to add a new instruction.

The full-adder overhead is more worrysome; one could also do that with
instruction fusion, but are they going to do it, and how well will it
work?

In the thread starting at
<2021Mar1...@mips.complang.tuwien.ac.at> we have discussed adding
an extra bit to each register to allow an efficient add-with-carry
(and other things), without adding a special-purpose flag register to
the architecture (which is against MIPS-style architectural
principles).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

MitchAlsup

ungelesen,

27.12.2021, 14:09:2627.12.21

an

Still not as efficient as My 66000:
<
CARRY R16,{{I}{IO}{IO}{O}}
ADD R12,R4,R8 // carry Out only
ADD R13,R5,R9 // Carry In and Out
ADD R14,R6,R10 // Carry In and Out
ADD R15,R7,R11 // Carry In only

Thomas Koenig

ungelesen,

27.12.2021, 17:46:2527.12.21

an

Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:

> In the thread starting at
><2021Mar1...@mips.complang.tuwien.ac.at> we have discussed adding
> an extra bit to each register to allow an efficient add-with-carry
> (and other things), without adding a special-purpose flag register to
> the architecture (which is against MIPS-style architectural
> principles).

I remember that discussion, and that is probably the best way.
So, the add with carry would be something like

add t0, r1, r2
add t1, r3, r4
addcf t1, t1, t0

where "addcf" would add the carry flag from t0 to t1.

From a compiler perspective, it would probably be best to
model this as a separate condition register with one bit.

(There is historical precedent, the accumulator on the 704
had something like that).

BGB

ungelesen,

27.12.2021, 21:45:3027.12.21

an

High on my list:
Scaled-Index Load/Store;
Encodings to allow for efficiently using large immediate values.

Say (large-immediate encodings):
zzzz_IIII_(xzzz-zztt-ttts-ssss_0yyy-dddd-d011-1111) (1)
IIII-IIII_(xzzz-zztt-ttts-ssss_1yyy-dddd-d011-1111) (2)
zzzz-IIII_IIII-IIII_(zzzz-zztt-ttts-ssss_0001-dddd-d111-1111) (3)
IIII-IIII_IIII-IIII_(zzzz-zztt-ttts-ssss_1001-dddd-d111-1111) (4)

1: Represents a 64-bit encoding space with 16/17-bit immediates.
2: Represents a 64-bit encoding space with 32/33-bit immediates.
3: Represents a 96-bit encoding space with 48-bit immediates.
4: Represents a 96-bit encoding space with 64-bit immediates.

Blocks 2 could have ALU operations and larger-displacement Load/Store.
Say, yyy:
000: Load
001: OP_Imm
010: Store
011: -
100: -
101: -
110: Long Branch
111: -

Say (Load/Store):
Lxx Rd, Rs, Rt*Sc, Disp33s //Rd=[Rs+Rt*Sc+Disp33s]
Sxx Rs, Rt*Sc, Disp33s, Rd //[Rs+Rt*Sc+Disp33s]=Rd
xmmm-cc
x: Sign Extension (Bit 31)
mm: Ld/St Type (B/H/W/D/BU/HU/WU/X)
cc: Scale (1/2/4/8)

For OP_Imm:
OP Rd, Rs, Imm32 //Rd=Rs OP Imm33s
With ~ 10 bits for Opcode.

Long branch would combine Bxx, JAL, and JALR, but with a 32 bit
displacement.

Block 4 could have some large-immediate ALU operations and similar.

Probably:
OR Rd, Zero, Imm64

Or similar, could be used as a 64-bit constant load.

Anton Ertl

ungelesen,

28.12.2021, 05:49:0328.12.21

an

Thomas Koenig <tko...@netcologne.de> writes:
>Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
>
>> In the thread starting at
>><2021Mar1...@mips.complang.tuwien.ac.at> we have discussed adding
>> an extra bit to each register to allow an efficient add-with-carry
>> (and other things), without adding a special-purpose flag register to
>> the architecture (which is against MIPS-style architectural
>> principles).
>
>I remember that discussion, and that is probably the best way.
>So, the add with carry would be something like
>
> add t0, r1, r2
> add t1, r3, r4
> addcf t1, t1, t0
>
>where "addcf" would add the carry flag from t0 to t1.

Yes, that's a good way to do it if you want to stay with
two-source-register instructions. The add for the next pair of words
can be done in parallel with the addcf, so the overall latency is the
latency of the first add plus n times the latency of addcf.

>From a compiler perspective, it would probably be best to
>model this as a separate condition register with one bit.

Why do you think that is best? Compilers are horrible at modeling
single condition-code registers (which is IMO a big reason why
MIPS-style architectures shun condition codes). I think the best part
of the extra-bits idea is that they are part of GPRs, and in most
contexts compilers don't need to model them separately.

Compilers do need to know whether the extra bits are needed across
calls or spilling, because they are not preserved in callee-saved
registers, and spilling the extra bits costs extra. For typical
bigint code the extra bits are not needed across calls.

Thomas Koenig

ungelesen,

28.12.2021, 17:16:2428.12.21

an

Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:

> Thomas Koenig <tko...@netcologne.de> writes:
>>Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
>>
>>> In the thread starting at
>>><2021Mar1...@mips.complang.tuwien.ac.at> we have discussed adding
>>> an extra bit to each register to allow an efficient add-with-carry
>>> (and other things), without adding a special-purpose flag register to
>>> the architecture (which is against MIPS-style architectural
>>> principles).
>>
>>I remember that discussion, and that is probably the best way.
>>So, the add with carry would be something like
>>
>> add t0, r1, r2
>> add t1, r3, r4
>> addcf t1, t1, t0
>>
>>where "addcf" would add the carry flag from t0 to t1.
>
> Yes, that's a good way to do it if you want to stay with
> two-source-register instructions. The add for the next pair of words
> can be done in parallel with the addcf, so the overall latency is the
> latency of the first add plus n times the latency of addcf.

What else would there be to fix with RISC-V if one
wanted to maintain strict two-source-registers?

An indexed store actually accesses three registers, so
if one wanted to remain pure there, array accesses
could be done via a LEA instruction which does

Ra = Rb + Rc << scale

followed by a load or store, leading to two instructions for an
array access to anything other than a byte instead of three.

MitchAlsup

ungelesen,

28.12.2021, 17:48:1228.12.21

an

On Tuesday, December 28, 2021 at 4:16:24 PM UTC-6, Thomas Koenig wrote:
> Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
> > Thomas Koenig <tko...@netcologne.de> writes:
> >>Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
> >>
> >>> In the thread starting at
> >>><2021Mar1...@mips.complang.tuwien.ac.at> we have discussed adding
> >>> an extra bit to each register to allow an efficient add-with-carry
> >>> (and other things), without adding a special-purpose flag register to
> >>> the architecture (which is against MIPS-style architectural
> >>> principles).
> >>
> >>I remember that discussion, and that is probably the best way.
> >>So, the add with carry would be something like
> >>
> >> add t0, r1, r2
> >> add t1, r3, r4
> >> addcf t1, t1, t0
> >>
> >>where "addcf" would add the carry flag from t0 to t1.
> >
> > Yes, that's a good way to do it if you want to stay with
> > two-source-register instructions. The add for the next pair of words
> > can be done in parallel with the addcf, so the overall latency is the
> > latency of the first add plus n times the latency of addcf.
> What else would there be to fix with RISC-V if one
> wanted to maintain strict two-source-registers?
<

Solve the FMAC problem using one 2 operand register specifiers !

>
> An indexed store actually accesses three registers, so
<

Err, not what you think:
<
An indexed store is allowed to access the register containing data to
be stored AFTER LD-Align (nominally the WRITE-Back stage). This
alleviates register read pressure, eliminates large numbers of flip-
flops in the pipeline, and simplifies pipeline design. I, personally,
have never had problems "reading the write slot" that is reading the
register file in the clock period where one would normally be writing
the result register. Done this way, there is NO FORWARDING needed
for the read of the ST.data.
<
I. for one, do not consider reading the register data to be stored
"a problem" worthy of an ISA-level "solution". Let the HW guys tell
you what to do here. HP even got a patent on ST pipeline design
(circa 1986) Follow (or at least READ) this patent before you do ISA
design.
<
There are a lot of things I don't like about RISC-V, but STs and LDs
should be symmetrical {and OpCodes should be at the most significant
parts of the container.}

<
> if one wanted to remain pure there, array accesses
> could be done via a LEA instruction which does
>
> Ra = Rb + Rc << scale
<

My 66000 has:
<
Rd = Rb + Ri<<scale + Disp // +'s are unsigned here; Disp optional
<
as its LEA instruction.

Thomas Koenig

ungelesen,

29.12.2021, 04:17:5229.12.21

an

MitchAlsup <Mitch...@aol.com> schrieb:

> On Tuesday, December 28, 2021 at 4:16:24 PM UTC-6, Thomas Koenig wrote:

>> An indexed store actually accesses three registers, so
><
> Err, not what you think:
><
> An indexed store is allowed to access the register containing data to
> be stored AFTER LD-Align (nominally the WRITE-Back stage). This
> alleviates register read pressure, eliminates large numbers of flip-
> flops in the pipeline, and simplifies pipeline design. I, personally,
> have never had problems "reading the write slot" that is reading the
> register file in the clock period where one would normally be writing
> the result register. Done this way, there is NO FORWARDING needed
> for the read of the ST.data.

Thanks for the explanation.

> I. for one, do not consider reading the register data to be stored
> "a problem" worthy of an ISA-level "solution". Let the HW guys tell
> you what to do here. HP even got a patent on ST pipeline design
> (circa 1986) Follow (or at least READ) this patent before you do ISA
> design.

I would love to.

Do you have any more information that would make it easier to find,
for example a number, inventor's name, or... ?

Marcus

ungelesen,

30.12.2021, 06:30:2830.12.21

an

What are the interrupt semantics here? Is the entire carry chain treated
as an uninterruptible sequence?

MitchAlsup

ungelesen,

30.12.2021, 13:13:1730.12.21

an

Interrupts are precise.

<
> Is the entire carry chain treated
> as an uninterruptible sequence?
<

No.