MADD instruction (integer multiply and add)

Marcus

unread,

Oct 31, 2021, 9:10:04 AM10/31/21

to

Hello group!

I just did some analysis of some code that included integer
multiplications, and concluded that having a fused multiply-and-add
(a.k.a MAC) instruction would eliminate many ADD instructions, since
a multiplication operation more often than not comes together with a
following addition operation.

Furthermore the instruction eliminates the (potential) data dependency
latency between the MUL instruction and the ADD instruction.

So, I went ahead and added a simple MADD instruction to my ISA, on the
form:

MADD R1, R2, R3 ; R1 <- R1 + R2 * R3

This was a very simple addition to my hardware implementation, and in
my FPGA design it didn't really add any extra delay (the addition is
performed in the pipeline stage directly after the multiplication,
concurrently with some other result fixup operations). Thus the cost of
the ADD instruction could be completely eliminated (both in code size
and in instruction cycle count).

I also noticed that ARMv8 has a similar (but more flexible 4-operand)
instruction. In fact, in ARMv8 the MUL instruction is an /alias/ for the
MADD instruction (the addend is set to zero).

However, many other ISA:s (apart from DSP ISA:s) seem to be lacking this
instruction (x86, RISC-V, ARMv7, POWER?, My 66000?). How come?

It seems to me that it's a very useful instruction, and that it has
a relatively small hardware cost.

/Marcus

Marcus

unread,

Oct 31, 2021, 9:19:44 AM10/31/21

to

On 2021-10-31, Marcus wrote:
> However, many other ISA:s (apart from DSP ISA:s) seem to be lacking this
> instruction (x86, RISC-V, ARMv7, POWER?, My 66000?). How come?

Correction: The POWER ISA has MADDLD

/Marcus

anti...@math.uni.wroc.pl

unread,

Oct 31, 2021, 10:27:34 AM10/31/21

to

x86 has it in its vector extentions.

> It seems to me that it's a very useful instruction, and that it has
> a relatively small hardware cost.

It is very useful instruction, in particular when correctly
rounded. However, it is not clear for me that cost is small.
Namely, consider

pr1 = x*y
pr2 = madd(x, y, -pr1)

where madd means hardware madd. When correctly rounded
pr2 will deliver low order bits of product. Which means
that your multiplier has to produce all bits of product.
IIUC "normal" FP multiplier can skip most of low order
bits, just setting few flags to be able to round correctly.

--
Waldek Hebisch

Marcus

unread,

Oct 31, 2021, 11:26:36 AM10/31/21

to

Which instruction is that? I know about FMA3 for floating-point, but how
about integer?

However for the scalar integer ISA they don't have it, which is what I
was thinking about (things like memory address calculations for matrix
indices etc use integer MUL + ADD). See:

x86: https://godbolt.org/z/dTEej5rP4 (imull + leal)

...but ARM does, for instance:

ARMv7: https://godbolt.org/z/K3Ecsavq1 (mla) <- Correction: ARMv7 has it
ARMv8: https://godbolt.org/z/EPczq6Wcc (madd)

>> It seems to me that it's a very useful instruction, and that it has
>> a relatively small hardware cost.
>
> It is very useful instruction, in particular when correctly
> rounded. However, it is not clear for me that cost is small.
> Namely, consider
>
> pr1 = x*y
> pr2 = madd(x, y, -pr1)
>
> where madd means hardware madd. When correctly rounded
> pr2 will deliver low order bits of product. Which means
> that your multiplier has to produce all bits of product.
> IIUC "normal" FP multiplier can skip most of low order
> bits, just setting few flags to be able to round correctly.

Are you talking about integer or floating-point?

/Marcus

Thomas Koenig

unread,

Oct 31, 2021, 12:29:02 PM10/31/21

to

Marcus <m.de...@this.bitsnbites.eu> schrieb:

> However for the scalar integer ISA they don't have it, which is what I
> was thinking about (things like memory address calculations for matrix
> indices etc use integer MUL + ADD).

Because multiplication used to be so expensive, strength reduction,
i.e. replacement multiplication by repeated addition, has received
considerable attention in compiler optimization. IIRC this is
something the very first optimizting compiler, for FORTRAN, did.

And if strength reduction does not work:

Even if an ISA does not have multiply + add as an instruction,
many of them support

MUL R1,R2,R3 // R1=R2*R3
L R4, R5(R1) // R4 = Mem(R5+R1)

so the addition is done in the load/store instruction. Then,
there is less advantage for array operations because an indirect
load is just a load with a fixed offset of zero.

MitchAlsup

unread,

Oct 31, 2021, 12:35:50 PM10/31/21

to

On Sunday, October 31, 2021 at 8:10:04 AM UTC-5, Marcus wrote:
> Hello group!
>
> I just did some analysis of some code that included integer
> multiplications, and concluded that having a fused multiply-and-add
> (a.k.a MAC) instruction would eliminate many ADD instructions, since
> a multiplication operation more often than not comes together with a
> following addition operation.
<

My 66000 has dabbled with having an integer MADD instruction.

>
> Furthermore the instruction eliminates the (potential) data dependency
> latency between the MUL instruction and the ADD instruction.
<

For many integer multiplies, the end result is an index into an array/matrix
where one has access to a "free" ADD in the memory reference addressing
mode.

>
> So, I went ahead and added a simple MADD instruction to my ISA, on the
> form:
>
> MADD R1, R2, R3 ; R1 <- R1 + R2 * R3
>
> This was a very simple addition to my hardware implementation, and in
> my FPGA design it didn't really add any extra delay (the addition is
> performed in the pipeline stage directly after the multiplication,
> concurrently with some other result fixup operations). Thus the cost of
> the ADD instruction could be completely eliminated (both in code size
> and in instruction cycle count).
<

Done "properly" the multiply and the MADD both produce 2×n results
(at least optionally).
<
But the cost of the ADD should be 2-gates of delay (where a 2-input
carry select adder is 11 gates of delay for 64-bit results,) because
the ADD portion can be done before conversion from carry-save to
binary form.

>
> I also noticed that ARMv8 has a similar (but more flexible 4-operand)
> instruction. In fact, in ARMv8 the MUL instruction is an /alias/ for the
> MADD instruction (the addend is set to zero).
>
> However, many other ISA:s (apart from DSP ISA:s) seem to be lacking this
> instruction (x86, RISC-V, ARMv7, POWER?, My 66000?). How come?
<

Never made the "statistics" to be warranted always on the edge.
The other reason is that the 3-operand encoding field is nearly full.

BGB

unread,

Oct 31, 2021, 2:41:01 PM10/31/21

to

Goes looking:
Yeah, it appears that this is a pretty common pattern.

Though, MUL isn't a super high priority op (in a few programs, several
orders of magnitude less common than the load/store ops; even vs ADD by
itself).

The debate, I guess, is whether it would save enough to be worthwhile.

Going by some of my clock-cycle usage stats, it would likely save around
0.01% of the total clock-cycles for programs like Doom and similar.

It does result in an increase in LUT cost and timing getting a little
worse. I ended up shoving it into the adders which deal with high-order
products and sign-extension. Causes resource cost to increase by around 2%.

But, there are cases where it could be useful.
Went and defined a DMAC extension, adding:
MACS.L Rm, Ro, Rn //Rn=SExt(Rn+(Rm*Ro))
MACU.L Rm, Ro, Rn //Rn=ZExt(Rn+(Rm*Ro))
DMACS.L Rm, Ro, Rn //Rn=Rn+(Rm*Ro) //Widening MAC
DMACU.L Rm, Ro, Rn //Rn=Rn+(Rm*Ro) //Widening MAC

Debate being mostly whether it is worth a 2% increase in LUT budget
(~1kLUT).

...

anti...@math.uni.wroc.pl

unread,

Oct 31, 2021, 2:56:27 PM10/31/21

to

Well, I was tinking about floating-point. However, there is old
PMADDWD instruction. It is limited to take 4 packed 16-bit numbers
and produces 2 dot products a0*b0 + a1*b1 and a2*b2 + a3*b3. 64-bit
version (128-bit dot products of pairs of 64-bit numbers) of this
would be nice...

> However for the scalar integer ISA they don't have it, which is what I
> was thinking about (things like memory address calculations for matrix
> indices etc use integer MUL + ADD). See:
>
> x86: https://godbolt.org/z/dTEej5rP4 (imull + leal)
>
> ...but ARM does, for instance:
>
> ARMv7: https://godbolt.org/z/K3Ecsavq1 (mla) <- Correction: ARMv7 has it
> ARMv8: https://godbolt.org/z/EPczq6Wcc (madd)
>
> >> It seems to me that it's a very useful instruction, and that it has
> >> a relatively small hardware cost.
> >
> > It is very useful instruction, in particular when correctly
> > rounded. However, it is not clear for me that cost is small.
> > Namely, consider
> >
> > pr1 = x*y
> > pr2 = madd(x, y, -pr1)
> >
> > where madd means hardware madd. When correctly rounded
> > pr2 will deliver low order bits of product. Which means
> > that your multiplier has to produce all bits of product.
> > IIUC "normal" FP multiplier can skip most of low order
> > bits, just setting few flags to be able to round correctly.
>
> Are you talking about integer or floating-point?

I was thinking about floating point. I would like to have integer
version, but to be really useful it would have to produce also high
bits of result, which means 4 input registers and 2 output redisters.
32-bit ARM is doing this, I am not sure if they kept this for 64-bit
mode.

More generally, most CPU desigers seem to skimp on integer arithmetic.
The effect is that for my needs I may be forced to use FPU as
"poor man" integer unit: it delivers substandard results (53 bits
instead of 64 or 128) and eats memory bandwidth (as I need to
have 64-bit containes even though only say 25 bits are significant),
but has higher througput than real integer unit...

--
Waldek Hebisch

Marcus

unread,

Nov 1, 2021, 4:43:34 AM11/1/21

to

In my ISA I use destructive 3-operand (a = a + b x c), which isn't quite
as flexible as a full 4-operand version (a = b + c x d). OTOH it seems
to cover the most common cases without requiring any additional moves
(and I think that a pre-move is better than a post-add anyway).

A quick test on a few code bases with my MADD-enabled GCC (keep in mind
that my GCC machine description is far from perfect):

Quake:
# of MUL: 146
# of MADD: 162 (52%)

Doom:
# of MUL: 395
# of MADD: 152 (28%)

Sqlite:
# of MUL: 152
# of MADD: 77 (33%)

That's not a huge number of multiplications, but OTOH the share of
multiplications that can utilize MADD is 30% or better (and when I look
at the cases where MUL is being used, some of them should be possible
to transform to MADD too, even though GCC fails at doing so).

The code snippet that originally got me thinking about the issue was
a fixed point (16.16) bilinear sampler routine:

https://godbolt.org/z/xGaszj7qY

After adding MADD to my ISA I get the following code (GCC):

sample_bilinear:
madd r3, r4, r2 ; Calculate row addresses
add r3, r1, r3
add r2, r3, r2
ldub r8, [r3, #0] ; Load four adjacent values
ldub r7, [r2, #0]
ldub r4, [r3, #1]
ldub r1, [r2, #1]
lsl r3, r8, #16
sub r1, r1, r7
sub r4, r4, r8
lsl r2, r7, #16
madd r3, r4, r5 ; Lerp horizontally, row 1
madd r2, r1, r5 ; Lerp horizontally, row 2
lsl r1, r3, #7
sub r2, r2, r3
lsr r2, r2, #9
add r1, r1, #0x400000 ; Round
madd r1, r2, r6 ; Lerp vertically
ebfu r1, r1, #<23:8>
ret

In this routine four instructions could be dropped (17%).

BTW, bilinear sampling is one of the things that is very hard to do
efficiently in SIMD (at least w/o gather load), so scalar
implementations are quite common.

/Marcus

Marcus

unread,

Nov 1, 2021, 5:23:00 AM11/1/21

to

Yes, I saw a ~1% LUT increase & ~4% register increase in my CPU core
(the registers are probably for the pipelined passing of the addend to
the final multiplier stage). Though, granted, I have not been very
restrictive with the LUT budget. I'm still under 50% logic utilization
for my complete SoC computer design. Lack of BRAM is a bigger issue for
me.

The single MADD instruction was just a PoC. I'm going to investigate if
MSUB makes sense too, and then I think I'm going to rearrange the
opcodes so that I can use an immediate value for one of the
multiplicands (and the divisor for division), which could make for a
nice improvement in some situations.

/Marcus

Thomas Koenig

unread,

Nov 1, 2021, 2:23:05 PM11/1/21

to

Marcus <m.de...@this.bitsnbites.eu> schrieb:

> A quick test on a few code bases with my MADD-enabled GCC (keep in mind
> that my GCC machine description is far from perfect):
>
> Quake:
> # of MUL: 146
> # of MADD: 162 (52%)

Here are a few numbers for POWER binaries, which has the madd*
instructions:

[tkoenig@gcc135 lib64]$ objdump --disassemble libgfortran.a | grep -P "\tmadd" | wc -l
1066
[tkoenig@gcc135 lib64]$ objdump --disassemble libgfortran.a | grep -P "\tmul" | wc -l
6350
[tkoenig@gcc135 lib64]$ objdump --disassemble libgfortran.a | wc -l
400530

[tkoenig@gcc135 bin]$ objdump --disassemble troff | grep -P "\tmadd" | wc -l
0
[tkoenig@gcc135 bin]$ objdump --disassemble troff | grep -P "\tmul" | wc -l
318
[tkoenig@gcc135 bin]$ objdump --disassemble troff | wc -l
106736

So, even in a very matrix-oriented code like libgfortran, the number
of integer multiply and adds is rather low - 0.25%.

BGB

unread,

Nov 1, 2021, 5:00:57 PM11/1/21

to

This is closer to what I was seeing:
Around 1/4 to 1/8 of MUL can be conjoined with an ADD.

Combined with relatively few of these cases being in the hot path, the
average case savings are pretty small (in terms of clock cycles).

It is a promising operation for things like IDCT in JPEG or MPEG, but if
one has a machine where much of the bottleneck for JPEG decoding is in
the Huffman stage, the savings from a slightly faster IDCT are small.

At least for BJX2, there would likely be a bigger advantage for a
JPEG-style format using Rice coding or similar rather than Huffman.

Granted, throwing Rice coding at problems seems to be anathema to many
compression people.

As I have added it experimentally in BGBCC (this was a hassle, ended up
needing to add support for trinary operators to multiple compiler
stages), it is likely that it would be more efficient to spread it out
using "+=" operators.

So, rather than writing, say:
i0=a00*c00 + a01*c01 + a02*c02 + a03*c03;
i1=a10*c10 + a11*c11 + a12*c12 + a13*c13;
i2=a20*c20 + a21*c21 + a22*c22 + a23*c23;
i3=a30*c30 + a31*c31 + a32*c32 + a33*c33;
It would work out more efficient to write it more like:
i0 =a00*c00; i1 =a10*c10; i2 =a20*c20; i3 =a30*c30;
i0+=a01*c01; i1+=a11*c11; i2+=a21*c21; i3+=a31*c31;
i0+=a02*c02; i1+=a12*c12; i2+=a22*c22; i3+=a32*c32;
i0+=a03*c03; i1+=a13*c13; i2+=a23*c23; i3+=a33*c33;

Mostly because the former would stumble on interlocks, and thus cost ~ 3
cycles per MAC, whereas the latter would average ~ 1 cycle per MAC, at
least, subject to the whims of the register allocator and similar.

Available encoding options would all:
MAC with reg coeff;
MAC with Imm5u coeff (32-bit encoding);
MAC with Imm29s coeff (64-bit encoding).

With this operation (if supported) being limited to Lane 1 (not subject
to WEX).

It is possible I could add MACS.W variants (as a special case of
MULS.W), depending mostly on if I want to spend encoding the space on
it, which could be combined with WEX (allowed in Lanes 1 or 2 with a
1-cycle latency). These could give a performance advantage if the input
values are 'char' or 'short' (*1).

...

*1:
const short c00=15, c01=7, ...;
short a00, a01, a02, ...;
int i0, i1, i2, i3;
...
i0=a00*c00 + a01*c01 + a02*c02 + a03*c03;
...

The tradeoff here is mostly that relatively little "normal" code goes
out of its way to use 'short' for multiplier inputs rather than 'int'
(given 'int' is pretty much the default type for pretty much everything,
even when technically overkill). So, it is likely that a "MACS.W
instruction" would end up hardly ever being used (well, much outside of
things like audio/video codecs or similar).

Partial reason this case can works OK is given this operation would map
fairly directly onto a DSP48 element.

I still have encoding space available, but given I have already used up
"most of it" (*2), this is starting to be cause for concern (though, the
rate of expansion has slowed considerably).

*2: I wanted to leave the F3 and F9 blocks as "user blocks", but if I
run out of space in the F0 block, that would mean I would start needing
to add new instructions mostly in the Op64 space, which would kinda suck
(any instructions added here can't currently be WEXified).

Quick survey:
I have enough remaining 3R space (in 32-bit land) for around 68, 3R
encodings (56 generic, 12 Ld/St).

This would be out of a starting space of 167 ( 192 - 24 - 1; with 24 for
2R spaces, and 1 for 1R/0R space ).

Actually, the actual starting space would have been significantly
larger, apart from ISA designs cutting off a big chunk of it for WEX.
The Predicate and XGPR spaces, meanwhile, were reclaimed from 16-bit
Land (16-bit land is basically already full; and XGPR ate the encoding
space that was previously occupied by the Op24 encodings; with XGPR and
Op24 assumed to be mutually exclusive).

There is enough remaining encoding space in Op64 land for around 1
million 3R encodings. Op48 space also theoretically exists, but is
incompatible with the WEXifier (which is built around the assumption of
being able to work with machine-code in terms of 32-bit units).

Where, currently:
F0nm_0eoZ Ld/St (Full)
F0nm_1eoZ ALU, 2R space (Mostly full; only 2R remains)
F0nm_2eoZ SIMD and FPU ops (Full)
F0nm_3eoZ 0R/1R/2R spaces, ALUX ops (Mostly only 1R and 2R remains)
F0nm_4eoZ Ld/St (MOV.X, MOV.C)
F0nm_5eoZ SIMD, FPU, DMULx, More ALU, ... (Full)
F0nm_6eoZ MAC, UTX/UAB ops, FPU (Mostly full)
F0nm_7eoZ Mostly Unused, more 2R space
F0nm_8eoZ Ld/St (XMOV; Mostly Full)
F0nm_9eoZ Unused (Reclaimed from original FPU)
F0nm_AeoZ Unused
F0nm_BeoZ Unused
F0dd_Cddd BRA Disp20
F0dd_Dddd BSR Disp20
F0dd_Eddd BT Disp20
F0dd_Fddd BF Disp20

The F1 block is now full.
The F2 block only really has Imm10 slots left.

The F3 block is Reserved for User extensions; These are intended for
implementation specific instructions (such as if one were to use BJX2 as
part of a GPU or Neural Net processor, these instructions would probably
go here); Similar for F9.

Can't really claim the remaining space in F8 as more 3R space, as the
block layout of the F8 block is incompatible with the F0 block layout.

Note that the 2R spaces are limited to 2R encodings, which are currently
things like:
OP Rm, Rn
OP Imm6, Rn

There is still plenty of 2R space left.

It could be possible to reclaim some encoding space, but with the usual
drawback that doing so tends to break binary compatibility with existing
code (preferably avoided).

...

BGB

unread,

Nov 1, 2021, 6:13:21 PM11/1/21

to

I had to give up on doing dual-core, so I am currently stuck with a
single CPU core on the XC7A100 with the current feature-set (though, a
whole lot of LUTs are being spent on having 64B cache lines in the L2
cache).

I had considered SMT as a possible cheaper alternative, but this is
still pretty far from being fully implemented.

Currently using ~ 70% (44000 / 63400 LUTs), and 118 / 135 BRAM, ...

As noted, this is with the current major configuration:
WEX-3W (3 Execute Lanes)
ALUX (128-bit ADD/SUB/Shift)
FP-SIMD (128-bit packed Floating-Point SIMD)
Also 64-bit packed Half
Also 64-bit 3x FP21 (S.E5.M15)
Some FP8 Ops ( 32-bit, 4x FP8; E4.M4 / S.E4.M3 )
Block-Texture and Block-Audio Decoders (UTX/UAB)
RISC-V Mode (Secondary Decoders)
XMOV (96-bit virtual address space)
...
64B L2 cache lines
256K L2 Cache
16K + 32K L1 Caches
256x 4-way TLB (2-way when using 96-bit VAs)
...

Without the MAC instructions, it is ~ 68% of the LUT budget.
Using 16B L2 cache lines saves ~ 8%, but somewhat reduces memory
bandwidth (and is basically required for the DRAM-backed framebuffer to
work effectively).

Currently not enabled (but exist):
FPUX / Long-Double (Hardware support for a 96-bit / S.E15.M80 format);
BLINT / BLERP (Hardware Bilinear Interpolator).

These features a both fairly expensive and also fairly niche.
The absence of FPUX can be partly offset by ALUX being able to implement
FPU emulation via 128-bit integer operations. Well, except FMUL is still
kinda slow when only only has a 32x32->64 multiplier (eg: int128
multiply is kinda slow).

The software emulation does the S.E15.M112 format (Quad Precision).

In practice, neither __float128 nor "long double" are commonly used, so
it isn't too much of an issue at present.

While the hardware bilinear interpolator did work, it didn't offer that
big of an advantage over doing it in software (to offset its cost).

Though, partly for TKRA-GL, did end up with some of the bilinear paths
using a slightly cheaper but approximate interpolator (namely, a
cost-reduced 3-point interpolation algo partly "inspired" by the
Nintendo 64).

> The single MADD instruction was just a PoC. I'm going to investigate if
> MSUB makes sense too, and then I think I'm going to rearrange the
> opcodes so that I can use an immediate value for one of the
> multiplicands (and the divisor for division), which could make for a
> nice improvement in some situations.
>

In my case, I ended up fiddling with it more, and adding some Imm5u
encodings as well (since Imm5 fits into 3R space).

There was an Imm9 space, but none of this remains (it was mostly used
for ALU ops).

> /Marcus

Marcus

unread,

Nov 2, 2021, 3:52:16 AM11/2/21

to

Thanks for the stats!

Unfortunately I do not yet have access to a lot of larger code bases
(e.g. my GCC toolchain is missing OS/posix functions), so I have to use
a few "simple" ones when I try out new features (e.g. Quake).

I wonder if your findings would translate to MRISC32 (apples to
apples?). Also, was troff compiled for POWER9?

Goes checking GNU troff...

Ok, I was able to build a bunch of MRISC32 flavored .o files from the
GNU troff source (including stuff going into libgroff.a), but program
linking failed because I do not have support for shared linking (or
something to that effect, I did not bother to try to get it to work at
this point).

I added all the .o files to a static library, libstuff.o, and got:

$ mrisc32-elf-objdump -d libstuff.a | grep '\smadd\s' | wc -l
35
$ mrisc32-elf-objdump -d libstuff.a | grep '\smul\s' | wc -l
42
$ mrisc32-elf-objdump -d libstuff.a | wc -l
32903

...so ~45% of the multiplications transform to MADD for MRISC32. I guess
there could be differences in the POWER ISA that allows it to do other
transformations?

/Marcus

Marcus

unread,

Nov 2, 2021, 6:03:04 AM11/2/21

to

I ended up ditching MSUB - it was almost never used. I now have the
following variants, which covers most of my needs (things are also
simplified by the fact that my ISA is 32-bit, so I don't have to worry
about widening combinations etc):

MADD R1,R2,R3 ; R1 += R2 * R3
MADD R1,R2,#imm ; R1 += R2 * imm

...where imm uses my 15-bit "I15HL" format, which is a 14-bit immediate
that can either be placed in bits <13:0> (sign extended to 32 bits) or
in bits <31:18> (LSB extended to bits <17:0>). I've found that this
particular immediate format is quite useful for most arithmetic and
bitwise logic instructions.

As usual, this also works with vector registers and packed data types,
so e.g:

MADD.H V1,V2,R3 ; V1[k] += V2[k] * R3 (packed half-word)
MADD V1,V2,#imm ; V1[k] += V2[k] * imm

I still have the option to add another MADD version that changes the
order of the operands in order to avoid MOV:s. For instance, it could
work like this:

MADD213 R1,R2,R3 ; R1 = R2 + R1 * R3

...but I think that at that point we're getting into diminishing
returns.

/Marcus

Thomas Koenig

unread,

Nov 2, 2021, 7:55:12 AM11/2/21

to

Marcus <m.de...@this.bitsnbites.eu> schrieb:

No, it wasn't.

Here is some data for libgsl, built with recent trunk with -mcpu=power9:

[tkoenig@gcc135 .libs]$ objdump --disassemble libgsl.a | grep -P "\tmadd" | wc -l
1045
[tkoenig@gcc135 .libs]$ objdump --disassemble libgsl.a | grep -P "\tmul" | wc -l
3416
[tkoenig@gcc135 .libs]$ objdump --disassemble libgsl.a | wc -l
633746

> Goes checking GNU troff...

Recent gcc trunk and the most recent groff gives me, with -mcpu=power9:

[tkoenig@gcc135 groff-1.22.4]$ objdump --disassemble troff | grep -P "\tmadd" | wc -l
79
[tkoenig@gcc135 groff-1.22.4]$ objdump --disassemble troff | grep -P "\tmul" | wc -l
265
[tkoenig@gcc135 groff-1.22.4]$ objdump --disassemble troff | wc -l
118783

[...]

> ...so ~45% of the multiplications transform to MADD for MRISC32. I guess
> there could be differences in the POWER ISA that allows it to do other
> transformations?

Possibly, I have not looked into this in that detail.

What would be interesting for your ISA is the effect that this
has on bignum multiplication, especially with vectorization
(important for crypto, of course).

EricP

unread,

Nov 2, 2021, 9:30:51 AM11/2/21

to

Marcus wrote:
>
> I ended up ditching MSUB - it was almost never used. I now have the
> following variants, which covers most of my needs (things are also
> simplified by the fact that my ISA is 32-bit, so I don't have to worry
> about widening combinations etc):
>
> MADD R1,R2,R3 ; R1 += R2 * R3
> MADD R1,R2,#imm ; R1 += R2 * imm
>

> ....where imm uses my 15-bit "I15HL" format, which is a 14-bit immediate

> that can either be placed in bits <13:0> (sign extended to 32 bits) or
> in bits <31:18> (LSB extended to bits <17:0>). I've found that this
> particular immediate format is quite useful for most arithmetic and
> bitwise logic instructions.

What do you do about immediate bits [17:14]?

Marcus

unread,

Nov 2, 2021, 10:41:05 AM11/2/21

to

That's another instruction (or possibly even two more instructions).
I have a LoaD Immediate (LDI) instruction that accepts any 32-bit value,
and actually expands it to 2 instructions (LDI + OR) if needed. It's a
classic RISC solution, but at least slightly better than the naive
solution thanks to the extra Hi/Lo flag...

I have an I21HL 21-bit encoding that's used by the LDI instruction, that
places the 20 bits in <19:0> or <31:13>.

So, with the smaller I15HL format you can express:

* Signed integers in the range [-8192, 8191].
* Bit masks & values such as:
- 0xFF000000
- 0xFFFFF000
- 0x1FFFFFFF
- 0x7F800000 (exponent mask for IEEE 754 binary-32)
- 0x41A80000 (21.0F in IEEE 754 binary-32)
- etc.

...and with the I21HL format you can express even wider ranges, e.g.
integers in the range [-524288, 524287], or 32-bit floating-point
values with up to 11 fractional bits (e.g. -1020.5F).

It's all in the ISA manual [1] (section 1.5: Immediate value encoding).

I have found that with these encodings I get away with a single 32-bit
instruction word for most common operations, e.g.

MIN R2, R1, #5000

XOR R2, R1, #0x07F00000

...and two 32-bit instruction words suffice in many of the situations
where one is not enough, e.g:

LDI R2, #500000
ADD R2, R1, R2

And finally there's the three-instruction fall-back for rare cases:

LDI R2, #0x12344000
OR R2, R2, #0x00001678 ; R2 = 0x12345678
ADD R2, R1, R2

In the latter case you can just write "LDI R2, #0x12345678" and the
assembler will expand it for you (and I'd expect advanced CPU front-
ends to be able to fuse the instruction pair into a single instruction).
Also, I've found that most of these large constants will be preloaded
into registers outside of loops (thanks to LICM and a sufficient number
of architectural registers).

/Marcus

[1] https://mrisc32.bitsnbites.eu/doc/mrisc32-instruction-set-manual.pdf

MitchAlsup

unread,

Nov 2, 2021, 4:01:44 PM11/2/21

to

On Tuesday, November 2, 2021 at 9:41:05 AM UTC-5, Marcus wrote:
> On 2021-11-02 14:30, EricP wrote:
> > Marcus wrote:
> >>
> >> I ended up ditching MSUB - it was almost never used. I now have the
> >> following variants, which covers most of my needs (things are also
> >> simplified by the fact that my ISA is 32-bit, so I don't have to worry
> >> about widening combinations etc):
> >>
> >> MADD R1,R2,R3 ; R1 += R2 * R3
> >> MADD R1,R2,#imm ; R1 += R2 * imm
> >>
> >> ....where imm uses my 15-bit "I15HL" format, which is a 14-bit immediate
> >> that can either be placed in bits <13:0> (sign extended to 32 bits) or
> >> in bits <31:18> (LSB extended to bits <17:0>). I've found that this
> >> particular immediate format is quite useful for most arithmetic and
> >> bitwise logic instructions.
> >
> > What do you do about immediate bits [17:14]?
<
> That's another instruction (or possibly even two more instructions).
> I have a LoaD Immediate (LDI) instruction that accepts any 32-bit value,
> and actually expands it to 2 instructions (LDI + OR) if needed. It's a
> classic RISC solution, but at least slightly better than the naive
> solution thanks to the extra Hi/Lo flag...
<

But not as nice as full length immediates available in the instruction set.
<
MUL R7,#123456789101112,-R9
<
So, while this takes 3 words of instruction space (64-bit immediate) it
takes only 1 pipelined cycle of execution (4 cycles of latency) and can
be performed even in lower end implementations as 1 instruction

>
> /Marcus
>
>
> [1] https://mrisc32.bitsnbites.eu/doc/mrisc32-instruction-set-manual.pdf

BGB

unread,

Nov 2, 2021, 4:22:10 PM11/2/21

to

From my listing:
* F0nm_6go0 ? MACS.L Rm, Ro, Rn
* F0nm_6Go0 ? MACS.L Rm, Imm5u, Rn
* F0nm_6go1 ? MACU.L Rm, Ro, Rn
* F0nm_6Go1 ? MACU.L Rm, Imm5u, Rn
* F0nm_6go2 ? DMACS.L Rm, Ro, Rn
* F0nm_6Go2 ? DMACS.L Rm, Imm5u, Rn
* F0nm_6go3 ? DMACU.L Rm, Ro, Rn
* F0nm_6Go3 ? DMACU.L Rm, Imm5u, Rn

These basically do:
Rn=Rn+Rm*Ro
Rn=Rn+Rm*Imm5

These also have Jumbo and Op64 variants:
* Jumbo extents the Imm5u to Imm29s;
FEii_iiii-F0nm_6Gi2 DMACS.L Rm, Imm29s, Rn //Rn=Rn + Rm*Imm29s
* Op64 can turn it into a 3RI or 4R encoding, eg:
FFw0_0Vpp-F0nm_6go2 DMACS.L Rm, Ro, Rp, Rn // Rn=Rp+Rm*Ro
FFw0_0vii-F0nm_6Gi2 DMACS.L Rm, Imm17s, Rn // Rn=Rn+Rm*Imm17s
FFw0_0Vii-F0nm_6Go2 DMACS.L Rm, Imm11s, Ro, Rn // Rn=Ro+Rm*Imm11s

The 'v' field being defined as (F0 block):
v: 0iii //3R or 3RI (Imm17s)
V: 1iii //4R / 4RI (Imm11s)

The difference between the Imm29s and Imm17s encodings being:
Imm29s encoding is limited to R0..R31;
Imm17s encoding applies to R0..R63.

The V field currently only exists for 3R/3RI encodings within the F0 block.

In theory, I could also use the 'v' field to encode Load/Store
operations with both an index and displacement, or with an explicit
scale, but I don't currently plan to do so (these would not be good for
timing). Though, this could allow addressing modes more directly
analogous to those in x86 and ARM, eg:
(Rm, Ro) //1a Scaled by Element Size (V.v=0)
(Rm, Ro*Sc) //2a, Explicit Scale (V.v=1)
(Rm, Disp17s) //3, Scaled by Element Size (V.v=0)
(Rm, Disp17s*1) //4, Unscaled (V.v=1)
(Rm, Ro, Disp11s) //1b, Scaled by Element Size (V.v=0)
(Rm, Ro*Sc, Disp9s) //2b, Scaled by Element Size (V.v=1)

Though, 1a/1b and 2a/2b would be the same encoding, with the rule being
that if Op64 is used with an Ld/St op:
(Rm, Ro) -> 1/2
(Rm, Disp5) -> 3/4

I could define these, but leave them unimplemented (reserved as a future
extension). In which case, the displacement field for these cases will
be required to be left as zero.

There will not be any proper 4R Ld/St ops, as 4R does not make sense for
Ld/St (and would also make an awful mess of the register ports in the
case of Store operations).

I had noted that the most common source of MUL (and MUL + ADD) in these
tests tends to actually be from arrays of structs, rather than from
arithmetic operations proper.

I had noted that the best case where a MAC instruction "pays off" is in
Dhrystone, where the DMACS.L instruction (in the context of array
addressing) gets the score from ~ 67k to ~ 69k.

It plays a much lesser role in either Doom or Quake though (neither of
these cases hit IMUL all that hard). Quake seems to hit FMUL a lot
harder than it hits IMUL (not sure if because FMUL is more common in the
hot-path, or because FMUL takes a lot more clock cycles on-average vs a
pipelined IMUL).

> I still have the option to add another MADD version that changes the
> order of the operands in order to avoid MOV:s. For instance, it could
> work like this:
>
> MADD213 R1,R2,R3 ; R1 = R2 + R1 * R3
>
> ...but I think that at that point we're getting into diminishing
> returns.
>

Yeah.

This is partly why I went and defined a few 4R encodings, even if they
are Op64 encodings (and thus can't be WEXified).

robf...@gmail.com

unread,

Nov 2, 2021, 7:19:26 PM11/2/21

to

On Tuesday, November 2, 2021 at 9:41:05 AM UTC-5, Marcus wrote:
> On 2021-11-02 14:30, EricP wrote:
> > Marcus wrote:
> >>

> >> I ended up ditching MSUB - it was almost never used. I now have the
> >> following variants, which covers most of my needs (things are also
> >> simplified by the fact that my ISA is 32-bit, so I don't have to worry
> >> about widening combinations etc):
> >>
> >> MADD R1,R2,R3 ; R1 += R2 * R3
> >> MADD R1,R2,#imm ; R1 += R2 * imm
> >>

> >> ....where imm uses my 15-bit "I15HL" format, which is a 14-bit immediate

> >> that can either be placed in bits <13:0> (sign extended to 32 bits) or
> >> in bits <31:18> (LSB extended to bits <17:0>). I've found that this
> >> particular immediate format is quite useful for most arithmetic and
> >> bitwise logic instructions.
> >

> > What do you do about immediate bits [17:14]?
<
> That's another instruction (or possibly even two more instructions).
> I have a LoaD Immediate (LDI) instruction that accepts any 32-bit value,
> and actually expands it to 2 instructions (LDI + OR) if needed. It's a
> classic RISC solution, but at least slightly better than the naive
> solution thanks to the extra Hi/Lo flag...
<

>But not as nice as full length immediates available in the instruction set.
<
>MUL R7,#123456789101112,-R9
<
>So, while this takes 3 words of instruction space (64-bit immediate) it
>takes only 1 pipelined cycle of execution (4 cycles of latency) and can
>be performed even in lower end implementations as 1 instruction

>
> /Marcus
>
>

I like to use prefix instructions to extend constants because they can be applied to
extending displacements as well as immediates. The prefix and following instruction
can be fetched and processed as a single unit if desired. For Thor there are short
11-bit immediate instructions that do not accept a prefix and longer 23-bit immediate
instructions that can be extended to 64 bits using just one prefix. Prefixes can add 7,
23, 41 or 55 bits. The number of bits for the instruction can then be tailored in 16-bit
parcel sizes. Rather than have whole bunches of instructions that process immediates
piece-meal for a 64-bit machine, prefixes are used. Shifting constants around does
not work for all instructions, like divide, but prefixes do. Using prefixes also does not
require the use of intermediate registers. So, full length immediates and displacements
are supported for most instructions in Thor. Gotta have those full- length constants!

Thor has a fast multiply and add instruction, single cycle hopefully, in addition to regular
multiply-and add. As the popular instruction format is 3R, I use all three source register
slots for many instructions. Logic operations have 3R forms Ra & Rb & Rc and the
compiler will support them.

MitchAlsup

unread,

Nov 2, 2021, 8:58:32 PM11/2/21

to

In My 66000, there is a group of OpCodes that contain 16-bit immediates 0b1xxxxx
and another set of groups that can attach 32-bit or 64-bit immediates 0b00xxxx. The
second set of groups also contains all the reg-reg OpCodes and the shifts with 12-bit
immediates.
<
I use a 4-bit encoding that also deals with all of the signs, immediates, displacements
and whether the immediate is operand[1] or operand[2]. One can argue that I have
lost entropy using such an encoding, but the pattern extends to the 1-operand and
3-operand formats without change in the 4-bit encoding.

<
> For Thor there are short
> 11-bit immediate instructions that do not accept a prefix and longer 23-bit immediate
> instructions that can be extended to 64 bits using just one prefix. Prefixes can add 7,
> 23, 41 or 55 bits. The number of bits for the instruction can then be tailored in 16-bit
> parcel sizes.
<

My 66000 is getting code density comparable to x86-64 without finding any need for
16-bit parcels. When I looked at adding 16-bit parcels, it screwed too many things
up for what looked to be a minor improvement in OpCode density; essentially eating
the vast majority of expansion room in the OpCode space that I did not look worthwhile
as a long term plan.

<
> Rather than have whole bunches of instructions that process immediates
> piece-meal for a 64-bit machine, prefixes are used.
<

Rather than having a whole bunch of instructions that process immediates, I gave EVERY
instruction access to every size of immediate (within reason).

<
> Shifting constants around does
> not work for all instructions, like divide, but prefixes do. Using prefixes also does not
> require the use of intermediate registers. So, full length immediates and displacements
> are supported for most instructions in Thor. Gotta have those full- length constants!
>
> Thor has a fast multiply and add instruction, single cycle hopefully, in addition to regular
> multiply-and add.
<

It may be fully pipelined, but it not going to be 1 cycle of latency. I haven't seen a
multiplier (integer) faster than 3 cycles and most of them are 4-cycles.
<
Each layer in the 4-2 compressor tree (that perform the actual multiplication) is
2-gates of delay, and you need at least 5-layers of these (plus wire delay, lots
of wire delay) and following the tree, you need an 11-gate delay adder; add in 3 gates
for Booth recoding, 1 gate for sign/unsign control and you have a good 26-gates of delay.
You might be able to route this and make 2-cycle pipeline, but I have not seen it done
in a otherwise high perf processor.

<
> As the popular instruction format is 3R, I use all three source register
> slots for many instructions. Logic operations have 3R forms Ra & Rb & Rc and the
> compiler will support them.
<

But can you do :: Ra &~ Rb | Rc in a single shot ?

BGB

unread,

Nov 3, 2021, 3:49:44 AM11/3/21

to

( misplaced response, going to assume Google Groups or something... )

> I like to use prefix instructions to extend constants because they can be applied to
> extending displacements as well as immediates. The prefix and following instruction
> can be fetched and processed as a single unit if desired. For Thor there are short
> 11-bit immediate instructions that do not accept a prefix and longer 23-bit immediate
> instructions that can be extended to 64 bits using just one prefix. Prefixes can add 7,
> 23, 41 or 55 bits. The number of bits for the instruction can then be tailored in 16-bit
> parcel sizes. Rather than have whole bunches of instructions that process immediates
> piece-meal for a 64-bit machine, prefixes are used. Shifting constants around does
> not work for all instructions, like divide, but prefixes do. Using prefixes also does not
> require the use of intermediate registers. So, full length immediates and displacements
> are supported for most instructions in Thor. Gotta have those full- length constants!
>
> Thor has a fast multiply and add instruction, single cycle hopefully, in addition to regular
> multiply-and add. As the popular instruction format is 3R, I use all three source register
> slots for many instructions. Logic operations have 3R forms Ra & Rb & Rc and the
> compiler will support them.
>

The multiply op is 1-3 cycles, depending on interlocks (mostly as with
Load, not using the result for at least 2 cycles makes it faster).

The experimental MAC / DMAC form does not add any additional latency
(but does have a non-zero LUT cost and similar).

But, yeah, in BJX2, the encodings for Jumbo and Op64 forms are in-effect
built from the use of prefixes.

This seems to be fairly cheap and easy to sort out in a decoder, even if
the actual encoding of the constants is a bit of a bit-twiddly mess.

Decided to skip listing out all of the immediate encodings in BJX2.

Then again, bit-twiddly encodings for immediate values is hardly an
issue unique to BJX2 (look at the organization of the bits in the branch
displacement in RISC-V's JAL instruction, for comparison).

The recent addition of 4R/4RI encodings for the MAC and DMAC
instructions was extended to allow encoding a few new address modes (as
an experiment).

Namely:
(Rm, Disp17s) (*1)
(Rm, Ro, Disp11)
(Rm, Ro*Sc, Disp9)

Seems able to pass timing as well, though supporting both an index and
displacement at the same time is debatable. Ended up making the
displacements unscaled in this case. Sc is 1/2/4/8 for now.

These don't use any additional encoding space in terms of 32-bit ops.

*1: Though, the simple case can also be encoded with a Jumbo prefix for
Disp33s.

A lot of these encodings are still untested though, and I still don't
really know if they are "sane".

Terje Mathisen

unread,

Nov 3, 2021, 6:35:54 AM11/3/21

to

On x86 you get the full form (hi,lo = a + (b*c) + carry) in 7-8 cycles,
that is probably fast enough, and it provides the most general building
block:

mov rax,b
mul c ;; 4 or 5 cycles

add rax,a ;; 1 cycle

adc rdx,0
add rax,carry ;; 1 cycle

adc rdx,0 ;; 1 cycle

Doing it in hardware is as you note almost free, zero or one additional
cycle over the basic 64x64->128 MUL, but as shown above, you only save 2
or 3 cycles and it is hard to fit a 4-input/2-output instruction in a
general CPU, even if you cheat and make both outputs implied. :-(

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Thomas Koenig

unread,

Nov 3, 2021, 7:06:14 AM11/3/21

to

Terje Mathisen <terje.m...@tmsw.no> schrieb:

> On x86 you get the full form (hi,lo = a + (b*c) + carry) in 7-8 cycles,
> that is probably fast enough, and it provides the most general building
> block:
>
> mov rax,b
> mul c ;; 4 or 5 cycles
>
> add rax,a ;; 1 cycle
>
> adc rdx,0
> add rax,carry ;; 1 cycle
>
> adc rdx,0 ;; 1 cycle
>
> Doing it in hardware is as you note almost free, zero or one additional
> cycle over the basic 64x64->128 MUL, but as shown above, you only save 2
> or 3 cycles and it is hard to fit a 4-input/2-output instruction in a
> general CPU, even if you cheat and make both outputs implied. :-(

Digging through the POWER ISA... in 3.0B aka POWER9 you can do
(RA, RB, RC, RT1 and RT2 refer to suitably defined registers)

addic RC, RC, 0 ! RC = RC + Carry
maddhdu RT1, RA, RB, RC ! RT1 = high(RA*RB + RC)
maddld RT2, RA, RB, RC ! RT2 = low(RA*RB + RC)

Terje Mathisen

unread,

Nov 3, 2021, 7:25:09 AM11/3/21

to

That is pretty nice as long as the maddhdu and maddld can overlap for
all or all_minus_1 cycles.

I used the wrong term for 'carry' above here, it is another full 64-bit
input, so the code above would not work the same way.

The idea is of course that 64+(64*64)+64 can maximally result in a
128-bit result with all bits set, i.e. no overflow is possible.

The mul-add-add is however a near-perfect building block for
crypto-class intermediate length (256-4096 bits) bigint math.

Thomas Koenig

unread,

Nov 3, 2021, 8:16:42 AM11/3/21

to

Terje Mathisen <terje.m...@tmsw.no> schrieb:

> Thomas Koenig wrote:
>> Terje Mathisen <terje.m...@tmsw.no> schrieb:
>>
>>> On x86 you get the full form (hi,lo = a + (b*c) + carry) in 7-8 cycles,
>>> that is probably fast enough, and it provides the most general building
>>> block:
>>>
>>> mov rax,b
>>> mul c ;; 4 or 5 cycles
>>>
>>> add rax,a ;; 1 cycle
>>>
>>> adc rdx,0
>>> add rax,carry ;; 1 cycle
>>>
>>> adc rdx,0 ;; 1 cycle
>>>
>>> Doing it in hardware is as you note almost free, zero or one additional
>>> cycle over the basic 64x64->128 MUL, but as shown above, you only save 2
>>> or 3 cycles and it is hard to fit a 4-input/2-output instruction in a
>>> general CPU, even if you cheat and make both outputs implied. :-(
>>
>> Digging through the POWER ISA... in 3.0B aka POWER9 you can do
>> (RA, RB, RC, RT1 and RT2 refer to suitably defined registers)
>>
>> addic RC, RC, 0 ! RC = RC + Carry
>> maddhdu RT1, RA, RB, RC ! RT1 = high(RA*RB + RC)
>> maddld RT2, RA, RB, RC ! RT2 = low(RA*RB + RC)
>>
> That is pretty nice as long as the maddhdu and maddld can overlap for
> all or all_minus_1 cycles.

There is no reason why these two instructions could not run in
parallel, though - there is no dependency between them.

> I used the wrong term for 'carry' above here, it is another full 64-bit
> input, so the code above would not work the same way.

Hm... so the question is when to add the carry. Doing this
at the end (probably the sane way) would lead to

maddld RT_low, RA, RB, RC
maddhdu RT_high, RA, RB, RC
add. RT_low, R_Carry
addic RT_high, RT_high, 0

An alternative might be

add. RA, R_Carry
maddld RT_low, RA, RB, RC
maddhdu RT_high, RA, RB, RC
addic RT_high, RT_high, 0

but I am not sure which version would be better, if there is
a difference at all.

> The idea is of course that 64+(64*64)+64 can maximally result in a
> 128-bit result with all bits set, i.e. no overflow is possible.
>
> The mul-add-add is however a near-perfect building block for
> crypto-class intermediate length (256-4096 bits) bigint math.

Which is why IBM added it in POWER9 (although they called it
blockchain or something like that - marketing).

Theo

unread,

Nov 4, 2021, 5:33:18 PM11/4/21

to

Marcus <m.de...@this.bitsnbites.eu> wrote:
> I also noticed that ARMv8 has a similar (but more flexible 4-operand)
> instruction. In fact, in ARMv8 the MUL instruction is an /alias/ for the
> MADD instruction (the addend is set to zero).
>
> However, many other ISA:s (apart from DSP ISA:s) seem to be lacking this
> instruction (x86, RISC-V, ARMv7, POWER?, My 66000?). How come?

32-bit ARM has a MLA instruction. This came in in ARM2, which added a
hardware multiply after ARM1 went without. The ARM2 datasheet says both MUL
and MLA take 'up to' 16 S-cycles (sequential cycles, ie without wait states,
bearing in mind there's no cache so it depends on DRAM for instruction
fetch), although it implies they can terminate early. I think the add
therefore comes 'free' in terms of cycles.

So it appeared in ARM architecture v2 and has stuck around since. I think I
remember Sophie Wilson saying it was added to aid in graphics, although
somebody else suggests sound.

> It seems to me that it's a very useful instruction, and that it has
> a relatively small hardware cost.

Indeed, it can cut down the instruction overhead of DSP things (and sound
and graphics) by quite a bit.

Theo

Terje Mathisen

unread,

Nov 5, 2021, 2:31:21 AM11/5/21

to

That's not true, at least not for the standard single-wide version: The
latency of the MUL is typically 4-5 cycles (or even more) while an ADD
is 1 cycle, so that's the only saving in this case.

Adding one or two register-size chunks to an N*N->2N MUL is however
quite useful, even more so on architectures without carry flag(s).

a...@littlepinkcloud.invalid

unread,

Nov 5, 2021, 5:11:59 AM11/5/21

to

anti...@math.uni.wroc.pl wrote:
> Marcus <m.de...@this.bitsnbites.eu> wrote:
>> Which instruction is that? I know about FMA3 for floating-point, but how
>> about integer?
>
> Well, I was tinking about floating-point. However, there is old
> PMADDWD instruction. It is limited to take 4 packed 16-bit numbers
> and produces 2 dot products a0*b0 + a1*b1 and a2*b2 + a3*b3. 64-bit
> version (128-bit dot products of pairs of 64-bit numbers) of this
> would be nice...

VPMADD52LUQ is the useful one. Packed Multiply of Unsigned 52-bit
Integers and Add the Low 52-bit Products to Qword
Accumulators. AVX-512, I'm afraid.

Andrew.

MitchAlsup

unread,

Nov 5, 2021, 1:25:24 PM11/5/21

to

Why stop there::
<
void Long_multiplication( uint64_t multiplicand[],
multiplier[],
sum[],
ilength, jlength )
{
for( uint64_t i = 0;
i < (ilength + jlength);
i++ )
sum[i] = 0;

for( uint64_t acarry = j = 0; j < jlength; j++ )
{
for( uint64_t mcarry = i = 0; i < ilength; i++ )
{
{mcarry, product} = multiplicand[i]*multiplier[j]
+ mcarry;
{acarry,sum[i+j]} = {sum[i+j]+acarry} + product;

BGB

unread,

Nov 5, 2021, 5:33:48 PM11/5/21

to

Such is the issue in my case:
MUL is 1-3 cycles in my case (32x32=>64);
MUL is also relatively infrequent in my tests;
MAC saves typically 1 cycle over MUL+ADD;
...

I didn't see much gain from adding it, because there are generally just
not enough integer multiplies to begin with.

Most of the integer MUL+ADD were in the form of "arrays of structs" and
similar, which benefited primarily from an "Rn=Rp+Rm*Imm" instruction.

However, even with this instruction able to fairly effectively handle
this use-case, there wasn't enough of them in the general case to make
much visible impact on performance.

One could argue that it could be "better" to have a full 64-bit hardware
multiply and not "fake it" with 32-bit multiply ops, however, the logic
for "faking it" fits into the pipeline moderately well. The most obvious
addition would be if there were a DMULS.L variant which could take the
high halves of a register (and avoid the need for shift ops).

Say (64x64=>128):
MOV 0, R17 | DMULUH.L R4, R5, R16 //1c, Lo(R4) * Hi(R5)
MOV 0, R19 | DMULUH.L R5, R4, R18 //1c, Lo(R5) * Hi(R4)
DMULSHH.L R4, R5, R3 //1c, Hi(R4) * Hi(R5)
DMULU.L R4, R5, R2 //1c, Lo(R4) * Lo(R5)
ADDX R16, R18, R20 //2c (interlock)
SHLDX R20, 32, R20 //1c (*2)
ADDX R2, R20, R2 //1c
RTS //2c (predicted)

Time: ~ 10 clock cycles (excluding function-call overheads).

Otherwise, as-is, one would need to spend an extra clock cycle on a pair
of "SHLD.Q" operations or similar.

*2: Newly added encoding.

The closest exception to this (DMACS.L being "kinda pointless") was
Dhrystone, which has some multidimensional arrays, which were used
highly enough to have a visible effect, but even then it was still
fairly modest.

Dhrystone score is still pretty weak though (~ 69k ATM; 0.79 DMIPS/MHz).

Though, this would seem to still be pretty solid at least if going by
"vintage stats" (eg; results for this benchmark as posted back in the
1990s).

Testing on my PC, there is actually a fairly large difference between
compilers and between optimization settings when it comes to this
benchmark (and both GCC and Clang seem to give notably higher numbers
than MSVC on this test; need to compare "-Os" vs "/O2" or similar to
give MSVC much hope of winning this one).

Still not really gotten "RISC-V Mode" working well enough to test an
RV64I build of Dhrystone to see how it compares; with both versions
running on the same hardware (could confirm or deny the "GCC is using
arcane magic on this benchmark" hypothesis).

Though, probably doesn't help that BGBCC kinda sucks even vs MSVC when
it comes to things like register allocation (despite x86-64 having half
as many registers, MSVC still somewhat beats BGBCC at the "not spilling
registers to memory all the time" game).

I actually got a lot more performance gains more recently mostly by some
fiddly with the register allocation logic (and was, for the first time
in a while, actually able to get a visible improvement in terms of Doom
framerate, *).

*: Doom is now ~ 15-25 (for the most part), now occasionally hitting the
30 fps limiter, and no longer dropping into single-digit territory.

This was a tweak that mostly eliminated a certain amount of "double
loading" (loading the same value from memory into multiple registers
using multiple memory loads). As well as some "spill value to memory,
immediately reload into another register" cases, ...

Mostly this was by adding logic to check whether a given variable would
be referenced again within the same basic-block, preferentially loading
a value into a register and working on the in-register version if this
was the case, or preferentially spilling to or operating on memory (via
scratch registers if needed) if this variable will not be used again
within the same basic block (the previous register allocation logic did
not use any sort of "forward looking" behavior).

This was along with recently running into some code (while working on
DMACS.L), which was handling scaled-addressing by acting as if it were
still generating code for SuperH, namely trying to build index-scale
from fixed-shift operators and ADD operations, and seemingly unaware
that the ISA now has 3R and 3RI operations (*3).

Could still be better here.

*3: Like, BJX2 is well past the stage of needing to do things like:
MOV RsrcA, Rdst
ADD RsrcB, Rdst
...

Or, trying to implement things like Rd=Rs*1280 as:
MOV Rs, Rd
SHLL2 Rd
ADD Rs, Rd
SHLL8 Rd

Just sorta used "#if 0" on a lot of this, since a 3RI MUL is now the
faster option (but, there are still some amount of "dark corners" like
this in the codegen I guess). Granted, the ISA had been enough of a
moving target that much of the codegen is sort of like a mass of layers
partly divided along different parts of the ISAs development.

So, some high-level parts of the codegen still pretend they are
targeting SuperH, then emit instructions with encodings for a much
earlier version of the ISA, which are then bit-twiddled into their newer
encodings. Some amount of it should probably be rewritten, but endless
"quick and dirty hacks" was an easier prospect than "just go and rewrite
all this from a clean slate".

Well, and if I did this, almost may as well go and finally banish the
"XML Demon" from the compiler frontend (because, well, using DOM as the
basis for ones' C compiler AST system was not such a great idea in
retrospect; have spent over a decade dealing with the fallout from this
decision, but it never being quite bad enough to justify "throw it out
and rewrite the whole thing from the ground up").

Though, my younger self was from an era when XML and SQL and similar
were hyped as the silver bullets to end all of ones' woes (well, also
Java; but my younger self was put off enough by how painful and awkward
it was, and at the time its performance was kinda trash which didn't
exactly help matters). Actually, this seems to be a common theme with
this era, most of the "silver bullet" technologies were about like
asking someone to build a house with an oversized lead mallet.

I guess this also slightly reduces the ranking position of "MOV Rm, Rn",
but there is still a lot more 'MOV' than there probably should be.

Probably still need to work on things, eg:
Trivial functions referenced via function pointers should probably not
create stack frames and save/restore GBR;
...

> Adding one or two register-size chunks to an N*N->2N MUL is however
> quite useful, even more so on architectures without carry flag(s).
>

The widening MUL is the default in my case; the narrow MUL is actually
implemented in hardware by taking the widening version and then sign or
zero extending the result.

Note that sign and zero extending arithmetic operators are fairly useful
in terms of keeping old C code behaving as expected. Some amount of the
code I am working with is prone to misbehave if integer values go "out
of range" rather than implementing modulo 2^32 wrapping behavior; which
in turn means either needing any operations which are prone to produce
out-of-range results to either have sign/zero extended versions, or
needing to insert explicit sign or zero extensions.

I suspect this may also be a reason why seemingly ARM64 and RV64I also
include 32-bit subsets of the various integer operations (even if
arguably it would be both simpler and "more RISC" to simply do
everything using 64-bit ALU operations).

But, then the split seems to be between, eg:
x86-64 and ARM64: provide a nearly full redundant set of 32b and 64b
operations, with 32b ops typically being zero-extended;
BJX2 and RV64I route, namely providing a more limited set of sign/zero
extended operators.

> Terje
>

Terje Mathisen

unread,

Nov 6, 2021, 7:04:20 AM11/6/21

to

That looks a lot like a crypto-size bigint MUL, it can of course be
synthesized very easily if you have that MUL-ADD-ADD intrinsic, er even
better, the My66000 CARRY feature.

BTW, I assume you would normally have code at the end to detect
overflow, i.e. a non-null final mcarry?

MitchAlsup

unread,

Nov 6, 2021, 1:13:44 PM11/6/21

to

As written it is a large×large multiply in unsigned with a large+large
product:: does not overflow.
<
Could easily be changed to signed and detect OVERFLOW into the
highest significant container.

anti...@math.uni.wroc.pl

unread,

Nov 6, 2021, 5:03:02 PM11/6/21

to

Terje Mathisen <terje.m...@tmsw.no> wrote:
> Theo wrote:
> > Marcus <m.de...@this.bitsnbites.eu> wrote:
> >> I also noticed that ARMv8 has a similar (but more flexible 4-operand)
> >> instruction. In fact, in ARMv8 the MUL instruction is an /alias/ for the
> >> MADD instruction (the addend is set to zero).
> >>
> >> However, many other ISA:s (apart from DSP ISA:s) seem to be lacking this
> >> instruction (x86, RISC-V, ARMv7, POWER?, My 66000?). How come?
> >
> > 32-bit ARM has a MLA instruction. This came in in ARM2, which added a
> > hardware multiply after ARM1 went without. The ARM2 datasheet says both MUL
> > and MLA take 'up to' 16 S-cycles (sequential cycles, ie without wait states,
> > bearing in mind there's no cache so it depends on DRAM for instruction
> > fetch), although it implies they can terminate early. I think the add
> > therefore comes 'free' in terms of cycles.
> >
> > So it appeared in ARM architecture v2 and has stuck around since. I think I
> > remember Sophie Wilson saying it was added to aid in graphics, although
> > somebody else suggests sound.
> >
> >> It seems to me that it's a very useful instruction, and that it has
> >> a relatively small hardware cost.
> >
> > Indeed, it can cut down the instruction overhead of DSP things (and sound
> > and graphics) by quite a bit.
>
> That's not true, at least not for the standard single-wide version: The
> latency of the MUL is typically 4-5 cycles (or even more) while an ADD
> is 1 cycle, so that's the only saving in this case.

Typical application of muladd is accumulating dot product-like things.
In such case what matter is throughput of multiply and latency of
add. In fact, by using multiple acumulators one can compensate
for latency of add. If muladd has higher throughput, then it is
highly useful. I would hope for muladd that has 1 cycle latency
with respect to added argument, so that one could execute chained
muladd-s one per clock, but maybe this is too much...
--
Waldek Hebisch

MitchAlsup

unread,

Nov 6, 2021, 5:35:43 PM11/6/21

to

Any reasonable 2-wide machine can already do this (in order machines
merely need to be code scheduled, out-of-order machines just need a
big enough window to contain the MUL+ADD latency.) There is not a lot
of gain by making it a single instruction, nor is there a large HW cost
to making it a single instruction.
<
Today's designers should treat this as a free variable.
> --
> Waldek Hebisch

MitchAlsup

unread,

Nov 6, 2021, 5:37:16 PM11/6/21

to

I should also add, making integer MUL+ADD a single instruction is a
LOT easier with a combined register file than separate integer and FP files.
{The 3-operand data path is already present for FMAC/FMAD.}
> > --
> > Waldek Hebisch

Marcus

unread,

Nov 9, 2021, 1:48:37 AM11/9/21

to

That's actually one of the reasons I went this route: I reasoned that I
needed fused multiply-add for floating-point anyway. Same thing with the
bitwise SELect (IIRC it's called MIX in My 66000).

/Marcus

Marcus

unread,

Nov 9, 2021, 2:32:44 AM11/9/21

to

I think that DSP:s commonly use a dedicated accumulator register that
works like a low-latency feedback into to the MAC adder, thus enabling
1 MAC / cycle (per MAC unit) without the need for insn scheduling or
interlocks.

In my design it would be possible to achieve the same thing if I
implemented "late forwarding". I.e. identify that we're using the output
from a previous MADD operation as an addend to the next MADD operation,
and rather than waiting for that result to be ready before starting the
multiplication, go ahead and start the operation immediately and forward
the MADD result into the adder.

(Let's see if this ASCII art works...)

Here's how the MADD pipeline works now (output->input latency = +2):

MUL1 : MUL2 : MUL3/ADD :
^ |
+-------------------------+

...and with late forwarding (no extra output->input latency):

MUL1 : MUL2 : MUL3/ADD :
^ |
+-----------+

I still have not implemented late forwarding, though, as it feels like
there are a number of things that you can easily get wrong.

/Marcus

EricP

unread,

Nov 9, 2021, 8:11:11 AM11/9/21

to

A base-indexed store needs 3 register operand paths,
and 4 operand buses if you allow an immediate too,
if it doesn't use your trick of reading the register to
store at the Write Back stage instead of Register Read.

It can make multi-issue operand buses a bit of a rats nest.

Stefan Monnier

unread,

Nov 9, 2021, 8:30:35 AM11/9/21

to

> I should also add, making integer MUL+ADD a single instruction is a
> LOT easier with a combined register file than separate integer and FP files.
> {The 3-operand data path is already present for FMAC/FMAD.}

Reminds me of a question: IIUC, the MADD instruction will typically have
a latency of a few cycles (let's say 3) but the latency of the signal
is not the same for all 3 inputs. More specifically, IIUC the 3rd
("carry") input can arrive as late as the last cycle without impacting
the overall latency.

This means that in theory we can have a sequence of MADD instructions
all accumulating into the same register at a rate of 1 per cycle
(assuming the unit is pipelined), but if all inputs are "read" at
the same time, then the rate goes down to 1/3 per cycle.

How do OoO cores avoid this throughput problem?
Do they schedule the MADD by predicting that the carry will be available
2 cycles later? How do they do this prediction? What if it fails?

AFAIK the same kind of problem affects stores since the input registers
affecting the address are needed earlier than the input register
containing the to-be-stored data, but IIUC these are solved more easily
because the two parts affect different units so the stores can
conceptually be split into 2 sub-operations and go back to waiting in
the ROB between the two.

Stefan

EricP

unread,

Nov 9, 2021, 10:15:30 AM11/9/21

to

The store data isn't needed until later in the LSQ pipeline,
but not so late that it blocks a potential store-load forwarding.

This is from the design of my simulator's Load-Store Queue...

Some addresses don't require calculation or reading a register,
for example immediate absolutes.
RIP-relative can be calculated in Decode because it has
an adder for calculating RIP-rel branch destinations.
Such Effective Address (EA) can issue straight from the front end
uOp pipeline to the Load-Store Queue.

Others require reading a register but no calculation, e.g. [reg].
That register may be ready, in which case the EA can go straight to
the LSQ, or be in-flight and require waiting for the EA to be forwarded.

Others require a trip through AGEN for EA calculation,
possibly syncing with in-flight operands being forwarded.

Once the EA is available then comes virtual translate by state
machine(s) which could have multiple page table walkers at once,
generating multiple outstanding L1 or L2 misses.
A load or store data might straddle a page boundary requiring
multiple EA translates and producing multiple Physical Addresses (PA)
(These page table walker(s) multiplex their requests with other
LSQ operations talking to D$L1 cache.)

After the above is complete and we have 1 or 2 PA's comes address
disambiguation where it looks at older and younger LSQ entries
to see if any are for the same cache line and subject to
store ordering (for stores) or store-load forwarding (for loads).
And there can be 1 or 2 of those for data that straddles cache lines.

After all of that the store data is actually needed in the LSQ
in case younger load(s) require store-load forwarding
or the store instruction is next to retire.
That store data may come from a register or from in-flight forwarding.

The younger load(s) that are candidates for S-L-forwarding may already be
waiting in the LSQ, held up by an unresolved older store address or data,
or may arrive later.
Either of the store or load data may straddle cache lines.

Then the LSQ talks to the D$L1 cache which may itself allow
multiple hit-under-miss operations.

MitchAlsup

unread,

Nov 9, 2021, 11:11:08 AM11/9/21

to

On Tuesday, November 9, 2021 at 7:30:35 AM UTC-6, Stefan Monnier wrote:
> > I should also add, making integer MUL+ADD a single instruction is a
> > LOT easier with a combined register file than separate integer and FP files.
> > {The 3-operand data path is already present for FMAC/FMAD.}
> Reminds me of a question: IIUC, the MADD instruction will typically have
> a latency of a few cycles (let's say 3) but the latency of the signal
> is not the same for all 3 inputs. More specifically, IIUC the 3rd
> ("carry") input can arrive as late as the last cycle without impacting
> the overall latency.
>
> This means that in theory we can have a sequence of MADD instructions
> all accumulating into the same register at a rate of 1 per cycle
> (assuming the unit is pipelined), but if all inputs are "read" at
> the same time, then the rate goes down to 1/3 per cycle.
>
> How do OoO cores avoid this throughput problem?
<

We build the function unit to accept all 3 operands on the same clock.
Then we don't use the 3rd operand until "later"

<
> Do they schedule the MADD by predicting that the carry will be available
> 2 cycles later? How do they do this prediction? What if it fails?
>
> AFAIK the same kind of problem affects stores since the input registers
> affecting the address are needed earlier than the input register
> containing the to-be-stored data, but IIUC these are solved more easily
<

In fact you do not need the data until you have verified permissions
(and possibly cache hit.)

MitchAlsup

unread,

Nov 9, 2021, 11:13:44 AM11/9/21

to

At this point, it is generally easier to perform calculation in AGEN
and just have the data path route the bits there over the operand
buses. This gives uniform timing which makes pipelining easier.

Stefan Monnier

unread,

Nov 9, 2021, 2:21:29 PM11/9/21

to

MitchAlsup [2021-11-09 08:11:06] wrote:
> On Tuesday, November 9, 2021 at 7:30:35 AM UTC-6, Stefan Monnier wrote:
>> Reminds me of a question: IIUC, the MADD instruction will typically have
>> a latency of a few cycles (let's say 3) but the latency of the signal
>> is not the same for all 3 inputs. More specifically, IIUC the 3rd
>> ("carry") input can arrive as late as the last cycle without impacting
>> the overall latency.
>>
>> This means that in theory we can have a sequence of MADD instructions
>> all accumulating into the same register at a rate of 1 per cycle
>> (assuming the unit is pipelined), but if all inputs are "read" at
>> the same time, then the rate goes down to 1/3 per cycle.
>>
>> How do OoO cores avoid this throughput problem?
> We build the function unit to accept all 3 operands on the same clock.
> Then we don't use the 3rd operand until "later"

IOW you don't avoid this throughput problem?
I mean:

accum1 = MADD(x1, x2, accum);
accum2 = MADD(x3, x4, accum1);

ends up with a latency of 2*N cycles instead of N+1, right?
Because we can't start the second MADD before the first is over :-(

For integers, we can replace the code with;

x12 = MUL(x1, x2);
x34 = MUL(x3, x4);
accum1 = ADD(x12, accum);
accum2 = ADD(x34, accum1);

with latency N+2, which is likely better than 2*N.

Stefan

MitchAlsup

unread,

Nov 9, 2021, 4:26:01 PM11/9/21

to

On Tuesday, November 9, 2021 at 1:21:29 PM UTC-6, Stefan Monnier wrote:
> MitchAlsup [2021-11-09 08:11:06] wrote:
> > On Tuesday, November 9, 2021 at 7:30:35 AM UTC-6, Stefan Monnier wrote:
> >> Reminds me of a question: IIUC, the MADD instruction will typically have
> >> a latency of a few cycles (let's say 3) but the latency of the signal
> >> is not the same for all 3 inputs. More specifically, IIUC the 3rd
> >> ("carry") input can arrive as late as the last cycle without impacting
> >> the overall latency.
> >>
> >> This means that in theory we can have a sequence of MADD instructions
> >> all accumulating into the same register at a rate of 1 per cycle
> >> (assuming the unit is pipelined), but if all inputs are "read" at
> >> the same time, then the rate goes down to 1/3 per cycle.
> >>
> >> How do OoO cores avoid this throughput problem?
> > We build the function unit to accept all 3 operands on the same clock.
> > Then we don't use the 3rd operand until "later"
> IOW you don't avoid this throughput problem?
> I mean:
>
> accum1 = MADD(x1, x2, accum);
> accum2 = MADD(x3, x4, accum1);
>
> ends up with a latency of 2*N cycles instead of N+1, right?
<

Yes,

<
> Because we can't start the second MADD before the first is over :-(
<

Yes,
<
But as long as you do not exceed the size of the execution window,
it all works; you can put new instructions into the window every
cycle, you can retire instructions from the window every cycle,
and each function units can start a calculation every cycle. All
without SW having to schedule the code or to apply any Herculean
effort in code selection.

<
>
> For integers, we can replace the code with;
>
> x12 = MUL(x1, x2);
> x34 = MUL(x3, x4);
> accum1 = ADD(x12, accum);
> accum2 = ADD(x34, accum1);
>
> with latency N+2, which is likely better than 2*N.
<

Given the IMUL takes 3 cycles, the count is N+3;
accum1 = cannot begin until x12 = MUL completes.
>
>
> Stefan

Stefan Monnier

unread,

Nov 9, 2021, 5:04:05 PM11/9/21

to

>> For integers, we can replace the code with;
>>
>> x12 = MUL(x1, x2);
>> x34 = MUL(x3, x4);
>> accum1 = ADD(x12, accum);
>> accum2 = ADD(x34, accum1);
>>
>> with latency N+2, which is likely better than 2*N.
> <
> Given the IMUL takes 3 cycles, the count is N+3;
> accum1 = cannot begin until x12 = MUL completes.

I don't understand. My "N" was the latency of MUL (and MADD).
So with your N=3 it means the MADD version takes 6 cycles whiles the
MUL+ADD version only takes 5.

Stefan

MitchAlsup

unread,

Nov 9, 2021, 5:30:34 PM11/9/21

to

+----------+----------+----------+----------+----------+
| cycle 1| cycle2|cycle 3|cycle 4|cycle 5|
+----------+----------+----------+----------+----------+
Cycle 1 MUL x12 begins
Cycle 2 MUL x23 begins
Cycle 3 no instruction gets launched
Cycle 4 1st ADD begins MUL x12 forwards to ADD
Cycle 5 2nd Add begins 1st Add ends MUL x34 forwards to ADDs
Cycle 6 2nd ADD ends
>
>
> Stefan

BGB

unread,

Nov 9, 2021, 5:37:53 PM11/9/21

to

Side note:
For reasons like this, the BJX2 effectively drops down to 2-lane when
doing a memory store, because the memory store eats Lane 3 to supply the
3rd register port.

Partial result is that 3-wide bundles end up being rare, because:
* It is hard to find enough usable ILP;
* They are limited to "ALU|ALU|ALU" or "ALU|ALU|LD".

Also, a lot of the time:
* Load and Store ops tend to be the dominant operations.
* Other non Ld/St ops are frequently run in parallel with a Ld/St op.
...

Or, if expressed as 4R operations, the ports would look like (With a
6R/3W regfile, labeled as Rs/Rt/Ru/Rv/Rx/Ry -> Rm/Rn/Ro):
1-Wide:
Sc: Rs, Rt, Rx, Rm
W : Ru:Rs, Rv:Rt, Ry:Rx, Rm:Rn
2-Wide:
Ru, Rv, Ry, Rn | Rs, Rt, Rx, Rm
3-Wide:
Rx, Ry, ZR, Ro | Ru, Rv, ZR, Rn | Rs, Rt, ZR, Rm

For a few operations, there is an implicit 4th 'Imm' value field, which
is normally forwarded into one of the other ports if an Immed form of an
instruction is used, but serves its own role in a few cases:
As an optional additional displacement for Load/Store operations (if
enabled, *);
Giving the Rounding-Mode for Op64 encodings of FPU operations (this is
decoded as zero for the 32-bit encodings, and treated as constant RNE
rounding).

There is another encoding which encodes a dynamic rounding mode, which
may be used if "FENV_ACCESS" is enabled for a given function.

*, Namely, one of:
(Rm, Ro*Sc, Disp9) //Explicit Scale
(Rm, Ro, Disp11) //Implicit Scale (Element Size)
This 'extra' displacement being implicitly unscaled.

It isn't currently used, and it looks like the compiler would need to be
able to merge several accesses together to be able to use it
effectively, eg:
foo.bar.baz[index];
Which could, in theory, be merged into a single operation, rather than
several LEA operations followed by a Load operation (namely, this is how
BGBCC would currently generate here).

Otherwise, past few days was mostly caught up in partly redesigning
BGBCC's AST system. I had hoped to be able to "escape" from it being
in-effect built on top of XML (or, at least, an abstract model
resembling that of XML), but (sadly) this would also have required a
near complete rewrite of the compiler front-end.

In effect, was able to eliminate its used of linked lists.

So, now the AST nodes look like, roughly:
Radix-8 key/value mapping (fewer than 8 keys = single node);
0-7 keys: Single Node
8-15 keys: Three Nodes (splits B-Tree style)
16-63: Adds 1 node for every 8 keys.
64-511: Uses a 3 level tree.
Radix-16 sub-node list.
0-3 children: Encoded using keys;
4-15 children: Encoded as a single list node.
16-31 children: Encoded as 3 list nodes
32-255 children: Adds a list node for every 16 children.
256-4095 use a 3 level tree;
...

The keys may be either node attributes, or child nodes in some cases
(most "simple" cases fitting within a single node structure).
The key field is generally stored as 16-bit number holding a 4-bit type
tag and 12-bit symbol (*1). Keys are stored sorted by index (so that
binary search can be used, if above a certain minimum, for 1-4 keys,
linear search is faster; for non-leaf nodes, the key would hold the
value of the lowest numbered subkey).

The value field is 64 bits, with a type interpreted based on the key's
type tag.

*1: Interned index numbers are used rather than strings because this is
both more compact and significantly faster. Numeric types are expressed
directly as integer or floating point values (int128 or float128 values
are split into high and low halves). String values (includes most
symbols from the source program, as well as "gensym" names) are
generally interned into a (larger) string table. One can probably guess
from this where some early bottlenecks were at. The ASTs make no
provision for XML namespaces or similar, as these were mostly N/A for a
compiler (so were dropped early on).

I had initially considered storing everything as keys, however:
This does not scale very well;
This would put a hard limit on the number of child nodes per parent node
(and the limit was small enough to be blown out by things like top-level
function prototypes and similar).

This replaces the use of linked lists. This slightly reduces the size of
each node (despite me increasing the number of keys per node), while at
the same time eliminating most of the need for doing mass tree cloning
when doing expression reduction operations (this was wasteful both in
terms of clock cycles and memory footprint; but was needed because nodes
with internal linked-list fields can't be shared between multiple trees;
but this restriction does not apply to nodes held in an array).

The main (significant) change needed to the rest of the compiler was
mostly rewriting bunches of logic to replace linked-list walks with
trying to index into a node (and changing out logic in a lot of places
to deal with the side-effects of lists of nodes now needing another node
to hold them in; where this change wasn't exactly entirely transparent
in some places).

Well, this is one checklist item; another would be getting around to
moving the compiler's bytecode IR stage over to using TLV packaging (and
also properly split up the AST->RIL and RIL->3AC stages into two
separate stages; better nail down the specification and semantics for
the IR).

...

Ivan Godard

unread,

Nov 9, 2021, 6:05:03 PM11/9/21

to

Except for reduction and other inter-instruction data dependencies. Then
you pay the full latency.

MitchAlsup

unread,

Nov 9, 2021, 6:26:26 PM11/9/21

to

So does essentially everybody.
<
It was not until the invention of the quire that reduction arithmetic
became modern.

Ivan Godard

unread,

Nov 9, 2021, 6:58:33 PM11/9/21

to

Please forgive my ignorance, but what's a "quire" in this context?

MitchAlsup

unread,

Nov 9, 2021, 8:03:30 PM11/9/21

to

quire is the error free accumulator designed into posits. It is as long
as required to be able to accumulate any result calculated and not
loose a single bit of precision (2048 for posit effective double)
<
We (they actually) could do something similar for 754..........costing
2100 bits in IEEE double (since you have to account for denorms)

Terje Mathisen

unread,

Nov 10, 2021, 3:48:34 AM11/10/21

to

We have discussed this many times here, afair always using the
SuperAccumulator term?

The key is that with carry-save/redundant storage, it becomes trivial to
accept one or more updates every cycle: Each full adder is just 2 or 3
gate delays, right?

If I was targeting just one update/cycle I would probably work in
multi-bit (8?) bit chunks with additional carry storage only at the
boundaries.

Actually extracting and rounding the final result requires an FF1
circuit that covers the entire array, a way to extract ~55/114 bits from
that point on and a circuit to OR together all trailing bits.

If the accumulator is stored as bytes, then you can just grab the 8 or
16 bytes that encompasses the target mantissa and shift it down by 0-7
bits, that seems very doable, while the tail end zero/non-zero detector
can be OR'ed into the bottom bit.

Stefan Monnier

unread,

Nov 10, 2021, 8:31:56 AM11/10/21

to

So we agree: the MUL+ADD version has lower latency than the MADD version?

Stefan

Stefan Monnier

unread,

Nov 10, 2021, 10:18:36 AM11/10/21

to

> So we agree: the MUL+ADD version has lower latency than the MADD version?

I find this disappointing: I thought non-legacy architectures wouldn't
suffer from such cases where a "complex" instruction results in slower
code than its decomposition into its constituent simpler operations.

I suspect that for FPMADD the problem doesn't occur because FPADD isn't
cheap (which both means that the decomposed code will be slower, and
also that even if we tried to be super-extra careful about scheduling,
FPMADD wouldn't be able to delay reading its third argument as much as
MADD would (it's not just a matter of stuffing the bits into a carry
save adder but you also need to first shift/align them according to the
exponent)).

Stefan

MitchAlsup

unread,

Nov 10, 2021, 2:39:03 PM11/10/21

to

A single 3-2 counter is 1 gate of delay.
A single 4-2 compressor is 2-gates of delay.
So, basically, you can add as many things into the accumulator as you
can afford to route to the accumulator. Certainly 8 lanes per cycle is
reasonable.

>
> If I was targeting just one update/cycle I would probably work in
> multi-bit (8?) bit chunks with additional carry storage only at the
> boundaries.
>
> Actually extracting and rounding the final result requires an FF1
> circuit that covers the entire array, a way to extract ~55/114 bits from
> that point on and a circuit to OR together all trailing bits.
<

In general, it is fairly easy to remember where the HoB is.
In general, assembling the sticky bit is only a few gates of delay (5-ish)

>
> If the accumulator is stored as bytes, then you can just grab the 8 or
> 16 bytes that encompasses the target mantissa and shift it down by 0-7
> bits, that seems very doable, while the tail end zero/non-zero detector
> can be OR'ed into the bottom bit.
<

The posit guys suggest a model where one accesses the quire as an array in
memory, accessing 2 quadwords per accumulation based on the exponent
of the intermediate augend.

MitchAlsup

unread,

Nov 10, 2021, 2:41:08 PM11/10/21

to

Integer MAC has the property that no bits have been lost.
Floating MAC has the property of both the multiply and the add are performed
prior to any rounding (rounding=loss in precision).
<
So while IMAC can be done as 2 instructions; FMAC cannot.
>
>
> Stefan

Terje Mathisen

unread,

Nov 11, 2021, 5:17:44 AM11/11/21

to

MitchAlsup wrote:
> On Wednesday, November 10, 2021 at 2:48:34 AM UTC-6, Terje Mathisen wrote:
>> Actually extracting and rounding the final result requires an FF1
>> circuit that covers the entire array, a way to extract ~55/114 bits from
>> that point on and a circuit to OR together all trailing bits.
> <
> In general, it is fairly easy to remember where the HoB is.
> In general, assembling the sticky bit is only a few gates of delay (5-ish)
>>

Sticky is just a wide zero/non-zero detector, right?

>> If the accumulator is stored as bytes, then you can just grab the 8 or
>> 16 bytes that encompasses the target mantissa and shift it down by 0-7
>> bits, that seems very doable, while the tail end zero/non-zero detector
>> can be OR'ed into the bottom bit.
> <
> The posit guys suggest a model where one accesses the quire as an array in
> memory, accessing 2 quadwords per accumulation based on the exponent
> of the intermediate augend.

If I had to implement it in SW, I'm guessing that an approach which
allows delayed carry updates would be nice.

I.e. something like 48 bits stored in each 64-bit block, each addition
splits the input into two parts based on the exponent, and we can
blindly add up to 65535 entries before we have to propagate the carries,
i.e. usually just a single iteration of this at the end?

Using 50% storage (32 bits in each 64-bit word) would make the alignment
slightly faster.

It would also be possible to let each block be either positive or
negative, with just a very small increase in the final carry/overflow
propagation.

Thomas Koenig

unread,

Nov 12, 2021, 1:53:10 AM11/12/21

to

Stefan Monnier <mon...@iro.umontreal.ca> schrieb:

> MitchAlsup [2021-11-09 08:11:06] wrote:
>> On Tuesday, November 9, 2021 at 7:30:35 AM UTC-6, Stefan Monnier wrote:
>>> Reminds me of a question: IIUC, the MADD instruction will typically have
>>> a latency of a few cycles (let's say 3) but the latency of the signal
>>> is not the same for all 3 inputs. More specifically, IIUC the 3rd
>>> ("carry") input can arrive as late as the last cycle without impacting
>>> the overall latency.
>>>
>>> This means that in theory we can have a sequence of MADD instructions
>>> all accumulating into the same register at a rate of 1 per cycle
>>> (assuming the unit is pipelined), but if all inputs are "read" at
>>> the same time, then the rate goes down to 1/3 per cycle.
>>>
>>> How do OoO cores avoid this throughput problem?
>> We build the function unit to accept all 3 operands on the same clock.
>> Then we don't use the 3rd operand until "later"
>
> IOW you don't avoid this throughput problem?
> I mean:
>
> accum1 = MADD(x1, x2, accum);
> accum2 = MADD(x3, x4, accum1);
>
> ends up with a latency of 2*N cycles instead of N+1, right?
> Because we can't start the second MADD before the first is over :-(

Depends, I think.

The most likely use case is going to be multiplication of
long integers for crypto. For this, you will need access
to the low and high part of the result, and you will
need another add. Let's also assume that the
architecture has separate high and low multiply + adds.

So, looking at the pseudocode that Mitch recently posted:

{mcarry, product} = multiplicand[i]*multiplier[j]
+ mcarry;
{acarry,sum[i+j]} = {sum[i+j]+acarry} + product;

with i the inner loop and J in a register. This would
then translate to (roughly)

LD Rmi, R_mulitiplicand + i << 8
LD Rs_ij, R_sum
MADD R_product, R_mi, R_mcarry
MADDH R_mcarry, R_mi, R_mcarry
ADD R_temp, Rs_ij, R_acarry ! Sets a carry
ADDC R_acarry, #0 ! Increment if carry set
ADD R_sum, R_temp, R_product ! Sets a carry
ST Rs_ij, R_sum
ADDC R_acarry, #0 ! Increment if carry set

The summation is a bit awkward (if I have that right), but there
is an advantage in using the multiply+add, as the latency is used
up by other instructions.

MitchAlsup

unread,

Nov 12, 2021, 12:42:49 PM11/12/21

to

< then translate to (exactly)
ADD Rij,Ri,Rj // you seem to have missed this
LDD Rmc,[Rmpc+Rj<<3] // multiplicand[i]
LDD Rsm,[Rsmp+Rij<<3] // sum[i+j]
CARRY Rcm,{IO}
MUL Rpr,Rmp,Rmc // MAC {Rcm,Rpr},{Rmp,Rcm},Rmc
CARRY Rca,{IO}
ADD Rsum,Rsum,Rpr // ADC {Rca,Rsum},{Rsum,Rca},Rpr
STD Rsum,[Rsum+Rij<<3] // sum[i+j]