[rfc] multi-read/write dependency matrices and associated computation unit

257 views
Skip to first unread message

lkcl

unread,
Apr 13, 2020, 10:01:55 AM4/13/20
to
apologies that this is best done using images rather than text.
i'm doing a redesign of the (augmented) 6600 engine because there
are a couple of design criteria/assumptions that do not fit our
requirements:

1. operations are only 2-in, 1-out
2. simultaneous register port read (and write) availability is guaranteed.

we require:

1. operations with up to *four* in and up to *three* out
2. sporadic availability of far less than 4 Reg-Read ports and 3 Reg-Write

here are the two associated diagrams which describe the *original*
6600 computational unit and FU-to-Regs Dependency Cell:

1. comp unit https://libre-soc.org/3d_gpu/comp_unit_req_rel.jpg
2. dep cell https://libre-soc.org/3d_gpu/dependence_cell_pending.jpg

as described here https://libre-soc.org/3d_gpu/architecture/6600scoreboard/
we found a signal missing from Mitch's book chapters, and tracked it down
from the original Thornton "Design of a Computer": Read_Release. this
is a synchronisation / acknowledgement signal for Go_Read which is directly
analogous to Req_Rel for Go_Write.

also in the dependency cell, we found that it is necessary to OR the
two "Read" Oper1 and Oper2 signals together and to AND that with the
Write_Pending Latch (top latch in diagram 2.) as shown in the wonderfully
hand-drawn orange OR gate.

thus, Read-After-Write hazard occurs if there is a Write_Pending *AND*
any Read (oper1 *OR* oper2) is requested.


now onto the additional modifications.

3. comp unit https://libre-soc.org/3d_gpu/compunit_multi_rw.jpg
4. dep cell https://libre-soc.org/3d_gpu/dependence_cell_multi_pending.jpg

firstly, the computation unit modifications:

* multiple Go_Read signals are present, GoRD1-3
* multiple incoming operands are present, Op1-3
* multiple Go_Write signals are present, GoWR1-3
* multiple outgoing results are present, Out1-2

note that these are *NOT* necessarily 64-bit registers: they are in fact
Carry Flags because we are implementing POWER9. however (as mentioned
yesterday in the huge 250+ discussion, as far as the Dep Matrices are
concerned you still have to treat Carry-In and Carry-out as Read/Write
Hazard-protected *actual* Registers)

in the original 6600 comp unit diagram (1), because the "Go_Read" assumes
that *both* registers will be read (and supplied) simultaneously from
the Register File, the sequence - the Finite State Machine - is real
simple:

* ISSUE -> BUSY (latched)
* RD-REQ -> GO_RD
* WR-REQ -> GO_WR
* repeat

[aside: there is a protective "revolving door" loop where the SR latch for
each state in the FSM is guaranteed stable (never reaches "unknown") ]

in *this* diagram (3), we instead need:

* ISSUE -> BUSY (latched)
* RD-REQ1 -> GO_RD1 (may occur independent of RD2/3)
* RD-REQ2 -> GO_RD2 (may occur independent of RD1/3)
* RD-REQ3 -> GO_RD3 (may occur independent of RD1/2)
* when all 3 of GO_RD1-3 have been asserted,
ONLY THEN raise WR-REQ1-2
* WR-REQ1 -> GO_WR1 (may occur independent of WR2)
* WR-REQ2 -> GO_WR2 (may occur independent of WR1)
* when all (2) of GO_WR1-2 have been asserted,
ONLY THEN reset back to the beginning.

note the crucial difference is that read request and acknowledge (GO_RD)
are *all independent* and may occur:

* in any order
* in any combination
* all at the same time

likewise for write-request/go-write.

thus, if there is only one spare READ Register File port available
(because this particular Computation Unit is a low priority, but
the other operations need only two Regfile Ports and the Regfile
happens to be 3R1W), at least one of OP1-3 may get its operation.

thus, if we have three 2-operand operations and a 3R1W regfile:

* clock cycle 1: the first may grab 2 ports and the second grabs 1 (Oper1)
* clock cycle 2: the second grabs one more (Oper2) and the third grabs 2

compare this to the *original* 6600: if there are three 2-operand
operations outstanding, they MUST go:

* clock cycle 1: the first may grab 2 ports, NEITHER the 2nd nor 3rd proceed
* clock cycle 2: the second may grab 2 ports, 3rd may NOT proceed
* clock cycle 3: the 3rd grabs 2 ports

this because the Comp Unit - and associated Dependency Matrices - *FORCE*
the Comp Unit to only proceed when *ALL* necessary Register Read Ports
are available (because there is only the one Go_Read signal).


so my questions are:

* does the above look reasonable? both in terms of the DM changes
and CompUnit changes.

* the use of the three SR latches looks a little weird to me
(bottom right corner of (3) which is a rewrite of the middle
of the page.

it looks a little weird to have an SR Latch looped back
"onto itself". namely that when the inversion of both
WR_REQ1 and WR_REQ2 going low triggers that AND gate
(the one with the input from Q of an SR Latch), it *resets*
that very same SR-Latch, which will cause a mini "blip"
on Reset, doesn't it?

argh. that doesn't feel right. what should it be replaced with?

tia,

l.

lkcl

unread,
Apr 13, 2020, 10:26:53 AM4/13/20
to
at 35 minutes long, please, only if you are *really* bored senseless:
https://www.youtube.com/watch?v=ohHbWRLDCfs

lkcl

unread,
Apr 13, 2020, 2:22:59 PM4/13/20
to
(corrections...)

On Monday, April 13, 2020 at 3:01:55 PM UTC+1, lkcl wrote:

> it looks a little weird to have an SR Latch looped back
> "onto itself". namely that when the inversion of both
> WR_REQ1 and WR_REQ2 going low triggers that AND gate
> (the one with the input from Q of an SR Latch), it *resets*
> that very same SR-Latch, which will cause a mini "blip"
> on Reset, doesn't it?
>
> argh. that doesn't feel right. what should it be replaced with?

updated 3. comp unit https://libre-soc.org/3d_gpu/compunit_multi_rw.jpg

the sequence is now as follows:

* all of Req_Rd_1-3 must have gone HI followed by all of them going LOW
* the BUSY signal must be asserted HI
* this will trigger the "Write" SR Latches (middle-lower of updated
diagram) to all flip to "Q HI".

at that point, as long as shadowing is off:

* *both* Req_Wr1 and 2 will go HI
* the 3rd SR Latch will be waiting for Go_Wr1 and Go_Wr2 to be HI
* with Req_Wr1 and 2 both being HI, one (or both) of Go_Wr1 and 2 will assert
* this will *disable* (deassert) their associated SR Latch, thereby
capturing Out1 or Out2 (or both) accordingly.

however in addition to that:

* GoWrAny (which is the ORing of Go_Wr1 with Go_Wr2) will be HI
* if both GoWrite Latches go LOW then the corresponding Req_Wr1/2
signals also go LOW
* as long as the 3rd SR Latch is also "active", this in turn
activates the AND gate at the bottom, which fires the "reset"
signal right at the top, allowing the CompUnit to capture the
next incoming Operand.

note however that the 3rd SR Latch is *only* reset when the
*ISSUE* signal comes in. this fixes the problem described
above (clockless "blipping" of the "reset") because the
GoWrAny signal is not permitted to be on at the same time
as the Issue signal.

this looks ok to me - did i miss anything?

l.

MitchAlsup

unread,
Apr 13, 2020, 7:53:48 PM4/13/20
to
I suggest this is not necessary and is unleashing a quadratic term into
play. In a typical Scoreboard with a <say> condition code <or 2 or 3>,
One would wait for all the registers and all of the dependent condition
codes to be ready before launching the instruction. Leaving but a single
Go_Read per function unit.

I see that you are trying to interleave not-necessarily-sufficient read
porting at the RF and still start operand delivery. But I don't see
your chosen route as the "efficient" one.

Consider it a 2 stage problem::
a) has every operand been written into storage
b) has every operand been delivered into calculation
And I think you are trying to do both and maybe getting confused in the
process.

Once several instructions have been picked, some kind of register port
and condition code port arbiters should take control of operand delivery.
This very well may add a cycle in the pipeline.

> * multiple incoming operands are present, Op1-3
> * multiple Go_Write signals are present, GoWR1-3
> * multiple outgoing results are present, Out1-2
>
> note that these are *NOT* necessarily 64-bit registers: they are in fact
> Carry Flags because we are implementing POWER9. however (as mentioned
> yesterday in the huge 250+ discussion, as far as the Dep Matrices are
> concerned you still have to treat Carry-In and Carry-out as Read/Write
> Hazard-protected *actual* Registers)
>
> in the original 6600 comp unit diagram (1), because the "Go_Read" assumes
> that *both* registers will be read (and supplied) simultaneously from
> the Register File, the sequence - the Finite State Machine - is real
> simple:

CDC 6600 always had sufficient read and write ports as these are what the
pickers arbitrated ! The fact that waiting instructions got released at
the same time was a side effect. CDC 6600 pikers could not pick anything
that did not already have sufficient resources presently available.

lkcl

unread,
Apr 14, 2020, 4:46:18 AM4/14/20
to
On Tuesday, April 14, 2020 at 12:53:48 AM UTC+1, MitchAlsup wrote:

> I suggest this is not necessary and is unleashing a quadratic term into
> play.

i've not spotted one (yet...) this not being a guarantee that there isn't one :)

* the increase in GoRD and GoWr signals is in one dimension of the FU-Regs Matrix
* likewise in the FU-FU the GoRd and GoWr are in 1 dimension only
* read and write pending remain single bit per register

in the matrices i think we get away with it.

> In a typical Scoreboard with a <say> condition code <or 2 or 3>,
> One would wait for all the registers and all of the dependent condition
> codes to be ready before launching the instruction. Leaving but a single
> Go_Read per function unit.

i don't believe anyone tried dropping a vector frontend onto a multi-issue scoreboard though :)

> I see that you are trying to interleave not-necessarily-sufficient read
> porting at the RF and still start operand delivery.

yes, because we are seriously constrained by availability of any kind of high port count regfile SRAM design (as in: we don't have one, and aren't going to design one in time for the oct 2020 tapeout)

i was therefore going to use a 2R1W and add at least a single forwarder bus to make up to a 3rd port, this being just barely sufficient to cover the typical case of MAC or FMAC having the dest reg be one of the src regs on a sequence of (F)MACs.

the Condition Codes regfile of course being completely separate and in *addition* to those ports.


> But I don't see
> your chosen route as the "efficient" one.

one primary reason for doing this is because of the dual regfiles (operands and CCs). i am having difficulty visualising how to use the same GORD / GOWR signals to cover both regfiles.

that, and the overlaps allowing operations to "sneak in" by trickle-feeding operands using spare unused ports.

3R1W with 3 2-operand instructions, you simply cannot start those in 2 cycles *unless* you allow one of them to grab op1 in the 1st cycle and op2 in the 2nd.

> Consider it a 2 stage problem::
> a) has every operand been written into storage
> b) has every operand been delivered into calculation
> And I think you are trying to do both and maybe getting confused in the
> process.

no i am simply separating the GORD and GOWR into per-operand signals rather than per-FunctionUnit ones.

thus each GORD (and GOWR) looks after one single Register Port rather than guards multiple ports.

in this way, ConditionCodes gets its own dedicated GORD and GOWR signals.


> Once several instructions have been picked, some kind of register port
> and condition code port arbiters should take control of operand delivery.

the priority pickers. this is where i feel more comfortable having more fine grained port access: one GORD/GOWR per port rather than one GORD/GOWR for a *group* of ports.

more on this below

> This very well may add a cycle in the pipeline.

fine with that.

> CDC 6600 always had sufficient read and write ports as these are what the
> pickers arbitrated ! The fact that waiting instructions got released at
> the same time was a side effect. CDC 6600 pikers could not pick anything
> that did not already have sufficient resources presently available.

i remember, it has actual separate regfiles (3 of them?) A B X and i think A had 5 ports, 3 read 2 write or something?

there were no 3 operand instructions: all 2 operand instrs for A used the 1st 2 ports, and all 1 operand instrs for A used the 3rd port, this being hardwired.

thus, if you have 3 1-operand instructions being dropped into a multi-issue queue in a single cycle they *must* take 3 cycles to start (not one because they are only allowed to use that one (3rd) port.

whereas i would like to allow 2 (or 3) such operations to start in the same cycle, because each independent GORD/GOWR can route to any one of the 2 (or 3) regfile ports (3 if including the forwarding bus port)

lkcl

unread,
Apr 14, 2020, 4:47:41 AM4/14/20
to
nuts sorry phone typing, hit send by mistake without completing the usual proofread and review.
l.

lkcl

unread,
Apr 14, 2020, 6:41:04 AM4/14/20
to
a clarification of the context:

1. POWER has a batch of branch/condition registers and also a FP CSR.

* LR link register
* CTR counter
* CR0 3 bit condition register
* XER i forget what this is atm i think
it contains carry bits.

2. because of the vectorisation and also the reduced regfile port count we are STRATIFYING the regfile into

* HI 32 ODD reg#
* LO 32 ODD
* HI 32 EVEN
* LO 32 EVEN

and therefore 64 bit operations comprise *two* "operands" for every one in "normal" processors.

thus we get double the performance for 32 bit vector operations which is how we like it.


having a single GORD / GOWR cover these at the CompUnit whilst also having the same GORD/W covering multiple CondRegs is i believe just not practical or sane.

firstly, a branch operation which uses LR may use up a port that another operation needed for reading Carry. which is fine if there are actual separate GORD/GOWR signals for the CondRegfile but in the case of a single GORD/GOWR the only way i can see how all scenarios could be covered is if the CondRegfile is massively overported (4R or even 6R, 4W minumum and we simply cannot do that right now unless we use DFFs instead of SRAM).

even on 64 bit operations, the fact that there are separate odd-even 2R1W (maybe 3R1W) ports for the Int/FP regfile but this same trick *cannot* be done for the CondRegfile (because two operations writing to one odd and one even reg# will still write to the same ONE CarryReg for example) has me concerned.

i am not trying to convince myself that this is a "nice optimisation", i believe (so far) that it is a necessary design requirement.

on top of *that* we have VL - the Vector Length Register - which can *also* be set (and read) and thus must also be protected by GO RD/WR signalling.


Mitch I remember vaguely that in the 66000 Computational Unit you showed me (last year?) you had 2 separate GO_READ signals (1 and 2). what was that about?

l.

lkcl

unread,
Apr 14, 2020, 9:14:53 AM4/14/20
to
(just checked your book chapter 10, mitch: section 10.4.8, 10.4.9 and 10.4.10)
* A regfile had one read, one write
* B regfile had 4 read ports, 3 write
* X regfile had 5 read ports, 4 write.

i am only slightly freaked out (mostly in awe) that an SR NOR latch was
used to store bits in the regfile, both S and R being driven low, and that "set to 1" persists for one gate delay longer due to an AND gate being inline with S but not R. modern ASIC designers would run away screaming in terror - but only because they'd need to pay USD $100 million in full custom ASIC NREs...

:)

anyway, the point is: we simply can't have that many regfile ports (if using SRAM - DFFs would be "ok" but the power consumption would be atrocious).

hence my concern about trying to synchronise simultaneous access to both INT/FP and CondCode regfiles.

i mean, we _could_ just have one GORD/GOWR for the standard 64-bit regfile
(covering all operands regardless of whether they're 1-3 per CompUnit)
and one GORD/GOWR for the ConditionCode regfile (regardless of whether
1 or 2 are required - Link, CTR, CR0, don't mind which).

l.

lkcl

unread,
Apr 14, 2020, 8:29:05 PM4/14/20
to
https://libre-soc.org/3d_gpu/fu_dep_cell_multi_6600.jpg

mofifications to the FunctionUnit Drpendency Cell, which tracks inter-FU result dependency order, requires duplication of the WRWAIT cell, one each for GORD1 and GORD2 then linking up the outputs with an OR gate, and likewise for GOWR1 and 2.

the output vectorw (RD WAIT and WR WAIT remain the same bit width)

thus, and this makes sense logically, we have to wait for *both* GORD signals to fire, resetting *both* SR Latches, and then and only then is the FU-FU write wait condition dropped.

previously, there was only one GORD, that indicated that *both* operands had (simultaneously) been read from the regfile.

so i believe this is just a linear increase in gate count, not quadratic (whew).

l.

lkcl

unread,
Apr 15, 2020, 9:50:34 AM4/15/20
to
https://libre-soc.org/3d_gpu/compunit_multi_rw.jpg

argh! this is really tricky. _another_ bug found, this one ironically
making the logic much more similar to the original.

the ISSUE-GORD-GOWR "cycle" is effectively left unaltered, however
it is augmented to:

ISSUE-ACKRDs-GOALLRD-ACKWRs-GOALLWR

the "ack-all-rds" and "ack-all-wrs" phases are inserted *in between*.
previously i thought that a suite of "ack-all-rds" latches could replace
the GO-RD latch.

however just as with the GO-WR latch needing to be augmented by a suite
of "ack-all-wrs" latches, this turns out to not be the case.

l.

MitchAlsup

unread,
Apr 15, 2020, 5:42:59 PM4/15/20
to
On Tuesday, April 14, 2020 at 8:14:53 AM UTC-5, lkcl wrote:
> (just checked your book chapter 10, mitch: section 10.4.8, 10.4.9 and 10.4.10)
> * A regfile had one read, one write
> * B regfile had 4 read ports, 3 write
> * X regfile had 5 read ports, 4 write.
>
> i am only slightly freaked out (mostly in awe) that an SR NOR latch was
> used to store bits in the regfile, both S and R being driven low, and that "set to 1" persists for one gate delay longer due to an AND gate being inline with S but not R. modern ASIC designers would run away screaming in terror - but only because they'd need to pay USD $100 million in full custom ASIC NREs...
>
> :)
>
> anyway, the point is: we simply can't have that many regfile ports (if using SRAM - DFFs would be "ok" but the power consumption would be atrocious).

Only because of having to deliver clock to each DFF.
This is what we used to do with "select"lines--remove the clock power from the DFFs which are not participating in the cycle of the moment.

MitchAlsup

unread,
Apr 15, 2020, 5:44:14 PM4/15/20
to
On Wednesday, April 15, 2020 at 8:50:34 AM UTC-5, lkcl wrote:
> https://libre-soc.org/3d_gpu/compunit_multi_rw.jpg

You are effectively using the input latch (closing) to signal that the
calculation is ready to begin.
>
> argh! this is really tricky. _another_ bug found, this one ironically
> making the logic much more similar to the original.

First principles at all times, first principles.....

lkcl

unread,
Apr 17, 2020, 6:52:16 AM4/17/20
to
On Wednesday, April 15, 2020 at 10:44:14 PM UTC+1, MitchAlsup wrote:
> On Wednesday, April 15, 2020 at 8:50:34 AM UTC-5, lkcl wrote:
> > https://libre-soc.org/3d_gpu/compunit_multi_rw.jpg
>
> You are effectively using the input latch (closing) to signal that the
> calculation is ready to begin.

yes. no :) using the detection of all read-requests (plural) having
(in any order at any time) been matched by corresponding go-reads (plural)
*and* that the operation input latch has been closed.

> >
> > argh! this is really tricky. _another_ bug found, this one ironically
> > making the logic much more similar to the original.
>
> First principles at all times, first principles.....

:)

i found that i couldn't precisely match the diagram. i had to cheat and
use the valid/ready signal coming from the pipelined ALU as the "progression"
indicator separating the (multi-) request_read phase and the (multi-)
request_release phase.

without that, the "read completed" phase kept re-requesting the ALU to
accept input that it had already accepted, because that latch which is
supposed to go "off" on GoWrAny simply... doesn't... because there *are*
no writes... yet.

i think these differences are down to the fact that i'm doing a
Concurrent (now multi-) Computation Unit, rather than the single-ALU-per-CU
from the original 6600.

l.

lkcl

unread,
Apr 25, 2020, 6:49:28 AM4/25/20
to
just making sure to document various discoveries as they occur.

in the middle of a rewrite of the LDST Computation Unit, made complex by the fact that in a multi-signal version, operand 2 changes from a RD to a WR, i noticed something interesting:

* in both LD and ST, operand 1 always has the immediate added to it (AGEN)

* that there is the one GORD+RDREQ pair dedicated to that operand

* with the immediate arriving as part of the opcode, AGEN may proceed *immediately and independently* of opcode 2

so if opcode 2 is blocked by RaW or WaR hazards this does not prevent AGEN from proceeding, where normally if there is a single GO_READ covering both operands, STORE is potentially held up.

if AGEN were to be computed in the same cycle (reasonable for very low clock rate systems) this would not be an advantage, however for higher clock rate systems where AGEN takes 1 clock cycle, the finegrain control over read/write ports has an advantage.

fascinating and unexpected.

l.

MitchAlsup

unread,
Apr 25, 2020, 12:16:05 PM4/25/20
to
Expected--onceone realizes that LDs are a 3-step process it almost falls
out for free. {AGEN, D$read, LoadAlign}
>
> l.

lkcl

unread,
Apr 28, 2020, 2:41:47 PM4/28/20
to
https://libre-soc.org/3d_gpu/ld_st_comp_unit.jpg

*yet another* update! this time because it took me a while to notice
that the POWER ISA can request that the AGEN'd Effective Address can
be "updated" - outputted for writing to RA.

this gives four separate and distinct modes, each of which the LD/ST
Computation Unit has to cope with:

* one *or* two register-reads
* one *or* two register-writes

despite this it is relatively straightforward. the only major complexity
is the "reset" condition, which must cope with:

* optional asynchronous write to reg1 (on update)
* optional asynchronous write to reg2 (on ST)
* optional write to memory (on ST)

hence the very weird 5-input AND-gate at the bottom (WR_reset)

l.

MitchAlsup

unread,
Apr 28, 2020, 4:56:18 PM4/28/20
to
On Tuesday, April 28, 2020 at 1:41:47 PM UTC-5, lkcl wrote:
> https://libre-soc.org/3d_gpu/ld_st_comp_unit.jpg
>
> *yet another* update! this time because it took me a while to notice
> that the POWER ISA can request that the AGEN'd Effective Address can
> be "updated" - outputted for writing to RA.
>
> this gives four separate and distinct modes, each of which the LD/ST
> Computation Unit has to cope with:
>
> * one *or* two register-reads
> * one *or* two register-writes

Looks like, sooner or later, you are going to want some kind of register
file port arbiter. Probably sooner.....

lkcl

unread,
Apr 28, 2020, 6:16:36 PM4/28/20
to
On Tuesday, April 28, 2020 at 9:56:18 PM UTC+1, MitchAlsup wrote:
> On Tuesday, April 28, 2020 at 1:41:47 PM UTC-5, lkcl wrote:
> > https://libre-soc.org/3d_gpu/ld_st_comp_unit.jpg
> >
> > *yet another* update! this time because it took me a while to notice
> > that the POWER ISA can request that the AGEN'd Effective Address can
> > be "updated" - outputted for writing to RA.
> >
> > this gives four separate and distinct modes, each of which the LD/ST
> > Computation Unit has to cope with:
> >
> > * one *or* two register-reads
> > * one *or* two register-writes
>
> Looks like, sooner or later, you are going to want some kind of register
> file port arbiter. Probably sooner.....

yep. at the moment it's just "anything marked op1 contends for rd port 1", "anything marked op2 contents for rd port 2" etc.

this keeps it very simple: no multiplexing across FU op1/2/3 to port 1/2/3

at some point that will have to change so that:

* one op1 in any FU may access rd port 1
* one op2 in any FU may access rd port 2, however
may also access rd port 1 if no op1 needs it
* one op3 in any FU may access rd port 3, however
may also access rd port 1 or 2 if no op1 or op2 needs it

this on the basis that a full suite of crossbars on every single FU is probably a bit much.

an alternative would be to make it more like the original 6600, and have say 4R or 6R and allow some of the FUs exclusive (one-to-one) access to 2-3 of the regfile read ports, whilst other FUs have exclusive access to the others:

* MUL FUs op1 contends for reg read port 1
* MUL FUs op2 contends for reg read port 2
* MUL FUs op3 contends for reg read port 3
* ADD FUs op1 contends for reg read port 4
* ADD FUs op2 contends for reg read port 5

something along those lines.

i don't quite know where to start, to begin analysing what would be the best
arrangement, here, despite reading the analysis methodology in your book
chapters (the ones where you did a statistical analysis of operations
most commonly used, then targetted the port allocation deliberately at that).

l.

MitchAlsup

unread,
Apr 28, 2020, 7:04:42 PM4/28/20
to
On Tuesday, April 28, 2020 at 5:16:36 PM UTC-5, lkcl wrote:
> On Tuesday, April 28, 2020 at 9:56:18 PM UTC+1, MitchAlsup wrote:
> > On Tuesday, April 28, 2020 at 1:41:47 PM UTC-5, lkcl wrote:
> > > https://libre-soc.org/3d_gpu/ld_st_comp_unit.jpg
> > >
> > > *yet another* update! this time because it took me a while to notice
> > > that the POWER ISA can request that the AGEN'd Effective Address can
> > > be "updated" - outputted for writing to RA.
> > >
> > > this gives four separate and distinct modes, each of which the LD/ST
> > > Computation Unit has to cope with:
> > >
> > > * one *or* two register-reads
> > > * one *or* two register-writes
> >
> > Looks like, sooner or later, you are going to want some kind of register
> > file port arbiter. Probably sooner.....
>
> yep. at the moment it's just "anything marked op1 contends for rd port 1", "anything marked op2 contents for rd port 2" etc.
>
> this keeps it very simple: no multiplexing across FU op1/2/3 to port 1/2/3
>
> at some point that will have to change so that:
>
> * one op1 in any FU may access rd port 1
> * one op2 in any FU may access rd port 2, however
> may also access rd port 1 if no op1 needs it
> * one op3 in any FU may access rd port 3, however
> may also access rd port 1 or 2 if no op1 or op2 needs it

Looks like you need some register use statistics to build a heuristic.
>
> this on the basis that a full suite of crossbars on every single FU is probably a bit much.

You should be able to get by with approximately SQRT( crossbar )
>
> an alternative would be to make it more like the original 6600, and have say 4R or 6R and allow some of the FUs exclusive (one-to-one) access to 2-3 of the regfile read ports, whilst other FUs have exclusive access to the others:

That is the "brute force" approach--sometimes the best solution.
>
> * MUL FUs op1 contends for reg read port 1
> * MUL FUs op2 contends for reg read port 2
> * MUL FUs op3 contends for reg read port 3

I suspect this port is lightly used relative to the others. You could let
STs borrow this port if it is lightly enough used.

> * ADD FUs op1 contends for reg read port 4
> * ADD FUs op2 contends for reg read port 5
>
> something along those lines.
>
> i don't quite know where to start, to begin analysing what would be the best
> arrangement, here, despite reading the analysis methodology in your book
> chapters (the ones where you did a statistical analysis of operations
> most commonly used, then targetted the port allocation deliberately at that).

Do you have ISA statistics you can get your hands upon?

>
> l.

lkcl

unread,
Apr 29, 2020, 4:54:13 AM4/29/20
to
On Wednesday, April 29, 2020 at 12:04:42 AM UTC+1, MitchAlsup wrote:

> > this on the basis that a full suite of crossbars on every single FU is probably a bit much.
>
> You should be able to get by with approximately SQRT( crossbar )

ok. appreciated.

> > * MUL FUs op1 contends for reg read port 1
> > * MUL FUs op2 contends for reg read port 2
> > * MUL FUs op3 contends for reg read port 3
>
> I suspect this port is lightly used relative to the others. You could let
> STs borrow this port if it is lightly enough used.

i remember the analysis you described last year where a careful study
of FMAC pipeline overlaps allowed for exactly that: one spare slot
existed for ST to get a word in.

> > i don't quite know where to start, to begin analysing what would be the best
> > arrangement, here, despite reading the analysis methodology in your book
> > chapters (the ones where you did a statistical analysis of operations
> > most commonly used, then targetted the port allocation deliberately at that).
>
> Do you have ISA statistics you can get your hands upon?

i'll ask around. we have an intelligent team member (hii jacob) who knows
the 3D core loops and statistics from Vulkan.

l.

Bruce Hoult

unread,
Apr 29, 2020, 8:08:52 AM4/29/20
to
On Wednesday, April 29, 2020 at 6:41:47 AM UTC+12, lkcl wrote:
> https://libre-soc.org/3d_gpu/ld_st_comp_unit.jpg
>
> *yet another* update! this time because it took me a while to notice
> that the POWER ISA can request that the AGEN'd Effective Address can
> be "updated" - outputted for writing to RA.

This is something generally noticed in the first ten minutes of looking at the
Power/PowerPC ISA :-)


I've been preparing a proposal to add EA writeback to stores as an extension in
the RISC-V ISA, and rS1+rS2 indexing (without writeback) for loads. Both fit
into a 2-read, 1-write pipeline and they actually complement each other very
well because many loops read several operands from memory and write back one --
so you can use the store address as the index register for the loads and update
*everything* just by writing back the store address.

current:

saxpy: ; float *dstp, *ap, *xp, *yp; int count
beqz count,2f
1:
flw a,(ap)
flw x,(xp)
flw y,(yp)
fmadd.s dst,a,x,y
fsw dst,(dstp)
addi ap,4
addi yp,4
addi zp,4
addi dstp,4
addi count,-1
bnez count,1b
2: ret

proposed:

saxpy: ; float *dstp, *ap, *xp, *yp; int count
beqz count,2f
addi dstp,dstp,-4
sub ap,ap,dstp
sub xp,xp,dstp
sub yp,yp,dstp
slli count,count,2
add limit,dstp,count
1:
flw a,(ap,dstp)
flw x,(xp,dstp)
flw y,(yp,dstp)
fmadd.s dst,a,x,y
fsw dst,4(dstp)!
bne dstp,limit,1b
2: ret

It's not less code, but it's less code in the loop.

(yes I know saxpy generally has a as a constant .. this is for illustration)

lkcl

unread,
Apr 29, 2020, 10:39:45 AM4/29/20
to
On Wednesday, April 29, 2020 at 1:08:52 PM UTC+1, Bruce Hoult wrote:
> On Wednesday, April 29, 2020 at 6:41:47 AM UTC+12, lkcl wrote:
> > https://libre-soc.org/3d_gpu/ld_st_comp_unit.jpg
> >
> > *yet another* update! this time because it took me a while to notice
> > that the POWER ISA can request that the AGEN'd Effective Address can
> > be "updated" - outputted for writing to RA.
>
> This is something generally noticed in the first ten minutes of looking at the
> Power/PowerPC ISA :-)

sigh :) honestly, i don't know what i'm looking at until i'm half-way
through it. it's the same when i was in a choir. two hours to read
two pages of music, finger pointing to each note just like any 5-year-old
reading their first book. so i memorised the tunes, instead, and still
remember them, today.

i set a goal, despite being, how to put it diplomatically,
"not really the best person to achieve it". where most people would spot
something like this immediately, i simply... don't. with persistence and
patience i'll get there in the end. *14* rewrites - so far - of the
LDST Computation Unit diagrams (!)


> I've been preparing a proposal to add EA writeback to stores as an extension in
> the RISC-V ISA, and rS1+rS2 indexing (without writeback) for loads. Both fit
> into a 2-read, 1-write pipeline and they actually complement each other very
> well because many loops read several operands from memory and write back one --
> so you can use the store address as the index register for the loads and update
> *everything* just by writing back the store address.

iinteresting.

> current:
>
> saxpy: ; float *dstp, *ap, *xp, *yp; int count
> addi ap,4
> addi yp,4
> addi zp,4
> addi dstp,4

ok so this version explicitly increments all 4 pointers
(and these explicit increments become time-consuming in the
loop)

> proposed:
>
> saxpy: ; float *dstp, *ap, *xp, *yp; int count
> sub ap,ap,dstp
> sub xp,xp,dstp
> sub yp,yp,dstp

that's interesting. *subtract* the destination pointer
from the source (therefore making ap, xp and yp *relative*
to dstp)...

> flw a,(ap,dstp)
> flw x,(xp,dstp)
> flw y,(yp,dstp)

...then add dstp *back* on when doing the LDs...

> fmadd.s dst,a,x,y
> fsw dst,4(dstp)!

... and use the "ST-with-update" to increment
dstp, which, on the next loop, a new dstp will
be used.

neat.

hm.

if adding up the actual number of adder circuits involved inside the
loop, i believe the ST_with-update version comes out as more
power-efficient, which is a good justification for using it even though
more instructions are needed overall.

first version:

* 3x flw - 3 adders involved in EA computation
* 1x fsw - 1 adder involved in EA computation
* 5x - *five* - adders involved in incrementing pointers (and counter)

a total of 9 adders

second version (loop only)

* 3x flw - 3 adders involved in indexed EA computation
* 1x fsw - 1 adder in EA-with-update computation
* semi-equivalent-to-adder in the BNE

a total of 4 (5-ish) adders.

when looking only at the ISA, it may be easy to forget that the LD/ST has
to perform an ADD. Mitch's book chapters, describing an augmented OoO
6600-style engine, i remember actually suggest *using* the ADD ALU within
the LD/ST Computational Unit for *actual* ADD operations, making them
dual-purpose. it's there, so why not?

l.

MitchAlsup

unread,
Apr 29, 2020, 1:39:48 PM4/29/20
to
Now, if you had properly indexed memory you could have written:
...
> 1:
> flw a,(ap+%i<<2)
> flw x,(xp+%i<<2)
> flw y,(yp+%i<<2)
> fmadd.s dst,a,x,y
> fsw dst,(dstp+%i<<2)
add %i,1
> cmp %i,count
> bnlt %i,1b
.
.
.

Bruce Hoult

unread,
Apr 29, 2020, 9:13:32 PM4/29/20
to
That's the utterly conventional way to do it, as on everything from VAX to x86
to ARM, yes. It is more convenient for assembly language programmers and needs
less setup code before the loop.

Having loads scale the index by the size of the load (only) is absolutely an
option I've been considering. It's cheap to do and doesn't need any extra
encoding space.

The problem is that adding an index in a store is NOT under consideration
because it adds an expensive new feature to the ISA in needing three source
registers. Ok, not so much a problem for FP stores, but definitely for integer
ones.

The point is that adding an index for loads (but not stores) is very cheap. And
adding EA writeback for stores (but not loads) is very cheap. And asymmetry
between those two have unexpected synergy -- as shown in my code example -- IFF
the index for the loads is *not* scaled.

Of course you can easily argue that ...

saxpy: ; float *dstp, *ap, *xp, *yp; int count
beqz count,2f
addi dstp,dstp,-4
li i,0
1:
flw a,(ap,i<<2)
flw x,(xp,i<<2)
flw y,(yp,i<<2)
fmadd.s dst,a,x,y
fsw dst,4(dstp)!
addi i,i,1
bne i,count,1b
2: ret

.. is better than ..

saxpy: ; float *dstp, *ap, *xp, *yp; int count
beqz count,2f
addi dstp,dstp,-4
sub ap,ap,dstp
sub xp,xp,dstp
sub yp,yp,dstp
slli count,count,2
add limit,dstp,count
1:
flw a,(ap,dstp)
flw x,(xp,dstp)
flw y,(yp,dstp)
fmadd.s dst,a,x,y
fsw dst,4(dstp)!
bne dstp,limit,1b
2: ret

... because of the smaller setup code, despite the one extra instruction in the
loop. It's certainly much better than the version with five bump instructions
in the loop.

It also makes it possible for the dst, a, x, y arrays to have all completely
different element sizes.


PowerPC is an interesting case. It has both indexed loads and stores (though
not scaled), and (as Luke just discovered) EA writeback for both loads and
stores.

Given the C code:

void saxpy(float *dstp, float *ap, float *xp, float *yp, int count) {
for (int i = 0; i<count; ++i){
float a = ap[i];
float x = xp[i];
float y = yp[i];
float dst = a * x + y;
dstp[i] = dst;
}
}

With -O2 PowerPC gcc 7.5.0 produces:

00000000 <saxpy>:
0: 94 21 ff f0 stwu r1,-16(r1)
4: 2c 07 00 00 cmpwi r7,0
8: 40 81 00 50 ble 58 <saxpy+0x58>
c: 54 e9 10 3a rlwinm r9,r7,2,0,29
10: 38 84 ff fc addi r4,r4,-4
14: 39 29 ff fc addi r9,r9,-4
18: 38 a5 ff fc addi r5,r5,-4
1c: 55 29 f0 be rlwinm r9,r9,30,2,31
20: 38 c6 ff fc addi r6,r6,-4
24: 39 29 00 01 addi r9,r9,1
28: 38 63 ff fc addi r3,r3,-4
2c: 7d 29 03 a6 mtctr r9
30: 60 00 00 00 nop
34: 60 00 00 00 nop
38: 60 00 00 00 nop
3c: 60 00 00 00 nop
40: c4 04 00 04 lfsu f0,4(r4)
44: c5 65 00 04 lfsu f11,4(r5)
48: c5 86 00 04 lfsu f12,4(r6)
4c: ec 00 62 fa fmadds f0,f0,f11,f12
50: d4 03 00 04 stfsu f0,4(r3)
54: 42 00 ff ec bdnz 40 <saxpy+0x40>
58: 38 21 00 10 addi r1,r1,16
5c: 4e 80 00 20 blr

For speed, gcc prefers the EA writeback form, despite the huge loop setup --
and even NOP pads it for alignment.

With -Os:

00000000 <saxpy>:
0: 94 21 ff f0 stwu r1,-16(r1)
4: 2f 87 00 00 cmpwi cr7,r7,0
8: 39 20 00 00 li r9,0
c: 39 47 00 01 addi r10,r7,1
10: 40 9c 00 08 bge cr7,18 <saxpy+0x18>
14: 39 40 00 01 li r10,1
18: 2c 0a 00 01 cmpwi r10,1
1c: 39 4a ff ff addi r10,r10,-1
20: 40 82 00 0c bne 2c <saxpy+0x2c>
24: 38 21 00 10 addi r1,r1,16
28: 4e 80 00 20 blr
2c: 7c 04 4c 2e lfsx f0,r4,r9
30: 7d 65 4c 2e lfsx f11,r5,r9
34: 7d 86 4c 2e lfsx f12,r6,r9
38: ec 00 62 fa fmadds f0,f0,f11,f12
3c: 7c 03 4d 2e stfsx f0,r3,r9
40: 39 29 00 04 addi r9,r9,4
44: 4b ff ff d4 b 18 <saxpy+0x18>

Size optimization prefers the indexed form. I don't know why speed optimization
doesn't as well. (ignoring the speed pessimising loop structure here)


RISC-V gcc 9.2.0 makes fake indexed instructions using an add&load and bumping
a single counter rather than bumping multiple pointers. It doesn't make any
difference to the overall speed or code size.

0000000000000000 <saxpy>:
0: 02e05a63 blez a4,34 <.L1>
4: 00271313 slli t1,a4,0x2
8: 4781 li a5,0

000000000000000a <.L3>:
a: 00f68733 add a4,a3,a5
e: 00f588b3 add a7,a1,a5
12: 00f60833 add a6,a2,a5
16: 00072707 flw fa4,0(a4)
1a: 0008a787 flw fa5,0(a7)
1e: 00082687 flw fa3,0(a6)
22: 00f50733 add a4,a0,a5
26: 0791 addi a5,a5,4
28: 70d7f7c3 fmadd.s fa5,fa5,fa3,fa4
2c: 00f72027 fsw fa5,0(a4)
30: fcf31de3 bne t1,a5,a <.L3>

0000000000000034 <.L1>:
34: 8082 ret

MitchAlsup

unread,
Apr 29, 2020, 9:46:51 PM4/29/20
to
a) the loads will smell a lot more like stores (i.e., easier for the
compiler).

b) reading the store data port can be done at the point in the pipeline
where you would have been writing the loaded result. So the port is
there, you just have to figure out how to change it from WRITE to READ.

c) this is all explained in the HP patent on their store pipeline
(circa 1986-ish) and is already public domain.

lkcl

unread,
Apr 30, 2020, 6:52:43 AM4/30/20
to
On Thursday, April 30, 2020 at 2:13:32 AM UTC+1, Bruce Hoult wrote:

> The problem is that adding an index in a store is NOT under consideration
> because it adds an expensive new feature to the ISA in needing three source
> registers.

examining the opcodes here https://libre-soc.org/openpower/isatables/
POWER dedicates *fourteen* of the precious 6-bit major opcodes to
"16-bit immediate" versions, however all the 3-source ones are minor
opcodes under 31-major. there are also another 8 major opcodes for
FP load-with-immediate, and again, all the 3-source ones are under
31-major.

they get away with it because (as Mitch mentioned), one of the sources
is also a destination (two dests for the LDs). consequently, it's not
necessary to use a 4-operand (3-source, 1-dest) I-format.

RISC-V i believe has *one* free, unused major 32-bit opcode left reserved
for official extensions that could be used for 4-operand? if there had
been more major 32-bit opcodes available, this would have put pressure
on the rest of the design, making other assembly-code areas sub-optimal.

seems to be a hell of a tricky balance.

i don't know if it's something that's unique to scoreboard-style OoO
engines, or whether it's down to the fact that we're splitting out individual
{go_read+read_req} / {go_write+write_req} signals for *each* src/dest,
allocating them a separate and distinct port rather than being forced
to wait until all 3 read-ports are simultaneously free, or forced to wait
until all 2 write-ports are free [that's if you *have* two write-ports
on the regfile].

with the multi-arbitration signals, we're fine with a 3-source 2-dest
instruction, even when one of those sources is also used as a destination.
the Dependency Matrices will go "ok you want RS to be a read *and* write
hazard, that's fine", and if there is contention for the regfile read-ports
and write-ports (only one write port available despite the op being 2-dest)
that's fine too: completion will take 2 cycles rather than 1 before
de-asserting the CompUnit's BUSY signal.

that all drops out "for free" from that decision to go with multi-arbitration.

ISA opcode pressure aside, and aside from spotting the above (needing two
simultaneous regfile write ports) why is this not challenging? what is it
about other microarchitectures that makes 3-source 2-dest operations
difficult to consider including?

l.

Bruce Hoult

unread,
Apr 30, 2020, 8:53:04 AM4/30/20
to
On Thursday, April 30, 2020 at 10:52:43 PM UTC+12, lkcl wrote:
> i don't know if it's something that's unique to scoreboard-style OoO
> engines, or whether it's down to the fact that we're splitting out individual
> {go_read+read_req} / {go_write+write_req} signals for *each* src/dest,
> allocating them a separate and distinct port rather than being forced
> to wait until all 3 read-ports are simultaneously free, or forced to wait
> until all 2 write-ports are free [that's if you *have* two write-ports
> on the regfile].
>
> with the multi-arbitration signals, we're fine with a 3-source 2-dest
> instruction, even when one of those sources is also used as a destination.
> the Dependency Matrices will go "ok you want RS to be a read *and* write
> hazard, that's fine", and if there is contention for the regfile read-ports
> and write-ports (only one write port available despite the op being 2-dest)
> that's fine too: completion will take 2 cycles rather than 1 before
> de-asserting the CompUnit's BUSY signal.
>
> that all drops out "for free" from that decision to go with multi-arbitration.
>
> ISA opcode pressure aside, and aside from spotting the above (needing two
> simultaneous regfile write ports) why is this not challenging? what is it
> about other microarchitectures that makes 3-source 2-dest operations
> difficult to consider including?

Sure, if every implementation of the ISA is OoO then you can grab access to
ports when a free slot comes up. That doesn't work if you also want to support
simple in-order implementations and don't want them to stall a lot. Or you can
provide enough expensive ports for the hungriest instruction and watch them go
to waste most of the time.

FP is a special case because FMA makes up such a high proportion of the
instructions that the extra port is worth providing.

And there really are simpler implementations that are popular. Both SiFive and
PULP offer 2-stage pipeline cores without caches or branch prediction that are
1 CPI except taken branches are 2. The "Bumblebee" core in the mass market
Gigadevice GD32VF103 chip (seen in e.g. the $4.90 "Longan Nano" (complete with
plastic case and LCD screen) is also cacheless and without branch predictor and
eats 3 cycles for taken branches.

And then there is "Rudi-RV32" done by a friend of mine in Christchurch just
noodling around over the Xmas/NY break. It has *one* pipe stage -- asynchronous
logic from PC to opcode fetch to decode to register access to ALU with new PC
value, new register value, memory write all sitting ready to latch at the next
clock cycle. It does 85 MHz in Artix-7 FPGA! Most five stage 1 CPI cores don't
do any better than 100 MHz, 125 tops on those FPGAs.

Ivan Godard

unread,
Apr 30, 2020, 9:06:58 AM4/30/20
to
I'm impressed. Ivan Sutherland has been bugging me for several years
about an asynch Mill - would your friend be interested? The way the belt
works is a natural for asynch.

Bruce Hoult

unread,
Apr 30, 2020, 9:54:46 AM4/30/20
to
On Friday, May 1, 2020 at 1:06:58 AM UTC+12, Ivan Godard wrote:
> I'm impressed. Ivan Sutherland has been bugging me for several years
> about an asynch Mill - would your friend be interested? The way the belt
> works is a natural for asynch.

Michael has a job (last I heard anyway) but I'm sure everyone can be interested
by a sufficiently interesting opportunity. I happen to know relocation is out of
the question for him, for family reasons.

Niklas Holsti

unread,
Apr 30, 2020, 11:07:06 AM4/30/20
to
On 2020-04-30 16:06, Ivan Godard wrote:

> I'm impressed. Ivan Sutherland has been bugging me for several years
> about an asynch Mill - would your friend be interested? The way the belt
> works is a natural for asynch.

In asynch, won't the faster FUs have to wait for the slowest FU to
finish, before the overall system can declare a new cycle and new belt
numbering?

"Wideness" in the Mill sense seems to diminish the potential speed
benefit of asynch, it seems to me. But then I'm very ignorant of the issues.

--
Niklas Holsti
Tidorum Ltd
niklas holsti tidorum fi
. @ .

MitchAlsup

unread,
Apr 30, 2020, 12:00:34 PM4/30/20
to
On Thursday, April 30, 2020 at 5:52:43 AM UTC-5, lkcl wrote:
> On Thursday, April 30, 2020 at 2:13:32 AM UTC+1, Bruce Hoult wrote:
>
> > The problem is that adding an index in a store is NOT under consideration
> > because it adds an expensive new feature to the ISA in needing three source
> > registers.
>
> examining the opcodes here https://libre-soc.org/openpower/isatables/
> POWER dedicates *fourteen* of the precious 6-bit major opcodes to
> "16-bit immediate" versions, however all the 3-source ones are minor
> opcodes under 31-major. there are also another 8 major opcodes for
> FP load-with-immediate, and again, all the 3-source ones are under
> 31-major.
>
> they get away with it because (as Mitch mentioned), one of the sources
> is also a destination (two dests for the LDs). consequently, it's not
> necessary to use a 4-operand (3-source, 1-dest) I-format.

I would simply call this 3-operand, 1-result rather than 4-operand.
>
> RISC-V i believe has *one* free, unused major 32-bit opcode left reserved
> for official extensions that could be used for 4-operand?

This is going to hurt later.

> if there had
> been more major 32-bit opcodes available, this would have put pressure
> on the rest of the design, making other assembly-code areas sub-optimal.

My 66000 has 14 major opcodes left and 6 reserved in perpetuity.
>
> seems to be a hell of a tricky balance.

It is, and it takes years of balancing these features to achieve a good one.
>
> i don't know if it's something that's unique to scoreboard-style OoO
> engines, or whether it's down to the fact that we're splitting out individual
> {go_read+read_req} / {go_write+write_req} signals for *each* src/dest,
> allocating them a separate and distinct port rather than being forced
> to wait until all 3 read-ports are simultaneously free, or forced to wait
> until all 2 write-ports are free [that's if you *have* two write-ports
> on the regfile].

Since you separated the signals, can I suggest placing a hopper between
reading register specifiers out of FUs and the RF ports. Each cycle a
number of specifiers are dropped into the hopper, and each cycle as many
RF read ports as you have are read and then delivered to instructions
at the calculation units. Then each calculation unit picks instructions
into execution.

MitchAlsup

unread,
Apr 30, 2020, 12:04:46 PM4/30/20
to
On Thursday, April 30, 2020 at 10:07:06 AM UTC-5, Niklas Holsti wrote:
> On 2020-04-30 16:06, Ivan Godard wrote:
>
> > I'm impressed. Ivan Sutherland has been bugging me for several years
> > about an asynch Mill - would your friend be interested? The way the belt
> > works is a natural for asynch.
>
> In asynch, won't the faster FUs have to wait for the slowest FU to
> finish, before the overall system can declare a new cycle and new belt
> numbering?

The problem with asynch was testing! What do you do if one CPU has its
adder faster than another die on the same chip such that timing of some
canned test case takes different numbers of pipelined cycles. It is
OK-enough to "get the right answer" or doe it require not only the
"get the right answer" but also take the "right amount of time".
Synch CPUs are always of the later.

Ivan Godard

unread,
Apr 30, 2020, 5:04:03 PM4/30/20