[rfc] multi-read/write dependency matrices and associated computation unit

lkcl

unread,

Apr 13, 2020, 10:01:55 AM4/13/20

to

apologies that this is best done using images rather than text.
i'm doing a redesign of the (augmented) 6600 engine because there
are a couple of design criteria/assumptions that do not fit our
requirements:

1. operations are only 2-in, 1-out
2. simultaneous register port read (and write) availability is guaranteed.

we require:

1. operations with up to *four* in and up to *three* out
2. sporadic availability of far less than 4 Reg-Read ports and 3 Reg-Write

here are the two associated diagrams which describe the *original*
6600 computational unit and FU-to-Regs Dependency Cell:

1. comp unit https://libre-soc.org/3d_gpu/comp_unit_req_rel.jpg
2. dep cell https://libre-soc.org/3d_gpu/dependence_cell_pending.jpg

as described here https://libre-soc.org/3d_gpu/architecture/6600scoreboard/
we found a signal missing from Mitch's book chapters, and tracked it down
from the original Thornton "Design of a Computer": Read_Release. this
is a synchronisation / acknowledgement signal for Go_Read which is directly
analogous to Req_Rel for Go_Write.

also in the dependency cell, we found that it is necessary to OR the
two "Read" Oper1 and Oper2 signals together and to AND that with the
Write_Pending Latch (top latch in diagram 2.) as shown in the wonderfully
hand-drawn orange OR gate.

thus, Read-After-Write hazard occurs if there is a Write_Pending *AND*
any Read (oper1 *OR* oper2) is requested.

now onto the additional modifications.

3. comp unit https://libre-soc.org/3d_gpu/compunit_multi_rw.jpg
4. dep cell https://libre-soc.org/3d_gpu/dependence_cell_multi_pending.jpg

firstly, the computation unit modifications:

* multiple Go_Read signals are present, GoRD1-3
* multiple incoming operands are present, Op1-3
* multiple Go_Write signals are present, GoWR1-3
* multiple outgoing results are present, Out1-2

note that these are *NOT* necessarily 64-bit registers: they are in fact
Carry Flags because we are implementing POWER9. however (as mentioned
yesterday in the huge 250+ discussion, as far as the Dep Matrices are
concerned you still have to treat Carry-In and Carry-out as Read/Write
Hazard-protected *actual* Registers)

in the original 6600 comp unit diagram (1), because the "Go_Read" assumes
that *both* registers will be read (and supplied) simultaneously from
the Register File, the sequence - the Finite State Machine - is real
simple:

* ISSUE -> BUSY (latched)
* RD-REQ -> GO_RD
* WR-REQ -> GO_WR
* repeat

[aside: there is a protective "revolving door" loop where the SR latch for
each state in the FSM is guaranteed stable (never reaches "unknown") ]

in *this* diagram (3), we instead need:

* ISSUE -> BUSY (latched)
* RD-REQ1 -> GO_RD1 (may occur independent of RD2/3)
* RD-REQ2 -> GO_RD2 (may occur independent of RD1/3)
* RD-REQ3 -> GO_RD3 (may occur independent of RD1/2)
* when all 3 of GO_RD1-3 have been asserted,
ONLY THEN raise WR-REQ1-2
* WR-REQ1 -> GO_WR1 (may occur independent of WR2)
* WR-REQ2 -> GO_WR2 (may occur independent of WR1)
* when all (2) of GO_WR1-2 have been asserted,
ONLY THEN reset back to the beginning.

note the crucial difference is that read request and acknowledge (GO_RD)
are *all independent* and may occur:

* in any order
* in any combination
* all at the same time

likewise for write-request/go-write.

thus, if there is only one spare READ Register File port available
(because this particular Computation Unit is a low priority, but
the other operations need only two Regfile Ports and the Regfile
happens to be 3R1W), at least one of OP1-3 may get its operation.

thus, if we have three 2-operand operations and a 3R1W regfile:

* clock cycle 1: the first may grab 2 ports and the second grabs 1 (Oper1)
* clock cycle 2: the second grabs one more (Oper2) and the third grabs 2

compare this to the *original* 6600: if there are three 2-operand
operations outstanding, they MUST go:

* clock cycle 1: the first may grab 2 ports, NEITHER the 2nd nor 3rd proceed
* clock cycle 2: the second may grab 2 ports, 3rd may NOT proceed
* clock cycle 3: the 3rd grabs 2 ports

this because the Comp Unit - and associated Dependency Matrices - *FORCE*
the Comp Unit to only proceed when *ALL* necessary Register Read Ports
are available (because there is only the one Go_Read signal).

so my questions are:

* does the above look reasonable? both in terms of the DM changes
and CompUnit changes.

* the use of the three SR latches looks a little weird to me
(bottom right corner of (3) which is a rewrite of the middle
of the page.

it looks a little weird to have an SR Latch looped back
"onto itself". namely that when the inversion of both
WR_REQ1 and WR_REQ2 going low triggers that AND gate
(the one with the input from Q of an SR Latch), it *resets*
that very same SR-Latch, which will cause a mini "blip"
on Reset, doesn't it?

argh. that doesn't feel right. what should it be replaced with?

tia,

l.

lkcl

unread,

Apr 13, 2020, 10:26:53 AM4/13/20

to

at 35 minutes long, please, only if you are *really* bored senseless:
https://www.youtube.com/watch?v=ohHbWRLDCfs

lkcl

unread,

Apr 13, 2020, 2:22:59 PM4/13/20

to

(corrections...)

On Monday, April 13, 2020 at 3:01:55 PM UTC+1, lkcl wrote:

> it looks a little weird to have an SR Latch looped back
> "onto itself". namely that when the inversion of both
> WR_REQ1 and WR_REQ2 going low triggers that AND gate
> (the one with the input from Q of an SR Latch), it *resets*
> that very same SR-Latch, which will cause a mini "blip"
> on Reset, doesn't it?
>
> argh. that doesn't feel right. what should it be replaced with?

updated 3. comp unit https://libre-soc.org/3d_gpu/compunit_multi_rw.jpg

the sequence is now as follows:

* all of Req_Rd_1-3 must have gone HI followed by all of them going LOW
* the BUSY signal must be asserted HI
* this will trigger the "Write" SR Latches (middle-lower of updated
diagram) to all flip to "Q HI".

at that point, as long as shadowing is off:

* *both* Req_Wr1 and 2 will go HI
* the 3rd SR Latch will be waiting for Go_Wr1 and Go_Wr2 to be HI
* with Req_Wr1 and 2 both being HI, one (or both) of Go_Wr1 and 2 will assert
* this will *disable* (deassert) their associated SR Latch, thereby
capturing Out1 or Out2 (or both) accordingly.

however in addition to that:

* GoWrAny (which is the ORing of Go_Wr1 with Go_Wr2) will be HI
* if both GoWrite Latches go LOW then the corresponding Req_Wr1/2
signals also go LOW
* as long as the 3rd SR Latch is also "active", this in turn
activates the AND gate at the bottom, which fires the "reset"
signal right at the top, allowing the CompUnit to capture the
next incoming Operand.

note however that the 3rd SR Latch is *only* reset when the
*ISSUE* signal comes in. this fixes the problem described
above (clockless "blipping" of the "reset") because the
GoWrAny signal is not permitted to be on at the same time
as the Issue signal.

this looks ok to me - did i miss anything?

l.

MitchAlsup

unread,

Apr 13, 2020, 7:53:48 PM4/13/20

to

I suggest this is not necessary and is unleashing a quadratic term into
play. In a typical Scoreboard with a <say> condition code <or 2 or 3>,
One would wait for all the registers and all of the dependent condition
codes to be ready before launching the instruction. Leaving but a single
Go_Read per function unit.

I see that you are trying to interleave not-necessarily-sufficient read
porting at the RF and still start operand delivery. But I don't see
your chosen route as the "efficient" one.

Consider it a 2 stage problem::
a) has every operand been written into storage
b) has every operand been delivered into calculation
And I think you are trying to do both and maybe getting confused in the
process.

Once several instructions have been picked, some kind of register port
and condition code port arbiters should take control of operand delivery.
This very well may add a cycle in the pipeline.

> * multiple incoming operands are present, Op1-3
> * multiple Go_Write signals are present, GoWR1-3
> * multiple outgoing results are present, Out1-2
>
> note that these are *NOT* necessarily 64-bit registers: they are in fact
> Carry Flags because we are implementing POWER9. however (as mentioned
> yesterday in the huge 250+ discussion, as far as the Dep Matrices are
> concerned you still have to treat Carry-In and Carry-out as Read/Write
> Hazard-protected *actual* Registers)
>
> in the original 6600 comp unit diagram (1), because the "Go_Read" assumes
> that *both* registers will be read (and supplied) simultaneously from
> the Register File, the sequence - the Finite State Machine - is real
> simple:

CDC 6600 always had sufficient read and write ports as these are what the
pickers arbitrated ! The fact that waiting instructions got released at
the same time was a side effect. CDC 6600 pikers could not pick anything
that did not already have sufficient resources presently available.

lkcl

unread,

Apr 14, 2020, 4:46:18 AM4/14/20

to

On Tuesday, April 14, 2020 at 12:53:48 AM UTC+1, MitchAlsup wrote:

> I suggest this is not necessary and is unleashing a quadratic term into
> play.

i've not spotted one (yet...) this not being a guarantee that there isn't one :)

* the increase in GoRD and GoWr signals is in one dimension of the FU-Regs Matrix
* likewise in the FU-FU the GoRd and GoWr are in 1 dimension only
* read and write pending remain single bit per register

in the matrices i think we get away with it.

> In a typical Scoreboard with a <say> condition code <or 2 or 3>,
> One would wait for all the registers and all of the dependent condition
> codes to be ready before launching the instruction. Leaving but a single
> Go_Read per function unit.

i don't believe anyone tried dropping a vector frontend onto a multi-issue scoreboard though :)

> I see that you are trying to interleave not-necessarily-sufficient read
> porting at the RF and still start operand delivery.

yes, because we are seriously constrained by availability of any kind of high port count regfile SRAM design (as in: we don't have one, and aren't going to design one in time for the oct 2020 tapeout)

i was therefore going to use a 2R1W and add at least a single forwarder bus to make up to a 3rd port, this being just barely sufficient to cover the typical case of MAC or FMAC having the dest reg be one of the src regs on a sequence of (F)MACs.

the Condition Codes regfile of course being completely separate and in *addition* to those ports.

> But I don't see
> your chosen route as the "efficient" one.

one primary reason for doing this is because of the dual regfiles (operands and CCs). i am having difficulty visualising how to use the same GORD / GOWR signals to cover both regfiles.

that, and the overlaps allowing operations to "sneak in" by trickle-feeding operands using spare unused ports.

3R1W with 3 2-operand instructions, you simply cannot start those in 2 cycles *unless* you allow one of them to grab op1 in the 1st cycle and op2 in the 2nd.

> Consider it a 2 stage problem::
> a) has every operand been written into storage
> b) has every operand been delivered into calculation
> And I think you are trying to do both and maybe getting confused in the
> process.

no i am simply separating the GORD and GOWR into per-operand signals rather than per-FunctionUnit ones.

thus each GORD (and GOWR) looks after one single Register Port rather than guards multiple ports.

in this way, ConditionCodes gets its own dedicated GORD and GOWR signals.

> Once several instructions have been picked, some kind of register port
> and condition code port arbiters should take control of operand delivery.

the priority pickers. this is where i feel more comfortable having more fine grained port access: one GORD/GOWR per port rather than one GORD/GOWR for a *group* of ports.

more on this below

> This very well may add a cycle in the pipeline.

fine with that.

> CDC 6600 always had sufficient read and write ports as these are what the
> pickers arbitrated ! The fact that waiting instructions got released at
> the same time was a side effect. CDC 6600 pikers could not pick anything
> that did not already have sufficient resources presently available.

i remember, it has actual separate regfiles (3 of them?) A B X and i think A had 5 ports, 3 read 2 write or something?

there were no 3 operand instructions: all 2 operand instrs for A used the 1st 2 ports, and all 1 operand instrs for A used the 3rd port, this being hardwired.

thus, if you have 3 1-operand instructions being dropped into a multi-issue queue in a single cycle they *must* take 3 cycles to start (not one because they are only allowed to use that one (3rd) port.

whereas i would like to allow 2 (or 3) such operations to start in the same cycle, because each independent GORD/GOWR can route to any one of the 2 (or 3) regfile ports (3 if including the forwarding bus port)

lkcl

unread,

Apr 14, 2020, 4:47:41 AM4/14/20

to

nuts sorry phone typing, hit send by mistake without completing the usual proofread and review.
l.

lkcl

unread,

Apr 14, 2020, 6:41:04 AM4/14/20

to

a clarification of the context:

1. POWER has a batch of branch/condition registers and also a FP CSR.

* LR link register
* CTR counter
* CR0 3 bit condition register
* XER i forget what this is atm i think
it contains carry bits.

2. because of the vectorisation and also the reduced regfile port count we are STRATIFYING the regfile into

* HI 32 ODD reg#
* LO 32 ODD
* HI 32 EVEN
* LO 32 EVEN

and therefore 64 bit operations comprise *two* "operands" for every one in "normal" processors.

thus we get double the performance for 32 bit vector operations which is how we like it.

having a single GORD / GOWR cover these at the CompUnit whilst also having the same GORD/W covering multiple CondRegs is i believe just not practical or sane.

firstly, a branch operation which uses LR may use up a port that another operation needed for reading Carry. which is fine if there are actual separate GORD/GOWR signals for the CondRegfile but in the case of a single GORD/GOWR the only way i can see how all scenarios could be covered is if the CondRegfile is massively overported (4R or even 6R, 4W minumum and we simply cannot do that right now unless we use DFFs instead of SRAM).

even on 64 bit operations, the fact that there are separate odd-even 2R1W (maybe 3R1W) ports for the Int/FP regfile but this same trick *cannot* be done for the CondRegfile (because two operations writing to one odd and one even reg# will still write to the same ONE CarryReg for example) has me concerned.

i am not trying to convince myself that this is a "nice optimisation", i believe (so far) that it is a necessary design requirement.

on top of *that* we have VL - the Vector Length Register - which can *also* be set (and read) and thus must also be protected by GO RD/WR signalling.

Mitch I remember vaguely that in the 66000 Computational Unit you showed me (last year?) you had 2 separate GO_READ signals (1 and 2). what was that about?

l.

lkcl

unread,

Apr 14, 2020, 9:14:53 AM4/14/20

to

(just checked your book chapter 10, mitch: section 10.4.8, 10.4.9 and 10.4.10)
* A regfile had one read, one write
* B regfile had 4 read ports, 3 write
* X regfile had 5 read ports, 4 write.

i am only slightly freaked out (mostly in awe) that an SR NOR latch was
used to store bits in the regfile, both S and R being driven low, and that "set to 1" persists for one gate delay longer due to an AND gate being inline with S but not R. modern ASIC designers would run away screaming in terror - but only because they'd need to pay USD $100 million in full custom ASIC NREs...

:)

anyway, the point is: we simply can't have that many regfile ports (if using SRAM - DFFs would be "ok" but the power consumption would be atrocious).

hence my concern about trying to synchronise simultaneous access to both INT/FP and CondCode regfiles.

i mean, we _could_ just have one GORD/GOWR for the standard 64-bit regfile
(covering all operands regardless of whether they're 1-3 per CompUnit)
and one GORD/GOWR for the ConditionCode regfile (regardless of whether
1 or 2 are required - Link, CTR, CR0, don't mind which).

l.

lkcl

unread,

Apr 14, 2020, 8:29:05 PM4/14/20

to

https://libre-soc.org/3d_gpu/fu_dep_cell_multi_6600.jpg

mofifications to the FunctionUnit Drpendency Cell, which tracks inter-FU result dependency order, requires duplication of the WRWAIT cell, one each for GORD1 and GORD2 then linking up the outputs with an OR gate, and likewise for GOWR1 and 2.

the output vectorw (RD WAIT and WR WAIT remain the same bit width)

thus, and this makes sense logically, we have to wait for *both* GORD signals to fire, resetting *both* SR Latches, and then and only then is the FU-FU write wait condition dropped.

previously, there was only one GORD, that indicated that *both* operands had (simultaneously) been read from the regfile.

so i believe this is just a linear increase in gate count, not quadratic (whew).

l.

lkcl

unread,

Apr 15, 2020, 9:50:34 AM4/15/20

to

https://libre-soc.org/3d_gpu/compunit_multi_rw.jpg

argh! this is really tricky. _another_ bug found, this one ironically
making the logic much more similar to the original.

the ISSUE-GORD-GOWR "cycle" is effectively left unaltered, however
it is augmented to:

ISSUE-ACKRDs-GOALLRD-ACKWRs-GOALLWR

the "ack-all-rds" and "ack-all-wrs" phases are inserted *in between*.
previously i thought that a suite of "ack-all-rds" latches could replace
the GO-RD latch.

however just as with the GO-WR latch needing to be augmented by a suite
of "ack-all-wrs" latches, this turns out to not be the case.

l.

MitchAlsup

unread,

Apr 15, 2020, 5:42:59 PM4/15/20

to

On Tuesday, April 14, 2020 at 8:14:53 AM UTC-5, lkcl wrote:
> (just checked your book chapter 10, mitch: section 10.4.8, 10.4.9 and 10.4.10)
> * A regfile had one read, one write
> * B regfile had 4 read ports, 3 write
> * X regfile had 5 read ports, 4 write.
>
> i am only slightly freaked out (mostly in awe) that an SR NOR latch was
> used to store bits in the regfile, both S and R being driven low, and that "set to 1" persists for one gate delay longer due to an AND gate being inline with S but not R. modern ASIC designers would run away screaming in terror - but only because they'd need to pay USD $100 million in full custom ASIC NREs...
>
> :)
>
> anyway, the point is: we simply can't have that many regfile ports (if using SRAM - DFFs would be "ok" but the power consumption would be atrocious).

Only because of having to deliver clock to each DFF.
This is what we used to do with "select"lines--remove the clock power from the DFFs which are not participating in the cycle of the moment.

MitchAlsup

unread,

Apr 15, 2020, 5:44:14 PM4/15/20

to

On Wednesday, April 15, 2020 at 8:50:34 AM UTC-5, lkcl wrote:
> https://libre-soc.org/3d_gpu/compunit_multi_rw.jpg

You are effectively using the input latch (closing) to signal that the
calculation is ready to begin.

>
> argh! this is really tricky. _another_ bug found, this one ironically
> making the logic much more similar to the original.

First principles at all times, first principles.....

lkcl

unread,

Apr 17, 2020, 6:52:16 AM4/17/20

to

On Wednesday, April 15, 2020 at 10:44:14 PM UTC+1, MitchAlsup wrote:
> On Wednesday, April 15, 2020 at 8:50:34 AM UTC-5, lkcl wrote:
> > https://libre-soc.org/3d_gpu/compunit_multi_rw.jpg
>
> You are effectively using the input latch (closing) to signal that the
> calculation is ready to begin.

yes. no :) using the detection of all read-requests (plural) having
(in any order at any time) been matched by corresponding go-reads (plural)
*and* that the operation input latch has been closed.

> >
> > argh! this is really tricky. _another_ bug found, this one ironically
> > making the logic much more similar to the original.
>
> First principles at all times, first principles.....

:)

i found that i couldn't precisely match the diagram. i had to cheat and
use the valid/ready signal coming from the pipelined ALU as the "progression"
indicator separating the (multi-) request_read phase and the (multi-)
request_release phase.

without that, the "read completed" phase kept re-requesting the ALU to
accept input that it had already accepted, because that latch which is
supposed to go "off" on GoWrAny simply... doesn't... because there *are*
no writes... yet.

i think these differences are down to the fact that i'm doing a
Concurrent (now multi-) Computation Unit, rather than the single-ALU-per-CU
from the original 6600.

l.

lkcl

unread,

Apr 25, 2020, 6:49:28 AM4/25/20

to

just making sure to document various discoveries as they occur.

in the middle of a rewrite of the LDST Computation Unit, made complex by the fact that in a multi-signal version, operand 2 changes from a RD to a WR, i noticed something interesting:

* in both LD and ST, operand 1 always has the immediate added to it (AGEN)

* that there is the one GORD+RDREQ pair dedicated to that operand

* with the immediate arriving as part of the opcode, AGEN may proceed *immediately and independently* of opcode 2

so if opcode 2 is blocked by RaW or WaR hazards this does not prevent AGEN from proceeding, where normally if there is a single GO_READ covering both operands, STORE is potentially held up.

if AGEN were to be computed in the same cycle (reasonable for very low clock rate systems) this would not be an advantage, however for higher clock rate systems where AGEN takes 1 clock cycle, the finegrain control over read/write ports has an advantage.

fascinating and unexpected.

l.

MitchAlsup

unread,

Apr 25, 2020, 12:16:05 PM4/25/20

to

Expected--onceone realizes that LDs are a 3-step process it almost falls
out for free. {AGEN, D$read, LoadAlign}
>
> l.

lkcl

unread,

Apr 28, 2020, 2:41:47 PM4/28/20

to

https://libre-soc.org/3d_gpu/ld_st_comp_unit.jpg

*yet another* update! this time because it took me a while to notice
that the POWER ISA can request that the AGEN'd Effective Address can
be "updated" - outputted for writing to RA.

this gives four separate and distinct modes, each of which the LD/ST
Computation Unit has to cope with:

* one *or* two register-reads
* one *or* two register-writes

despite this it is relatively straightforward. the only major complexity
is the "reset" condition, which must cope with:

* optional asynchronous write to reg1 (on update)
* optional asynchronous write to reg2 (on ST)
* optional write to memory (on ST)

hence the very weird 5-input AND-gate at the bottom (WR_reset)

l.

MitchAlsup

unread,

Apr 28, 2020, 4:56:18 PM4/28/20

to

On Tuesday, April 28, 2020 at 1:41:47 PM UTC-5, lkcl wrote:
> https://libre-soc.org/3d_gpu/ld_st_comp_unit.jpg
>
> *yet another* update! this time because it took me a while to notice
> that the POWER ISA can request that the AGEN'd Effective Address can
> be "updated" - outputted for writing to RA.
>
> this gives four separate and distinct modes, each of which the LD/ST
> Computation Unit has to cope with:
>
> * one *or* two register-reads
> * one *or* two register-writes

Looks like, sooner or later, you are going to want some kind of register
file port arbiter. Probably sooner.....

lkcl

unread,

Apr 28, 2020, 6:16:36 PM4/28/20

to

On Tuesday, April 28, 2020 at 9:56:18 PM UTC+1, MitchAlsup wrote:
> On Tuesday, April 28, 2020 at 1:41:47 PM UTC-5, lkcl wrote:
> > https://libre-soc.org/3d_gpu/ld_st_comp_unit.jpg
> >
> > *yet another* update! this time because it took me a while to notice
> > that the POWER ISA can request that the AGEN'd Effective Address can
> > be "updated" - outputted for writing to RA.
> >
> > this gives four separate and distinct modes, each of which the LD/ST
> > Computation Unit has to cope with:
> >
> > * one *or* two register-reads
> > * one *or* two register-writes
>
> Looks like, sooner or later, you are going to want some kind of register
> file port arbiter. Probably sooner.....

yep. at the moment it's just "anything marked op1 contends for rd port 1", "anything marked op2 contents for rd port 2" etc.

this keeps it very simple: no multiplexing across FU op1/2/3 to port 1/2/3

at some point that will have to change so that:

* one op1 in any FU may access rd port 1
* one op2 in any FU may access rd port 2, however
may also access rd port 1 if no op1 needs it
* one op3 in any FU may access rd port 3, however
may also access rd port 1 or 2 if no op1 or op2 needs it

this on the basis that a full suite of crossbars on every single FU is probably a bit much.

an alternative would be to make it more like the original 6600, and have say 4R or 6R and allow some of the FUs exclusive (one-to-one) access to 2-3 of the regfile read ports, whilst other FUs have exclusive access to the others:

* MUL FUs op1 contends for reg read port 1
* MUL FUs op2 contends for reg read port 2
* MUL FUs op3 contends for reg read port 3
* ADD FUs op1 contends for reg read port 4
* ADD FUs op2 contends for reg read port 5

something along those lines.

i don't quite know where to start, to begin analysing what would be the best
arrangement, here, despite reading the analysis methodology in your book
chapters (the ones where you did a statistical analysis of operations
most commonly used, then targetted the port allocation deliberately at that).

l.

MitchAlsup

unread,

Apr 28, 2020, 7:04:42 PM4/28/20

to

On Tuesday, April 28, 2020 at 5:16:36 PM UTC-5, lkcl wrote:
> On Tuesday, April 28, 2020 at 9:56:18 PM UTC+1, MitchAlsup wrote:
> > On Tuesday, April 28, 2020 at 1:41:47 PM UTC-5, lkcl wrote:
> > > https://libre-soc.org/3d_gpu/ld_st_comp_unit.jpg
> > >
> > > *yet another* update! this time because it took me a while to notice
> > > that the POWER ISA can request that the AGEN'd Effective Address can
> > > be "updated" - outputted for writing to RA.
> > >
> > > this gives four separate and distinct modes, each of which the LD/ST
> > > Computation Unit has to cope with:
> > >
> > > * one *or* two register-reads
> > > * one *or* two register-writes
> >
> > Looks like, sooner or later, you are going to want some kind of register
> > file port arbiter. Probably sooner.....
>
> yep. at the moment it's just "anything marked op1 contends for rd port 1", "anything marked op2 contents for rd port 2" etc.
>
> this keeps it very simple: no multiplexing across FU op1/2/3 to port 1/2/3
>
> at some point that will have to change so that:
>
> * one op1 in any FU may access rd port 1
> * one op2 in any FU may access rd port 2, however
> may also access rd port 1 if no op1 needs it
> * one op3 in any FU may access rd port 3, however
> may also access rd port 1 or 2 if no op1 or op2 needs it

Looks like you need some register use statistics to build a heuristic.

>
> this on the basis that a full suite of crossbars on every single FU is probably a bit much.

You should be able to get by with approximately SQRT( crossbar )

>
> an alternative would be to make it more like the original 6600, and have say 4R or 6R and allow some of the FUs exclusive (one-to-one) access to 2-3 of the regfile read ports, whilst other FUs have exclusive access to the others:

That is the "brute force" approach--sometimes the best solution.

>
> * MUL FUs op1 contends for reg read port 1
> * MUL FUs op2 contends for reg read port 2
> * MUL FUs op3 contends for reg read port 3

I suspect this port is lightly used relative to the others. You could let
STs borrow this port if it is lightly enough used.

> * ADD FUs op1 contends for reg read port 4
> * ADD FUs op2 contends for reg read port 5
>
> something along those lines.
>
> i don't quite know where to start, to begin analysing what would be the best
> arrangement, here, despite reading the analysis methodology in your book
> chapters (the ones where you did a statistical analysis of operations
> most commonly used, then targetted the port allocation deliberately at that).

Do you have ISA statistics you can get your hands upon?

>
> l.

lkcl

unread,

Apr 29, 2020, 4:54:13 AM4/29/20

to

On Wednesday, April 29, 2020 at 12:04:42 AM UTC+1, MitchAlsup wrote:

> > this on the basis that a full suite of crossbars on every single FU is probably a bit much.
>
> You should be able to get by with approximately SQRT( crossbar )

ok. appreciated.

> > * MUL FUs op1 contends for reg read port 1
> > * MUL FUs op2 contends for reg read port 2
> > * MUL FUs op3 contends for reg read port 3
>
> I suspect this port is lightly used relative to the others. You could let
> STs borrow this port if it is lightly enough used.

i remember the analysis you described last year where a careful study
of FMAC pipeline overlaps allowed for exactly that: one spare slot
existed for ST to get a word in.

> > i don't quite know where to start, to begin analysing what would be the best
> > arrangement, here, despite reading the analysis methodology in your book
> > chapters (the ones where you did a statistical analysis of operations
> > most commonly used, then targetted the port allocation deliberately at that).
>
> Do you have ISA statistics you can get your hands upon?

i'll ask around. we have an intelligent team member (hii jacob) who knows
the 3D core loops and statistics from Vulkan.

l.

Bruce Hoult

unread,

Apr 29, 2020, 8:08:52 AM4/29/20

to

On Wednesday, April 29, 2020 at 6:41:47 AM UTC+12, lkcl wrote:
> https://libre-soc.org/3d_gpu/ld_st_comp_unit.jpg
>
> *yet another* update! this time because it took me a while to notice
> that the POWER ISA can request that the AGEN'd Effective Address can
> be "updated" - outputted for writing to RA.

This is something generally noticed in the first ten minutes of looking at the
Power/PowerPC ISA :-)

I've been preparing a proposal to add EA writeback to stores as an extension in
the RISC-V ISA, and rS1+rS2 indexing (without writeback) for loads. Both fit
into a 2-read, 1-write pipeline and they actually complement each other very
well because many loops read several operands from memory and write back one --
so you can use the store address as the index register for the loads and update
*everything* just by writing back the store address.

current:

saxpy: ; float *dstp, *ap, *xp, *yp; int count
beqz count,2f
1:
flw a,(ap)
flw x,(xp)
flw y,(yp)
fmadd.s dst,a,x,y
fsw dst,(dstp)
addi ap,4
addi yp,4
addi zp,4
addi dstp,4
addi count,-1
bnez count,1b
2: ret

proposed:

saxpy: ; float *dstp, *ap, *xp, *yp; int count
beqz count,2f
addi dstp,dstp,-4
sub ap,ap,dstp
sub xp,xp,dstp
sub yp,yp,dstp
slli count,count,2
add limit,dstp,count
1:
flw a,(ap,dstp)
flw x,(xp,dstp)
flw y,(yp,dstp)
fmadd.s dst,a,x,y
fsw dst,4(dstp)!
bne dstp,limit,1b
2: ret

It's not less code, but it's less code in the loop.

(yes I know saxpy generally has a as a constant .. this is for illustration)

lkcl

unread,

Apr 29, 2020, 10:39:45 AM4/29/20

to

On Wednesday, April 29, 2020 at 1:08:52 PM UTC+1, Bruce Hoult wrote:
> On Wednesday, April 29, 2020 at 6:41:47 AM UTC+12, lkcl wrote:
> > https://libre-soc.org/3d_gpu/ld_st_comp_unit.jpg
> >
> > *yet another* update! this time because it took me a while to notice
> > that the POWER ISA can request that the AGEN'd Effective Address can
> > be "updated" - outputted for writing to RA.
>
> This is something generally noticed in the first ten minutes of looking at the
> Power/PowerPC ISA :-)

sigh :) honestly, i don't know what i'm looking at until i'm half-way
through it. it's the same when i was in a choir. two hours to read
two pages of music, finger pointing to each note just like any 5-year-old
reading their first book. so i memorised the tunes, instead, and still
remember them, today.

i set a goal, despite being, how to put it diplomatically,
"not really the best person to achieve it". where most people would spot
something like this immediately, i simply... don't. with persistence and
patience i'll get there in the end. *14* rewrites - so far - of the
LDST Computation Unit diagrams (!)

> I've been preparing a proposal to add EA writeback to stores as an extension in
> the RISC-V ISA, and rS1+rS2 indexing (without writeback) for loads. Both fit
> into a 2-read, 1-write pipeline and they actually complement each other very
> well because many loops read several operands from memory and write back one --
> so you can use the store address as the index register for the loads and update
> *everything* just by writing back the store address.

iinteresting.

> current:
>
> saxpy: ; float *dstp, *ap, *xp, *yp; int count

> addi ap,4
> addi yp,4
> addi zp,4
> addi dstp,4

ok so this version explicitly increments all 4 pointers
(and these explicit increments become time-consuming in the
loop)

> proposed:
>
> saxpy: ; float *dstp, *ap, *xp, *yp; int count

> sub ap,ap,dstp
> sub xp,xp,dstp
> sub yp,yp,dstp

that's interesting. *subtract* the destination pointer
from the source (therefore making ap, xp and yp *relative*
to dstp)...

> flw a,(ap,dstp)
> flw x,(xp,dstp)
> flw y,(yp,dstp)

...then add dstp *back* on when doing the LDs...

> fmadd.s dst,a,x,y
> fsw dst,4(dstp)!

... and use the "ST-with-update" to increment
dstp, which, on the next loop, a new dstp will
be used.

neat.

hm.

if adding up the actual number of adder circuits involved inside the
loop, i believe the ST_with-update version comes out as more
power-efficient, which is a good justification for using it even though
more instructions are needed overall.

first version:

* 3x flw - 3 adders involved in EA computation
* 1x fsw - 1 adder involved in EA computation
* 5x - *five* - adders involved in incrementing pointers (and counter)

a total of 9 adders

second version (loop only)

* 3x flw - 3 adders involved in indexed EA computation
* 1x fsw - 1 adder in EA-with-update computation
* semi-equivalent-to-adder in the BNE

a total of 4 (5-ish) adders.

when looking only at the ISA, it may be easy to forget that the LD/ST has
to perform an ADD. Mitch's book chapters, describing an augmented OoO
6600-style engine, i remember actually suggest *using* the ADD ALU within
the LD/ST Computational Unit for *actual* ADD operations, making them
dual-purpose. it's there, so why not?

l.

MitchAlsup

unread,

Apr 29, 2020, 1:39:48 PM4/29/20

to

Now, if you had properly indexed memory you could have written:
...
> 1:
> flw a,(ap+%i<<2)
> flw x,(xp+%i<<2)
> flw y,(yp+%i<<2)
> fmadd.s dst,a,x,y
> fsw dst,(dstp+%i<<2)
add %i,1
> cmp %i,count
> bnlt %i,1b
.
.
.

Bruce Hoult

unread,

Apr 29, 2020, 9:13:32 PM4/29/20

to

That's the utterly conventional way to do it, as on everything from VAX to x86
to ARM, yes. It is more convenient for assembly language programmers and needs
less setup code before the loop.

Having loads scale the index by the size of the load (only) is absolutely an
option I've been considering. It's cheap to do and doesn't need any extra
encoding space.

The problem is that adding an index in a store is NOT under consideration
because it adds an expensive new feature to the ISA in needing three source
registers. Ok, not so much a problem for FP stores, but definitely for integer
ones.

The point is that adding an index for loads (but not stores) is very cheap. And
adding EA writeback for stores (but not loads) is very cheap. And asymmetry
between those two have unexpected synergy -- as shown in my code example -- IFF
the index for the loads is *not* scaled.

Of course you can easily argue that ...

saxpy: ; float *dstp, *ap, *xp, *yp; int count
beqz count,2f
addi dstp,dstp,-4

li i,0
1:
flw a,(ap,i<<2)
flw x,(xp,i<<2)
flw y,(yp,i<<2)
fmadd.s dst,a,x,y
fsw dst,4(dstp)!
addi i,i,1
bne i,count,1b
2: ret

.. is better than ..

saxpy: ; float *dstp, *ap, *xp, *yp; int count
beqz count,2f
addi dstp,dstp,-4
sub ap,ap,dstp
sub xp,xp,dstp
sub yp,yp,dstp
slli count,count,2
add limit,dstp,count
1:
flw a,(ap,dstp)
flw x,(xp,dstp)
flw y,(yp,dstp)
fmadd.s dst,a,x,y
fsw dst,4(dstp)!
bne dstp,limit,1b
2: ret

... because of the smaller setup code, despite the one extra instruction in the
loop. It's certainly much better than the version with five bump instructions
in the loop.

It also makes it possible for the dst, a, x, y arrays to have all completely
different element sizes.

PowerPC is an interesting case. It has both indexed loads and stores (though
not scaled), and (as Luke just discovered) EA writeback for both loads and
stores.

Given the C code:

void saxpy(float *dstp, float *ap, float *xp, float *yp, int count) {
for (int i = 0; i<count; ++i){
float a = ap[i];
float x = xp[i];
float y = yp[i];
float dst = a * x + y;
dstp[i] = dst;
}
}

With -O2 PowerPC gcc 7.5.0 produces:

00000000 <saxpy>:
0: 94 21 ff f0 stwu r1,-16(r1)
4: 2c 07 00 00 cmpwi r7,0
8: 40 81 00 50 ble 58 <saxpy+0x58>
c: 54 e9 10 3a rlwinm r9,r7,2,0,29
10: 38 84 ff fc addi r4,r4,-4
14: 39 29 ff fc addi r9,r9,-4
18: 38 a5 ff fc addi r5,r5,-4
1c: 55 29 f0 be rlwinm r9,r9,30,2,31
20: 38 c6 ff fc addi r6,r6,-4
24: 39 29 00 01 addi r9,r9,1
28: 38 63 ff fc addi r3,r3,-4
2c: 7d 29 03 a6 mtctr r9
30: 60 00 00 00 nop
34: 60 00 00 00 nop
38: 60 00 00 00 nop
3c: 60 00 00 00 nop
40: c4 04 00 04 lfsu f0,4(r4)
44: c5 65 00 04 lfsu f11,4(r5)
48: c5 86 00 04 lfsu f12,4(r6)
4c: ec 00 62 fa fmadds f0,f0,f11,f12
50: d4 03 00 04 stfsu f0,4(r3)
54: 42 00 ff ec bdnz 40 <saxpy+0x40>
58: 38 21 00 10 addi r1,r1,16
5c: 4e 80 00 20 blr

For speed, gcc prefers the EA writeback form, despite the huge loop setup --
and even NOP pads it for alignment.

With -Os:

00000000 <saxpy>:
0: 94 21 ff f0 stwu r1,-16(r1)
4: 2f 87 00 00 cmpwi cr7,r7,0
8: 39 20 00 00 li r9,0
c: 39 47 00 01 addi r10,r7,1
10: 40 9c 00 08 bge cr7,18 <saxpy+0x18>
14: 39 40 00 01 li r10,1
18: 2c 0a 00 01 cmpwi r10,1
1c: 39 4a ff ff addi r10,r10,-1
20: 40 82 00 0c bne 2c <saxpy+0x2c>
24: 38 21 00 10 addi r1,r1,16
28: 4e 80 00 20 blr
2c: 7c 04 4c 2e lfsx f0,r4,r9
30: 7d 65 4c 2e lfsx f11,r5,r9
34: 7d 86 4c 2e lfsx f12,r6,r9
38: ec 00 62 fa fmadds f0,f0,f11,f12
3c: 7c 03 4d 2e stfsx f0,r3,r9
40: 39 29 00 04 addi r9,r9,4
44: 4b ff ff d4 b 18 <saxpy+0x18>

Size optimization prefers the indexed form. I don't know why speed optimization
doesn't as well. (ignoring the speed pessimising loop structure here)

RISC-V gcc 9.2.0 makes fake indexed instructions using an add&load and bumping
a single counter rather than bumping multiple pointers. It doesn't make any
difference to the overall speed or code size.

0000000000000000 <saxpy>:
0: 02e05a63 blez a4,34 <.L1>
4: 00271313 slli t1,a4,0x2
8: 4781 li a5,0

000000000000000a <.L3>:
a: 00f68733 add a4,a3,a5
e: 00f588b3 add a7,a1,a5
12: 00f60833 add a6,a2,a5
16: 00072707 flw fa4,0(a4)
1a: 0008a787 flw fa5,0(a7)
1e: 00082687 flw fa3,0(a6)
22: 00f50733 add a4,a0,a5
26: 0791 addi a5,a5,4
28: 70d7f7c3 fmadd.s fa5,fa5,fa3,fa4
2c: 00f72027 fsw fa5,0(a4)
30: fcf31de3 bne t1,a5,a <.L3>

0000000000000034 <.L1>:
34: 8082 ret

MitchAlsup

unread,

Apr 29, 2020, 9:46:51 PM4/29/20

to

a) the loads will smell a lot more like stores (i.e., easier for the
compiler).

b) reading the store data port can be done at the point in the pipeline
where you would have been writing the loaded result. So the port is
there, you just have to figure out how to change it from WRITE to READ.

c) this is all explained in the HP patent on their store pipeline
(circa 1986-ish) and is already public domain.

lkcl

unread,

Apr 30, 2020, 6:52:43 AM4/30/20

to

On Thursday, April 30, 2020 at 2:13:32 AM UTC+1, Bruce Hoult wrote:

> The problem is that adding an index in a store is NOT under consideration
> because it adds an expensive new feature to the ISA in needing three source
> registers.

examining the opcodes here https://libre-soc.org/openpower/isatables/
POWER dedicates *fourteen* of the precious 6-bit major opcodes to
"16-bit immediate" versions, however all the 3-source ones are minor
opcodes under 31-major. there are also another 8 major opcodes for
FP load-with-immediate, and again, all the 3-source ones are under
31-major.

they get away with it because (as Mitch mentioned), one of the sources
is also a destination (two dests for the LDs). consequently, it's not
necessary to use a 4-operand (3-source, 1-dest) I-format.

RISC-V i believe has *one* free, unused major 32-bit opcode left reserved
for official extensions that could be used for 4-operand? if there had
been more major 32-bit opcodes available, this would have put pressure
on the rest of the design, making other assembly-code areas sub-optimal.

seems to be a hell of a tricky balance.

i don't know if it's something that's unique to scoreboard-style OoO
engines, or whether it's down to the fact that we're splitting out individual
{go_read+read_req} / {go_write+write_req} signals for *each* src/dest,
allocating them a separate and distinct port rather than being forced
to wait until all 3 read-ports are simultaneously free, or forced to wait
until all 2 write-ports are free [that's if you *have* two write-ports
on the regfile].

with the multi-arbitration signals, we're fine with a 3-source 2-dest
instruction, even when one of those sources is also used as a destination.
the Dependency Matrices will go "ok you want RS to be a read *and* write
hazard, that's fine", and if there is contention for the regfile read-ports
and write-ports (only one write port available despite the op being 2-dest)
that's fine too: completion will take 2 cycles rather than 1 before
de-asserting the CompUnit's BUSY signal.

that all drops out "for free" from that decision to go with multi-arbitration.

ISA opcode pressure aside, and aside from spotting the above (needing two
simultaneous regfile write ports) why is this not challenging? what is it
about other microarchitectures that makes 3-source 2-dest operations
difficult to consider including?

l.

Bruce Hoult

unread,

Apr 30, 2020, 8:53:04 AM4/30/20

to

On Thursday, April 30, 2020 at 10:52:43 PM UTC+12, lkcl wrote:
> i don't know if it's something that's unique to scoreboard-style OoO
> engines, or whether it's down to the fact that we're splitting out individual
> {go_read+read_req} / {go_write+write_req} signals for *each* src/dest,
> allocating them a separate and distinct port rather than being forced
> to wait until all 3 read-ports are simultaneously free, or forced to wait
> until all 2 write-ports are free [that's if you *have* two write-ports
> on the regfile].
>
> with the multi-arbitration signals, we're fine with a 3-source 2-dest
> instruction, even when one of those sources is also used as a destination.
> the Dependency Matrices will go "ok you want RS to be a read *and* write
> hazard, that's fine", and if there is contention for the regfile read-ports
> and write-ports (only one write port available despite the op being 2-dest)
> that's fine too: completion will take 2 cycles rather than 1 before
> de-asserting the CompUnit's BUSY signal.
>
> that all drops out "for free" from that decision to go with multi-arbitration.
>
> ISA opcode pressure aside, and aside from spotting the above (needing two
> simultaneous regfile write ports) why is this not challenging? what is it
> about other microarchitectures that makes 3-source 2-dest operations
> difficult to consider including?

Sure, if every implementation of the ISA is OoO then you can grab access to
ports when a free slot comes up. That doesn't work if you also want to support
simple in-order implementations and don't want them to stall a lot. Or you can
provide enough expensive ports for the hungriest instruction and watch them go
to waste most of the time.

FP is a special case because FMA makes up such a high proportion of the
instructions that the extra port is worth providing.

And there really are simpler implementations that are popular. Both SiFive and
PULP offer 2-stage pipeline cores without caches or branch prediction that are
1 CPI except taken branches are 2. The "Bumblebee" core in the mass market
Gigadevice GD32VF103 chip (seen in e.g. the $4.90 "Longan Nano" (complete with
plastic case and LCD screen) is also cacheless and without branch predictor and
eats 3 cycles for taken branches.

And then there is "Rudi-RV32" done by a friend of mine in Christchurch just
noodling around over the Xmas/NY break. It has *one* pipe stage -- asynchronous
logic from PC to opcode fetch to decode to register access to ALU with new PC
value, new register value, memory write all sitting ready to latch at the next
clock cycle. It does 85 MHz in Artix-7 FPGA! Most five stage 1 CPI cores don't
do any better than 100 MHz, 125 tops on those FPGAs.

Ivan Godard

unread,

Apr 30, 2020, 9:06:58 AM4/30/20

to

I'm impressed. Ivan Sutherland has been bugging me for several years
about an asynch Mill - would your friend be interested? The way the belt
works is a natural for asynch.

Bruce Hoult

unread,

Apr 30, 2020, 9:54:46 AM4/30/20

to

On Friday, May 1, 2020 at 1:06:58 AM UTC+12, Ivan Godard wrote:
> I'm impressed. Ivan Sutherland has been bugging me for several years
> about an asynch Mill - would your friend be interested? The way the belt
> works is a natural for asynch.

Michael has a job (last I heard anyway) but I'm sure everyone can be interested
by a sufficiently interesting opportunity. I happen to know relocation is out of
the question for him, for family reasons.

Niklas Holsti

unread,

Apr 30, 2020, 11:07:06 AM4/30/20

to

On 2020-04-30 16:06, Ivan Godard wrote:

> I'm impressed. Ivan Sutherland has been bugging me for several years
> about an asynch Mill - would your friend be interested? The way the belt
> works is a natural for asynch.

In asynch, won't the faster FUs have to wait for the slowest FU to
finish, before the overall system can declare a new cycle and new belt
numbering?

"Wideness" in the Mill sense seems to diminish the potential speed
benefit of asynch, it seems to me. But then I'm very ignorant of the issues.

--
Niklas Holsti
Tidorum Ltd
niklas holsti tidorum fi
. @ .

MitchAlsup

unread,

Apr 30, 2020, 12:00:34 PM4/30/20

to

On Thursday, April 30, 2020 at 5:52:43 AM UTC-5, lkcl wrote:
> On Thursday, April 30, 2020 at 2:13:32 AM UTC+1, Bruce Hoult wrote:
>
> > The problem is that adding an index in a store is NOT under consideration
> > because it adds an expensive new feature to the ISA in needing three source
> > registers.
>
> examining the opcodes here https://libre-soc.org/openpower/isatables/
> POWER dedicates *fourteen* of the precious 6-bit major opcodes to
> "16-bit immediate" versions, however all the 3-source ones are minor
> opcodes under 31-major. there are also another 8 major opcodes for
> FP load-with-immediate, and again, all the 3-source ones are under
> 31-major.
>
> they get away with it because (as Mitch mentioned), one of the sources
> is also a destination (two dests for the LDs). consequently, it's not
> necessary to use a 4-operand (3-source, 1-dest) I-format.

I would simply call this 3-operand, 1-result rather than 4-operand.

>
> RISC-V i believe has *one* free, unused major 32-bit opcode left reserved
> for official extensions that could be used for 4-operand?

This is going to hurt later.

> if there had
> been more major 32-bit opcodes available, this would have put pressure
> on the rest of the design, making other assembly-code areas sub-optimal.

My 66000 has 14 major opcodes left and 6 reserved in perpetuity.

>
> seems to be a hell of a tricky balance.

It is, and it takes years of balancing these features to achieve a good one.

>
> i don't know if it's something that's unique to scoreboard-style OoO
> engines, or whether it's down to the fact that we're splitting out individual
> {go_read+read_req} / {go_write+write_req} signals for *each* src/dest,
> allocating them a separate and distinct port rather than being forced
> to wait until all 3 read-ports are simultaneously free, or forced to wait
> until all 2 write-ports are free [that's if you *have* two write-ports
> on the regfile].

Since you separated the signals, can I suggest placing a hopper between
reading register specifiers out of FUs and the RF ports. Each cycle a
number of specifiers are dropped into the hopper, and each cycle as many
RF read ports as you have are read and then delivered to instructions
at the calculation units. Then each calculation unit picks instructions
into execution.

MitchAlsup

unread,

Apr 30, 2020, 12:04:46 PM4/30/20

to

On Thursday, April 30, 2020 at 10:07:06 AM UTC-5, Niklas Holsti wrote:
> On 2020-04-30 16:06, Ivan Godard wrote:
>
> > I'm impressed. Ivan Sutherland has been bugging me for several years
> > about an asynch Mill - would your friend be interested? The way the belt
> > works is a natural for asynch.
>
> In asynch, won't the faster FUs have to wait for the slowest FU to
> finish, before the overall system can declare a new cycle and new belt
> numbering?

The problem with asynch was testing! What do you do if one CPU has its
adder faster than another die on the same chip such that timing of some
canned test case takes different numbers of pipelined cycles. It is
OK-enough to "get the right answer" or doe it require not only the
"get the right answer" but also take the "right amount of time".
Synch CPUs are always of the later.

Ivan Godard

unread,

Apr 30, 2020, 5:04:03 PM4/30/20

to

On 4/30/2020 8:07 AM, Niklas Holsti wrote:
> On 2020-04-30 16:06, Ivan Godard wrote:
>
>> I'm impressed. Ivan Sutherland has been bugging me for several years
>> about an asynch Mill - would your friend be interested? The way the
>> belt works is a natural for asynch.
>
> In asynch, won't the faster FUs have to wait for the slowest FU to
> finish, before the overall system can declare a new cycle and new belt
> numbering?
>
> "Wideness" in the Mill sense seems to diminish the potential speed
> benefit of asynch, it seems to me. But then I'm very ignorant of the
> issues.
>

Me too, but Sutherland is an expert and that's his opinion. I can see
how it fits in our decode, but I don't really understand how it removes
setup time in the crossbar like he says.

The real problem is EDM tools apparently, or rather their absence.

Bruce Hoult

unread,

Apr 30, 2020, 5:38:01 PM4/30/20

to

I think there's some confusion. I wasn't talking about an asynchronous CPU, but about a synchronous clocked CPU that has just 1 pipeline stage not 3 or 5 or 10.

MitchAlsup

unread,

Apr 30, 2020, 5:43:39 PM4/30/20

to

On Thursday, April 30, 2020 at 4:04:03 PM UTC-5, Ivan Godard wrote:
> On 4/30/2020 8:07 AM, Niklas Holsti wrote:
> > On 2020-04-30 16:06, Ivan Godard wrote:
> >
> >> I'm impressed. Ivan Sutherland has been bugging me for several years
> >> about an asynch Mill - would your friend be interested? The way the
> >> belt works is a natural for asynch.
> >
> > In asynch, won't the faster FUs have to wait for the slowest FU to
> > finish, before the overall system can declare a new cycle and new belt
> > numbering?
> >
> > "Wideness" in the Mill sense seems to diminish the potential speed
> > benefit of asynch, it seems to me. But then I'm very ignorant of the
> > issues.
> >

Wideness may in an of itself minimize any gain from asynch.

If you have a 1-wide pipeline and allow each stage to self clock,
you end up operating at actual gate speed rather than worst case gate
speed. PDP-6 and the first PDP-10 short clocked their ALU, the rest
of the PDP-10s went back to synch.

>
> Me too, but Sutherland is an expert and that's his opinion. I can see
> how it fits in our decode, but I don't really understand how it removes
> setup time in the crossbar like he says.

The problem with asynch is that it must suffer the later of any data
which can arrive by more than one path and this is the (THE) major
(non testing) problem with asynch. This was known as the concordance
problem.

Mill is wide enough 16+ that the concordance problem (above) will
almost never allow the machine to operate as fast as the gates, and
will operate as slow as the slowest path (anyway). So, I think asynch
will be at best problematic.

>
> The real problem is EDM tools apparently, or rather their absence.

Which does not help, either.

lkcl

unread,

Apr 30, 2020, 10:05:38 PM4/30/20

to

On Thursday, April 30, 2020 at 5:00:34 PM UTC+1, MitchAlsup wrote:

> I would simply call this 3-operand, 1-result rather than 4-operand.

the distinction being important in that 4-operand is actually 4 separate
operands in the actual opcode (FMAC rd <- a, b, c)

> > i don't know if it's something that's unique to scoreboard-style OoO
> > engines, or whether it's down to the fact that we're splitting out individual
> > {go_read+read_req} / {go_write+write_req} signals for *each* src/dest,
> > allocating them a separate and distinct port rather than being forced
> > to wait until all 3 read-ports are simultaneously free, or forced to wait
> > until all 2 write-ports are free [that's if you *have* two write-ports
> > on the regfile].
>
> Since you separated the signals, can I suggest placing a hopper between
> reading register specifiers out of FUs and the RF ports. Each cycle a
> number of specifiers are dropped into the hopper, and each cycle as many
> RF read ports as you have are read and then delivered to instructions
> at the calculation units. Then each calculation unit picks instructions
> into execution.

*sigh* unfortunately we are a little bit hamstrung by having lost the register
binary indices, using instead the Dependency Matrix unary encoding to
directly enable the regfile row. i mean, we _could_ use the unary encoding
as the register specifier...

by "hopper" do i take it that you mean a cyclic shift register where each
bucket contains the register specifier, type of operation (R/W), and, say, the FU
number?

or, actually, if the register specifier is in unary...

oo i have an idea. it may well be the same idea.

* for all registers being requested by all FUs, merge the unary register indices
into a single "these registers need to be read" vector. i think, in fact, that
the Dependency Matrices might actually have this already (Read Vector
and Write Vector Pending)

* for reads, throw that vector at a regfile "fetcher", which, using the read vector,
simply fetches as many registers as there are available regfile ports

let's say that there are 4R regfile ports

* the 4 read results, along with a *single* bit unary encoding of their corresponding
reg number, are dropped into a cyclic 4-wide buffer (cyclic 4-wide shift reg)

* all FUs "GoRead1" are connected to port 1 of the cyclic buffer
* all FUs "GoRead2" are connected to port 2...
* ............GoRead3 ..... 3

* entries *remain* in the cyclic buffer for as long as the corresponding bit in the
Read Vector remains set, continuing to cycle through, moving from port 1 to
2, 2 to 3, 3 to 4, 4 back to 1, on each clock

* on any given clock cycle, *all* FUs may read that "broadcasted" entry from their
corresponding cyclic buffer "hopper", as long as the unary reg index matches up

* when the reg index matches up, GO_RD1/2/3 is fired at that FU.

* when the FU has read from that port, it may drop the REQ_RD1/2/3

* this results in the (usual) dropping of the unary bit from the Read Vector back at
the Dependency Matrix

* this in turn also triggers the cyclic buffer to drop the reg value, freeing up that
port for a refill opportunity.

it's by no means perfect: the cyclic buffer moving only one value at a time means
that if an FU has 3 GO_RD/REQ_RD lines, it could have to wait for up to 3 cycles
for the data to cycle round the shift-buffer... *per read-register*.

if however a concerted effort is made to ensure that a REQ_RD1 always tries to drop
the read-request into cyclic-buffer entry 1, REQ_RD2 always tries to drop into entry 2
and so on, then the reads should just pass straight through.

this does have the advantage that if there are multiple FUs waiting for the same
register, then as a "broadcast" bus system, they can in fact get the same value
simultaneously.

writes are very similar, and there is the additional advantage that the written value
can be read from the cyclic buffer by any FU that "sees" it, prior to the value landing
at the actual regfile.

thus the cyclic buffer also happens to serve double-duty as a forwarding bus. or,
more to the point, the role and distinction between "register bus" and "forwarding
bus" is now decoupled / moot.

was that sort-of what you meant by "hopper"? :)

l.

lkcl

unread,

Apr 30, 2020, 10:18:43 PM4/30/20

to

On Friday, May 1, 2020 at 3:05:38 AM UTC+1, lkcl wrote:

> * the 4 read results, along with a *single* bit unary encoding of their corresponding
> reg number, are dropped into a cyclic 4-wide buffer (cyclic 4-wide shift reg)

hypothetically, we could have multiple such buffers, in rows. the FUs have "column"
access (GO_RD1 has access to column 1, GO_RD2 access to column 2 etc.)

if one row is full, from 4R on the regfile, and other FUs need all 4 regs so no entries
in that row may be dropped on that cycle, the port-reader may fill the *second* row.

in this way there would be no lost opportunities to read from the regfile just because
some of the FUs hadn't had the value cyclically delivered to them yet.

l.

MitchAlsup

unread,

Apr 30, 2020, 10:35:01 PM4/30/20

to

On Thursday, April 30, 2020 at 9:05:38 PM UTC-5, lkcl wrote:
> On Thursday, April 30, 2020 at 5:00:34 PM UTC+1, MitchAlsup wrote:
>
> > I would simply call this 3-operand, 1-result rather than 4-operand.
>
> the distinction being important in that 4-operand is actually 4 separate
> operands in the actual opcode (FMAC rd <- a, b, c)

F
M
A
C

Rd result (writes RF)
<-
a operand (reads RF)
,
b operand (reads RF)
,
c operand (reads RF)

see the difference?

STs would have the Rd specifier read the RF and actually become an operand.

With this nomenclature there are 2 words in use one signifying RF read
the other signifying RF write. With your's you have 3 words, one simply
being a RF access without specifying which direction. I suggest, kindly,
that more specificity using fewer words will work better.

lkcl

unread,

May 1, 2020, 6:22:45 AM5/1/20

to

On Friday, May 1, 2020 at 3:35:01 AM UTC+1, MitchAlsup wrote:

> see the difference?

apologies, i meant that in the ISA there's not enough space to specify
4 separate operands, not that the underlying hardware has a total of
4 accesses (3 read, 1 write) to the register file.

in POWER9 the "update" variants only have *three* registers specified
in the instruction, yet they perform four regfile accesses:

* ST-update: 3-read, 1-write
* LD-update: 2-read, 2-write

this by re-using one of the SRC-specified regs also as a DEST, which FMAC
does not do.

> With this nomenclature there are 2 words in use one signifying RF read
> the other signifying RF write. With your's you have 3 words, one simply
> being a RF access without specifying which direction. I suggest, kindly,
> that more specificity using fewer words will work better.

apologies. doing my best.

l.

Stephen Fuld

unread,

May 1, 2020, 1:07:52 PM5/1/20

to

On 5/1/2020 3:22 AM, lkcl wrote:
> On Friday, May 1, 2020 at 3:35:01 AM UTC+1, MitchAlsup wrote:
>
>> see the difference?
>
> apologies, i meant that in the ISA there's not enough space to specify
> 4 separate operands, not that the underlying hardware has a total of
> 4 accesses (3 read, 1 write) to the register file.
>
> in POWER9 the "update" variants only have *three* registers specified
> in the instruction, yet they perform four regfile accesses:
>
> * ST-update: 3-read, 1-write
> * LD-update: 2-read, 2-write
>
> this by re-using one of the SRC-specified regs also as a DEST, which FMAC
> does not do.

Many people have said that having one of the sources also be the
destination is "bad". Presumably this is because sometimes you want to
preserve the contents of the source register, and that requires another
instruction prior to the one in question to copy it to another register.
But in modern CPUs, isn't that extra copy actually done in the
renaming stage, and so the cost is much less?

Also, how about a compromise where the instruction has a "destination"
field of perhaps two bits that are added to one of the source register
fields to generate the destination. Since the destination is needed
later in time than the sources, this might not slow things down very
much. Various people have said this is hard for the compiler to deal
with, and I am not a compiler guy, so I would appreciate other's comments.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

MitchAlsup

unread,

May 1, 2020, 1:21:21 PM5/1/20

to

On Friday, May 1, 2020 at 12:07:52 PM UTC-5, Stephen Fuld wrote:
> On 5/1/2020 3:22 AM, lkcl wrote:
> > On Friday, May 1, 2020 at 3:35:01 AM UTC+1, MitchAlsup wrote:
> >
> >> see the difference?
> >
> > apologies, i meant that in the ISA there's not enough space to specify
> > 4 separate operands, not that the underlying hardware has a total of
> > 4 accesses (3 read, 1 write) to the register file.
> >
> > in POWER9 the "update" variants only have *three* registers specified
> > in the instruction, yet they perform four regfile accesses:
> >
> > * ST-update: 3-read, 1-write
> > * LD-update: 2-read, 2-write
> >
> > this by re-using one of the SRC-specified regs also as a DEST, which FMAC
> > does not do.
>
> Many people have said that having one of the sources also be the
> destination is "bad".

But certainly not those pushing for 16-bit subset of ISA embedded within
a rather std 32-bit RISC.

> Presumably this is because sometimes you want to
> preserve the contents of the source register, and that requires another
> instruction prior to the one in question to copy it to another register.

The resuse is rather high (Rd == Rs1) and this is what prevents x86
instructions from s**king so bad. In K9 we even ent out of our way
to detect::

MOV Rx,Ry
OP Rx,<operand>

and would recode this in the Icache as:

OP Rx,Ry,<operand>

> But in modern CPUs, isn't that extra copy actually done in the
> renaming stage, and so the cost is much less?

It can be done a host of places.

>
> Also, how about a compromise where the instruction has a "destination"
> field of perhaps two bits that are added to one of the source register
> fields to generate the destination. Since the destination is needed
> later in time than the sources, this might not slow things down very
> much. Various people have said this is hard for the compiler to deal
> with, and I am not a compiler guy, so I would appreciate other's comments.

The register allocation guys in compilers will not enjoy this restriction.

lkcl

unread,

May 1, 2020, 1:25:39 PM5/1/20

to

On Thursday, April 30, 2020 at 5:00:34 PM UTC+1, MitchAlsup wrote:

> Since you separated the signals, can I suggest placing a hopper between
> reading register specifiers out of FUs and the RF ports. Each cycle a
> number of specifiers are dropped into the hopper, and each cycle as many
> RF read ports as you have are read and then delivered to instructions
> at the calculation units. Then each calculation unit picks instructions
> into execution.

i started drawing a cyclic hopper out, and i really like it (although i am
still guessing-in-the-dark without those statistics: one to solve).

* FMAC FUs will be 3-read, 1-write
* LD/ST FUs will be 3-read, 2-write (2R2W for LD, 3R1W for ST)
* ADD/SUB/etc will be 3-read, 2-write (2R1W for data regfile, 1R1W for *condition* regfile)

however we have to design our own regfile based around a 1R-or-1W SRAM cell,
grouping them together, and it would not be sensible to try to make a 2W or 3W.
therefore we have to tolerate a 3R1W or 4R1W regfile.

we decided last year that the priority was 32-bit vector operations, and to assume
that they would be in sequentially-numbered registers. therefore we can "stripe"
the regfile into four separate banks:

* HI32-Odd
* LO32-Odd
* HI32-Even
* LO32-Even

this gives us, under ideal conditions (sequential operations) effectively 12/16R4W
for 32-bit operations, 8/6R2W for 64-bit operations (2 requests in pairs to
construct a 64-bit R/W), and worst-case we have 3/4R1W which is "tolerable" given
how much we won't end up overwhelmed with multiplexers.

regarding FunctionUnit allocation:

* Even-numbered FUs will be connected to a separate LO32 regfile bus
* Odd-numbered FUs will be connected to a separate HI32 regfile bus

64-bit operations therefore make *two* requests, require *two* Dependency Matrix
Hazard allocations, but pass the HI-LO data to the *ONE* 64-bit pipeline.

that's background.

the reason i like the cyclic register buffer/cache/hopper idea is because sometimes
you actually need registers to "cross over" between the HI and LO and the ODD and
EVEN boundary. this can be handled *at the hopper* by allowing entries at the end
of the hopper to "cross" to one of the other buffers, if GLOBAL_RD_PENDING_VECTOR
(or GLOBAL_WR_PENDING_VECTOR) indicates that one of the "lanes" needs that
register value (or needs to store it).

ideally however, to avoid write, you simply would not actually allocate an operation
to a "lane" where you knew, in advance, that the result, once created, was going to
need to cross over at the cyclic buffer. _ideally_ you would allocate the operation
to a FunctionUnit where the result would go straight through to its quadrant of the
regfile.

what is particularly nice about the cyclic buffer concept, then, is that the cross-over
between writes and reads (aka "operand forwarding"), and the cross-over between
lanes (quadrants HI/LO x ODD/EVEN) can all be handled in the same way:

allow the register data to be copied, at the end, between "lanes".

with each cell (hopper) in every part of the register cache/buffer receiving the
broadcasted GLOBAL_READ_PEND_VEC and corresponding write vector, there's
no risk of the cyclic buffer permanently ending up with a value going round and
round forever, because those pending bits are held to indicate *that* the data is
in fact wanted!

it's very much like how Token Ring used to operate. i really like it.

l.

MitchAlsup

unread,

May 1, 2020, 1:50:48 PM5/1/20

to

On Friday, May 1, 2020 at 12:25:39 PM UTC-5, lkcl wrote:
> On Thursday, April 30, 2020 at 5:00:34 PM UTC+1, MitchAlsup wrote:
>
> > Since you separated the signals, can I suggest placing a hopper between
> > reading register specifiers out of FUs and the RF ports. Each cycle a
> > number of specifiers are dropped into the hopper, and each cycle as many
> > RF read ports as you have are read and then delivered to instructions
> > at the calculation units. Then each calculation unit picks instructions
> > into execution.
>
> i started drawing a cyclic hopper out, and i really like it (although i am
> still guessing-in-the-dark without those statistics: one to solve).
>
> * FMAC FUs will be 3-read, 1-write
> * LD/ST FUs will be 3-read, 2-write (2R2W for LD, 3R1W for ST)
> * ADD/SUB/etc will be 3-read, 2-write (2R1W for data regfile, 1R1W for *condition* regfile)
>
> however we have to design our own regfile based around a 1R-or-1W SRAM cell,
> grouping them together, and it would not be sensible to try to make a 2W or 3W.
> therefore we have to tolerate a 3R1W or 4R1W regfile.

One can interleave in space or in time.

Using 3 SRAM macros to provide 3R1W is an interleave in space.

One can also interleave in time::

Say you have a GPU and in this GPU each instruction performs 32-units of
work and you allocate 8 lanes of calculation so the 32 instructions take
4 beats to pipeline through the CUs. You can assign 3 of these beats to
reading the RF and 1 beat to writing the RF and you have avoided duplic-
ating the RF bit storage.

Stefan Monnier

unread,

May 1, 2020, 1:57:24 PM5/1/20

to

>> Also, how about a compromise where the instruction has a "destination"
>> field of perhaps two bits that are added to one of the source register
>> fields to generate the destination. Since the destination is needed
>> later in time than the sources, this might not slow things down very
>> much. Various people have said this is hard for the compiler to deal
>> with, and I am not a compiler guy, so I would appreciate other's comments.
>
> The register allocation guys in compilers will not enjoy this restriction.

Indeed.

For your typical graph-coloring register allocator, it's easy to add
extra *static* constraints saying that such and such instructions can
only use particular subsets of the register file on some (or all) of its
inputs or outputs.

But it seems non-trivial to add constraints that depend on the choice of
register for some other operand. There might be some clever way to
handle it, and maybe it will even be obvious in hindsight, but at least
it's not obvious to me in "frontsight"(?).

Stefan

lkcl

unread,

May 1, 2020, 2:16:50 PM5/1/20

to

i did really like this, until i realised that it requires 16-element vector operations
to be really effective. if the computational load is not a 16-wide vector (no
dependencies between any of those elements), then the ports *still* have to
be read/written to on single-cycles, and thus you have sub-optimal performance:
each FMAC taking 7/8 cycles to complete (3RD, 3/4pipeline, 1WR), that's
unavoidable.

given that a dedicated GPU is highly unlikely to be faced with non-vector
operations, it doesn't matter {for a dedicated GPU].

sadly we are developing a hybrid processor that must have some semblance
of capability when faced with traditional CPU workloads. hence some weirdness
and unfortunately not being able to use this particularly efficient technique.

l.

Stephen Fuld

unread,

May 1, 2020, 3:06:00 PM5/1/20

to

OK, thanks for your insight. I had thought it would be a variant of the
code used to handle say 64 bit floating point on 32 bit systems by using
two consecutive registers, or similarly, widening multiply.

MitchAlsup

unread,

May 1, 2020, 3:18:47 PM5/1/20

to

Two 4-element FMACs back to back

|-RD-|-RD-|-RD-| |-RD-|-RD-|-RD-|
|FMAC|FMAC|FMAC|FMAC|FMAC|FMAC|FMAC|FMAC|
|-WT-| |-WT-|

lkcl

unread,

May 1, 2020, 4:29:32 PM5/1/20

to

On Friday, May 1, 2020 at 8:18:47 PM UTC+1, MitchAlsup wrote:
> On Friday, May 1, 2020 at 1:16:50 PM UTC-5, lkcl wrote:
> > i did really like this, until i realised that it requires 16-element vector operations
> > to be really effective. if the computational load is not a 16-wide vector (no
> > dependencies between any of those elements), then the ports *still* have to
> > be read/written to on single-cycles, and thus you have sub-optimal performance:
> > each FMAC taking 7/8 cycles to complete (3RD, 3/4pipeline, 1WR), that's
> > unavoidable.
>
> Two 4-element FMACs back to back
>
> |-RD-|-RD-|-RD-| |-RD-|-RD-|-RD-|
> |FMAC|FMAC|FMAC|FMAC|FMAC|FMAC|FMAC|FMAC|
> |-WT-| |-WT-|

hmm... the analysis i did was 12-15 months ago. what the heck was it that made
it problematic?

it _might_ have been the assumption that each 1-ported (1R-or-1W) regfile was striped
into 4 lanes, with no multiplexers to allow cross-over between lanes.

thus, yes, it's perfectly fine to time-slice two 4-element FMACs back-to-back:
however if you *don't* have them issued continuously, utilisation becomes very
poor.

in a GPU workload it is perfectly reasonable to assume a massive continuous parallel
workload, no inter-relation between them, such that the time-streaming produces
much higher FMACs per gate area and power.

which is why i really wish we could use it.

l.

MitchAlsup

unread,

May 1, 2020, 4:39:17 PM5/1/20

to

On Friday, May 1, 2020 at 3:29:32 PM UTC-5, lkcl wrote:
> On Friday, May 1, 2020 at 8:18:47 PM UTC+1, MitchAlsup wrote:
> > On Friday, May 1, 2020 at 1:16:50 PM UTC-5, lkcl wrote:
> > > i did really like this, until i realised that it requires 16-element vector operations
> > > to be really effective. if the computational load is not a 16-wide vector (no
> > > dependencies between any of those elements), then the ports *still* have to
> > > be read/written to on single-cycles, and thus you have sub-optimal performance:
> > > each FMAC taking 7/8 cycles to complete (3RD, 3/4pipeline, 1WR), that's
> > > unavoidable.
> >
> > Two 4-element FMACs back to back
> >
> > |-RD-|-RD-|-RD-| |-RD-|-RD-|-RD-|
> > |FMAC|FMAC|FMAC|FMAC|FMAC|FMAC|FMAC|FMAC|
> > |-WT-| |-WT-|
>
> hmm... the analysis i did was 12-15 months ago. what the heck was it that made
> it problematic?

The First RD has to obtain Rs1 for the next 4 cycles
The second RD has to obtain Rs2 for the next 4 cycles
The third RD has to obtain Rs3 for the next 4 cycles
The fourth WT has to deposit Rd for the previous 4 cycles

>
> it _might_ have been the assumption that each 1-ported (1R-or-1W) regfile was striped
> into 4 lanes, with no multiplexers to allow cross-over between lanes.
>
> thus, yes, it's perfectly fine to time-slice two 4-element FMACs back-to-back:
> however if you *don't* have them issued continuously, utilisation becomes very
> poor.

The above example illustrated two (2) FMAC instructions of 4 elements
each (not 8 individual FMAC instructions)

It works for GPU, it works for vector, it does not work for scalar.

>
> in a GPU workload it is perfectly reasonable to assume a massive continuous parallel
> workload, no inter-relation between them, such that the time-streaming produces
> much higher FMACs per gate area and power.
>
> which is why i really wish we could use it.

A tool to remember.
>
> l.

Stefan Monnier

unread,

May 1, 2020, 5:55:45 PM5/1/20

to

> OK, thanks for your insight. I had thought it would be a variant of the
> code used to handle say 64 bit floating point on 32 bit systems by using
> two consecutive registers, or similarly, widening multiply.

I don't know what trick is used to handle such register pairs.

Stefan

EricP

unread,

May 1, 2020, 5:58:48 PM5/1/20

to

The VAX used register pairs for 64 bit floats, quads for 128 bit.
There might be some old compiler papers about.

EricP

unread,

May 1, 2020, 6:16:07 PM5/1/20

to

Coloring register pairs, 1992
https://dl.acm.org/doi/abs/10.1145/130616.130617
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.30.6968&rep=rep1&type=pdf

lkcl

unread,

May 1, 2020, 7:32:10 PM5/1/20

to

ok interesting, that's a compiler technique, which we will kinda need at the 3 bit level but not the 64 bit one.

to the standard scalar compiler the hardware will look like 64 bit hardware, compliant with POWER9. switch on the vector looping mode and each register becomes effectively a pointer into an SRAM that is typecast to a union of bytearray, hwordarray, word and dword array.

so at that point the colour graph becones a rainbow :)

l.

MitchAlsup

unread,

May 1, 2020, 9:00:53 PM5/1/20

to

If you have a scalar register file for the scalar stuff, you can still use
the interleave in time philosophy for the vector stuff.

First Read clock< 0>: Rs1 clocks< 4, 5, 6, 7>
second Read clock< 1>: Rs2 clocks< 4, 5, 6, 7>
Third Read clock< 2>: Rs3 clocks< 4, 5, 6, 7>
Write clock< 3>: Rd clocks<-8,-7,-6,-5> FMAC<0> starts
First Read clock< 4>: Rs1 clocks< 4, 5, 6, 7> FMAC<1>
second Read clock< 5>: Rs2 clocks< 4, 5, 6, 7> FMAC<2>
Third Read clock< 6>: Rs3 clocks< 4, 5, 6, 7> FMAC<3>
Write clock< 7>: Rd clocks<-4,-3,-2,-1> FMAC<4> FMAC<0> finishes
First Read clock< 8>: Rs1 clocks< 4, 5, 6, 7> FMAC<5> FMAC<1> finishes
second Read clock< 9>: Rs2 clocks< 4, 5, 6, 7> FMAC<6> FMAC<2> finishes
Third Read clock<10>: Rs3 clocks< 4, 5, 6, 7> FMAC<7> FMAC<3> finishes
Write clock<11>: Rd clocks< 0, 1, 2, 3> FMAC<4> finishes
...

lkcl

unread,

May 1, 2020, 9:57:15 PM5/1/20

to

On Saturday, May 2, 2020 at 2:00:53 AM UTC+1, MitchAlsup wrote:
> > to the standard scalar compiler the hardware will look like 64 bit hardware, compliant with POWER9. switch on the vector looping mode and each register becomes effectively a pointer into an SRAM that is typecast to a union of bytearray, hwordarray, word and dword array.
>
> If you have a scalar register file for the scalar stuff, you can still use
> the interleave in time philosophy for the vector stuff.

so the key characteristic of this technique is that there is a single regfile port per ALU. what we haven't mentioned explicitly is that for the technique to work, the operands being read from that single port will need to be farmmed out to the incoming operand latches of the ALU.

a reasonable technique for achieving this: a shift register.

ta-daaaa.

by providing each FunctionUnit with its own shift register, and making the Common Data Buses between the Function Unit Operand Latches and the Arbiters "general" channels rather than specifically dedicated to Op1-and-RegreadPort1, Op2-and-RegreadPort2 etc, the bus and port contention issue goes away.

If three out of four of the Common Data Buses are 100% occupied, an FMAC FunctionUnit can *still* read its 3 operands in sequence over that one remaining bus, perform the shifting *at the Function Unit* to get the operands into the Op123 Latches.

this would not take any longer than if Bus1 2 and 3 happened to be free for one cycle each, assuming they were directly wired only to FU op latch 1 2 and 3 because for 2 of the reads over a single Bus there would be time to perform 2 shifts anyway.

whether the shift register is cyclic or bi-directional is down to how much in the way of routing/muxing can be tolerated at the FunctionUnit. it is however localised rather than being a massive inter-FU multi-bus crossbar.

in front of the regfile the previous idea (a single cyclic hopper / buffer) becomes no longer necessary.

instead, one Common Data Broadcast Bus per regfile port, whether read or write, can be provided.

now, clearly, it is preferable to still have read port 1 prioritise FU OP1 latch reads, to avoid the need to run OP1 through two cyclic shifts (from FU OP2 hopper then to OP3 hopper and finally to OP1).

my only concern is that it is quite a lot of extra register latches, with there being one cyclic shift buffer *per FunctionUnit* rather than one global one.

l.

Ivan Godard

unread,

May 2, 2020, 12:52:59 AM5/2/20

to

On 5/1/2020 6:57 PM, lkcl wrote:
> On Saturday, May 2, 2020 at 2:00:53 AM UTC+1, MitchAlsup wrote:
>>> to the standard scalar compiler the hardware will look like 64 bit hardware, compliant with POWER9. switch on the vector looping mode and each register becomes effectively a pointer into an SRAM that is typecast to a union of bytearray, hwordarray, word and dword array.
>>
>> If you have a scalar register file for the scalar stuff, you can still use
>> the interleave in time philosophy for the vector stuff.
>
> so the key characteristic of this technique is that there is a single regfile port per ALU. what we haven't mentioned explicitly is that for the technique to work, the operands being read from that single port will need to be farmmed out to the incoming operand latches of the ALU.
>
> a reasonable technique for achieving this: a shift register.
>
> ta-daaaa.
>
> by providing each FunctionUnit with its own shift register, and making the Common Data Buses between the Function Unit Operand Latches and the Arbiters "general" channels rather than specifically dedicated to Op1-and-RegreadPort1, Op2-and-RegreadPort2 etc, the bus and port contention issue goes away.
>
> If three out of four of the Common Data Buses are 100% occupied, an FMAC FunctionUnit can *still* read its 3 operands in sequence over that one remaining bus, perform the shifting *at the Function Unit* to get the operands into the Op123 Latches.
>
> this would not take any longer than if Bus1 2 and 3 happened to be free for one cycle each, assuming they were directly wired only to FU op latch 1 2 and 3 because for 2 of the reads over a single Bus there would be time to perform 2 shifts anyway.
>
> whether the shift register is cyclic or bi-directional is down to how much in the way of routing/muxing can be tolerated at the FunctionUnit. it is however localised rather than being a massive inter-FU multi-bus crossbar.

Mill is of course built around a crossbar, and we early on hit the size
constraint that has you worried, and much worse than you would be seeing
due to the Mill's width. Our solution was the cascaded crossbar
described in the talks. I'd offer it to you, but you couldn't accept
anyway because it's patented.

Your shift register idea appears at first sight to be only usable in OOO
with dynamic arbiting because the operand latency would vary based on
how many of the arguments had to use the shift register. But actually I
think you could use this in a static schedule design too, because the
compiler could model the state of the shift registers, and know when the
required operands got to the FU, and from that when the results would
come out, i.e. the actual latency, and from that could build a schedule.

In such a design you would still have to deal with making the dataflows
congruent at control join points, which is a bugaboo of any statically
schedule system. But I don't see any reason you couldn't do it with your
own flavor of conforming ops such as the Mill uses.

lkcl

unread,

May 2, 2020, 2:08:08 AM5/2/20

to

On Saturday, May 2, 2020 at 5:52:59 AM UTC+1, Ivan Godard wrote:
> On 5/1/2020 6:57 PM, lkcl wrote:
> > whether the shift register is cyclic or bi-directional is down to how much in the way of routing/muxing can be tolerated at the FunctionUnit. it is however localised rather than being a massive inter-FU multi-bus crossbar.
>
> Mill is of course built around a crossbar, and we early on hit the size
> constraint that has you worried, and much worse than you would be seeing
> due to the Mill's width. Our solution was the cascaded crossbar
> described in the talks. I'd offer it to you, but you couldn't accept
> anyway because it's patented.

in 1991 my 3rd year project was to improve the ALICE Transputer network. it was a half butterfly using 8-in 8-out crossbar chips designed by PLESSEY and it sucked. random-routed packet throughput fell off exponentially due to blocking. these blocking rules did however provide guaranteed delivery times.

i therefore doubled it up (now called a Bezer Network or something) aka a butterfly network, and noted that there were always two paths to a destination.

by analysing each bit of the address i realised that all permutations (no conflicting delivery) could be solved in linear time with 100% throughput.

so i appreciate the offer... however i have a potential solution if needed, on which there is prior art (by me) dating back to a period which if anyone did patent it back then (without my kniwledge or permission), it would have long expired since.

> Your shift register idea appears at first sight to be only usable in OOO
> with dynamic arbiting because the operand latency would vary based on
> how many of the arguments had to use the shift register. But actually I
> think you could use this in a static schedule design too, because the
> compiler could model the state of the shift registers,

and some static guarantees built-in to them, by having fixed delivery times and no bypass mechanisms that shorten the arrival under certain dynamic conditions (particularly data-dependent ones).

so, hmm, that's a good point: all pipelines should have fixed (or known, calculable) completion times.

which for LD/ST may be challenging.

and for predication, where we will be cancelling operations in-flight, based on bits in a mask register that's delayed.

data-dependent "fail on first" might also get tricky.

> and know when the
> required operands got to the FU, and from that when the results would
> come out, i.e. the actual latency, and from that could build a schedule.

that's a neat idea.

i wonder if the Common Data Buses have any interaction consequences... they... hmmm... the prioritisation rules about which Bus to use in any given cycle would also have to be mirrored in the compiler.

those rules, let's say that Regfile Port 1 CDB (Common Data Bus), by convention always connected to Operand 1 Latch on every Function Unit, happened to be busy.

which alternative Bus do you pick?

Bus 2 is not the logical choice to deliver Operand 1 because on a 3 operand FunctionUnit, it will be 2 shifts, 2 extra cycles, to get the data into Op1 position inside the FU.

On the other hand, if the FU doesn't *have* a 3op or even a 2op (a NEG or INV FU) then delivery can be done over any of the 3 CDBs and the operand delivered on that same clock. assuming that all 3 CDBs are in fact connected *to* that 1 operand FU.

these all seem ok though.

it just leaves data-dependent operations (vecror ffirst, vector predication, branch speculation) as needing some more thought.

however at least in the Mill you have goid solutions for those.

> In such a design you would still have to deal with making the dataflows
> congruent at control join points, which is a bugaboo of any statically
> schedule system. But I don't see any reason you couldn't do it with your
> own flavor of conforming ops such as the Mill uses.

AMD used to always be penalised on performance for Windows because Microsoft specifically used Intel's optimising compiler.

baked into that proprietary compiler was secret knowledge of the best ways to schedule instructions, based on detailed internal scheduling knowledge of Intel products.

it does concern me somewhat, how will our design react to assembly code specifically statically optimised, even by gcc, for IBM POWER9 Cores?

we just have to see, for now, i will be celebrating when it runs anything at all :)

l.

Ivan Godard

unread,

May 2, 2020, 5:45:53 AM5/2/20

to

On 5/1/2020 11:08 PM, lkcl wrote:
> On Saturday, May 2, 2020 at 5:52:59 AM UTC+1, Ivan Godard wrote:
>> On 5/1/2020 6:57 PM, lkcl wrote:
>>> whether the shift register is cyclic or bi-directional is down to how much in the way of routing/muxing can be tolerated at the FunctionUnit. it is however localised rather than being a massive inter-FU multi-bus crossbar.
>>
>> Mill is of course built around a crossbar, and we early on hit the size
>> constraint that has you worried, and much worse than you would be seeing
>> due to the Mill's width. Our solution was the cascaded crossbar
>> described in the talks. I'd offer it to you, but you couldn't accept
>> anyway because it's patented.
>
> in 1991 my 3rd year project was to improve the ALICE Transputer network. it was a half butterfly using 8-in 8-out crossbar chips designed by PLESSEY and it sucked. random-routed packet throughput fell off exponentially due to blocking. these blocking rules did however provide guaranteed delivery times.
>
> i therefore doubled it up (now called a Bezer Network or something) aka a butterfly network, and noted that there were always two paths to a destination.
>
> by analysing each bit of the address i realised that all permutations (no conflicting delivery) could be solved in linear time with 100% throughput.
>
> so i appreciate the offer... however i have a potential solution if needed, on which there is prior art (by me) dating back to a period which if anyone did patent it back then (without my kniwledge or permission), it would have long expired since.

Sounds like you built a real crossbar, i.e. a routing network that is
N*M. Though we call what we have a "crossbar" for the benefit of casual
readers, it is not really one in the usual sense. Instead it a
brute-force all-to-all mux-mess like a conventional bypass network. The
cascade patent turns that into a M-to-3 and a 3-to-N. On a typical
config, M is ~40 and N is ~12, which is big but not in the range of
modern RFs. 40-to-12 was right out.

>> Your shift register idea appears at first sight to be only usable in OOO
>> with dynamic arbiting because the operand latency would vary based on
>> how many of the arguments had to use the shift register. But actually I
>> think you could use this in a static schedule design too, because the
>> compiler could model the state of the shift registers,
>
> and some static guarantees built-in to them, by having fixed delivery times and no bypass mechanisms that shorten the arrival under certain dynamic conditions (particularly data-dependent ones).
>
> so, hmm, that's a good point: all pipelines should have fixed (or known, calculable) completion times.
>
> which for LD/ST may be challenging.
>
> and for predication, where we will be cancelling operations in-flight, based on bits in a mask register that's delayed.
>
> data-dependent "fail on first" might also get tricky.
>
>> and know when the
>> required operands got to the FU, and from that when the results would
>> come out, i.e. the actual latency, and from that could build a schedule.
>
> that's a neat idea.
>
> i wonder if the Common Data Buses have any interaction consequences... they... hmmm... the prioritisation rules about which Bus to use in any given cycle would also have to be mirrored in the compiler.
>
> those rules, let's say that Regfile Port 1 CDB (Common Data Bus), by convention always connected to Operand 1 Latch on every Function Unit, happened to be busy.
>
> which alternative Bus do you pick?

Doesn't matter so long as it is deterministic and the compiler and
hardware agree.

> Bus 2 is not the logical choice to deliver Operand 1 because on a 3 operand FunctionUnit, it will be 2 shifts, 2 extra cycles, to get the data into Op1 position inside the FU.
>

> On the other hand, if the FU doesn't *have* a 3op or even a 2op (a NE YMMVG or INV FU) then delivery can be done over any of the 3 CDBs and the operand delivered on that same clock. assuming that all 3 CDBs are in fact connected *to* that 1 operand FU.

>
> these all seem ok though.
>
> it just leaves data-dependent operations (vecror ffirst, vector predication, branch speculation) as needing some more thought.
>
> however at least in the Mill you have goid solutions for those.

Most of our goids boiled down to ignoring the situation. The only op
where dynamic latency matters is LOAD. All the rest of the early outs
are just hardware complication and benefits in the noise. YMMV.

>> In such a design you would still have to deal with making the dataflows
>> congruent at control join points, which is a bugaboo of any statically
>> schedule system. But I don't see any reason you couldn't do it with your
>> own flavor of conforming ops such as the Mill uses.
>
> AMD used to always be penalised on performance for Windows because Microsoft specifically used Intel's optimising compiler.
>
> baked into that proprietary compiler was secret knowledge of the best ways to schedule instructions, based on detailed internal scheduling knowledge of Intel products.
>
> it does concern me somewhat, how will our design react to assembly code specifically statically optimised, even by gcc, for IBM POWER9 Cores?
>
> we just have to see, for now, i will be celebrating when it runs anything at all :)

Yep :-)

lkcl

unread,

May 2, 2020, 7:09:19 AM5/2/20

to

On Saturday, May 2, 2020 at 10:45:53 AM UTC+1, Ivan Godard wrote:
> On 5/1/2020 11:08 PM, lkcl wrote:
> > On Saturday, May 2, 2020 at 5:52:59 AM UTC+1, Ivan Godard wrote:
> >> On 5/1/2020 6:57 PM, lkcl wrote:
> >>> whether the shift register is cyclic or bi-directional is down to how much in the way of routing/muxing can be tolerated at the FunctionUnit. it is however localised rather than being a massive inter-FU multi-bus crossbar.
> >>
> >> Mill is of course built around a crossbar, and we early on hit the size
> >> constraint that has you worried, and much worse than you would be seeing
> >> due to the Mill's width. Our solution was the cascaded crossbar
> >> described in the talks. I'd offer it to you, but you couldn't accept
> >> anyway because it's patented.
> >
> > in 1991 my 3rd year project was to improve the ALICE Transputer network. it was a half butterfly using 8-in 8-out crossbar chips designed by PLESSEY and it sucked. random-routed packet throughput fell off exponentially due to blocking. these blocking rules did however provide guaranteed delivery times.
> >
> > i therefore doubled it up (now called a Bezer Network or something) aka a butterfly network, and noted that there were always two paths to a destination.
> >
> > by analysing each bit of the address i realised that all permutations (no conflicting delivery) could be solved in linear time with 100% throughput.
> >
> > so i appreciate the offer... however i have a potential solution if needed, on which there is prior art (by me) dating back to a period which if anyone did patent it back then (without my kniwledge or permission), it would have long expired since.
>
> Sounds like you built a real crossbar, i.e. a routing network that is
> N*M.

N x LOG (M) or something like that. diagram 2, here:

http://homepages.inf.ed.ac.uk/cgi/rni/comp-arch.pl?Networks/benes.html,Networks/benes-f.html,Networks/menu-dyn.html

"Benes Network" is the quotes modern quotes name for this much-rediscovered
concept-with-many-names :)

> Though we call what we have a "crossbar" for the benefit of casual
> readers, it is not really one in the usual sense. Instead it a
> brute-force all-to-all mux-mess like a conventional bypass network. The
> cascade patent turns that into a M-to-3 and a 3-to-N.

that's not like what i came up with at all. now i'm interested and curious.

> On a typical
> config, M is ~40 and N is ~12, which is big but not in the range of
> modern RFs. 40-to-12 was right out.

40 Function Units, 12 read/write ports, any-to-any brute-force Muxing:
yeah that's Bad :)

even the Benes Network still does not provide a reduction in gate count
that could be considered "reasonable", except only in the most ultra
high performance designs.

> > those rules, let's say that Regfile Port 1 CDB (Common Data Bus), by convention always connected to Operand 1 Latch on every Function Unit, happened to be busy.
> >
> > which alternative Bus do you pick?
>
> Doesn't matter so long as it is deterministic and the compiler and
> hardware agree.

understood. i mention it only because the wrong hardware choice would result in
consistent performance degradation.

> > it just leaves data-dependent operations (vecror ffirst, vector predication, branch speculation) as needing some more thought.
> >
> > however at least in the Mill you have goid solutions for those.
>
> Most of our goids boiled down to ignoring the situation.

i see you noticed i was typing on a phone... :)

"ignore the situation". oh, you mean: "yes, you know that (potentially)
static compiler scheduling could hypothetically help, but because of
the data-dependency just ignore it, as it's not really critical, it's
an optimisation".

> The only op
> where dynamic latency matters is LOAD. All the rest of the early outs
> are just hardware complication and benefits in the noise. YMMV.

appreciated the insights.

l.

lkcl

unread,

May 2, 2020, 9:48:34 AM5/2/20

to

On Saturday, May 2, 2020 at 10:45:53 AM UTC+1, Ivan Godard wrote:

> readers, it is not really one in the usual sense. Instead it a
> brute-force all-to-all mux-mess like a conventional bypass network. The
> cascade patent turns that into a M-to-3 and a 3-to-N.

ok the significance of this sunk in overnight. is it the case that there is "broadcast"
capability via those -3- pathways? or are they just dedicated exclusive routes,
requiring multiple assertions of the same register value, multiple times over
the same -3- pathways, for delivery from N sources (regfile ports) to M FUs?

* the "3" is the "3 Common / Broadcast Data Buses" (RD1, RD2, RD3) that have been
percolating in my mind.

* the "M" is the per-Function-Unit cyclic buffer(s).

* the "N" is the *global* cyclic buffer(s) which allow for decoupling of in-flight
values from Number-of-Buses (3) from regfile ports.

the advantage of having Broadcast Bus(es) for read is that *multiple* Function Units
may, just like the CDB (Common Data Bus) of a Tomasulo Algorithm, receive the
value in multiple places.

for the broadcasting idea to work, however: unlike the original 6600, it requires that
the ID of the register be transmitted (broadcast) alongside the actual data.

in the Tomasulo Algorithm that was a binary ID, meaning that every Reservation Station
on *every* Function Unit required an XOR comparator array. 40 FUs, 3 operands per
FU, 5-bit IDs for register numbers is a total of 600 XOR gates - every single one of
which will fire and use power on every cycle. yuk.

honestly i do not know if (wires-wise) using an unary ID would be any better, although
if i think about it real fast, i think you *might* get away with a *single* unary "requestor"
(merge all READ_REQs into a single unary vector) per Function Unit, as long as you don't
mind losing the information about which FU Operand Port (1/2/3) is requesting which
register.

if you _do_ mind (because if the value arrives over CDB 2 and needs to be cycle-shifted
to port 1 that's 2 extra clock cycles) then that's a hell of a lot of wires! 6,400 wires
if there are 32 bit unary register IDs, 40 Function Units, and 5 ports (3 RD, 2WR) over
which those are transmitted!

we're aiming for around 20 FUs which is still 3,200 wires. dang.

if using binary-encoding, that reduces down to 500 (plus another 500 to get the ID *to* the
Function Unit). it's still a hell of a lot, but is more manageable.

the other aspect we need to consider: whilst the behaviour of the GO_READ/REQ_READ
does not change, it is absolutely imperative that the Function Unit *not* receive a
"GO_WRITE" signal until the result has been committed to the regfile... *even* if it's
sent to other units before that happens, via (any) Forwarding Bus(es).

the reason why "GO_WRITE" must not be triggered is because that signals to the
Dependency Matrices, "there is no longer a write hazard at the regfile" and in the
Forwarding / Bypass case, and when values are in any of the cyclic buffers, that
condition has *NOT* been met.

that means that GO_READ is near-trivial, whilst GO_WRITE becomes a lot more involved.

needs thought.

l.

Ivan Godard

unread,

May 2, 2020, 10:22:52 AM5/2/20

to

The erstwhile crossbar has a fast path and a slower path. N is the
number of consumers across all slots; some may not have operations in a
particular cycle and some that have operations may not use use all the
input paths.

There is a mux path from all slots that have single-cycle results from
the previous cycle (roughly N/2) to each of the N (12) inputs. This is
effectively a direct back-to-back forwarding network. This is the fast path.

Operations from previous cycles that were not single-cycle (say FMUL for
example) do not have paths from the relevant FUs output latches to the M
consumers the way single-cycle result do. Instead there is a mux path
from each source to one of three latches associated with the M inputs
(two of those are full width and one is 1 bit wide; don't ask). This is
the slow path.

In the following cycle, the values from the three latches are presented
to the FU inputs along with the N/2 from the fast path. In the worst
case, a natural three-cycle op can take three cycles in the FU and one
in the slow path. However, the muxes (34-to-3) of the slow path don't
really take a whole cycle, so the FU can steal the excess by pushing the
three latches into the slow path cycle and using the rest itself.

There are three latches because the PICK op (C's ?:) is actually no an
FU; the select is done in the crossbar during transit of the data in the
muxes (that's why one of the triad is single-bit; it's true/false that
passes one of the two other values to the FU inputs.

So if M is 40 and N is 12 (values pulled out of the air) the FU inputs
are being driven by N/2+3 sources, while the latch triads are being
driven by 40-N/2 FU (etc) latches, but have a whole additional cycle for
the mux trees.

The physical layout of all this puts the fast path muxes in a row, with
the FUs butted right op against it so the mux tree output is adjacent to
the FU input. The mux trees are actually folded over, so the inputs and
outputs are on top of each other (not physically of course, but
interdigitated). This makes the wire lengths pretty close to nil.

The each FU in the row of adjacent FUs is itself folded so that the
layout of slow path ops leaves the latch triad also adjacent to the fast
path. If there are four pipe stages it looks like
muxes | FU input | FU stage 1 | FU stage 2 V
muxes |
muxes | triads | FU stage 4 | FU stage 3 <

It's a fire hose. Does that help?

>> On a typical
>> config, M is ~40 and N is ~12, which is big but not in the range of
>> modern RFs. 40-to-12 was right out.
>
> 40 Function Units, 12 read/write ports, any-to-any brute-force Muxing:
> yeah that's Bad :)
>
> even the Benes Network still does not provide a reduction in gate count
> that could be considered "reasonable", except only in the most ultra
> high performance designs.
>
>>> those rules, let's say that Regfile Port 1 CDB (Common Data Bus), by convention always connected to Operand 1 Latch on every Function Unit, happened to be busy.
>>>
>>> which alternative Bus do you pick?
>>
>> Doesn't matter so long as it is deterministic and the compiler and
>> hardware agree.
>
> understood. i mention it only because the wrong hardware choice would result in
> consistent performance degradation.
>
>
>>> it just leaves data-dependent operations (vecror ffirst, vector predication, branch speculation) as needing some more thought.
>>>
>>> however at least in the Mill you have goid solutions for those.
>>
>> Most of our goids boiled down to ignoring the situation.
>
> i see you noticed i was typing on a phone... :)
>
> "ignore the situation". oh, you mean: "yes, you know that (potentially)
> static compiler scheduling could hypothetically help, but because of
> the data-dependency just ignore it, as it's not really critical, it's
> an optimisation".

Not really. Early-outs other than LOAD, such as X*0->0, are rare enough
that even if you gave then zero latency (which is possible in some
systems where the substitution happens in register rename) do not
produce measurable gain in real whole programs with real data.

Early-out for load make a very big difference, one well worth adapting
the architecture to. Mill does it by splitting the issue and retire of a
load into two distinct ops; the semantics are that load consumers see
memory as of the retire op, not as of the issue op. As a result the load
op has no retire dependencies and can be issued as soon as its arguments
are available, while the retire can be much later, just before its
consumers, thus eating as much of the variable latency as possible.

The hardware for that has a set of (logically) per frame retire
stations, filled by the issue op, that watch the store stream much like
the load buffers you are wrestling Mitch about. When the retire pary
executes, the RS buffer drops its value if the load has completed, and
stall if it hasn't.

Ivan Godard

unread,

May 2, 2020, 10:40:41 AM5/2/20

to

On 5/2/2020 6:48 AM, lkcl wrote:
> On Saturday, May 2, 2020 at 10:45:53 AM UTC+1, Ivan Godard wrote:
>
>> readers, it is not really one in the usual sense. Instead it a
>> brute-force all-to-all mux-mess like a conventional bypass network. The
>> cascade patent turns that into a M-to-3 and a 3-to-N.
>
> ok the significance of this sunk in overnight. is it the case that there is "broadcast"
> capability via those -3- pathways? or are they just dedicated exclusive routes,
> requiring multiple assertions of the same register value, multiple times over
> the same -3- pathways, for delivery from N sources (regfile ports) to M FUs?

I hope my previous posting may have cleared some of up.

> * the "3" is the "3 Common / Broadcast Data Buses" (RD1, RD2, RD3) that have been
> percolating in my mind.

Mill has no registers. Each slot can have as many inputs as the most
demanding op supported by that slot requires. Most are two-input; the
obvious suspects are three input. For four or more we use ganging to
spread the op across two slots to keep the fan-in reasonable.

> * the "M" is the per-Function-Unit cyclic buffer(s).

Input latches; no cyclic, no buffer, no FFs.

> * the "N" is the *global* cyclic buffer(s) which allow for decoupling of in-flight
> values from Number-of-Buses (3) from regfile ports.
>
> the advantage of having Broadcast Bus(es) for read is that *multiple* Function Units
> may, just like the CDB (Common Data Bus) of a Tomasulo Algorithm, receive the
> value in multiple places.

It's all-to-all in each part of the cascade. There are no busses.

> for the broadcasting idea to work, however: unlike the original 6600, it requires that
> the ID of the register be transmitted (broadcast) alongside the actual data.

There are a couple of ways that you can identify belt operands (the
crossbar is really the belt when you think it through). Each FU input
has to get an operand from particular some output latch; decode decides
which output should be used. One way is to add an id to each output (set
when the FU was goosed and passed through), and then have each input
broadcast the id of the value it wants; makes the whole thing act like a
CAM. Another way is to have decode emit a mux-control mask for each
input that needs a value and each path through the muxes uses that mask
to route.

IANAHG, and frankly I don't understand the details; I expect we will do
both to try out.

> in the Tomasulo Algorithm that was a binary ID, meaning that every Reservation Station
> on *every* Function Unit required an XOR comparator array. 40 FUs, 3 operands per
> FU, 5-bit IDs for register numbers is a total of 600 XOR gates - every single one of
> which will fire and use power on every cycle. yuk.
>
> honestly i do not know if (wires-wise) using an unary ID would be any better, although
> if i think about it real fast, i think you *might* get away with a *single* unary "requestor"
> (merge all READ_REQs into a single unary vector) per Function Unit, as long as you don't
> mind losing the information about which FU Operand Port (1/2/3) is requesting which
> register.
>
> if you _do_ mind (because if the value arrives over CDB 2 and needs to be cycle-shifted
> to port 1 that's 2 extra clock cycles) then that's a hell of a lot of wires! 6,400 wires
> if there are 32 bit unary register IDs, 40 Function Units, and 5 ports (3 RD, 2WR) over
> which those are transmitted!
>
> we're aiming for around 20 FUs which is still 3,200 wires. dang.

We pass more bits than that, say ~75 bits (counting metadata) by 9
inputs, but there are no long wires because everything is latches and
immediately adjacent. Its a bypass - the area and power is like a
bypass, not like RFs or RSs.

> if using binary-encoding, that reduces down to 500 (plus another 500 to get the ID *to* the
> Function Unit). it's still a hell of a lot, but is more manageable.
>
>
> the other aspect we need to consider: whilst the behaviour of the GO_READ/REQ_READ
> does not change, it is absolutely imperative that the Function Unit *not* receive a
> "GO_WRITE" signal until the result has been committed to the regfile... *even* if it's
> sent to other units before that happens, via (any) Forwarding Bus(es).
>
> the reason why "GO_WRITE" must not be triggered is because that signals to the
> Dependency Matrices, "there is no longer a write hazard at the regfile" and in the
> Forwarding / Bypass case, and when values are in any of the cyclic buffers, that
> condition has *NOT* been met.
>
> that means that GO_READ is near-trivial, whilst GO_WRITE becomes a lot more involved.
>
> needs thought.

With static scheduling you don't need either GO_READ nor GO_WRITE. Why
make your life complicated :-)

lkcl

unread,

May 2, 2020, 1:11:15 PM5/2/20

to

On Saturday, May 2, 2020 at 3:40:41 PM UTC+1, Ivan Godard wrote:

> I hope my previous posting may have cleared some of up.

yes - raised a lot of questions too :)

> There are a couple of ways that you can identify belt operands (the
> crossbar is really the belt when you think it through).

that makes sense.

> Each FU input
> has to get an operand from particular some output latch; decode decides
> which output should be used. One way is to add an id to each output (set
> when the FU was goosed and passed through), and then have each input
> broadcast the id of the value it wants; makes the whole thing act like a
> CAM. Another way is to have decode emit a mux-control mask for each
> input that needs a value and each path through the muxes uses that mask
> to route.
>
> IANAHG, and frankly I don't understand the details; I expect we will do
> both to try out.

it's still evolving, we're on the clock and this is a crucial piece of the puzzle, another idea occurred to me which may reduce gate/wire count (below)

> > we're aiming for around 20 FUs which is still 3,200 wires. dang.
>
> We pass more bits than that, say ~75 bits (counting metadata) by 9
> inputs, but there are no long wires because everything is latches and
> immediately adjacent.

niiice.

> With static scheduling you don't need either GO_READ nor GO_WRITE. Why
> make your life complicated :-)

sigh because we don't have the time to invest in compilers as well :) your team is quite lucky in being able to do both.

so the latest idea recognises the following:

* Each FU has a row in the FU-REGS DM that captures the critical information about what regs the FU needs.
* more than that, it captures the FU *Opcode* port order/usage
* the idea is to have a cyclic buffer at each FU of the same width as the FU Operands.
* and to have the same number of CDB Broadcast Buses

so all those numbers match up: same size cyclic buffer, same number of CDBs, all one to one captured in the FUREGS DM.

now, what is *normally* considered is: broadcast the reg ID (Belt ID) and match those up using CAM / XOR gate comparators, or, if you can afford the wires, an unary UD which only needs an AND gate.

neither option is attractive :)

the alternative: for each FU, calculate the difference between

* the index of the FU Operand Number and
* the index of the CDB Bus Number

then transmit those differences along with the value, and, on receipt, use it as a countdown of the number of cyclic shifts to be performed at the FU!

the really *really* cool thing is, those differences are only going to be 2 bits! only 1 bit if the FU has only 2 input operands!

so there would only be 2 bits for each FU READ side (3 input latches), 1 bit for LDST with Update (2 write latches).

times 20 for the number of FUs, times 5 for the number of Common Data Buses (3 read, 2 write).

ta-daaa, only 600 wires! and just as woth the original 6600 you don't even need to send the Reg ID (Belt ID) to the FU because the destination is encoded back at the DM.

i haven't yet been able to mull it over to think through any major issues.

l.