tomasulo algorithm and reorder buffers for parallel vector engines / SIMD

lk...@lkcl.net

unread,

Dec 5, 2018, 3:31:09 AM12/5/18

to

with thanks to mitch alsup for alerting me to the tomasulo
algorithm and reservation stations, i began investigating
how to leverage this algorithm for use in SimpleV. here
is a really good video explaining reorder buffers and how
to do speculative execution:

https://www.youtube.com/watch?v=OU3jI8j7Ozw

what i particularly like about reorder buffers is the fact
that it is an exceptionally clean solution which pretty
much solves absolutely every hazard and potential issue
(LD/ST WAW/WAR, register WAW/WAR), can be used to
implement multi-issue, speculation, clear up branch
mis-prediction, and deal with exceptions as well, simply
by rolling back (erasing) the ROB table.

it's effectively conceptually identical to database
"transactions", where the only items that are allowed to be
"committed" are those at the head of the buffer, once all data
is in. not only that: extending tomasulo to do multi-issue
is a simple matter of allowing more than one instruction
to be entered into the ROB on every clock cycle, and ensuring
that all data paths are wide enough to cope.

scoreboarding on the other hand is an absolute nuisance:

http://home.deib.polimi.it/silvano/FilePDF/AAC/Lesson_4_ILP_PartII_Scoreboard.pdf

scoreboarding only works properly when register renaming
is deployed; it doesn't seem to lend itself to multi-issue,
it doesn't solve exception roll-backs, and in fact *causes*
exception problems, needing the concept of "imprecise"
exceptions (p16).

so my first question is: what am i missing about scoreboards?
they're apparently much simpler than the tomasulo algorithm
and reorder buffers, and also i understand that a CAM is not
needed, which will save power.

the second question is: are there known ways to reduce power
consumption of the Reorder Buffer? the four fields of a ROB
table entry are:

* instruction type (branch / store / ALU op / exception)
* destination (register number or memory address)
* value
* ready / completed status

the destination is what needs to be a CAM (key-value store),
as on every cycle it is necessary to check if the source
operands are depending on *any one* of the entries in the
ROB table which have outstanding execution to be carried out.

the one i thought of is to have a bitfield representing every
available register (in the case of SV, that's 128 bits). if
a given register has a bit clear, it indicates that there
does not exist a ROB entry with that register as a destination,
indicating that it is possible to cut out the CAM lookup step.

so... would it not be unreasonable to simply maintain a table
of register-to-ROB# entries? if there are 128 destination
registers, and the ROB table is 32 entries, would a table
128 deep and 5 bits wide do, as a way to not have a CAM?
is it really that simple?

thirdly - this may take some explaining - i'm looking to map
predicated SIMD down onto the Reorder Buffer, which will be
an important aspect of SV, and all other implementation
schemes are potentially problematic.

SV has a register "redirection" table, where a 5-bit standard
*scalar* operation may be "redirected" to a 7-bit *actual*
register. this ensures that the standard RISC-V scalar
operations need not be modified at all.

however, there is no restriction on the redirection: x5 could
be redirected to x112, *and x30 could be as well*.

more than that, it is possible to request that when for example
using x5, the operand bitwidth is changed to e.g. 16-bit,
and when using for example x30, the operand bitwidth is changed
to e.g. 32-bit *and is offset to target the UPPER half of
register x112*.

thus, it is not the *full* width of the register that's being
used, and, consequently, if byte-masking is passed all the way
down to the register file (byte-write-enable as opposed to just
reg-write-enable), there may be an opportunity to even pass
that byte-masking into SIMD-style ALUs

https://libre-riscv.org/3d_gpu/rob_bytemask.jpg

in the above image, there is the possibility of having *three*
entries for the exact same destination register, because all
three byte-masks do not overlap.

in this way it would be possible to use SIMD-style ALUs to
carry out multiple lower-bitwidth operations.

before going ahead with an experimental implementation, i thought
it might be a really really good idea to see if anyone has any
insights or caveats that would make this byte-masking impractical
or unworkable.

thoughts appreciated.

l.

MitchAlsup

unread,

Dec 5, 2018, 12:17:12 PM12/5/18

to

Renaming and a ROB fixes this set of issues.

>
> so my first question is: what am i missing about scoreboards?

The entire scoreboard for the CDC 6600 was smaller than one single
reservation station entry in the IBM 360/91!

> they're apparently much simpler than the tomasulo algorithm
> and reorder buffers, and also i understand that a CAM is not
> needed, which will save power.

Scoreboards rename the dataflow into function unit while reservation
stations rename into the reorder buffer (or physical register file.)
Since a function unit knows its own "name", it can broadcast the "tag"
with a single wire which can goto an AND gate per scoreboard entry
and if both are set, this operand has become ready.

You can do conceptually the same thing with a <really narrow> CAM.

>
> the second question is: are there known ways to reduce power
> consumption of the Reorder Buffer? the four fields of a ROB
> table entry are:
>
> * instruction type (branch / store / ALU op / exception)
> * destination (register number or memory address)
> * value
> * ready / completed status

There is a way to build the register file called "Physical Register
File" where the architectural registers are merged with the in-flight
registers into one single rename pool. THis gets rid of moving the
ROB entry to the RF when the instruction has been committed.

>
> the destination is what needs to be a CAM (key-value store),
> as on every cycle it is necessary to check if the source
> operands are depending on *any one* of the entries in the
> ROB table which have outstanding execution to be carried out.
>
> the one i thought of is to have a bitfield representing every
> available register (in the case of SV, that's 128 bits). if
> a given register has a bit clear, it indicates that there
> does not exist a ROB entry with that register as a destination,
> indicating that it is possible to cut out the CAM lookup step.
>
> so... would it not be unreasonable to simply maintain a table
> of register-to-ROB# entries? if there are 128 destination
> registers, and the ROB table is 32 entries, would a table
> 128 deep and 5 bits wide do, as a way to not have a CAM?
> is it really that simple?
>

This is computer architecture--it is never that simple. But it is
not far from being simple.

ALUs (including SIMD ALUs) are not expensive (in area or power).
What is expensive is routing of the data to them and routing it back.

>
> before going ahead with an experimental implementation, i thought
> it might be a really really good idea to see if anyone has any
> insights or caveats that would make this byte-masking impractical
> or unworkable.
>
> thoughts appreciated.

The inherent simplicity of the scoreboard saves power and area. The
CDC 6600 scoreboard controls the latches used to route data around
the machine. That is correct--latches, not flip-flops. Since a latch
can be as small as 3 gates and a flip-flop is generally 10 gates;
building with latches is a lot smaller than building with flip-flops.
It is a shame the current tools do not support that kind of design.

Forwarding can be added to the scoreboard--eliminating this delay
(compared to a CDC 6600 scoreboard.)

1-cycle integer can also be added to the scoreboard for back to back
calculations.

Branch prediction/recovery can be added to the scoreboard.

And finally, reservation stations can be made "value-free", reading
the register file after the instruction fires into execution.
>
> l.

EricP

unread,

Dec 5, 2018, 5:58:44 PM12/5/18

to

Not quite. A scoreboard doesn't need a renamer.

The cause of imprecise exceptions and interrupts is that
the scoreboard allows out-of-order (OoO) writeback and commit.
It is hypothetically possible to build a scoreboard that stalled
writeback until all prior instructions had completed and thus have
precise interrupts, but that would very much limit the parallel execution
(or require a huge number of function units to hold finished results
until writeback was allowed).

> so my first question is: what am i missing about scoreboards?
> they're apparently much simpler than the tomasulo algorithm
> and reorder buffers, and also i understand that a CAM is not
> needed, which will save power.

A scoreboard is a kind of OoO scheduler that operates
by tracking RAW, WAR and WAW dependencies.
RAW are prevented by stalling the read until prior writeback is performed.
WAR are prevented by stalling the writeback until prior read is performed.
WAW are prevented by stalling the issue until prior writeback is performed.

A scoreboard is limited to only looking the decoded instruction
and the state of the outstanding registers and function units.
If an instruction stalls for WAW then all instructions after it stall.
If an instruction stalls for RAW then it waits in the function unit,
preventing that FU from being used. If a subsequent instruction
needs the same FU then it and all subsequent instructions stall.

(Compare the scoreboard scheduler to a full OoO scheduler described below)

The renamer + physical register file (PRF) eliminates the need to
check for WAR and WAW dependencies because it allocates a new
physical register for each write. It is also one of the two
components that allows out-of-order execution with precise interrupts,
the other component being the Instruction Queue (IQ) aka ROB.

You could have a simple scoreboard scheduler with a simple renamer and
PRF but you would still have imprecise exceptions. It would have
slightly better parallelism than the original scoreboard
because it would not stall for WAR and WAW hazards.
But without an IQ/ROB it would have to park RAW pending instructions
in the FU, and that could stall subsequent instruction decoding.

It is when you also have an IQ/ROB with commit logic that
updates the renamer map that you get precise interrupts.

I draw a distinction between an IQ and a ROB in that a ROB is
an IQ with extra features, typically the scheduling information
and sometimes it holds uncommitted result values.
The IQ is the part that is necessary to track the instruction order
and drive the commit logic giving precise interrupts.

A full OoO scheduler is not limited to looking at just the decoded
instruction. The decoded instructions are transferred to the ROB
and then the OoO scheduler looks at the whole ROB at once.
Only when all an instructions operands are ready does it dispatch
it to the FU. Since it does not use a FU to wait for RAW dependencies
(as the scoreboard does), it can use the FU in any order.
But that makes the OoO scheduler more complex.

> the second question is: are there known ways to reduce power
> consumption of the Reorder Buffer? the four fields of a ROB
> table entry are:
>
> * instruction type (branch / store / ALU op / exception)
> * destination (register number or memory address)
> * value
> * ready / completed status
>
> the destination is what needs to be a CAM (key-value store),
> as on every cycle it is necessary to check if the source
> operands are depending on *any one* of the entries in the
> ROB table which have outstanding execution to be carried out.

There are lots of papers on these subjects online
exploring various design options.

This one walks through a a whole OoO x86 design in detail:

A Superscalar Out-of-Order x86 Soft Processor for FPGA 2017
http://www.stuffedcow.net/files/henry-thesis-phd.pdf

Here is a list of other papers I found on schedulers, ROB, etc.
they may or may no be helpful.
You'll have to hunt them down on Google yourself.
I also find it interesting to check the citation links Google tracks
as that often leads to related papers on the same subject.
--------------------------------------------------------------

The Design Space of Register Renaming Techniques, Sima 2000

Implementing Precise Interrupts in Pipelined Processors 1988

Complexity and Correctness of a Super-Pipelined Processor 2005

Complexity and Correctness of Computer Architectures 1996

Complexity Effective Superscalar Processors 1997

Quantifying the Complexity of Superscalar Processors 1997

Super-Scalar Processor Design CSL-TR-89-383 1989

The Microarchitecture of Superscalar Processors 1995

A high-speed dynamic instruction scheduling scheme
for superscalar processors 2001

A Low-Complexity Issue Logic 2000

A Reorder Buffer Design for High Performance Processors 2012

An oldest-first selection logic implementation for non-compacting
issue queues 2002

Circuits for wide-window superscalar processors 2000

Delay Evaluation of Issue Queue in Superscalar Processors 2012

Design and Evaluation of a RISC Processor
with a Tomasulo Scheduler 1999

Direct instruction wakeup for out-of-order processors 2004

Efficient Dynamic Scheduling through Tag Elimination 2002

Instruction Issue Logic for High-performance, Interruptable
Pipelined Processors 1987

Instruction Issue Logic in Pipelined Supercomputers 1984

Issue logic for a 600-MHz Out-of-Order execution microprocessor 1998

Low-complexity reorder buffer architecture 2002

Macro-op Scheduling Relaxing Scheduling Loop Constraints 2003

Making the Original Scoreboard Mechanism Deadlock Free 1996

Matrix Scheduler Reloaded 2007

On Pipelining Dynamic Instruction Scheduling Logic 2000

Patent US20120278593 Low complexity out-of-order issue
logic using static circuits 2012

Patent US6557095 Scheduling operations using a dependency matrix 1999

Patent US6988185 Select-free dynamic instruction scheduling 2006

Reducing the complexity of the issue logic 2001

Select Free Instruction Scheduling Logic 2005

Simplifying instruction issue logic in superscalar processors 2002

The Impact of Hardware Scheduling Mechanisms
on the Performance and Cost of Processor Designs 1999

Understanding Scheduling Replay Schemes 2004

Ivan Godard

unread,

Dec 5, 2018, 7:58:03 PM12/5/18

to

On 12/5/2018 2:58 PM, EricP wrote:

<an excellent explanation of scoreboard vs. OOO snipped>

Another way to get precise exceptions without OOO overhead is to use
result replay rather than issue replay. Issue replay is the usual way:
track which operations have been issued but not yet in-order retired by
keeping an instruction pointer to the last fully-retired operation. On
exception, clear the machine state and reset the IP to the tracking
address, thereby discarding all in-flight operations, even those that
have completed but not yet retired, and reissue them.

In result replay, all operations in-flight at the exception are allowed
to complete (but not retire) and their results are saved. The hardware
tracks which operations have not yet issued; issue is not necessarily
in-order. When the exception handler returns only those operations not
already issued before the exception are issued; there is no re-issue of
in-flights. However, results are replayed, so that the results of both
issue-before and issue-after operations are interleaved so that the
ordering is the same as if the exception had not happened at all.

Result replay saves the pipeline drain and restart inherent in issue
replay, and obviates the IQ, rename registers, and ROB required by OOO,
but requires the ability to buffer and save the results of in-flight
ops. It also requires special handling for ops (like loads) which cannot
complete until the exception has been handled, and to deal with when
several in-flight ops independently get exceptions (also a problem for OOO).

The Mill is the only architecture I know of using result replay
throughout. Quite a few ISAs use it for some special ops. For example,
restartable string ops can be seen as an instance of result replay.

lk...@lkcl.net

unread,

Dec 6, 2018, 3:52:05 AM12/6/18

to

On Wednesday, December 5, 2018 at 10:58:44 PM UTC, EricP wrote:

> lk...@lkcl.net wrote:
> > scoreboarding only works properly when register renaming
> > is deployed; it doesn't seem to lend itself to multi-issue,
> > it doesn't solve exception roll-backs, and in fact *causes*
> > exception problems, needing the concept of "imprecise"
> > exceptions (p16).
>
> Not quite. A scoreboard doesn't need a renamer.
>
> The cause of imprecise exceptions and interrupts is that
> the scoreboard allows out-of-order (OoO) writeback and commit.

understood.

> It is hypothetically possible to build a scoreboard that stalled
> writeback until all prior instructions had completed and thus have
> precise interrupts, but that would very much limit the parallel execution
> (or require a huge number of function units to hold finished results
> until writeback was allowed).

[or a reorder buffer to hold them... which introduces extra latency.
is that worth it, for the simplicity that a ROB brings? quite likely!]

in the LibreRISCV SoC/GPU, the general plan is to have a lot of function
units anyway, as it is a multi-issue design. can we make simpler FUs
(taking less silicon) that take longer to complete, and have more of them?
maybe: that means a trade-off if doing operand-forwarding, as the routing
between FUs for op-forwarding would need to be increased.

why is this so hard to evaluate! :)

> The renamer + physical register file (PRF) eliminates the need to
> check for WAR and WAW dependencies because it allocates a new
> physical register for each write. It is also one of the two
> components that allows out-of-order execution with precise interrupts,
> the other component being the Instruction Queue (IQ) aka ROB.

i think this is a very important requirement. i found a video
online which describes a scheme to do register-renames:
https://www.youtube.com/watch?v=p4SdrUhZrBM

i took a screenshot of the "Register Aliasing Table", here:
https://libre-riscv.org/3d_gpu/rat_table.png

basically it contains two tables. the first is the RAT table:

Valid | Tag | Value
----- | ------- | ------------
Y/N | FUname | Reg contents

and the second looks remarkably similar to a Reservation Station,
and is associated with the Functional Unit:

<---- Src1 ---> | <---- Src2 ------> |
Reg# | Tag | Value | Reg# | Tag | Value |

so... um... if the RAT basically introduces the Reservation Station
concept in a different form, and a ROB is needed to do precise exceptions
and rollback, then... um... what's the difference again? :)

> I draw a distinction between an IQ and a ROB in that a ROB is
> an IQ with extra features, typically the scheduling information
> and sometimes it holds uncommitted result values.

understood.

> There are lots of papers on these subjects online
> exploring various design options.
>
> This one walks through a a whole OoO x86 design in detail:
>
> A Superscalar Out-of-Order x86 Soft Processor for FPGA 2017
> http://www.stuffedcow.net/files/henry-thesis-phd.pdf

wowww, figure 5.1 - 7R3W regfile. zowee!
very valuable resource, thank you eric.

> Here is a list of other papers I found on schedulers, ROB, etc.
> they may or may no be helpful.
> You'll have to hunt them down on Google yourself.
> I also find it interesting to check the citation links Google tracks
> as that often leads to related papers on the same subject.

appreciated. i'm currently using the technique of doing google
image searches for phrases "register rename RAT table" and so on,
as i find that produces more in the way of visual diagrams, which
seem to sink in better than words, for me.

the list you gave is an extremely valuable alternative starting
point to explore, so thank you.

l.

lk...@lkcl.net

unread,

Dec 6, 2018, 4:36:59 AM12/6/18

to

this seems to be the consensus. and, if that youtube video
is correct, adding those basically gives pretty much identical
functionality to Tomasulo+ROB.

> >
> > so my first question is: what am i missing about scoreboards?
>
> The entire scoreboard for the CDC 6600 was smaller than one single
> reservation station entry in the IBM 360/91!

woooow. well, thinking about it, that's no surprise: a reservation
station is effectively holding all the operand values so that
register-renaming is provided / mitigated.

so, if that's missed out, then yes.

> > they're apparently much simpler than the tomasulo algorithm
> > and reorder buffers, and also i understand that a CAM is not
> > needed, which will save power.
>
> Scoreboards rename the dataflow into function unit while reservation
> stations rename into the reorder buffer (or physical register file.)
> Since a function unit knows its own "name", it can broadcast the "tag"
> with a single wire which can goto an AND gate per scoreboard entry
> and if both are set, this operand has become ready.

ahh of course: the difference being, the Tomasulo Algorithm uses
the ROB# and a broadcast bus. so, TSA+ROB goes like this:

* Dest# goes into ROB
* ROB# goes into RSs (plural) and LOADQ
* Result gets broadcast and goes straight into *all* RS's and Qs

where SBd goes:

* FUs wait for SRCregs
* FUs put Dest *BACK* into RegFile (or RAT table)
* RegFile values updated in SBd entries (plural)
* repeat

the step "FUs put Dest back into RegFile" is where "forwarding"
is often brought in, to reduce the latency associated with that,
which implies an absolutely massive array of bypass routing
between absolutely every single FU's outputs and absolutely
every single FU's input stages.

> You can do conceptually the same thing with a <really narrow> CAM.

1 bit, or 1 entry?

iinteresting... i wonder if it would be possible to
augment TSA+ROB to include references to the Functional Units?

so, the ROB table entry also contains scoreboard back-references
(Qj, Qk) to the Functional Units that will produce the result
they depend on? no - it's the Reservation Station, isn't it.
and that would make ROB# *equal* to the FU# if and only if
there are separate and distinct Reservation Stations associated
with each and every FU. i.e. each FU had its own RS.

hm i'm going to have to draw that out, see if it works.

> >
> > the second question is: are there known ways to reduce power
> > consumption of the Reorder Buffer? the four fields of a ROB
> > table entry are:
> >
> > * instruction type (branch / store / ALU op / exception)
> > * destination (register number or memory address)
> > * value
> > * ready / completed status
>
> There is a way to build the register file called "Physical Register
> File" where the architectural registers are merged with the in-flight

> registers into one single rename pool. This gets rid of moving the

> ROB entry to the RF when the instruction has been committed.

iiinterestiing... and it would require the number of ROB entries to
be equal/greater than the number of architectural registers.

> > so... would it not be unreasonable to simply maintain a table
> > of register-to-ROB# entries? if there are 128 destination
> > registers, and the ROB table is 32 entries, would a table
> > 128 deep and 5 bits wide do, as a way to not have a CAM?
> > is it really that simple?
> >
> This is computer architecture--it is never that simple. But it is
> not far from being simple.

:)

i realised that the proposed table (register-to-ROB#) may not
be a one-to-one and onto (i.e. a permutation table). or, more
to the point, i have not yet determined *if* it is a one-to-one
and onto relationship.

in other words, what concerns me is where two instructions are
issued which write to the same destination register. now, whoops:
actually you need register-to-ROB+ROB(+ROB)?

have to think that one through.

> > in this way it would be possible to use SIMD-style ALUs to
> > carry out multiple lower-bitwidth operations.
>
> ALUs (including SIMD ALUs) are not expensive (in area or power).
> What is expensive is routing of the data to them and routing it back.

yeah, am finding that out...

> > before going ahead with an experimental implementation, i thought
> > it might be a really really good idea to see if anyone has any
> > insights or caveats that would make this byte-masking impractical
> > or unworkable.
> >
> > thoughts appreciated.
>
> The inherent simplicity of the scoreboard saves power and area. The
> CDC 6600 scoreboard controls the latches used to route data around
> the machine. That is correct--latches, not flip-flops. Since a latch
> can be as small as 3 gates and a flip-flop is generally 10 gates;
> building with latches is a lot smaller than building with flip-flops.

that's a *really* compelling argument to continue investigating

> It is a shame the current tools do not support that kind of design.

interestingly it turns out that yosys had support added for
detection and generation of d-latches, in 2015:
https://github.com/YosysHQ/yosys/commit/554a8df5e2e7c750b76021821bdf2e07970b9dbf

yosys is a libre tool that turns verilog into BTIF. one of the
command-line options is *specifically* to disable d-latch creation.

> Forwarding can be added to the scoreboard--eliminating this delay
> (compared to a CDC 6600 scoreboard.)
>
> 1-cycle integer can also be added to the scoreboard for back to back
> calculations.
>
> Branch prediction/recovery can be added to the scoreboard.
>
> And finally, reservation stations can be made "value-free", reading
> the register file after the instruction fires into execution.

ah of course. hm... how would it be ensured that the value being
read is the correct one? one of the key benefits of Tomasulo is
that reg-renaming is avoided (or, "provided") precisely because
the relevant Reservation Station(s) contain *copies* of the (old)
value.

l.

adaptive...@gmail.com

unread,

Dec 6, 2018, 5:14:19 AM12/6/18

to

Let us see course example @1998,
https://people.eecs.berkeley.edu/~pattrsn/252S98/Lec04-tomosulo.pdf

Best,
S.Takano

lk...@lkcl.net

unread,

Dec 6, 2018, 7:12:02 AM12/6/18

to

ah! good find. very useful comparison of both algorithms.

slide 18: ah. the 6600 scoreboard algorithm cannot do
multiple LOAD operations. that's a show-stopper for the
SimpleV SoC: it is absolutely critical that multiple LOAD
(and STORE) operations be issued and carried out.

does anyone know if, since the 6600, this has been enhanced /
augmented, to make multiple LOAD operations possible?

l.

MitchAlsup

unread,

Dec 6, 2018, 12:02:39 PM12/6/18

to

On Wednesday, December 5, 2018 at 6:58:03 PM UTC-6, Ivan Godard wrote:
> On 12/5/2018 2:58 PM, EricP wrote:
>
> <an excellent explanation of scoreboard vs. OOO snipped>
>
> Another way to get precise exceptions without OOO overhead is to use
> result replay rather than issue replay. Issue replay is the usual way:
> track which operations have been issued but not yet in-order retired by
> keeping an instruction pointer to the last fully-retired operation. On
> exception, clear the machine state and reset the IP to the tracking
> address, thereby discarding all in-flight operations, even those that
> have completed but not yet retired, and reissue them.

There is another way to get precise ordering of the writes in a scoreboard.
First, one has to implement forwarding in the scoreboard.
Second, the function units need an output queue <of say 4 registers>
Now, one can launch an instruction and pick up its operand either
from the RF or from the function unit output while the result sits
in the function unit waiting for its GO_Write signal.

Thus the launching of instructions is not delayed due to hazards
but the results are delivered to the RF in program order.

This looks surprisingly like a 'belt' at the end of the function unit.

MitchAlsup

unread,

Dec 6, 2018, 12:08:49 PM12/6/18

to

On Thursday, December 6, 2018 at 2:52:05 AM UTC-6, lk...@lkcl.net wrote:
> On Wednesday, December 5, 2018 at 10:58:44 PM UTC, EricP wrote:
> > lk...@lkcl.net wrote:
> > > scoreboarding only works properly when register renaming
> > > is deployed; it doesn't seem to lend itself to multi-issue,
> > > it doesn't solve exception roll-backs, and in fact *causes*
> > > exception problems, needing the concept of "imprecise"
> > > exceptions (p16).
> >
> > Not quite. A scoreboard doesn't need a renamer.
> >
> > The cause of imprecise exceptions and interrupts is that
> > the scoreboard allows out-of-order (OoO) writeback and commit.
>
> understood.
>
> > It is hypothetically possible to build a scoreboard that stalled
> > writeback until all prior instructions had completed and thus have
> > precise interrupts, but that would very much limit the parallel execution
> > (or require a huge number of function units to hold finished results
> > until writeback was allowed).
>
> [or a reorder buffer to hold them... which introduces extra latency.
> is that worth it, for the simplicity that a ROB brings? quite likely!]
>
> in the LibreRISCV SoC/GPU, the general plan is to have a lot of function
> units anyway, as it is a multi-issue design. can we make simpler FUs
> (taking less silicon) that take longer to complete, and have more of them?
> maybe: that means a trade-off if doing operand-forwarding, as the routing
> between FUs for op-forwarding would need to be increased.

If you can get the tools to allow you to deposit FU results into a latch
array, it will be profitable to have the FU hold several <prior> results.
If you have to do this with real flip-flops, the area will not be conductive.

>
> why is this so hard to evaluate! :)

Because the old guys that know how to do this have been driven away from
this NG due to the signal to noise ratio (which lately has been pretty
good.)

>
> > The renamer + physical register file (PRF) eliminates the need to
> > check for WAR and WAW dependencies because it allocates a new

Err, no. Renamer+PRF does not get rid of the need to check--it gets rid
of the hazard itself! {A not so subtle difference}

> > physical register for each write. It is also one of the two
> > components that allows out-of-order execution with precise interrupts,
> > the other component being the Instruction Queue (IQ) aka ROB.
>
> i think this is a very important requirement. i found a video
> online which describes a scheme to do register-renames:
> https://www.youtube.com/watch?v=p4SdrUhZrBM
>
> i took a screenshot of the "Register Aliasing Table", here:
> https://libre-riscv.org/3d_gpu/rat_table.png
>
> basically it contains two tables. the first is the RAT table:
>
> Valid | Tag | Value
> ----- | ------- | ------------
> Y/N | FUname | Reg contents
>
> and the second looks remarkably similar to a Reservation Station,
> and is associated with the Functional Unit:
>
> <---- Src1 ---> | <---- Src2 ------> |
> Reg# | Tag | Value | Reg# | Tag | Value |
>
> so... um... if the RAT basically introduces the Reservation Station
> concept in a different form, and a ROB is needed to do precise exceptions
> and rollback, then... um... what's the difference again? :)
>
> > I draw a distinction between an IQ and a ROB in that a ROB is
> > an IQ with extra features, typically the scheduling information
> > and sometimes it holds uncommitted result values.

I have a different, and arguably better PRF+renamer, which takes zero
cycles to backup.

MitchAlsup

unread,

Dec 6, 2018, 12:19:04 PM12/6/18

to

At much smaller gate costs.

The thing about renaming into FUs is that it is straightforward to add
an order (count) to the FU where it is no easy to do that with a renamer.
So lets say the Multiplier was FU=5 and had 4 results. Each time an
instruction is put in the queue towards FU=5 the count is incremented
and dependent instructions get the fact that X7 is going to come from
FU=5;cnt=2.

>
>
> > You can do conceptually the same thing with a <really narrow> CAM.
>
> 1 bit, or 1 entry?
>
> iinteresting... i wonder if it would be possible to
> augment TSA+ROB to include references to the Functional Units?

The conversion only works for 1,2, and 3-bits. At 4-bits it becomes
less expensive to use the CAM.

>
> so, the ROB table entry also contains scoreboard back-references
> (Qj, Qk) to the Functional Units that will produce the result
> they depend on? no - it's the Reservation Station, isn't it.
> and that would make ROB# *equal* to the FU# if and only if
> there are separate and distinct Reservation Stations associated
> with each and every FU. i.e. each FU had its own RS.
>
> hm i'm going to have to draw that out, see if it works.

I have a 40-page chapter in a book I wrote explaining the CDC 6600
scoreboard and how to bring it into the modern world.

>
> > >
> > > the second question is: are there known ways to reduce power
> > > consumption of the Reorder Buffer? the four fields of a ROB
> > > table entry are:
> > >
> > > * instruction type (branch / store / ALU op / exception)
> > > * destination (register number or memory address)
> > > * value
> > > * ready / completed status
> >
> > There is a way to build the register file called "Physical Register
> > File" where the architectural registers are merged with the in-flight
> > registers into one single rename pool. This gets rid of moving the
> > ROB entry to the RF when the instruction has been committed.
>
> iiinterestiing... and it would require the number of ROB entries to
> be equal/greater than the number of architectural registers.

Entries = architectural+dynamic.

Where dynamic is essentially the count in the ROB.

>
> > > so... would it not be unreasonable to simply maintain a table
> > > of register-to-ROB# entries? if there are 128 destination
> > > registers, and the ROB table is 32 entries, would a table
> > > 128 deep and 5 bits wide do, as a way to not have a CAM?
> > > is it really that simple?
> > >
> > This is computer architecture--it is never that simple. But it is
> > not far from being simple.
>
> :)
>
> i realised that the proposed table (register-to-ROB#) may not
> be a one-to-one and onto (i.e. a permutation table). or, more
> to the point, i have not yet determined *if* it is a one-to-one
> and onto relationship.
>
> in other words, what concerns me is where two instructions are
> issued which write to the same destination register. now, whoops:
> actually you need register-to-ROB+ROB(+ROB)?
>
> have to think that one through.

The only issue is that they have to be delivered in order to the RF.

Which reminds me of the RF in the 88110. Each RF entry had 2 flip-flops
and control logic allowed 1-level of OoO-ness. For a 2-wide machine
this worked rather well.

>
> > > in this way it would be possible to use SIMD-style ALUs to
> > > carry out multiple lower-bitwidth operations.
> >
> > ALUs (including SIMD ALUs) are not expensive (in area or power).
> > What is expensive is routing of the data to them and routing it back.
>
> yeah, am finding that out...
>
> > > before going ahead with an experimental implementation, i thought
> > > it might be a really really good idea to see if anyone has any
> > > insights or caveats that would make this byte-masking impractical
> > > or unworkable.
> > >
> > > thoughts appreciated.
> >
> > The inherent simplicity of the scoreboard saves power and area. The
> > CDC 6600 scoreboard controls the latches used to route data around
> > the machine. That is correct--latches, not flip-flops. Since a latch
> > can be as small as 3 gates and a flip-flop is generally 10 gates;
> > building with latches is a lot smaller than building with flip-flops.
>
> that's a *really* compelling argument to continue investigating
>
> > It is a shame the current tools do not support that kind of design.
>
> interestingly it turns out that yosys had support added for
> detection and generation of d-latches, in 2015:
> https://github.com/YosysHQ/yosys/commit/554a8df5e2e7c750b76021821bdf2e07970b9dbf
>

Good to know.

> yosys is a libre tool that turns verilog into BTIF. one of the
> command-line options is *specifically* to disable d-latch creation.
>
> > Forwarding can be added to the scoreboard--eliminating this delay
> > (compared to a CDC 6600 scoreboard.)
> >
> > 1-cycle integer can also be added to the scoreboard for back to back
> > calculations.
> >
> > Branch prediction/recovery can be added to the scoreboard.
> >
> > And finally, reservation stations can be made "value-free", reading
> > the register file after the instruction fires into execution.
>
> ah of course. hm... how would it be ensured that the value being
> read is the correct one? one of the key benefits of Tomasulo is
> that reg-renaming is avoided (or, "provided") precisely because
> the relevant Reservation Station(s) contain *copies* of the (old)
> value.

Perhaps you should read the chapter I wrote.
>
> l.

MitchAlsup

unread,

Dec 6, 2018, 12:21:19 PM12/6/18

to

This is a paper that, in my opinion, is overly critical to the Scoreboard
and does nothing to ameliorate its deficiencies.

Whereas, they are perfectly happy to augment the Tomasulo mechanism with
multiple <common> data busses.

And one of the reasons I wrote that chapter, which debunks a large portion
of that paper.

MitchAlsup

unread,

Dec 6, 2018, 12:25:00 PM12/6/18

to

On Thursday, December 6, 2018 at 6:12:02 AM UTC-6, lk...@lkcl.net wrote:
> On Thursday, December 6, 2018 at 10:14:19 AM UTC, adaptive...@gmail.com wrote:
> > Let us see course example @1998,
> > https://people.eecs.berkeley.edu/~pattrsn/252S98/Lec04-tomosulo.pdf
> >
> > Best,
> > S.Takano
>
> ah! good find. very useful comparison of both algorithms.
>
> slide 18: ah. the 6600 scoreboard algorithm cannot do
> multiple LOAD operations. that's a show-stopper for the
> SimpleV SoC: it is absolutely critical that multiple LOAD
> (and STORE) operations be issued and carried out.

CDC 6600 scoreboard could have 2 memory references in the pre AGEN
stage. After AGEN, there were 5 LD address buffers and 2 ST address
buffers. It is all there in the Thornton book in great detail.

The inability to AGEN was the only limitation not the passages
of the mem refs through the memory system.

>
> does anyone know if, since the 6600, this has been enhanced /
> augmented, to make multiple LOAD operations possible?

I probably know as much or more about scoreboards than anyone alive.
Including the actual gate count of the CDC 6600 Scoreboard.
>
> l.

lk...@lkcl.net

unread,

The only practical difference left is that the Scoreboard is a
centralized control block, while the ReservationStation is a
distributed logic block.

>
> at present, the unique feature of SV, which is the exclusion
> of SIMD and its replacement with polymorphic bit-widths
> and variable-length vectors that map down onto the "real"
> register file(s) as opposed to having separate Vector Pipelines
> and RFs, is throwing a massive spanner in the works.
>
> where most architectures would have SIMD instructions and
> be done with it, after seeing this
> https://www.sigarch.org/simd-instructions-considered-harmful/
> i never want to be involved with SIMD, ever again :)

I figured this out about 2003.

>
> as mentioned before, what i'd therefore like to do is to be able
> to "merge" SIMD operations at the Reservation Station phase, by
> spotting patterns in the bitmasks:
>
> ROB#-dest | dest-bitmask | ROB#-src1/val | ROB#-src2/val
> --------- | ------------ | ------------- | -----------
> ROB5 | 0b11000000 | 0x59000000 | 0x12000000
> ROB4 | 0b00110000 | 0x00130000 | 0x00540000
>
> this to an 8-bit SIMD "Add" Functional Unit.
>
> the RS would have built-in "mask" detection logic that noted that the dest byte masks are non-overlapping, and would pass the *masked* src1 and src2 of *BOTH*
> rows of the RS down into the ALU.
>
> on completion of the operation, the destination would be noted as
> being different, and it would be necessary to issue *two* writes:
> one to ROB4 and the other to ROB5. if however the ROB#s were the
> same, only one write would be needed, and even the bitmasks merged.
>
> detection of this merging at instruction-issue time may not be
> possible, as it could hypothetically be *completely different*
> instructions that issued the instructions with different bytemasks.

You said above::

> i never want to be involved with SIMD, ever again :)

And yet you are in effect doing SIMD to your register file!

>
> in fact, i'm counting on precisely that, as a way to not need
> 8-bit-wide register routing and multiplexing etc. etc., which
> would be insane to have byte-wide crossbars all over the place.
>
> so, as i can't think of any other architectural way to achieve the
> same ability to merge non-overlapping SIMD ops, doing some voodoo
> on the RS rows *requires* Reservation Stations, that in turn requires
> ROB#s, and it all sort-of hangs together.
>
> oh, the other thing: a 3D GPU + VPU needs an insane 256 (total)
> 64-bit registers. 128 for INT, 128 for FP. this makes it possible
> to fit pixel "tiles" (as they're called) into the regfile, without
> the performance / power hit of transferring data through the L1/L2
> cache barrier (using LD/STs).

A common register file with both int and float in the same file gets rid
of 1/2 of that insanity.

>
> if we were to deploy the "usual" algorithm(s) associated with
> Scoreboard register-renaming, it would require multiplying that
> 256-entry register file by say 2, 3 or even 4.

Or 32 in order to absorb memory latency.

>
> by contrast, the Reservation Stations (whether deployed for SB+ROB
> or for TS+ROB) simply house the in-flight data, job done. oh,
> and we would get the opportunity to pull that SIMD-style data-merging
> trick.

As I do now and again, I will warn you about the difficulties of
doing SIMD to your register file.

More like carborundum.
>
> thoughts appreciated.

Ivan Godard

unread,

Dec 7, 2018, 4:02:38 PM12/7/18

to

How: You can't put them in any architecturally visible place, by
definition. So you have to put them in reserved memory. But that is way
too expensive, so you need some buffering, essentially a cache but
addressed using some kind of temporal addressing so you get them in the
right order at replay. So after an operand has worked its way up the
per-slot daisy chain and it's still alive (i.e. is still mapped by the
decoders) the chain essentially continues into spill buffers in front of
a configurable amount of SRAM. Add a FSM at the front of the SRAM to
merge the slots into the SRAM, and a rather normal looking cache
spill/fill FSM to connect the SRAM to the memory controllers. The spill
buffers look just like the inputs to an ordinary FU, so the operand
movement needs no special handling or hardware in the bypass crossbar.

At this point you can save and correctly replay over an exception. If
you mark the saved operands with a call-frame number too, then you can
handle nested exception. But exceptions are just "calls by other means",
so you can use the same machinery for ordinary call/return too. Of
course, the replay is after the call returns, and may replay results
that were in-flight (not yet completed) when the call happened, just as
you can with regular OOO except without all the OOO overhead (but with
the SRAM).

Of course, result replay requires knowing when a given operand is to be
replayed. So you either need reorder and dependency hardware during the
replay, or you must statically know the latency of every operation: lots
of hardware or no early outs, choose one. As it turns out, the only
early out that matters is load. The Mill is a hybrid: issue replay for
loads, and result replay for everything else. The load retire stations
are pretty much the same as the load queue on a conventional, except for
the ability to re-request the load they are holding, so there's no added
hardware.

Uniqueness: that's a question of namespace and how dependencies are
represented. I've not looked very hard are dependency representation for
dynamic result replay, and the ideas I've had would need a whiteboard,
so I'll leave that up to you You've a free hand; we have no patents in
the area.

In static result replay, the compiler knows the static latencies of
everything and dependency order is instruction stream order.
Consequently intermediate operands can be addressed as creation order
ordinal. This is temporal addressing (when, not where) rather than
spatial addressing (register number). In the decoder, recent results can
be addressed by indexing into the history; there's a logi-temporal to
physi-spatial latchset mapper in the decoder. The architectural view of
that mapping is the Belt. The actual mapping doesn't matter so long as
the compiler uses the same one when placing operations and assigning
belt numbers to argument references.

Uniqueness turned out to be one of the bigger issues given that we
wanted to be able to migrate tasks between cores, so a call saved on one
core would be replayed on another. We think we have it solved, but we
aren't doing multicore yet and may have to revisit the issue.

lk...@lkcl.net

unread,

Dec 7, 2018, 6:29:39 PM12/7/18

to

On Friday, December 7, 2018 at 4:31:44 PM UTC, MitchAlsup wrote:

> > at this point i am genuinely puzzled and confused as to
> > whether there is any difference - at all - between
> > Scoreboards+ROBs and Tomasulo+ROBs :)
>
> The only practical difference left is that the Scoreboard is a
> centralized control block, while the ReservationStation is a
> distributed logic block.

i found in one of those links you sent, a reference to a paper
that described how intel implemented the ReservationStations:
as a central block, with routing going out from there to ALUs.

after seeing the pages in the book chapters you sent, about
the 6600 having dependency resolution matrices (FU-to-FU
and others), putting the RS logic in one place on the chip
would make sense.

> > https://www.sigarch.org/simd-instructions-considered-harmful/
> > i never want to be involved with SIMD, ever again :)
>
> I figured this out about 2003.

sadly the case for SIMD is seductively compelling. "add another
opcode" boooys, only an O(N^6) opcode proliferation, noo problem
there at aaaaall.

dropping in another ALU, let the compiler sort it out, it's very
very "simple", in hardware terms, unfortunately.

> You said above::
> > i never want to be involved with SIMD, ever again :)
>
> And yet you are in effect doing SIMD to your register file!

:) i'll do a separate reply (change of subject), summary:
the external ISA looks like a vectorisation system, it's
just the internals where SIMD is the architectural compromise.

> > oh, the other thing: a 3D GPU + VPU needs an insane 256 (total)
> > 64-bit registers. 128 for INT, 128 for FP. this makes it possible
> > to fit pixel "tiles" (as they're called) into the regfile, without
> > the performance / power hit of transferring data through the L1/L2
> > cache barrier (using LD/STs).
>
> A common register file with both int and float in the same file gets rid
> of 1/2 of that insanity.

yeahh unfortunately, the requirements are to cover simultaneous
GPU *and* VPU workloads, where the int regfile is likely to be
heavily used for VPU, and the FP one for GPU.

if you've ever seen demos by nvidia of spinning cubes with videos
being played on them, that's pretty much what web browsers now do,
except flat (2D).

what they do is, they have the VPU do the actual decode, however
the handling of the *displaying* part is passed to an OpenGL
call, as some sort of "texture", i don't know the full details.

this can be "solved" with a massive amount of work, modifying
standard software releases of Firefox and Chromium. i don't want
to get involved with that: got enough to do.

more than that: RISC-V (scalar RISC-V) doesn't have shared INT/FP
regfiles: i did however see it suggested for the embedded space.

compiler-wise, shared register files make me extremely nervous.

> > if we were to deploy the "usual" algorithm(s) associated with
> > Scoreboard register-renaming, it would require multiplying that
> > 256-entry register file by say 2, 3 or even 4.
>
> Or 32 in order to absorb memory latency.

i really really hope not - target clock rate's only 800mhz,
and only a 32-bit-wide DDR3/4 800mhz memory interface
(this is standard fare for $4 300/400-pin FBGA SoCs)

> >
> > by contrast, the Reservation Stations (whether deployed for SB+ROB
> > or for TS+ROB) simply house the in-flight data, job done. oh,
> > and we would get the opportunity to pull that SIMD-style data-merging
> > trick.
>
> As I do now and again, I will warn you about the difficulties of
> doing SIMD to your register file.

i believe i have a way to fake it (the bytemask is part of it).
i'll keep it as a separate thread. we *need* massive amounts
of 8-bit and 16-bit operations, for video decoding, as well as
for handling the RGB pixels on the 3D side.

l.

MitchAlsup

unread,

Dec 7, 2018, 7:24:13 PM12/7/18

to

On Friday, December 7, 2018 at 5:29:39 PM UTC-6, lk...@lkcl.net wrote:
> On Friday, December 7, 2018 at 4:31:44 PM UTC, MitchAlsup wrote:
>
> > > at this point i am genuinely puzzled and confused as to
> > > whether there is any difference - at all - between
> > > Scoreboards+ROBs and Tomasulo+ROBs :)
> >
> > The only practical difference left is that the Scoreboard is a
> > centralized control block, while the ReservationStation is a
> > distributed logic block.
>
> i found in one of those links you sent, a reference to a paper
> that described how intel implemented the ReservationStations:
> as a central block, with routing going out from there to ALUs.

The dispatch stack means can be done wither with SB or RS and a
couple of other design points.

>
> after seeing the pages in the book chapters you sent, about
> the 6600 having dependency resolution matrices (FU-to-FU
> and others), putting the RS logic in one place on the chip
> would make sense.
>
> > > https://www.sigarch.org/simd-instructions-considered-harmful/
> > > i never want to be involved with SIMD, ever again :)
> >
> > I figured this out about 2003.
>
> sadly the case for SIMD is seductively compelling. "add another
> opcode" boooys, only an O(N^6) opcode proliferation, noo problem
> there at aaaaall.

In the case of Intel, it was consume another prefix to enable a whole
new sugGroup of instructions.

Why? They also solve a lot of problems:
a) one set of LD/ST instructions
b) INT(F), REAL(I) have no cross routing
c) compare instructions deliver condition to the easiest consumable place
d) FP compare and INT compare are almost the same logic! Why duplicate?
e) FP & MASK is easy to do--surprisingly useful in Transcendental codes.

My K9 design at AMD had both the INT and FP register file in one place,
with slightly different tag and data timing for the FP and INT sections.
The dynamic instruction window was much easier to construct with the
common PRF.

>
> > > if we were to deploy the "usual" algorithm(s) associated with
> > > Scoreboard register-renaming, it would require multiplying that
> > > 256-entry register file by say 2, 3 or even 4.
> >
> > Or 32 in order to absorb memory latency.
>
> i really really hope not - target clock rate's only 800mhz,
> and only a 32-bit-wide DDR3/4 800mhz memory interface
> (this is standard fare for $4 300/400-pin FBGA SoCs)
>
> > >
> > > by contrast, the Reservation Stations (whether deployed for SB+ROB
> > > or for TS+ROB) simply house the in-flight data, job done. oh,
> > > and we would get the opportunity to pull that SIMD-style data-merging
> > > trick.
> >
> > As I do now and again, I will warn you about the difficulties of
> > doing SIMD to your register file.
>
> i believe i have a way to fake it (the bytemask is part of it).
> i'll keep it as a separate thread. we *need* massive amounts
> of 8-bit and 16-bit operations, for video decoding, as well as
> for handling the RGB pixels on the 3D side.

SIMD in my SAMSUNG GPU/PE took essentially 1/2 of the architectural
effort, due to all the silly ramifications (along with a bit of
intragroup squabbling.)
>
> l.

Bruce Hoult

unread,

Dec 7, 2018, 9:04:29 PM12/7/18

to

On Friday, December 7, 2018 at 4:24:13 PM UTC-8, MitchAlsup wrote:
> SIMD in my SAMSUNG GPU/PE took essentially 1/2 of the architectural
> effort, due to all the silly ramifications (along with a bit of
> intragroup squabbling.)

Oh gawd. "Braiding". Don't talk to me about braiding. Kill it with fire!!! Awful.

Last I heard, it might not even get used.

lk...@lkcl.net

unread,

Dec 8, 2018, 2:55:34 AM12/8/18

to

On Friday, December 7, 2018 at 4:31:44 PM UTC, MitchAlsup wrote:

> You said above::
> > i never want to be involved with SIMD, ever again :)
>
> And yet you are in effect doing SIMD to your register file!

ok, so changing the topic: one of the requirements is to
be able to do massive amounts of 8-bit and 16-bit INT (incl.
IEEE754 FP16) computations.

given that the data width is so low, the cost of moving
(only) 8-bit / 16-bit values around, per clock, far exceeds
the amount of logic needed to move it.

so what i decided to do is:

* place a vectorisation front-end on the ISA
* allow "type-casting" of the register file from the
default down to 32-bit, 16-bit or 8-bit *as if*
it was a memory-addressable RAM
* behind the scenes (as part of the instruction issue
phase) have as many of those "elements" jammed into
an operation as will fit into the 64-bit width, and
mark them with a "byte-mask" if they won't fit

the alternatives are pretty much flat-out insane, and include
things like having a separate 8-bit sequential/scalar set of FUs
plus ROB plus Reservation Stations, and yet another separate
16-bit set *and* a 32-bit one... except now how on earth do you
sync those both up?

another alternative: just pass scalar operations in, jamming up
the Reorder Buffer with rows consuming dozens of bits, bearing
in mind each ROB entry is an increase in power consumption due
to the DestReg CAM, now it's no longer a quad-issue system it's
a 32-issue system and that's just insane.

so the compromise is to rely on the sequential nature of the
SV "hardware macro-loop-unrolling", which will spawn multiple
sequential operations (sequentially-incrementing "element"
numbers) that, in the 8-bit case, will *happen* to be the next
byte of a register (until the end of that register is reached,
in which case, we move on to the *next register*).

where the Vector Length is not a multiple of 8 (in the
case of 8-bit elements) or a multiple of 4 (in the case
of 16-bit elements), obviously if the entire register were
allowed to perform parallel computations, the MSBs would
become corrupted.

this is where the byte-mask comes in. the byte-mask will
not only tell the ALU which 8/16-bit operands it is to
calculate, it will also tell the Register File which bytes
of the DestReg are to be written to and which are to be left
alone.

so it's not *actually* a SIMD ISA (that's insane), it's just
predicated-SIMD ALUs, where the decoding is done in phases,
even right down to that optimisation idea of merging non-overlapping
byte-masks within the Reservation Stations.

there are some additional complications / wrinkles at play, here:

(1) the 1D/2D/3D "REMAP" system may add *offsets* onto the
vector-element indices. so, element 0 may come from
byte *THREE* of Register 5.

(2) the 1D/2D/3D "REMAP" system may again completely change
the element indices, according to a regular pattern

(3) there is polymorphic bit-width conversion available, on
*all* src operands *and* the destination.

here what i am planning to do is to utilise the Reorder Buffer
to effectively add in additional micro-code operations that will
pre-process and post-process the operands:

* in the case of the "REMAP", that's dead simple: the operands
go through an xBitManip phase (pre and post). the vectorised
instruction issue phase "gets" registers as standard 64-bit
blocks (element indices &ed with ~0x7), and the xBitManip
micro-phase will take care of the rest (element indices &ed 0x7)

* in the case of the bit-width conversion, that's also simple,
(and can be done in parallel): pass in 8-bit elements, out
comes 16/32/64-bit sign- or zero- extended elements.

the pre-processing and post-processing may in some cases actually
be done i believe by updating the Reorder Buffer entry, as opposed
to occupying multiple ROB entries. in particular this would
be destination register pre/post processing.

other phases (particularly bit-width source register reduction)
may actually be done with a filter on the RegFile.

so, i belieeve that this is a workable strategy for shoe-horning
polymorphic vectorisation into a multi-issue micro-architecture.
it's by no means perfect, as certain workloads will be optimal
and others will be rubbish (missed opportunities for byte-mask
merges).

what we will have to do is, just ensure that the assembly code
reasonably matches the optimal performance. it's a reasonable
compromise for a first effort.

l.

lk...@lkcl.net

unread,

Dec 8, 2018, 7:42:03 AM12/8/18

to

On Saturday, December 8, 2018 at 12:24:13 AM UTC, MitchAlsup wrote:

> > i found in one of those links you sent, a reference to a paper
> > that described how intel implemented the ReservationStations:
> > as a central block, with routing going out from there to ALUs.
>
> The dispatch stack means can be done wither with SB or RS and a
> couple of other design points.

i've yet to work out how instruction order is preserved, i was
fascinated to note in chapter 11 that you sent me, that branch
speculation is achieved by a combination of schroedinger-marking
the instruction (not yet dead/alive) and using that to mask out
the "write" flag.

simple: if the instruction can't "write", its results can't
do anything, they just sit there.

likewise, imprecise exceptions are made "precise" by the same
fundamental technique: the instruction with exception "holds" a
line that prevents the write from proceeding.

i am finally i *think* starting to "Get It", and it's down to
information that is completely missing from the 1967 patent:
the "Dependency Matrices" are an absolutely critical, critical
part that's *not* mentioned anywhere in any of the online
lectures.

the "Dependency Matrices" are basically a bit-version of the
Reorder Buffer, which will do things such as those mentioned
above. these DMs are from Function-Unit along X *and* Function-
Unit along Y, with dead-simple wires and some tiny sets of
logic gates implementing things like "If instruction is
schroedinger'd, disable the Go_Write signal".

* the Reorder Buffer basically moves a 1D "head of queue" in
a cyclic buffer.
* the Dependency Matrices do sort-of the same thing except
that the effects from X ripple along Y.

so... in effect, the lack of the 2D matrices is why the ROB
has the CAM in the Dest Reg... and *that* is why you said,
earlier, mitch, that SBds are like a "1-bit CAM"! but it's
not the CAM that's important, it's the *matrices*!

i haven't yet been able to completely absorb it, i only skim-read
chapter 11 last night, i see you added a means to do concurrent
LOAD/STOREs that *also* are based on a (sparse) matrix
(and then also link in to the other matrices). the LD/ST sparse
matrix then feeds into other Dependency Matrices.

the key for LD/ST is that LOADs can stop stores, and STOREs can
stop loads.

there's *no mention* of these simple Matrices *anywhere* in the
academic literature!

this missing information what was really preventing me from
"accepting" Scoreboards as a viable method of doing OoO superscalar
execution.

utterly, utterly cool.

> > sadly the case for SIMD is seductively compelling. "add another
> > opcode" boooys, only an O(N^6) opcode proliferation, noo problem
> > there at aaaaall.
>
> In the case of Intel, it was consume another prefix to enable a whole
> new sugGroup of instructions.

that's slightly better than i imagined it would be. if that prefix
"persisted" (in some form of CSR state), that's basically what SV
does (and RVV). it means though that a mechanism to store/restore
that state is needed.

> > more than that: RISC-V (scalar RISC-V) doesn't have shared INT/FP
> > regfiles: i did however see it suggested for the embedded space.
> >
> > compiler-wise, shared register files make me extremely nervous.
>
> Why?

i don't really know. or, more specifically: i have a series of
below-conscious-level intuitive hunches that i haven't yet been
able to explore / express, all of which are screaming for my
attention with big neon flashing signs... it's just that they're
about 50 miles away (over the horizon).

these hunches tell me that *if* i was to explore each of them
in huge excruciating detail, then some time in the next... maybe...
6 to 60 months i would have come up with the *one* show-stopping
concrete reason that would allow me to answer...

... oh and i would be absolutely kicking myself for having spent
6 to 60 months on it.

> They also solve a lot of problems:
> a) one set of LD/ST instructions

well.. that's easy enough to have the Functional Unit do the same
task, if the Reservation Station has a bit which marks it as being
either Int or FP.

likewise for (d)

> b) INT(F), REAL(I) have no cross routing
> c) compare instructions deliver condition to the easiest consumable place
> d) FP compare and INT compare are almost the same logic! Why duplicate?
> e) FP & MASK is easy to do--surprisingly useful in Transcendental codes.
>
> My K9 design at AMD had both the INT and FP register file in one place,
> with slightly different tag and data timing for the FP and INT sections.
> The dynamic instruction window was much easier to construct with the
> common PRF.

that seems fairly obvious / logical to me.

> SIMD in my SAMSUNG GPU/PE took essentially 1/2 of the architectural
> effort, due to all the silly ramifications (along with a bit of
> intragroup squabbling.)

Samsung? interesting! was that the internal group, or the external /
3rd party one (Nexell)

l.

lk...@lkcl.net

unread,

Dec 8, 2018, 7:47:42 AM12/8/18

to

On Saturday, December 8, 2018 at 2:04:29 AM UTC, Bruce Hoult wrote:
> On Friday, December 7, 2018 at 4:24:13 PM UTC-8, MitchAlsup wrote:
> > SIMD in my SAMSUNG GPU/PE took essentially 1/2 of the architectural
> > effort, due to all the silly ramifications (along with a bit of
> > intragroup squabbling.)
>
> Oh gawd. "Braiding". Don't talk to me about braiding.

you do _not_ wanna touch it...
https://www.youtube.com/watch?v=54o8aYDhnos

> Kill it with fire!!! Awful.

luckily, patents are obtuse and unreadable these days...
https://patents.justia.com/patent/10061591

l.

Bruce Hoult

unread,

Dec 8, 2018, 8:28:38 AM12/8/18

to

On Saturday, December 8, 2018 at 4:42:03 AM UTC-8, lk...@lkcl.net wrote:
> On Saturday, December 8, 2018 at 12:24:13 AM UTC, MitchAlsup wrote:
>
> > > i found in one of those links you sent, a reference to a paper
> > > that described how intel implemented the ReservationStations:
> > > as a central block, with routing going out from there to ALUs.
> >
> > The dispatch stack means can be done wither with SB or RS and a
> > couple of other design points.
>
> i've yet to work out how instruction order is preserved, i was
> fascinated to note in chapter 11 that you sent me, that branch
> speculation is achieved by a combination of schroedinger-marking
> the instruction (not yet dead/alive) and using that to mask out
> the "write" flag.
>
> simple: if the instruction can't "write", its results can't
> do anything, they just sit there.

Don't forget that in these days of Spectre and Meltdown, merely preventing dead instruction results from being written to registers or memory is NOT ENOUGH. You also need to prevent load instructions from altering cache and branch instructions from altering branch prediction state. For example.

Bruce Hoult

unread,

Dec 8, 2018, 8:32:01 AM12/8/18

to

That's the one. Interesting to see Mitch is not listed, but I personally know and worked on the compiler targeting this with three of the six inventors listed.

To put it simply, the problem is threads in a warp are predicated, but braids in a thread are not. The compiler needs to prove that braids will never want to diverge.

MitchAlsup

unread,

Dec 8, 2018, 12:07:01 PM12/8/18

to

For those out of the loop:
in SIMT each lane in the GPU is a different thread
Braiding placed data from different threads across a single register
Since each thread executes the same instruction,
on the face of it, this looks clean and straightforward.
It is only later that you realize you have just shot out both your feet.

But I agree with Bruce:: Kill the beast using any means possible.

MitchAlsup

unread,

Dec 8, 2018, 12:17:55 PM12/8/18

to

On Saturday, December 8, 2018 at 6:42:03 AM UTC-6, lk...@lkcl.net wrote:
> On Saturday, December 8, 2018 at 12:24:13 AM UTC, MitchAlsup wrote:
>
> > > i found in one of those links you sent, a reference to a paper
> > > that described how intel implemented the ReservationStations:
> > > as a central block, with routing going out from there to ALUs.
> >
> > The dispatch stack means can be done wither with SB or RS and a
> > couple of other design points.
>
> i've yet to work out how instruction order is preserved, i was
> fascinated to note in chapter 11 that you sent me, that branch
> speculation is achieved by a combination of schroedinger-marking
> the instruction (not yet dead/alive) and using that to mask out
> the "write" flag.
>
> simple: if the instruction can't "write", its results can't
> do anything, they just sit there.
>
> likewise, imprecise exceptions are made "precise" by the same
> fundamental technique: the instruction with exception "holds" a
> line that prevents the write from proceeding.
>
> i am finally i *think* starting to "Get It", and it's down to
> information that is completely missing from the 1967 patent:
> the "Dependency Matrices" are an absolutely critical, critical
> part that's *not* mentioned anywhere in any of the online
> lectures.

The online lectures don't want their pupils to actually "get" what scoreboards
can do at low expense. They want everyone to buy in on reservation stations.

>
> the "Dependency Matrices" are basically a bit-version of the
> Reorder Buffer, which will do things such as those mentioned
> above. these DMs are from Function-Unit along X *and* Function-
> Unit along Y, with dead-simple wires and some tiny sets of
> logic gates implementing things like "If instruction is
> schroedinger'd, disable the Go_Write signal".

You actually are starting to get it.

>
> * the Reorder Buffer basically moves a 1D "head of queue" in
> a cyclic buffer.

The RoB is essentially the dependency matrix plus storage and an
access means (post delivery forwarding).

> * the Dependency Matrices do sort-of the same thing except
> that the effects from X ripple along Y.
>
> so... in effect, the lack of the 2D matrices is why the ROB
> has the CAM in the Dest Reg... and *that* is why you said,
> earlier, mitch, that SBds are like a "1-bit CAM"! but it's
> not the CAM that's important, it's the *matrices*!

Bingo!

>
>
> i haven't yet been able to completely absorb it, i only skim-read
> chapter 11 last night, i see you added a means to do concurrent
> LOAD/STOREs that *also* are based on a (sparse) matrix
> (and then also link in to the other matrices). the LD/ST sparse
> matrix then feeds into other Dependency Matrices.

I have also done a DM in support of OoO memory system putting order
back together, but that is for another day.

>
> the key for LD/ST is that LOADs can stop stores, and STOREs can
> stop loads.
>
> there's *no mention* of these simple Matrices *anywhere* in the
> academic literature!

"Design of a Computer" Thornton has it all in great detail, except you
have to translate from pre-TTL gates into modern gate technologies.
If you have not read this, you need to. It is available online.

You are forgetting the precious nature of your OpCode space. The lower
the demands on OpCode space, the better the encoding can be, and the more
space you leave for future iterations.

>
> likewise for (d)
>
> > b) INT(F), REAL(I) have no cross routing
> > c) compare instructions deliver condition to the easiest consumable place
> > d) FP compare and INT compare are almost the same logic! Why duplicate?
> > e) FP & MASK is easy to do--surprisingly useful in Transcendental codes.
> >
> > My K9 design at AMD had both the INT and FP register file in one place,
> > with slightly different tag and data timing for the FP and INT sections.
> > The dynamic instruction window was much easier to construct with the
> > common PRF.
>
> that seems fairly obvious / logical to me.
>
> > SIMD in my SAMSUNG GPU/PE took essentially 1/2 of the architectural
> > effort, due to all the silly ramifications (along with a bit of
> > intragroup squabbling.)
>
> Samsung? interesting! was that the internal group, or the external /
> 3rd party one (Nexell)

Internal.
>
> l.

MitchAlsup

unread,

Dec 8, 2018, 12:21:37 PM12/8/18

to

On Saturday, December 8, 2018 at 7:32:01 AM UTC-6, Bruce Hoult wrote:
> On Saturday, December 8, 2018 at 4:47:42 AM UTC-8, lk...@lkcl.net wrote:
> > On Saturday, December 8, 2018 at 2:04:29 AM UTC, Bruce Hoult wrote:
> > > On Friday, December 7, 2018 at 4:24:13 PM UTC-8, MitchAlsup wrote:
> > > > SIMD in my SAMSUNG GPU/PE took essentially 1/2 of the architectural
> > > > effort, due to all the silly ramifications (along with a bit of
> > > > intragroup squabbling.)
> > >
> > > Oh gawd. "Braiding". Don't talk to me about braiding.
> >
> > you do _not_ wanna touch it...
> > https://www.youtube.com/watch?v=54o8aYDhnos
> >
> > > Kill it with fire!!! Awful.
> >
> > luckily, patents are obtuse and unreadable these days...
> > https://patents.justia.com/patent/10061591
>
> That's the one. Interesting to see Mitch is not listed, but I personally know and worked on the compiler targeting this with three of the six inventors listed.

John was the principal contributor and Max through the compiler would be easy
to get braiding worked out.

>
> To put it simply, the problem is threads in a warp are predicated, but braids in a thread are not. The compiler needs to prove that braids will never want to diverge.

At the time I left, there was a scheme by which even diverged threads still worked with braiding. I spent months figuring out the encodings and logic
to pull it off (being fought most of the way.)

MitchAlsup

unread,

Dec 8, 2018, 12:30:47 PM12/8/18

to

Which, oddly enough, provides a necessity for being able to consume
multiple containers from the cache Miss buffers, which oddly enough,
are what makes a crucial mechanism in the Virtual Vector Method work.

In the past, one would forward the demand container to the waiting memref
and then write the whole the line into the cache. S&M means you have to
forward multiple times from the miss buffers and avoid damaging the cache
until the instruction retires. VVM uses this to avoid having a vector strip
mine the data cache.

lk...@lkcl.net

unread,

Dec 8, 2018, 9:04:38 PM12/8/18

to

*thinks*... i suspect it's more that the patents were what everyone
had access to, and those simply did not emphasise enough the
importance of the FU dependency matrices.

i saw somewhere, "modern computing has all been about how to get
rid of the CAMs"... well... errr... the 6600 managed it, in the 1960s!

> > * the Reorder Buffer basically moves a 1D "head of queue" in
> > a cyclic buffer.
>
> The RoB is essentially the dependency matrix plus storage and an
> access means (post delivery forwarding).

specifying "to" (via the dest-reg) rather than "from" (via the src-regs),
hence the CAM, as the relationship "to" is one-to-many.

> > * the Dependency Matrices do sort-of the same thing except
> > that the effects from X ripple along Y.
> >
> > so... in effect, the lack of the 2D matrices is why the ROB
> > has the CAM in the Dest Reg... and *that* is why you said,
> > earlier, mitch, that SBds are like a "1-bit CAM"! but it's
> > not the CAM that's important, it's the *matrices*!
>
> Bingo!

:)

what that means is: many of the restrictions i envisaged due to the
ROB# Dest-Reg Cam, trying to keep the size of the ROB down to reduce
power usage, that restriction is now gone.

so instead of messing about trying to shoe-horn 2x 32-bit operations
into 64-bit-wide registers, 4x 16-bit ops into the same, and 8x 8-bit
ops likewise, it *might* be practical to consider putting in large
job-lots of 32-bit ALUs, possibly even 16-bit ones.

i'll have to think this through carefully, because just going to
32-bit-wide ALUs would literally double (actually, triple) the number
of FUs, which with an O(N^2) increase of the Dependency Matrix might
be a bit much.

so, keeping the SIMD-like bytemasks might still be a good idea...
hmm....

> >
> >
> > i haven't yet been able to completely absorb it, i only skim-read
> > chapter 11 last night, i see you added a means to do concurrent
> > LOAD/STOREs that *also* are based on a (sparse) matrix
> > (and then also link in to the other matrices). the LD/ST sparse
> > matrix then feeds into other Dependency Matrices.
>
> I have also done a DM in support of OoO memory system putting order
> back together, but that is for another day.

indeed.

> >
> > the key for LD/ST is that LOADs can stop stores, and STOREs can
> > stop loads.
> >
> > there's *no mention* of these simple Matrices *anywhere* in the
> > academic literature!
>
> "Design of a Computer" Thornton has it all in great detail, except you
> have to translate from pre-TTL gates into modern gate technologies.
> If you have not read this, you need to. It is available online.

found it. wooow, it's a scanned PDF... and there is a note
at the back with permission from James Thornton, written by
his wife, to make it available online.

wow.

oh, that's incredible: core memory and boot-loader bank switches!
coool! this reminds me, i remember someone telling me of a bit-encoded
instruction set that was so efficient that its bootstrap loader
required only 17 switches.

the diagrams and photos are absolutely fascinating. they were doing
the ALUs on PCBs with transistors?!?! no wonder there was a focus
on using D-Latches instead of Flip-flops, as that's a hell of a lot
of extra circuitry.

dang.

> > well.. that's easy enough to have the Functional Unit do the same
> > task, if the Reservation Station has a bit which marks it as being
> > either Int or FP.
>
> You are forgetting the precious nature of your OpCode space. The lower
> the demands on OpCode space, the better the encoding can be, and the more
> space you leave for future iterations.

*thinks*... the 1 extra bit from the size of the regfile (assuming
you want the same total register space) would lose that advantage...
however the mv instructions (from fp to reg) would be gone, at
various bitwidth permutations. there's probably some other obscure
savings, too.

l.

lk...@lkcl.net

unread,

Dec 9, 2018, 9:18:30 AM12/9/18

to

so, i'm still completely blown away at how amazing the
6600 design is.

* no need for a Reorder Buffer because the forward-chain
of the Go-Read and Go-Write signals through the matrices
preserves the instruction order dependencies whilst still
allowing out-of-order execution

* no need for operand-forwarding as that's an integral
part of the register file, through the "pass-through"
capability.

* the role of the Common Data Bus is also absorbed partly
by the register file's "pass-through" capability, and
the broadcasting of the Dest Reg on the CDB is replaced
by the "single-wire" sort-of CAMs of the Function Units

* as outlined in Mitch's book chapters, if one of those
Function Units is a "Branch" operation, associated
row entries in the Dependency Matrix that hook into
the "Go_Write" of down-stream Functional Units can be
used to allow speculative branch execution to *begin*
but not yet *complete* (commit)...

* also mentioned in the same: if speculation beyond more
than one branch is desired, just add another Branch
Function Unit!

* the book chapters also outline a multi-issue LOAD/STORE
mechanism that also works similarly, with a
secondary (augmentation) "memory-dependency" matrix.

some things which i've not yet been able to fully grasp:

* i believe the same logic proposed for the Speculative Branching
may be applied to Exceptions:
speculatively executed exceptions may be similarly
implemented, i imagine by having an Exception "Functional
Unit", in a similar analogous way to how a Reorder Buffer
typically stores speculative exceptions, ready for when
(or if) they hit the head of the ROB queue.

* whilst i Get how with the Tomasulo Algorithm, simply marking a
ROB Dest Reg table entry as empty or not is enough to provide
full register-renaming, because the *values* are stored in the
Reservation Stations (several copies of them, waiting in the RS
Function Unit buffers), i'm no longer clear on exactly how to
replace that ROB Dest Reg field in a scoreboard-with-RS context.

is this diagram with a "Register Aliasing Table" relevant and
correct?
https://libre-riscv.org/3d_gpu/rat_table.png
the table is indexed by register number, and contains a valid
bit, tag, and a "value" (presumably that value is the regfile
register value).

the "tag" apparently is just the "Function Unit"
number (a fixed quantity), and *hypothetically* i can kinda
imagine that it would just be a bitfield (one unique bit per
Function Unit). except, with say 50 Function Units and
256 actual registers, that's now a hell of a lot of bits.

i'm almost at the point where i have a full understanding, enough
to be able to commit to a simulation / implementation. lack of
a full understanding of the role of the Register Aliasing Table is
currently holding that back.

l.

MitchAlsup

unread,

Dec 9, 2018, 12:00:46 PM12/9/18

to

On Sunday, December 9, 2018 at 8:18:30 AM UTC-6, lk...@lkcl.net wrote:
> so, i'm still completely blown away at how amazing the
> 6600 design is.
>
> * no need for a Reorder Buffer because the forward-chain
> of the Go-Read and Go-Write signals through the matrices
> preserves the instruction order dependencies whilst still
> allowing out-of-order execution
>
> * no need for operand-forwarding as that's an integral
> part of the register file, through the "pass-through"
> capability.
>
> * the role of the Common Data Bus is also absorbed partly
> by the register file's "pass-through" capability, and
> the broadcasting of the Dest Reg on the CDB is replaced
> by the "single-wire" sort-of CAMs of the Function Units
>
> * as outlined in Mitch's book chapters, if one of those
> Function Units is a "Branch" operation, associated
> row entries in the Dependency Matrix that hook into
> the "Go_Write" of down-stream Functional Units can be
> used to allow speculative branch execution to *begin*
> but not yet *complete* (commit)...
>
> * also mentioned in the same: if speculation beyond more
> than one branch is desired, just add another Branch
> Function Unit!

They are small why not get 2.....

>
> * the book chapters also outline a multi-issue LOAD/STORE
> mechanism that also works similarly, with a
> secondary (augmentation) "memory-dependency" matrix.
>
>
>
> some things which i've not yet been able to fully grasp:
>
> * i believe the same logic proposed for the Speculative Branching
> may be applied to Exceptions:
> speculatively executed exceptions may be similarly
> implemented, i imagine by having an Exception "Functional
> Unit", in a similar analogous way to how a Reorder Buffer
> typically stores speculative exceptions, ready for when
> (or if) they hit the head of the ROB queue.

I designed it to cover both cases. It has the singular drawback of
taking interrupts at what appears to be branch boundaries most of
the time.

>
> * whilst i Get how with the Tomasulo Algorithm, simply marking a
> ROB Dest Reg table entry as empty or not is enough to provide
> full register-renaming, because the *values* are stored in the
> Reservation Stations (several copies of them, waiting in the RS
> Function Unit buffers), i'm no longer clear on exactly how to
> replace that ROB Dest Reg field in a scoreboard-with-RS context.
>
> is this diagram with a "Register Aliasing Table" relevant and
> correct?
> https://libre-riscv.org/3d_gpu/rat_table.png
> the table is indexed by register number, and contains a valid
> bit, tag, and a "value" (presumably that value is the regfile
> register value).
>
> the "tag" apparently is just the "Function Unit"
> number (a fixed quantity), and *hypothetically* i can kinda
> imagine that it would just be a bitfield (one unique bit per
> Function Unit). except, with say 50 Function Units and
> 256 actual registers, that's now a hell of a lot of bits.

The whole thing has a quadratic term in the expansion. So SBs have
an upper limit on the registers times FUs that are supportable.

Ivan Godard

unread,

Dec 9, 2018, 12:25:31 PM12/9/18

to

On 12/9/2018 9:00 AM, MitchAlsup wrote:

>
> The whole thing has a quadratic term in the expansion. So SBs have
> an upper limit on the registers times FUs that are supportable.

Can't you use the same group-FUs-into-slots that we did, to reduce the
size of the FU quadratic coefficient?

lk...@lkcl.net

unread,

Dec 9, 2018, 12:34:28 PM12/9/18

to

On Sunday, December 9, 2018 at 5:00:46 PM UTC, MitchAlsup wrote:

> > * also mentioned in the same: if speculation beyond more
> > than one branch is desired, just add another Branch
> > Function Unit!
>
> They are small why not get 2.....

buy one get one free...

> > the "tag" apparently is just the "Function Unit"
> > number (a fixed quantity), and *hypothetically* i can kinda
> > imagine that it would just be a bitfield (one unique bit per
> > Function Unit). except, with say 50 Function Units and
> > 256 actual registers, that's now a hell of a lot of bits.
>
> The whole thing has a quadratic term in the expansion. So SBs have
> an upper limit on the registers times FUs that are supportable.

after reviewing this video https://youtu.be/p4SdrUhZrBM?t=494
and noting the problem there of two results being ready at
the same time, both of which target the same register r1
(one from the MUL unit, one from an ADD), i think i might have
a workable scalable solution that doesn't blow up.

* each reservation station is given its own scoreboard src1/src2
ready flags. about 2 entries would be just about tolerated
because this would double the number of Functional Unit wires,
thus quadrupling the size of the Dependency Matrices.

* the src reg number(s) are shared (identical) so it's not as
horrible as it seems.

* writes are selected based on the age of the instruction.
this being directly equivalent to the "head of Reorder Buffer
gets commit rights"

(question: does this mechanism already exist?)

* on "Go Write", the index of the Reservation Station, which
is a single bit (there being lots of those), is used to create
a pseudo-encoding that's directly equivalent to the
"Reorder Buffer Number".

* this pseudo-encoding is passed through the Register File
and out the other side, whereupon it is expanded *BACK*
out to single-bit signals that indicate to the Reservation
Stations (with a single bit), "data's arrived".

this is critically dependent on register writes being done in
the instructions' originally-issued order. i am happy to admit
as having no clue how that's done :)

l.

MitchAlsup

unread,

Dec 9, 2018, 2:00:24 PM12/9/18

to

One of the things that made the CDC 6600 SB so small is that it had
3 register files of 8 entries each. The A registers were only readable
by the INCrement unit and the Shifter, THe B registers only readable by
INCrementunit and the shifter, and that the X registers were only
readable by the MUL, DIV, ADDer, and Store units.

If one were to try a SB with 32 GPRs each GPR can go to any of 5-8
function units and the SB becomes too big for its timing to work.
Let us postulate that we have a fully general purpose SB for such a
RISC machine (Opteron size) and the SB has size X.

The SB for the partitioned CDC 6600 is about 24 times smaller. (IIRC).

MitchAlsup

unread,

Dec 9, 2018, 2:04:34 PM12/9/18

to

On Sunday, December 9, 2018 at 11:34:28 AM UTC-6, lk...@lkcl.net wrote:
> On Sunday, December 9, 2018 at 5:00:46 PM UTC, MitchAlsup wrote:
>
> > > * also mentioned in the same: if speculation beyond more
> > > than one branch is desired, just add another Branch
> > > Function Unit!
> >
> > They are small why not get 2.....
>
> buy one get one free...
>
> > > the "tag" apparently is just the "Function Unit"
> > > number (a fixed quantity), and *hypothetically* i can kinda
> > > imagine that it would just be a bitfield (one unique bit per
> > > Function Unit). except, with say 50 Function Units and
> > > 256 actual registers, that's now a hell of a lot of bits.
> >
> > The whole thing has a quadratic term in the expansion. So SBs have
> > an upper limit on the registers times FUs that are supportable.
>
> after reviewing this video https://youtu.be/p4SdrUhZrBM?t=494
> and noting the problem there of two results being ready at
> the same time, both of which target the same register r1
> (one from the MUL unit, one from an ADD), i think i might have
> a workable scalable solution that doesn't blow up.

The CDC 6600 would not issue the second instruction targeting R1
preventing this kind of WaW hazard. However, one could use another
copy of the WaR hazard prevention logic to allow issue but still
prevent WaW hazards from transpiring.

>
> * each reservation station is given its own scoreboard src1/src2
> ready flags. about 2 entries would be just about tolerated
> because this would double the number of Functional Unit wires,
> thus quadrupling the size of the Dependency Matrices.

Just use the WaR hazard logic and use it to create a WaW hazard
detector, anduse pickers on that. More logic, yes; new state in
SB, no!

>
> * the src reg number(s) are shared (identical) so it's not as
> horrible as it seems.
>
> * writes are selected based on the age of the instruction.
> this being directly equivalent to the "head of Reorder Buffer
> gets commit rights"

Go back and read what I just wrote above. It is easier....

EricP

unread,

Dec 9, 2018, 2:10:15 PM12/9/18

to

lk...@lkcl.net wrote:
> so, i'm still completely blown away at how amazing the
> 6600 design is.
>
> * no need for a Reorder Buffer because the forward-chain
> of the Go-Read and Go-Write signals through the matrices
> preserves the instruction order dependencies whilst still
> allowing out-of-order execution

I don't think this is correct. If the instructions write
independent registers then they can complete out-of-order.

In your example Lesson_4_ILP_PartII_Scoreboard.pdf

MULTD F0 F2 F4
SUBD F8 F6 F2

in cycle 12 on page 38 you can see the SUBD instruction
completing, writing register F8 and being removed from
the scoreboard while the MULTD is still running.

If MULTD was to throw an exception then F0 would not be
updated but F8 would be, thus an imprecise exception.

Looking at the CDC 6600 ISA and scoreboard, I don't see how it
prevents out-of-order writes to the same memory address from
possibly saving values in the wrong order.
If the code was:

MULTD F0=F2*F4
STD [A6], F0
SUBD F8=F6-F2
STD [A6], F8

then the second store could execute before the first.

Stephen Fuld

unread,

Dec 9, 2018, 2:43:40 PM12/9/18

to

On 12/9/2018 6:18 AM, lk...@lkcl.net wrote:
> so, i'm still completely blown away at how amazing the
> 6600 design is.

So you are getting a feeling of why Seymour Cray is so highly regarded
as a computer designer.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Ivan Godard

unread,

Dec 9, 2018, 4:00:39 PM12/9/18

to

That's like address vs data regs, and is grouping the other coefficient;
a worthy thing but not my question.

To expand, say you had four banks of general-purpose regs RP1..4, with
slots S1..4 connected to them. The slots contain several copies of ALU,
FPU, AGEN, branch: S1->ALU1, FP1, AGEN1; S2->ALU2, AGEN2, Branch1;
S3->ALU3, Branch2; S4->FP2, ALU4. Or some other population. The
coefficient is 4 because four slots, but the FU population is 10. The
assumption is that there's rarely work for more than four of the
potentially ten FUs, you just don't know which four, and you tune the
config for the workload so availability stalls are rare.
Does that work in your model?

lk...@lkcl.net

unread,

Dec 9, 2018, 4:26:25 PM12/9/18

to

On Sunday, December 9, 2018 at 7:10:15 PM UTC, EricP wrote:
> lk...@lkcl.net wrote:
> > so, i'm still completely blown away at how amazing the
> > 6600 design is.
> >
> > * no need for a Reorder Buffer because the forward-chain
> > of the Go-Read and Go-Write signals through the matrices
> > preserves the instruction order dependencies whilst still
> > allowing out-of-order execution
>
> I don't think this is correct. If the instructions write
> independent registers then they can complete out-of-order.

ah, that's the beauty: they can't, because it's not just a scoreboard,
there's the N(Func-Units) x N(Func-Units) matrix of 1-bit CAMs, one bit
per read-after-write hazard and one bit per write-after-read hazard.

these beautifully express the dependencies between
srcreg-to-function-unit-to-destreg, in exactly the same way as
a Tomasulo + Reorder Buffer... without needing to actively
track the head of the Reorder Buffer.

> In your example Lesson_4_ILP_PartII_Scoreboard.pdf
>
> MULTD F0 F2 F4
> SUBD F8 F6 F2
>
> in cycle 12 on page 38 you can see the SUBD instruction
> completing, writing register F8 and being removed from
> the scoreboard while the MULTD is still running.
>
> If MULTD was to throw an exception then F0 would not be
> updated but F8 would be, thus an imprecise exception.

the standard academic literally completely and utterly misses
the absolute crucial strategic value and purpose of the
N(FUs) x N(FUss) dependency-matrix, which contains precisely
and elegantly, as 1-bit CAMs, exactly the information needed.

if you *do not have* that NxN dependency matrix, which captures
all of the RAW and WAW hazards, then yes, you are absolutely
correct: an imprecise exception is what will result.

> Looking at the CDC 6600 ISA and scoreboard, I don't see how it
> prevents out-of-order writes to the same memory address from
> possibly saving values in the wrong order.
> If the code was:
>
> MULTD F0=F2*F4
> STD [A6], F0
> SUBD F8=F6-F2
> STD [A6], F8
>
> then the second store could execute before the first.

ok so there would be 3 FUs (functional units), here, so a 3x3
dependency matrix.

first instruction would set src1=f2, src2=f4, dest=f0, in FU1 (aka "MUL")

second instruction would set src1=f0, which, because of the match between
src1 and dest of FU2 (aka "ST"), would set the "RAW hazard" dependency bit in
row 2 (FU2 aka "ST") column 1 (FU1 aka "MUL").

etc. etc.

this insight - the role of the dependency matrix - is quite literally
completely and utterly missing from the academic literature. so of
*course* everyone implements Reorder Buffers, not realising that the
ROB# "Dest Reg" CAM and the cyclic buffer itself may be turned into
multiple pairs of 1-bit *SOURCE* register CAMs aka "an AND gate", one
bit expressing RAW hazards and the other expressing WAW hazards.

58 years. *58*!

l.

lk...@lkcl.net

unread,

Dec 9, 2018, 4:29:12 PM12/9/18

to

On Sunday, December 9, 2018 at 7:04:34 PM UTC, MitchAlsup wrote:

> > after reviewing this video https://youtu.be/p4SdrUhZrBM?t=494
> > and noting the problem there of two results being ready at
> > the same time, both of which target the same register r1
> > (one from the MUL unit, one from an ADD), i think i might have
> > a workable scalable solution that doesn't blow up.
>
> The CDC 6600 would not issue the second instruction targeting R1
> preventing this kind of WaW hazard. However, one could use another
> copy of the WaR hazard prevention logic to allow issue but still
> prevent WaW hazards from transpiring.

yep, got it. another aspect of the 6600 that still hadn't properly
sunk in.

> > * each reservation station is given its own scoreboard src1/src2
> > ready flags. about 2 entries would be just about tolerated
> > because this would double the number of Functional Unit wires,
> > thus quadrupling the size of the Dependency Matrices.
>
> Just use the WaR hazard logic and use it to create a WaW hazard
> detector, anduse pickers on that. More logic, yes; new state in
> SB, no!

yep, got it. no need even to track the instruction number. the
RaW WaW 1-bit CAMs in the FUxFU matrix are sufficient.

it's one of those, "surely it can't be that simple" moments...

l.

lk...@lkcl.net

unread,

Dec 9, 2018, 4:30:21 PM12/9/18

to

On Sunday, December 9, 2018 at 7:43:40 PM UTC, Stephen Fuld wrote:
> On 12/9/2018 6:18 AM, lk...@lkcl.net wrote:
> > so, i'm still completely blown away at how amazing the
> > 6600 design is.
>
> So you are getting a feeling of why Seymour Cray is so highly regarded
> as a computer designer

absoluuutely.

MitchAlsup

unread,

Dec 9, 2018, 5:28:53 PM12/9/18

to

On Sunday, December 9, 2018 at 1:43:40 PM UTC-6, Stephen Fuld wrote:
> On 12/9/2018 6:18 AM, lk...@lkcl.net wrote:
> > so, i'm still completely blown away at how amazing the
> > 6600 design is.
>
>
> So you are getting a feeling of why Seymour Cray is so highly regarded
> as a computer designer.

What is sad is how much of the modern literature skips over the stuff he did.

MitchAlsup

unread,

Dec 9, 2018, 5:35:25 PM12/9/18

to

Rp1 <-> S1{ALU1, FP1, AGEN1}
Rp2 <-> S2{ALU2, AGEN2, Branch1}
Rp3 <-> S3{ALU3, Branch2}
Rp4 <-> S4{ALU4, FP2}

This kind of SB is vastly smaller than:

{Rp1,2,3,4} <-> S1{ALU1, FP1, AGEN1}
{Rp1,2,3,4} <-> S2{ALU2, AGEN2, Branch1}
{Rp1,2,3,4} <-> S3{ALU3, Branch2}
{Rp1,2,3,4} <-> S4{ALU4, FP2}

But no compiler writer would put up with the limitations.

But, yes, various flavors of partition attack the RF*FU SB size.

EricP

unread,

Dec 9, 2018, 5:50:00 PM12/9/18

to

lk...@lkcl.net wrote:
> On Sunday, December 9, 2018 at 7:10:15 PM UTC, EricP wrote:
>> lk...@lkcl.net wrote:
>>>
>>> * no need for a Reorder Buffer because the forward-chain
>>> of the Go-Read and Go-Write signals through the matrices
>>> preserves the instruction order dependencies whilst still
>>> allowing out-of-order execution
>> I don't think this is correct. If the instructions write
>> independent registers then they can complete out-of-order.
>
> ah, that's the beauty: they can't, because it's not just a scoreboard,
> there's the N(Func-Units) x N(Func-Units) matrix of 1-bit CAMs, one bit
> per read-after-write hazard and one bit per write-after-read hazard.

I see no mention of this hardware by Thornton (nor in H&P).
What figure in Design Of A Computer are you looking at?

> these beautifully express the dependencies between
> srcreg-to-function-unit-to-destreg, in exactly the same way as
> a Tomasulo + Reorder Buffer... without needing to actively
> track the head of the Reorder Buffer.
>
>> In your example Lesson_4_ILP_PartII_Scoreboard.pdf
>>
>> MULTD F0 F2 F4
>> SUBD F8 F6 F2
>>
>> in cycle 12 on page 38 you can see the SUBD instruction
>> completing, writing register F8 and being removed from
>> the scoreboard while the MULTD is still running.
>>
>> If MULTD was to throw an exception then F0 would not be
>> updated but F8 would be, thus an imprecise exception.
>
> the standard academic literally completely and utterly misses
> the absolute crucial strategic value and purpose of the
> N(FUs) x N(FUss) dependency-matrix, which contains precisely
> and elegantly, as 1-bit CAMs, exactly the information needed.

H&P also shows an example scoreboard with out-of-order writeback.

> if you *do not have* that NxN dependency matrix, which captures
> all of the RAW and WAW hazards, then yes, you are absolutely
> correct: an imprecise exception is what will result.

But there is no RAW or WAW dependency between those two
instructions or between the FU's.

>> Looking at the CDC 6600 ISA and scoreboard, I don't see how it
>> prevents out-of-order writes to the same memory address from
>> possibly saving values in the wrong order.
>> If the code was:
>>
>> MULTD F0=F2*F4
>> STD [A6], F0
>> SUBD F8=F6-F2
>> STD [A6], F8
>>
>> then the second store could execute before the first.
>
> ok so there would be 3 FUs (functional units), here, so a 3x3
> dependency matrix.
>
> first instruction would set src1=f2, src2=f4, dest=f0, in FU1 (aka "MUL")
>
> second instruction would set src1=f0, which, because of the match between
> src1 and dest of FU2 (aka "ST"), would set the "RAW hazard" dependency bit in
> row 2 (FU2 aka "ST") column 1 (FU1 aka "MUL").
>
> etc. etc.

You just yada yada yada'd over the problem.
You are counting on register dependencies for program ordering.
The two stores are to the same address in [A6] but with different
source registers F0 and F8 so there is no register dependency.
As soon as the SUBD writes F8 the second store will execute.

> this insight - the role of the dependency matrix - is quite literally
> completely and utterly missing from the academic literature. so of
> *course* everyone implements Reorder Buffers, not realising that the
> ROB# "Dest Reg" CAM and the cyclic buffer itself may be turned into
> multiple pairs of 1-bit *SOURCE* register CAMs aka "an AND gate", one
> bit expressing RAW hazards and the other expressing WAW hazards.
>
> 58 years. *58*!
>
> l.

I'm afraid I still don't see it.

Ivan Godard

unread,

Dec 9, 2018, 7:43:47 PM12/9/18

to

Why do you say so? We had no trouble, and we're certainly not the first
wide static to do it. The question was whether dynamic (where the
compiler doesn't matter) could do the same in practice, with the decoder
scheduling into slots rather than FUs and the slot routing to the right
FU based on the opcode.

lk...@lkcl.net

unread,

Dec 10, 2018, 2:57:24 AM12/10/18

to

On Sunday, December 9, 2018 at 10:50:00 PM UTC, EricP wrote:

> > ah, that's the beauty: they can't, because it's not just a scoreboard,
> > there's the N(Func-Units) x N(Func-Units) matrix of 1-bit CAMs, one bit
> > per read-after-write hazard and one bit per write-after-read hazard.
>
> I see no mention of this hardware by Thornton (nor in H&P).
> What figure in Design Of A Computer are you looking at?

i'm not. i'm looking at a modified variant where
"reservation stations as function units" have been
created:

https://libre-riscv.org/3d_gpu/rat_table.png

> But there is no RAW or WAW dependency between those two
> instructions or between the FU's.

in a RAT-augmented variant with effectively separate
function units, the instruction is allocated to
the *alternative* FU (that happens to route to the
same ALU), thus providing the dependency link and
also preventing the stall.

> >> Looking at the CDC 6600 ISA and scoreboard, I don't see how it
> >> prevents out-of-order writes to the same memory address from
> >> possibly saving values in the wrong order.
> >> If the code was:
> >>
> >> MULTD F0=F2*F4
> >> STD [A6], F0
> >> SUBD F8=F6-F2
> >> STD [A6], F8
> >>
> >> then the second store could execute before the first.
> >
> > ok so there would be 3 FUs (functional units), here, so a 3x3
> > dependency matrix.
> >
> > first instruction would set src1=f2, src2=f4, dest=f0, in FU1 (aka "MUL")
> >
> > second instruction would set src1=f0, which, because of the match between
> > src1 and dest of FU2 (aka "ST"), would set the "RAW hazard" dependency bit in
> > row 2 (FU2 aka "ST") column 1 (FU1 aka "MUL").
> >
> > etc. etc.
>
> You just yada yada yada'd over the problem.

almost certainly. i'm not an expert in this field: this is
entirely new to me.

> I'm afraid I still don't see it.

neither do i... so i'm going to have to write a simulator,
see how it works out.

l.

lk...@lkcl.net

unread,

Dec 10, 2018, 11:55:19 AM12/10/18

to

https://www.youtube.com/watch?v=lEvYWf6fZ3

ok so i thought it best to do as a video, apologies in
advance for the "um, err", and the phone autofocus errors :)

the changes over a standard CDC 6600 (and the original
video) are:

* Reservation Stations are flattened out i.e. given their
*own* unique Function Unit identity (whilst still directing
to the same ALU). thus we have FU "ADD0" and "ADD1"

* the FU x FU dependency matrix has an extra bit which is
used to express the instruction sequence, and it controls
whether an instruction may "commit" or not.

this is part of the augmentations that mitch describes in
his book chapter, and it is used for branch speculation
and to provide precise exceptions. it prevents the "commit"
phase whilst still allowing "execution".

it is functionally DIRECTLY equivalent to the ROB ordering,
where only the head of the Reorder Buffer will have this
bit "cleared".

the rest is straight CDC 6600 dependency matrices. remember
that there are *two* such matrices:

* FunctionUnit to FunctionUnit
* FunctionUnit to Registers

by allowing multiple FunctionUnits to exist (and to have copies of
the values), the Register Alias Table disappears.

it's basically a direct functional equivalence of the Reorder
Buffer and Reservation Stations, except with the ROB Dest CAM
expanded out to single-bit CAMs and multiplied out through
NxM and NxN matrices.

the same trick is deployed on the Tomasulo LOAD Buffer, replacing
that with an OxO matrix that expresses the LOAD/STORE dependencies.

l.

timca...@aol.com

unread,

Dec 10, 2018, 3:27:18 PM12/10/18

to

On Sunday, December 9, 2018 at 2:00:24 PM UTC-5, MitchAlsup wrote:
> On Sunday, December 9, 2018 at 11:25:31 AM UTC-6, Ivan Godard wrote:
> > On 12/9/2018 9:00 AM, MitchAlsup wrote:
> >
> > >
> > > The whole thing has a quadratic term in the expansion. So SBs have
> > > an upper limit on the registers times FUs that are supportable.
> >
> > Can't you use the same group-FUs-into-slots that we did, to reduce the
> > size of the FU quadratic coefficient?
>
> One of the things that made the CDC 6600 SB so small is that it had
> 3 register files of 8 entries each. The A registers were only readable
> by the INCrement unit and the Shifter, THe B registers only readable by
> INCrementunit and the shifter, and that the X registers were only
> readable by the MUL, DIV, ADDer, and Store units.
>

Mitch, you have read the circuit diagrams so you obviously have superior
knowledge in this area, but I'm confused by the above.

Looking at the instruction set:
http://www.60bits.net/msu/mycomp/cdc6000/65inst.htm
(Since I don't have access to my old Compass manual).
The A register was only readable by the (what we called) the short (18 bit) adder.
The B registers could be used in shift, branch and short adder instructions.
The X registers could be used as inputs for stores, FP (ADD, SUB, DIV, MUL, normalize, unpack), integer (add, sub), short adder, branch (compare & branch) logical (and, or, xor) and shifts.

That doesn't quite match up with what you stated above.

- Tim

MitchAlsup

unread,

Dec 10, 2018, 4:16:32 PM12/10/18

to

On Monday, December 10, 2018 at 2:27:18 PM UTC-6, timca...@aol.com wrote:
> On Sunday, December 9, 2018 at 2:00:24 PM UTC-5, MitchAlsup wrote:
> > On Sunday, December 9, 2018 at 11:25:31 AM UTC-6, Ivan Godard wrote:
> > > On 12/9/2018 9:00 AM, MitchAlsup wrote:
> > >
> > > >
> > > > The whole thing has a quadratic term in the expansion. So SBs have
> > > > an upper limit on the registers times FUs that are supportable.
> > >
> > > Can't you use the same group-FUs-into-slots that we did, to reduce the
> > > size of the FU quadratic coefficient?
> >
> > One of the things that made the CDC 6600 SB so small is that it had
> > 3 register files of 8 entries each. The A registers were only readable
> > by the INCrement unit and the Shifter, THe B registers only readable by
> > INCrementunit and the shifter, and that the X registers were only
> > readable by the MUL, DIV, ADDer, and Store units.
> >
> Mitch, you have read the circuit diagrams so you obviously have superior
> knowledge in this area, but I'm confused by the above.
>
> Looking at the instruction set:
> http://www.60bits.net/msu/mycomp/cdc6000/65inst.htm
> (Since I don't have access to my old Compass manual).
> The A register was only readable by the (what we called) the short (18 bit) adder.

Which was called the Increment unit in Thornton and there were 2 instructions
which could be queued on this unit.

> The B registers could be used in shift, branch and short adder instructions.

Yes, and in a special instruction, the exponent would be written into the B
register while the fraction (mantissa) would be written into the X register
simultaneously.

> The X registers could be used as inputs for stores, FP (ADD, SUB, DIV, MUL, normalize, unpack), integer (add, sub), short adder, branch (compare & branch) logical (and, or, xor) and shifts.
>
> That doesn't quite match up with what you stated above.

Extemporaneous; after a decade of non-use--it's not so bad.

lk...@lkcl.net

unread,

Dec 10, 2018, 8:41:36 PM12/10/18

to

i gave this a little thought, and i suspect it would be workable.
instead of dedicating a function unit to *one* function, it is
instead dedicated to functions with identical input and output
requirements / characteristics, and the opcode is passed in
as one of the parameters.

mitch this is no different effectively from the "computational groups"
concept described in the 2nd chapter of your book.
(this concept, basically, is that multiple FU rows are served by
the same (usually pipelined) ALU).

by passing in the opcode, the primary purpose of the Dependency
Matrices and scoreboard has been abstracted out and made independent
of the ALUs.

basically the opcode is added as a column in the Scoreboard table,
and this is used to reserve an FU with the src-dest reg characteristics
that suit that opcode. execution passes in the opcode, which is then
"fanned out" to actual ALUs.

l.

Bruce Hoult

unread,

Dec 10, 2018, 8:58:32 PM12/10/18

to

On Monday, December 10, 2018 at 8:55:19 AM UTC-8, lk...@lkcl.net wrote:
> https://www.youtube.com/watch?v=lEvYWf6fZ3
>
> ok so i thought it best to do as a video, apologies in
> advance for the "um, err", and the phone autofocus errors :)

"Video Unavailable"

lk...@lkcl.net

unread,

Dec 10, 2018, 9:06:10 PM12/10/18

to

ngggh, missed an "s" in the cut/paste:
https://www.youtube.com/watch?v=lEvYWf6fZ3s

Ivan Godard

unread,

Dec 11, 2018, 1:20:42 AM12/11/18

to

Yes, that's what I had in mind. We looked at filing for the idea, and
decided that every VLIW and its brother were prior art, and apparently
Mitch is too. That's the trouble with guys like him - all your own good
ideas he's already done :-)

Now you get to decide which FUs go in which slot - er, group.

MitchAlsup

unread,

Dec 11, 2018, 5:05:42 PM12/11/18

to

On Monday, December 10, 2018 at 7:41:36 PM UTC-6, lk...@lkcl.net wrote:
> On Sunday, December 9, 2018 at 5:25:31 PM UTC, Ivan Godard wrote:
> > On 12/9/2018 9:00 AM, MitchAlsup wrote:
> >
> > >
> > > The whole thing has a quadratic term in the expansion. So SBs have
> > > an upper limit on the registers times FUs that are supportable.
> >
> > Can't you use the same group-FUs-into-slots that we did, to reduce the
> > size of the FU quadratic coefficient?
>
> i gave this a little thought, and i suspect it would be workable.
> instead of dedicating a function unit to *one* function, it is
> instead dedicated to functions with identical input and output
> requirements / characteristics, and the opcode is passed in
> as one of the parameters.
>
> mitch this is no different effectively from the "computational groups"
> concept described in the 2nd chapter of your book.
> (this concept, basically, is that multiple FU rows are served by
> the same (usually pipelined) ALU).

Back with I was doing the 88120; a 6-wide, packet cache, OoO machine circa
1990, we chose the FUs as::
FU[0] = {AGEN, INT, LOG, SHIFT}
FU[1] = {AGEN, INT, LOG, SHIFT}
FU[2] = {AGEN, INT, LOG, SHIFT}
FU[3] = {INT, LOG, FADD}
FU[4] = {INT, LOG, FMUL}
FU[5] = (INT, LOG, BR}

So INT (integer) LOG (logical) could go in any slot
LDs or STs could go in the first 3 slots
Shifts could go in the first 3 slots
FADD, FSUB, FCMP, and a few twiddly FP instructions in slot 3
FMUL, FDIV, SQRT instructions in slot 4
Branches in slot 5

Rational:
* LOG is so small it is easier to support it everywhere than create special
cases where it is not.
* Needed 3 memory refs per cycle to support Matrix300
* need FADD and FMUL to support Matrix300
* need ADD, CMP, BR in an unused slot so Matrix300 could use the rest of
the machine while loop overhead was out of the way.
* Since the memory refs had scaled indexing, there are few shifts left, so
we share shifters as load aligner and store aligner. {Shifters are NOT small}

And we could run Matrix300 at 1 loop per cycle (5.95 I/C) with a unified
16 KB direct mapped cache--mainly because we had a DIDO pin interface.

The DIDO pin interface had 4 ports: AO, DO, AI, DI
AO = Address (and command) out
DO = Address (and command) and data out
AI = Address (and command) in
DI = Address (and command) and data in

AO was where read misses left the chip
DO was where modified data was pushed back to DRAM
AI was where snoops arrived
DI was where read data arrived.

Each port could perform it function every cycle.
Since the busses were not turned around electrically, we were going to make
the busses the CMOS equivalent to ECL logic levels.

The simulator even performed DRAM refresh in the DRAM banks; every DRAM line
was refreshed every 4 milliseconds, the DRAM controller remembered which lines
had been accessed and did not refresh them.

unread,

Dec 11, 2018, 7:13:11 PM12/11/18

to

On 12/11/2018 4:01 PM, lk...@lkcl.net wrote:

snip

> in particular i found it fascinating that analysis of INT
> instructions found a 50% LD, 25% ST and 25% branch,

That doesn't seem right. Since these sum to 100%, were there no add nor
subtract instructions?

MitchAlsup

unread,

Dec 11, 2018, 7:40:52 PM12/11/18

to

On Tuesday, December 11, 2018 at 6:01:13 PM UTC-6, lk...@lkcl.net wrote:
> On Tuesday, December 11, 2018 at 10:05:42 PM UTC, MitchAlsup wrote:
>
> > Back with I was doing the 88120; a 6-wide, packet cache, OoO machine circa
> > 1990, we chose the FUs as::
> > FU[0] = {AGEN, INT, LOG, SHIFT}
> > FU[1] = {AGEN, INT, LOG, SHIFT}
> > FU[2] = {AGEN, INT, LOG, SHIFT}
> > FU[3] = {INT, LOG, FADD}
> > FU[4] = {INT, LOG, FMUL}
> > FU[5] = (INT, LOG, BR}
> >
> > So INT (integer) LOG (logical) could go in any slot
> > LDs or STs could go in the first 3 slots
> > Shifts could go in the first 3 slots
> > FADD, FSUB, FCMP, and a few twiddly FP instructions in slot 3
> > FMUL, FDIV, SQRT instructions in slot 4
> > Branches in slot 5
>
> so... it's not a one-to-one relationship, you don't just
> drop all ADD operations at one FU, for example: there's
> a balanced workload distribution, where the FUs may
> throw ops at a wider range of ALUs.

Integer ADD is not that expensive,
Logical operations are cheap
Shift operations are a bit on the expensive side
FADD is quite expensive
FMUL is very expensive
Branch is no bigger than logical operations

So, rather than making hard to perform instruction to slot packing algorithm
(in hardware) we were happy to replicate cheap calculations, and happy to
fix expensive calculations.

I though the people on this NG would find it interesting that while INT
was replicated everywhere, SHIFT was not.

Also note AGEN == INT adder for that architecture. But I contend this is
close to optimal for other architectures. In x86-land, the addition of
a 3-2 compressor (carry save adder) in front of the adder is the cost
of doing 3-input address arithmetic. Another 3-2 compressor and you can
do the segmentation at the same time.

>
> i assume those ALUs were duplicated in some cases (such as
> INT)? i did note in the 2nd chapter you describe a "concurrency"
> method - a many-to-one relationship between FU "slots" and one
> (pipelined) ALU.

Can you say this again but use different words?

>
> presumably it would also be reasonable to consider many-slots
> to many-ALUs, although i personally would not like to do so
> as a many-to-many relationship i would find much harder to
> implement.

Routing is harder that calculation (and now more power expensive.)

>
> much easier to duplicate the ALUs.

Cheap and easy.

>
> one thing i did find fascinating, the same 2nd chapter also does
> an analysis (similar to the one you did, above), of the number
> of register read and write ports based on computational workloads,
> which turned out to be *different* for FP from INT.
>
> in particular i found it fascinating that analysis of INT
> instructions found a 50% LD, 25% ST and 25% branch, and that
> 70% were 2-src ops. therefore you made sure that the number
> of read and write ports matched these, to ensure no bottlenecks,
> bearing in mind that ST requires reading an address *and*
> a data register.

I never had a problem in "reading the write slot" in any of my pipelines.
That is, take a pipeline where LD (cache hit) has a latency of 3 cycles
(AGEN, Cache, Align). Align would be in the cycle where the data was being
forwarded, and the subsequent cycle, data could be written into the RF.

|dec|AGN|$$$|ALN|LDW|

For stores I would read the LDs write slot Align the store data and merge
into the cache as::

|dec|AGEN|tag|---|STR|ALN|$$$|

You know 4 cycles in advance that a store is coming, 2 cycles after hit
so there is easy logic to decide to read the write slot (or not), and it
costs 2 address comparators to disambiguate this short shadow in the pipeline.

This is a lower expense than building another read port into the RF, in
both area and power, and uses the pipeline efficiently.

I can't claim credit for the pipeline--as this is the HP snake-era $ pipe.
>
> quite fascinating.
>
> l.

MitchAlsup

unread,

Dec 11, 2018, 7:47:12 PM12/11/18

to

On Tuesday, December 11, 2018 at 6:01:13 PM UTC-6, lk...@lkcl.net wrote:

> one thing i did find fascinating, the same 2nd chapter also does
> an analysis (similar to the one you did, above), of the number
> of register read and write ports based on computational workloads,
> which turned out to be *different* for FP from INT.
>
> in particular i found it fascinating that analysis of INT
> instructions found a 50% LD, 25% ST and 25% branch, and that
> 70% were 2-src ops. therefore you made sure that the number
> of read and write ports matched these, to ensure no bottlenecks,
> bearing in mind that ST requires reading an address *and*
> a data register.

I think this data has been purposely obfuscated.

A) none of the LD, ST, Branch instructions are INTEGER !?!
B) but they seem to be in a proper ratio with each other.

So, perhaps if we added in 25% integer instructions and then renormalized::

40% LD -- a bit high
20% ST -- pretty close
20% BR -- pretty close
20% INT - Not so bad.

Now to the more interesting side of the question/statement::
70% were 2 source-operand.
I assume that this includes:
ADD R5,R6,R7
and
ADD R5,R6,1
but not
ADD R5,1 // equivalent to ADD R5,R5,1
>
> quite fascinating.
>
> l.

lk...@lkcl.net

unread,

Dec 12, 2018, 2:52:06 AM12/12/18

to

On Wednesday, December 12, 2018 at 12:13:11 AM UTC, Stephen Fuld wrote:
> On 12/11/2018 4:01 PM, lk...@lkcl.net wrote:
>
> snip
>
>
> > in particular i found it fascinating that analysis of INT
> > instructions found a 50% LD, 25% ST and 25% branch,
>
> That doesn't seem right. Since these sum to 100%, were there no add nor
> subtract instructions?

i surmise it was a study which had two different ways of looking at
the same data. the other breakdown was, on integer-based workloads:

* 43% INT ops (so, add, sub, mul etc.)
* 21% LD
* 12% ST
* 24% Branch

so, looking at the LD-ST-Branch ratio there, it's appx 50%-25%-25%

l.

Ivan Godard

unread,

Dec 12, 2018, 3:33:23 AM12/12/18

to

Here's another datum for you for grouping. This is the config spec for
Silver, a midrange Mill.
slots[exuBlock] = newRow<slot>(4);
slots[exuBlock][0] << aluFU << countFU << mulFU << shiftFU <<
shuffleFU << divFU;
slots[exuBlock][1] << aluFU << mulFU << shiftFU;
slots[exuBlock][2] << aluFU;
slots[exuBlock][3] << aluFU;
slots[exuBlock][0] << bfpFU << bfpmFU << bfpdFU;
slots[exuBlock][1] << bfpFU << bfpmFU;
slots[flowBlock] = newRow<slot>(4);
slots[flowBlock][0] << cacheFU << conFU << conformFU <<
controlFU << lsFU;
slots[flowBlock][1] << conFU << conformFU << controlFU <<
lsFU << miscFU;
slots[flowBlock][2] << conFU << conformFU << controlFU <<
lsFU << miscFU << bootFU;

Mills are essentially two synchronous wide issue machines side-by-side,
called the flow and exu sides. Silver has four slots on each side,
similar to the TI C-series VLIWs. Each slot can be independently
provisioned from the FUs that are meaningful for that side, but you
can't put an FU native to one side into the provisioning of the other.

lsFU is load/store/LEA, i.e. Mitch's AGEN. aluFU is Mitch's LOG and INT
(we also combine them) but not MUL/DIV. controlFU is branch/call/return.
mulFU and divFU are integer MUL and DIV. bfp is FADD, bfpm is FMUL, bfpd
is FDIV and friends. countFU is popCount, findFirst and friends.
conformFU is the Mill-unique "rescue" op. "bootFU" is the power-up
sequencer.

Probably surprising in Silver's provisioning is the presence of three
control (branch) units. This supports Mill's First Winner Rule for
multiple concurrent control flow. Also surprising may be the boot
sequencer as an FU. It is reachable only from logic and has no encodable
operations; that lets the sequencer have access to many internal
structures for free and avoids lots of special-case wiring. A
conventional CPU would put the sequencer in microcode, but the Mill has
no microcode. Lastly, the final flow-side slot is empty; this is purely
for encoding of longer argument lists of operations in the adjacent
slot, and has no hardware after the decode.

This config is seat-of-the-pants and has not yet been heavily tuned.
With tuning it will almost certainly lose divFU and bfpdFU, replacing
them with in-line emulation. The code here is the actual C++ of the
machine definition. Our software mechanically processes it into
compiler/assembler/sim/doc and Verilog.

By and large the provisioning is quite similar to Mitch's and
performance should be similar for equal process and clock. However,
because Silver lacks the registers, reservation stations and other
structures of OOO designs (including Mitch's) it should be lower power
and smaller area. The Gold config is larger and should have similar
power and area to Mitch's, but up to twice (application dependent) the
performance. In contrast, the Copper configuration is smaller and has
only five slots and no floating point, for much smaller power and area
at slightly lower performance; there's Tin below that. Potentially
there's a Platinum configuration above Gold, but we're not sure it's
practical.

lk...@lkcl.net

unread,

Dec 12, 2018, 3:37:57 AM12/12/18

to

you may be interested to hear of something known as
the Vedic Multiply technique, developed several thousands
of years ago (no, not in 500 BC: some ****ing moronic
victorian historian decided to "rewrite" history...)

the Vedic Multiply technique when moved to binary basically:

* groups everything that will go into bit 0 of the result
(which, duh, is bit 0 of op1 ANDed with bit 0 of op2)
* groups everything that will go into bit 1 of the result:
bit 0 of src1 ANDed with bit 1 of src2
bit 1 of src1 ANDed with bit 0 of src2
then does a population count on them
* groups everything that will go into bit 2...
bit 0 src1 AND bit 2 src2
bit 1 src1 AND bit 1 src2
bit 2 src1 AND bit 0 src2
and does a popcount on that
* ...
* half way through it's all bits popcounted
0..N-1 ANDed with N-1..0
* ...
* highest bit ANDed with highest bit

and then does a cascading add on all of those popcounted
intermediate values:

* 000000000000000N 1 bit (1
* 0000000000000NN0 2 bits (1+1)
* 000000000000NN00 2 bits (1+1+1 can't exceed 0b11)
* 0000000000NNN000 3 bits (1+1+1+1, max=0b100)
* 000000000NNN0000 here's about the max bits
* 00000000NNN00000 3 bits again (1+1+1+1, max 0b100)
* 00000000NN000000 2 bits again (1+1+1)
* 0000000NN0000000 2 bits (1+1)
* 0000000N00000000 1 bit (1)

note, fascinatingly, that the width of the intermediary
adders is a log2 of the bitwidth, so for 32-bit they'll
average to around 3-4 bits, maxing out at 6 for the
middle value.

if it is undesirable to do this in one massive hit it is
perfectly reasonable to consider breaking down into
multiple 8-bit groups where the technique is *recursively*
deployed, or perhaps deploying the Vedic technique on the
8-bit groups then deploying the Karatsuba multiply
algorithm.

love that name. Karatsuuuuba :)

Vedic maths is awesome. my friend told me that there's
a technique documented in Sanskrit for doing integration,
the worked example was for working out the area of a
circle... *without reference to algebra because they hadn't
invented algebra yet at the time!*. thousands of years
before Newton and the other guy that Newton got the credit
for integration from [and nobody remembers].

there's also a Vedic technique for doing long-square-root
which basically in binary boils down to one add and one
compare operation. based on (a+b)^2 being a^2 + 2ab + b^2
which may be expressed as a^2 + b(2a + b) and you just
move digits from A to B one at a time. if b(2a+b) is
greater than the "current" intermediate value a^2
(where the compare comes in to play), then you don't move
that digit over from b to a: move on to the next loop.

> Branch is no bigger than logical operations
>
> So, rather than making hard to perform instruction to slot packing algorithm
> (in hardware) we were happy to replicate cheap calculations, and happy to
> fix expensive calculations.
>
> I though the people on this NG would find it interesting that while INT
> was replicated everywhere, SHIFT was not.

https://github.com/KestrelComputer/polaris/blob/master/processor/rtl/verilog/alu.v

the author of that code told me that supporting shift *completely*
dominated the ALU.

> > i assume those ALUs were duplicated in some cases (such as
> > INT)? i did note in the 2nd chapter you describe a "concurrency"
> > method - a many-to-one relationship between FU "slots" and one
> > (pipelined) ALU.
>
> Can you say this again but use different words?

in slot 0, there's a duplicated INT ALU.
in slot 1, likewise.
...
slot 5, likewise

or:

in slot 0, there's a Reservation Station (latch) for INT values
in slot 1, likewise etc.
and the INT ALU is pipelined and takes as multiplexed input *all*
of slot 0 through slot 5 latched values, and produces (up to)
6 buffered outputs which are redirected back to their corresponding
slot destination

or:

a combination of both techniques.

obviously, INT being a bit of a dumb example, probably better
to just have 6 (duplicate) INT ALUs.

> > much easier to duplicate the ALUs.
>
> Cheap and easy.

i think you answered my question above, here :)

> I never had a problem in "reading the write slot" in any of my pipelines.

fascinating and i've taken a note, to make sure when we get to
that part, to Not Screw Up :)

... y'know... i remember the intel-amd comparisons of the late 1990s,
it was really really hard to compare like-for-like. AMD cores ran
at much lower clock-rates yet easily had comparable workloads.
i.e. they had a far higher IPC count.

i'm guessing that was down to the designs that you advocated.

l.

lk...@lkcl.net

unread,

Dec 12, 2018, 6:59:29 AM12/12/18

to

On Tuesday, December 11, 2018 at 2:06:10 AM UTC, lk...@lkcl.net wrote:
> https://youtube.com/watch?v=lEvYWf6fZ3s

so, i figured out how not to need the RAT table - at all - that
is described here: https://youtube.com/watch?v=p4SdrUhZrBM
(and how to express it)

this scheme below allows the Physical Register File to *equal*
the Architectural Register File, just as in a Tomasulo+ROB.

the basic principle is that the Dest Reg# of the TS Reorder
Buffer table is turned from a multi-bit CAM into a
*vector* of 1-bit (mutually-exclusive) CAMs, *one each
per Function Unit*.

this means that there is a N(Function-Units) x N(Registers)
matrix of single-bits, expressing exactly the same functionality
as the Reorder Buffer's Dest Reg#.

by a happy coincidence, this N(FU) x N(Regs) happens to be
exactly what already exists in a 6600-style Scoreboard system:
the FU-to-Reg Dependency Matrix.

going back to the example shown in the original video by
Chetan Bornarkar, he has 3 instructions:

MUL r1, r2, r3
ADD r4, r1, r3
ADD r1, r5, r6

he has a RAT Table with a "valid" bit, a "tag", and a physical
register table value. the "tag" points to a Reservation
Station position i.e. is a tuple of (FU#,FU-RS-index#)

he therefore also has a multi-row Reservation Station table
each per FU, where each RS row is indexed by the above-mentioned
FU-RS-index#.

FU name: add

tagname Rdy Tag Value Rdy Tag Value
add-0 n - - n - -
add-1 n - - n - -

FU name: mul

tagname Rdy Tag Value Rdy Tag Value
mul-0 n - - n - -
mul-1 n - - n - -

in the video that i did, i showed a diagram where the
Reservation Station table is flattened out. the number
of "rows" in the RS table per FU is mandatorily made 1
and 1 only, and instead, the number of FUs is doubled
(or tripled, or quadrupled):

FU name: add-0

tagname Rdy Tag Value Rdy Tag Value
add-0 n - - n - -

FU name: add-1

tagname Rdy Tag Value Rdy Tag Value
add-1 n - - n - -

FU name: mul-0

tagname Rdy Tag Value Rdy Tag Value
mul-0 n - - n - -

FU name: mul-1

tagname Rdy Tag Value Rdy Tag Value
mul-1 n - - n - -

so now the RS name *equals* the FU name, they may now
become a 1-bit CAM.

FU name Reg name
12345678
add-0 ........
add-1 ........
mul-0 ........
mul-1 ........

on execution of the MUL r1, r2, r3, the Reg name column
is checked. it is found to be empty. this is synonymous
with a RAT table "valid" bit. thus, we may pick the first
free mul FU. however we must mark R1 as a dest:

FU name Reg name
12345678
add-0 ........
add-1 ........
mul-0 X.......
mul-1 ........

this indicates two things:

1. the equivalent of RAT "valid" is cleared
2. the "tag" entry of RAT is set to the mul-0
Reservation Station row in the MUL Function Unit

next instruction, ADD r4, r1, r3. this discovers that
R1 is not valid, however we also obtain, very very easily,
WITH A SINGLE BIT PATTERN, the row that contains the
dependent FU that will (eventually) contain the value
we need. this happens to be mul-0.

so, now in the FU-to-FU matrix, we know that there
is a dependency between mul-0 and add-0:

FU name FU name
0101
aamm
add-0 ....
add-1 ....
mul-0 D...
mul-1 ....

and the FU-to-reg bit-cam, because F4 is a dest target,
becomes:

FU name Reg name
12345678
add-0 ...X....
add-1 ........
mul-0 X.......
mul-1 ........

next instruction, ADD r1, r5, r6. this is where it
gets interesting. firstly, add-0 is still occupied
so we must use add-1. the mul-0,R1 bit is *CLEARED*,
and *REPLACED* with an add-1,R1 marker instead:

FU name Reg name
12345678
add-0 ...X....
add-1 X.......
mul-0 ........
mul-1 ........

i made a mistake in the video by leaving that mul-0,R1
entry set. clearing that bit is critical, because it
says, "any future issued instruction needing R1 as a src
from this point will no longer be dependent on mul-0,R1
because that's just been "overwritten", you *must* now
use the output from add-1,R1 when it becomes available"

and now we have the following (WaW?) dependency where
previously there was a (RaW?) dependency. honestly
i'm slightly lost on these.

FU name FU name
0101
aamm
add-0 ....
add-1 ..W.
mul-0 R...
mul-1 ....

from this point, the X bits have served their purpose:
the FUs can sort themselves out, such that if (when)
the mul-0 and add-1 happen to produce results on the
same clock cycle, the standard scoreboard RaW/WaW
hazard logic may sort out which one gets to update R1
first.

so, basically, aside from my confusion about the RaW and
WaW standard scoreboard stuff, which i am certain that
people already well understand, the RAT table *and*
Reorder Buffer's functionality are replaced by a single
extra bit in the N(FUs) x N(Regs) Dependency Matrix.

latches storing the values, providing the orthogonal
functionality of Reservation Station rows, *already exist*
in the 6600 scoreboard design.

register-renaming is therefore brought into existence
at far less cost than a multi-bit CAM of a Reorder Buffer;
the instruction-issue phase of requiring PRF-to-ARF renaming
is eliminated, as is the difference *between* the PRF and ARF.

ironically this scheme would work perfectly well even for
low-reg-count architectures such as x86. x86 would simply
require a lot more FU rows (multiplying up on the number
of Reservation Stations), providing "depth" to the
renaming where previously the RAT+ARF+PRF would provide
"breadth".

in the Libre-RISC-V SoC we already have a whopping 128+128
physical registers, so having significant numbers of FUs
(to multiply up the number of Reservation Stations) is
something we have to keep an eye on.

coooooool :)

l.

EricP

unread,

Dec 12, 2018, 2:58:53 PM12/12/18

to

lk...@lkcl.net wrote:
> On Sunday, December 9, 2018 at 10:50:00 PM UTC, EricP wrote:
>> lk...@lkcl.net wrote:
>>> if you *do not have* that NxN dependency matrix, which captures
>>> all of the RAW and WAW hazards, then yes, you are absolutely
>>> correct: an imprecise exception is what will result.
>> But there is no RAW or WAW dependency between those two
>> instructions or between the FU's.
>
> hi eric,
>
> i re-read the chapter from mitch's book: there isn't one... so
> you add one. you *make* a WaW dependency.

Unfortunately I don't have Mitch's scoreboard book -
its too big to fit through ISP's 10 MB mailbox limit.

Yesterday I went back through Thornton and confirmed, as far as
I can see, that the design in the text does not impose program
order on loads and stores, just data flow order.

> so, what makes an exception imprecise is that there does not
> exist a WaW dependency of upstream (shadowed) instructions
> on a downstream one, in a "standard" scoreboard... *so you add one*.
>
> when the earlier-issued instruction *knows* that it will not
> cause an exception (such as when the LOAD data has arrived,
> or perhaps when the L1 cache entry has been confirmed as reserved)
> the earlier-issued instruction may *release* that WaW reservation.
>
> until that WaW is released, upstream (later-issued) instructions
> may still execute, they just may not commit results.
>
> basically the newly-added WaW dependency "chains" down the FUs,
> in instruction-issue order, to create the exact same conditions
> for precise exceptions as exist in a Reorder Buffer.
>
> the difference being that the pointer to the front of the Reorder
> Buffer is now replaced by multiple single-purpose 1-bit wires
> that all lead into and raise WaW.
>
> branch speculation wire, exception wire, LOAD/STORE wire, these
> all OR in to down-stream WaWs, providing exactly the same functional
> equivalence of a Tomasulo Reorder Buffer (just without the Dest Reg
> Cam).
>
> l.

Yes you can do that. A separate FU*FU 1-bit dependency matrix
could track serial order between FU's.

Note that a WAW dependency stalls in the issue stage.
It would be better to stall before writeback like WAR dependencies
as that allows the FU to do its work before stalling.
Also load and store instructions need to stall before
memory access to enforce serial program order on memory accesses.

This kind of serialization would severely limit the execution parallelism.
Also, depending on implementation details, it might not allow FU's
to do parallel scheduling as the classic scoreboard allows.

Note also that you are using expensive function units just to hold
temporary result values, and thereby blocking new issues to that FU
of other independent instructions. For example, you might have
an expensive multiply or divide unit stalled before writeback
for serial dependency when it otherwise might be working.

It is the later problem that a renamer can address as it
would allow the FU to get rid of its result and continue work.
One form of renamer, the merged architecture and physical
register file, can address this with 2 RAT tables,
one for future (FRAT) and one for committed (CRAT) renames.
But then you need something like an Instruction Queue
with commit logic to drive the CRAT.

Stephen Fuld

unread,

Dec 12, 2018, 11:14:40 PM12/12/18

to

On 12/12/2018 11:58 AM, EricP wrote:
> lk...@lkcl.net wrote:
>> On Sunday, December 9, 2018 at 10:50:00 PM UTC, EricP wrote:
>>> lk...@lkcl.net wrote:
>>>> if you *do not have* that NxN dependency matrix, which captures
>>>> all of the RAW and WAW hazards, then yes, you are absolutely
>>>> correct: an imprecise exception is what will result.
>>> But there is no RAW or WAW dependency between those two
>>> instructions or between the FU's.
>>
>> hi eric,
>>
>> i re-read the chapter from mitch's book: there isn't one... so
>> you add one. you *make* a WaW dependency.
>
> Unfortunately I don't have Mitch's scoreboard book -
> its too big to fit through ISP's 10 MB mailbox limit.
>
> Yesterday I went back through Thornton and confirmed, as far as
> I can see, that the design in the text does not impose program
> order on loads and stores, just data flow order.

This may be less of an issue than you might think. Remember, the 6600
was designed before caches became available in CPUs.

Ivan Godard

unread,

Dec 13, 2018, 12:00:45 AM12/13/18

to

IIRC, the 6600 was designed for Fortran, which doesn't allow aliasing.

Stephen Fuld

unread,

Dec 13, 2018, 12:04:53 AM12/13/18

to

Also true.

lk...@lkcl.net

unread,

Dec 13, 2018, 4:26:55 AM12/13/18

to

On Wednesday, December 12, 2018 at 7:58:53 PM UTC, EricP wrote:
> lk...@lkcl.net wrote:
> > On Sunday, December 9, 2018 at 10:50:00 PM UTC, EricP wrote:
> >> lk...@lkcl.net wrote:
> >>> if you *do not have* that NxN dependency matrix, which captures
> >>> all of the RAW and WAW hazards, then yes, you are absolutely
> >>> correct: an imprecise exception is what will result.
> >> But there is no RAW or WAW dependency between those two
> >> instructions or between the FU's.
> >
> > hi eric,
> >
> > i re-read the chapter from mitch's book: there isn't one... so
> > you add one. you *make* a WaW dependency.
>
> Unfortunately I don't have Mitch's scoreboard book -
> its too big to fit through ISP's 10 MB mailbox limit.

hm, email me (lk...@lkcl.net) and let's see if we can find
a way to fix that (with mitch's permission): i run several
servers.

> > branch speculation wire, exception wire, LOAD/STORE wire, these
> > all OR in to down-stream WaWs, providing exactly the same functional
> > equivalence of a Tomasulo Reorder Buffer (just without the Dest Reg
> > Cam).
> >
> > l.
>
> Yes you can do that. A separate FU*FU 1-bit dependency matrix
> could track serial order between FU's.

indeed... however i realised soon after that it would make
the commit phase in a multi-issue environment very difficult
to do. there would be e.g. 4 commits that had simultaneous
flags raised, right across the matrix, how do you determine
the order of those in one clock cycle?

:(

> Note that a WAW dependency stalls in the issue stage.
> It would be better to stall before writeback like WAR dependencies
> as that allows the FU to do its work before stalling.
> Also load and store instructions need to stall before
> memory access to enforce serial program order on memory accesses.

there is some work such as address decode that may still proceed.
the scheme that mitch outlines really is no different from a
tomasulo LOAD (and STORE) buffer: that may be parallelised
without stalling, therefore the same thing may be achieved here.

> This kind of serialization would severely limit the execution parallelism.
> Also, depending on implementation details, it might not allow FU's
> to do parallel scheduling as the classic scoreboard allows.

the actual execution may still go ahead in parallel as it is
the commit phase that is prevented and prohibited from proceeding
in strict sequential order with the (previously-proposed)
inter-instruction 1-bit matrix.

i genuinely have absolutely no idea how a Tomasulo + Reorder Buffer
can be modified to be multi-issue. Tomasulo+ROB is, as far as i
am aware, only allowed to have the (oldest) instruction at the head
of the Reorder cyclic buffer do a "commit". how on earth it is
possible to do say 4 head-of-buffer commits in a single cycle,
i have no clue

> Note also that you are using expensive function units just to hold
> temporary result values, and thereby blocking new issues to that FU
> of other independent instructions.

yes. so, all you do is: put more FUs down, and that gets rid of
the blocking.

if the ALUs are expensive, you just farm multiple FU inputs to
the same (preferably pipelined) ALU, and fan the results *back*
out to the corresponding FU from which the src regs came.

> For example, you might have
> an expensive multiply or divide unit stalled before writeback
> for serial dependency when it otherwise might be working.

MitchAlsup

unread,

Dec 14, 2018, 12:01:38 PM12/14/18

to

Point of Order::

No modern multiplier designer uses Wallace trees.
All modern multiplier designers use the Dadda method for building the tree.

MitchAlsup

unread,

Dec 14, 2018, 12:04:48 PM12/14/18

to

My guess is "nothing substantial", just a few refinements here and there.

MitchAlsup

unread,

Dec 14, 2018, 12:08:03 PM12/14/18

to

On Friday, December 14, 2018 at 4:49:55 AM UTC-6, Anton Ertl wrote:

> MitchAlsup <?????@aol.com> writes:
> >On Thursday, December 13, 2018 at 8:52:35 AM UTC-6, Anton Ertl wrote:
> >> Ivan Godard <iv...@millcomputing.com> writes:
> >> >IIRC, the 6600 was designed for Fortran, which doesn't allow aliasing.
> >>
> >> Fortran certainly does allow aliasing. It just specifies that a
> >> parameter of a function does not alias with another parameter.
> >
> >More importantly, FORTRAN takes the position that if aliasing occurs, it
> >is the fault of the program and not that of the hardware.
>
> This makes no sense. If the source program accesses the same memory,
> this is aliasing; it's not a fault, it's a specification of the
> behaviour; the compiler is responsible for implementing this behaviour
> using the features provided by the hardware.

It is one of the big ticket items that enabled the optimizer to do much better
on FORTRAN codes than on C codes.

Anton Ertl

unread,

Dec 14, 2018, 1:37:23 PM12/14/18

to

Bill Findlay <findl...@blueyonder.co.uk> writes:
>On 14 Dec 2018, Anton Ertl wrote

>> The thing with standard Fortran is that parameters must not refer to
>> the same memory (but of course there are programs that trespass this
>> restriction). The reason for this restriction is that they wanted to
>> allow both value (copy-in copy-out) and reference implementations of
>> parameter passing, and programs trespassing this restriction might
>> behave differently on different implementations.
>
>There are further complications (or there used to be, I am not /au fait/
>with 21st century Fortran).
>
>Variables in COMMON blocks can be aliased statically by EQUIVALENCE
>statements;
>and dynamically, by beingused as both a parameter and a COMMON block element.

Given the reason for disallowing aliasing between parameters, I expect
that aliasing between parameters and global variables would also not
be allowed. I don't expect restrictions on aliasing between variables
through EQUIVALENCE, but I am no Fortran expert.

>The subroutine using the COMMON variable has no way of knowing that it is
>also
>referencing it via a formal parameter in a particular call.

That has not stopped language standardizers from not standardizing the
behaviour in such cases; they then declare these cases as "undefined
behaviour". I consider it a quality criterion of a programming
language specification to have as little "undefined behaviour" as
possible (i.e., an implementation should ideally either produce a
well-defined result or an error for every program). And if a language
does not define everything, it should at least limit the allowed
behaviours (no nasal demons).

EricP

unread,

Dec 14, 2018, 3:51:05 PM12/14/18

to

lk...@lkcl.net wrote:
> On Wednesday, December 12, 2018 at 7:58:53 PM UTC, EricP wrote:
>> lk...@lkcl.net wrote:
>>> On Sunday, December 9, 2018 at 10:50:00 PM UTC, EricP wrote:
>>>> lk...@lkcl.net wrote:
>>>>> if you *do not have* that NxN dependency matrix, which captures
>>>>> all of the RAW and WAW hazards, then yes, you are absolutely
>>>>> correct: an imprecise exception is what will result.
>>>> But there is no RAW or WAW dependency between those two
>>>> instructions or between the FU's.
>>> hi eric,
>>>
>>> i re-read the chapter from mitch's book: there isn't one... so
>>> you add one. you *make* a WaW dependency.
>> Unfortunately I don't have Mitch's scoreboard book -
>> its too big to fit through ISP's 10 MB mailbox limit.
>
> hm, email me (lk...@lkcl.net) and let's see if we can find
> a way to fix that (with mitch's permission): i run several
> servers.

I remembered I do have a copy of Mitch's Compter Mechanics,
chapter 10, scoreboards. I'll have another look at it.

>>> branch speculation wire, exception wire, LOAD/STORE wire, these
>>> all OR in to down-stream WaWs, providing exactly the same functional
>>> equivalence of a Tomasulo Reorder Buffer (just without the Dest Reg
>>> Cam).
>>>
>>> l.
>> Yes you can do that. A separate FU*FU 1-bit dependency matrix
>> could track serial order between FU's.
>
> indeed... however i realised soon after that it would make
> the commit phase in a multi-issue environment very difficult
> to do. there would be e.g. 4 commits that had simultaneous
> flags raised, right across the matrix, how do you determine
> the order of those in one clock cycle?
>
> :(

As I indicated, I don't like the dependency chain approach
(because of the downsides) but thought it could be made to work.
When I said that I was thinking of a crossbar matrix.
If built with t-gates its fast and efficient, but unfortunately
I'm told that t-gates are out. Building it with tristate inverters
would be slower and power wasting, but could work.

The matrix is a 1-bit crossbar, and it effectively builds
a single linked list dependency chain of function unit numbers (FUN),
by tracking the list head (oldest) and list tail (youngest)
and telling each who its predecessor FUN is.

As each FU finishes, it watches the matrix for an OK_WRITE signal
from its predecessor, then propagates that signal from itself
on its line in the crossbar. When an FU gets its OK_WRITE signal,
it arbitrates for the bus and writes its result.

The propagation stops when it gets to an FU that is busy,
or one that has an exception, or the list end.

The effect to cause a contiguous chain of pending results to be
flushed as a batch.

When the FU that has a pending exception becomes the list head,
it throws the exception and begins the abort sequence.

A t-gate matrix could have a propagation delay of about 1 or 2
gates per FU in a chain.

>> Note that a WAW dependency stalls in the issue stage.
>> It would be better to stall before writeback like WAR dependencies
>> as that allows the FU to do its work before stalling.
>> Also load and store instructions need to stall before
>> memory access to enforce serial program order on memory accesses.
>
> there is some work such as address decode that may still proceed.
> the scheme that mitch outlines really is no different from a
> tomasulo LOAD (and STORE) buffer: that may be parallelised
> without stalling, therefore the same thing may be achieved here.
>
>> This kind of serialization would severely limit the execution parallelism.
>> Also, depending on implementation details, it might not allow FU's
>> to do parallel scheduling as the classic scoreboard allows.
>
> the actual execution may still go ahead in parallel as it is
> the commit phase that is prevented and prohibited from proceeding
> in strict sequential order with the (previously-proposed)
> inter-instruction 1-bit matrix.

Yes, that is what I meant by stalling in writeback like WAR.

>
> i genuinely have absolutely no idea how a Tomasulo + Reorder Buffer
> can be modified to be multi-issue. Tomasulo+ROB is, as far as i
> am aware, only allowed to have the (oldest) instruction at the head
> of the Reorder cyclic buffer do a "commit". how on earth it is
> possible to do say 4 head-of-buffer commits in a single cycle,
> i have no clue

This is a whole topic unto itself.
There are multiple renamer designs, and multiple IQ/ROB designs.
These interact with issue and scheduler logic, and commit.

But in answer to your specific question,
for example in one of the papers from that list I sent:

A Reorder Buffer Design for High Performance Processors, 2012

they store 4 instruction entries in each ROB row,
so 1 port can select 4 instructions together.
A ROB index is augmented with a 2 bit offset to
uniquely select an instruction's state flags.

>> Note also that you are using expensive function units just to hold
>> temporary result values, and thereby blocking new issues to that FU
>> of other independent instructions.
>
> yes. so, all you do is: put more FUs down, and that gets rid of
> the blocking.
>
> if the ALUs are expensive, you just farm multiple FU inputs to
> the same (preferably pipelined) ALU, and fan the results *back*
> out to the corresponding FU from which the src regs came.

eh?

>> For example, you might have
>> an expensive multiply or divide unit stalled before writeback
>> for serial dependency when it otherwise might be working.
>
> not if the above many-FUs-to-one-pipelined-ALU scheme is deployed.
> it's outlined in detail in the 2nd chapter of mitch's book.
>
>> It is the later problem that a renamer can address as it
>> would allow the FU to get rid of its result and continue work.
>
> essentially, yes. exactly. see below.
>
>> One form of renamer, the merged architecture and physical
>> register file, can address this with 2 RAT tables,
>> one for future (FRAT) and one for committed (CRAT) renames.
>> But then you need something like an Instruction Queue
>> with commit logic to drive the CRAT.
>
> by the time you read this, as long as you're reading comp.arch in
> sequential order, you should have seen some cross-over of a message
> describing how to do Tomasulo-ROB-RS-style register renaming,
> without needing a RAT.

RAT's are cheap - can be just a small SRAM.

> the scheme is critically dependent on the above-mentioned duplication
> of FUs with the latched copies of src registers, as this is directly
> equivalent to a Reservation Station, followed by the insight that
> the ROB Dest Reg# may be expressed as a mutually-exclusive bit-vector
> of FU "markers", per register, that, in *exactly* the same way as the
> ROB Dest Reg# CAM field, informs the instruction issue phase which of
> the FUs has the most recent overwritten Dest Reg.
>
> in this way, then just as with TS+RS+ROB, there is absolutely no
> difference between the PRF and ARF. the RS's hold the intermediate
> results.
>
> l.

This sounds like you want to temporarily stash results in the
RS source registers? Is that what you are saying?

EricP

unread,

Dec 14, 2018, 3:59:56 PM12/14/18

to

I think it is quite possible because each path in the data flow
is timing dependent. For example a stack spill and reload:
if one register is used to calculate a value to store and spill,
and a different register to later reload, with no register
dependency then the reload could happen before the store.

This is also affected by how the instructions fit in the
instruction buffer, possibly making results model dependant.
Caches could make the path timing random.

The nature of their target code, array processing, that probably
made this less likely. Also just rudimentary optimizations.

James Van Buskirk

unread,

Dec 14, 2018, 6:43:33 PM12/14/18

to

"Anton Ertl" wrote in message
news:2018Dec1...@mips.complang.tuwien.ac.at...

> MitchAlsup <Mitch...@aol.com> writes:

> >More importantly, FORTRAN takes the position that if aliasing occurs, it
> >is the fault of the program and not that of the hardware.

> This makes no sense. If the source program accesses the same memory,
> this is aliasing; it's not a fault, it's a specification of the
> behaviour; the compiler is responsible for implementing this behaviour
> using the features provided by the hardware.

Fortran now allows you to tip off the compiler that aliasing is
intended by giving the affected dummy argument the TARGET
attribute. Otherwise, as Mitch said, it's your fault and not the
compiler's.

> The thing with standard Fortran is that parameters must not refer to
> the same memory (but of course there are programs that trespass this
> restriction). The reason for this restriction is that they wanted to
> allow both value (copy-in copy-out) and reference implementations of
> parameter passing, and programs trespassing this restriction might
> behave differently on different implementations.

Fortran pretty much forces copy in/copy out in some contexts but
this can create problems that cause the program to crash. I
don't know whether this issue has been fixed in the most
recent version of the standard.

James Van Buskirk

unread,

Dec 14, 2018, 6:57:58 PM12/14/18

to

wrote in message
news:b1272aeb-bf14-46bf...@googlegroups.com...

> Both FORTRANs and Modern Fortran support aliasing just fine.
> The simplest and best supported form is accessing single array
> with multiple index variables.
> I didn't use FOrtran for more than 30 years, but quite sure that
> Fortran equivalent of the following is legal and fully defined as
> long as all accesses are within array's range.

> void Foo(double bar[], int distance, int n)
> {
> for (int i = 11; i <= n+10; ++i)
> bar[i] = bar[i+distance]*1.1;
> }

> 6600 FORTRAN compiler had to deal somehow with this sort
> of code and somehow produce expected results.

Fortran can do the sequential thing with

DO i = 11, n+10
bar(i+1) = bar(i+distance+1)*1.1d0
END DO

or the effective temporary copy thing with

bar(11+1:n+10+1) = bar(11+distance+1:n+10+distance+1)*1.1d0

or even ambiguous illegal things like

DO, CONCURRENT i = 11, n+10
bar(i+1) = bar(i+distance+1)*1.1d0
END DO

lk...@lkcl.net

unread,

Dec 14, 2018, 11:23:54 PM12/14/18

to

On Friday, December 14, 2018 at 8:51:05 PM UTC, EricP wrote:

> I remembered I do have a copy of Mitch's Compter Mechanics,
> chapter 10, scoreboards. I'll have another look at it.

cool - if you have chap 11 as well i can point you to
diagrams / sections, below. if not, i attempt ascii art.
more later, the earlier sections i need to re-read and
give some thought.

> A Reorder Buffer Design for High Performance Processors, 2012

reading, thx.

> > if the ALUs are expensive, you just farm multiple FU inputs to
> > the same (preferably pipelined) ALU, and fan the results *back*
> > out to the corresponding FU from which the src regs came.
>
> eh?

section 11.4.9.3 (ALU) 11.4.11.2 (LD/ST) mitch's scoreboard2 pdf

* multiple (N) FUs
* multiple (N) data latches and multiple Go_Read/Go_Write signals
* pipeline selects each in turn
* processes data in pipeline
* feeds results (N) back out to N result latches

FU 0 ---+
FU 1 ----+
FU 2 -----+
FU 3 ------+
||||
Pipeline
||||
||||
++++--> Fan-out to FU outputs

so, single ALU, multiple FU in, back out to corresponding FU out.

thus, the FUs effectively become "synonymous" with Reservation
Station Rows, in effect.

another way to describe it is: every row of every Tomasulo Reservation
Station is given a unique Function Unit number.

it is perfectly normal in TS RSs to see diagrams of multiple TS rows
directing data to a single pipelined ALU.

it is exactly the same thing.

> > in this way, then just as with TS+RS+ROB, there is absolutely no
> > difference between the PRF and ARF. the RS's hold the intermediate
> > results.
> >
> > l.
>
> This sounds like you want to temporarily stash results in the
> RS source registers? Is that what you are saying?

yes. the RS table rows are flattened into a single line,
one FU is allocated to each RS, they merge functionality,
effectively becoming the same unit, then farm out to ALUs
(and back to the corresponding FU).

l.

Anton Ertl

unread,

Dec 15, 2018, 3:49:33 AM12/15/18

to

"James Van Buskirk" <not_...@comcast.net> writes:
>"Anton Ertl" wrote in message
>news:2018Dec1...@mips.complang.tuwien.ac.at...
>
>> MitchAlsup <Mitch...@aol.com> writes:
>
>> >More importantly, FORTRAN takes the position that if aliasing occurs, it
>> >is the fault of the program and not that of the hardware.
>
>> This makes no sense. If the source program accesses the same memory,
>> this is aliasing; it's not a fault, it's a specification of the
>> behaviour; the compiler is responsible for implementing this behaviour
>> using the features provided by the hardware.
>
>Fortran now allows you to tip off the compiler that aliasing is
>intended by giving the affected dummy argument the TARGET
>attribute. Otherwise, as Mitch said, it's your fault and not the
>compiler's.

Mitch did not write that at all. What he wrote is cited above.

As for using a hole in a standard as a justification for blaming
programmers, that's a frequent, but ill-advised meme.

MitchAlsup

unread,

Dec 15, 2018, 11:54:44 AM12/15/18

to

On Saturday, December 15, 2018 at 2:49:33 AM UTC-6, Anton Ertl wrote:
> "James Van Buskirk" <not_...@comcast.net> writes:
> >"Anton Ertl" wrote in message
> >news:2018Dec1...@mips.complang.tuwien.ac.at...
> >
> >> MitchAlsup <?????@aol.com> writes:
> >
> >> >More importantly, FORTRAN takes the position that if aliasing occurs, it
> >> >is the fault of the program and not that of the hardware.
> >
> >> This makes no sense. If the source program accesses the same memory,
> >> this is aliasing; it's not a fault, it's a specification of the
> >> behaviour; the compiler is responsible for implementing this behaviour
> >> using the features provided by the hardware.
> >
> >Fortran now allows you to tip off the compiler that aliasing is
> >intended by giving the affected dummy argument the TARGET
> >attribute. Otherwise, as Mitch said, it's your fault and not the
> >compiler's.
>
> Mitch did not write that at all. What he wrote is cited above.
>
> As for using a hole in a standard as a justification for blaming
> programmers, that's a frequent, but ill-advised meme.

It is the reason compilers can do so much better optimizations in
FORTRAN than they can in C.

lk...@lkcl.net

unread,

Dec 15, 2018, 2:08:08 PM12/15/18

to

On Friday, December 14, 2018 at 8:51:05 PM UTC, EricP wrote:

> The matrix is a 1-bit crossbar, and it effectively builds
> a single linked list dependency chain of function unit numbers (FUN),
> by tracking the list head (oldest) and list tail (youngest)
> and telling each who its predecessor FUN is.

in my mind it's more akin to an acyclic directed graph.
at least, we better hope like hell it doesn't have any
circular dependencies :)

> As each FU finishes, it watches the matrix for an OK_WRITE signal
> from its predecessor, then propagates that signal from itself
> on its line in the crossbar. When an FU gets its OK_WRITE signal,
> it arbitrates for the bus and writes its result.
>
> The propagation stops when it gets to an FU that is busy,
> or one that has an exception, or the list end.
>
> The effect to cause a contiguous chain of pending results to be
> flushed as a batch.

yes. we are planning to stripe the register file (banks of 3R1W)
so that up to four operations can flush at once without contention.
it does mean that destination registers with the same modulo 4
value will *not* be multi-issue (as they will not even get issued)
and that src regs not in the same modulo 4 bank will have to go
via a multiplexer.

however... we don't mind that, as the vectorisation engine will
be, for the most part, generating sequentially-increasing index
dest *and* src registers, so we kinda get away with it.

> When the FU that has a pending exception becomes the list head,
> it throws the exception and begins the abort sequence.

yes i like this. branch-speculation can use the same trick.

> A t-gate matrix could have a propagation delay of about 1 or 2
> gates per FU in a chain.

fascinating. t-gate, you mean "CMOS Transmission Gate"?
https://en.wikipedia.org/wiki/Transmission_gate

apparently they may be specified in verilog. how intriguing
http://www.asic-world.com/verilog/gate1.html
https://groups.google.com/forum/#!topic/comp.lang.verilog/KC8csRM1dS0

you may be interested to know that... i think it was professor
kamakoti of IIT Madras created an N-way routing matrix using a
similarly unusual type of CMOS gate. very very low cost and latency.
N inputs from the side, M outputs from the top.

> > the actual execution may still go ahead in parallel as it is
> > the commit phase that is prevented and prohibited from proceeding
> > in strict sequential order with the (previously-proposed)
> > inter-instruction 1-bit matrix.
>
> Yes, that is what I meant by stalling in writeback like WAR.

in section 11.4.10 of mitch's book he describes how to do operand
forwarding. *hypothetically*, as i have not thought it through, it
*may* be possible to use operand forwarding on speculative execution
paths, *without* the execution commit phase.

i say "might" as if done incorrectly i suspect it would wreak havoc
with the register file.

also, compared to the diagrams mitch has in 11.4.10, the assumption
of those diagrams is that the forwarding and commit to the register
file are simultaneous. the above idea *breaks* that link in order
to be successful, and would need an in-sequence post-commit phase
on the register writes that had been left behind.

> > i genuinely have absolutely no idea how a Tomasulo + Reorder Buffer
> > can be modified to be multi-issue. Tomasulo+ROB is, as far as i
> > am aware, only allowed to have the (oldest) instruction at the head
> > of the Reorder cyclic buffer do a "commit". how on earth it is
> > possible to do say 4 head-of-buffer commits in a single cycle,
> > i have no clue
>
> This is a whole topic unto itself.
> There are multiple renamer designs, and multiple IQ/ROB designs.
> These interact with issue and scheduler logic, and commit.
>
> But in answer to your specific question,
> for example in one of the papers from that list I sent:
>
> A Reorder Buffer Design for High Performance Processors, 2012
>
> they store 4 instruction entries in each ROB row,
> so 1 port can select 4 instructions together.
> A ROB index is augmented with a 2 bit offset to
> uniquely select an instruction's state flags.

i saw that. annoyingly it's a "summary" paper, however that seems
to be the key point. the commit phase of the ROB that's normally
exclusively the head of the queue waits instead for 4 entries at
the head of the queue to finish, when the 4 2-bit offsets are all
filled in. then, all 4 queue head entries are committed.

in the libre-riscv design we're going to fudge it a little, as i
mentioned above, by striping the operations modulo the destination
register number.

> > by the time you read this, as long as you're reading comp.arch in
> > sequential order, you should have seen some cross-over of a message
> > describing how to do Tomasulo-ROB-RS-style register renaming,
> > without needing a RAT.
>
> RAT's are cheap - can be just a small SRAM.

there's other advantages to using a matrix bit-field (of width of the
register file), which includes that the register read/write-enable
lines are already de-multiplexed.

have to see how that goes. 256-wide bit-matrix and a single AND gate
to select, vs 8-bit-wide address entries and... um/er... 8-ANDs, 8-NOTs
to select [i think].

l.

already...@yahoo.com

unread,

Dec 15, 2018, 2:22:10 PM12/15/18

to

Do you know it from 1-st hand experience or from respectful papers that you can cite or just from repeating claims of Nick Mclaren?

MitchAlsup

unread,

Dec 15, 2018, 3:49:31 PM12/15/18

to

On Saturday, December 15, 2018 at 1:08:08 PM UTC-6, lk...@lkcl.net wrote:
> On Friday, December 14, 2018 at 8:51:05 PM UTC, EricP wrote:
>
> > The matrix is a 1-bit crossbar, and it effectively builds
> > a single linked list dependency chain of function unit numbers (FUN),
> > by tracking the list head (oldest) and list tail (youngest)
> > and telling each who its predecessor FUN is.
>
> in my mind it's more akin to an acyclic directed graph.
> at least, we better hope like hell it doesn't have any
> circular dependencies :)

I had a state machine in one chip that could come up out of power on in a
state it could not get out of. Since this experience, I have a rule with
state machines, A state machine must be able to go from any state to idle
when the reset line is asserted.

You have to prove that the logic can never create a circular dependency,
not a proof with test vectors, a logical proof like what we do with FP
arithmetic these days.

>
>
> > As each FU finishes, it watches the matrix for an OK_WRITE signal
> > from its predecessor, then propagates that signal from itself
> > on its line in the crossbar. When an FU gets its OK_WRITE signal,
> > it arbitrates for the bus and writes its result.
> >
> > The propagation stops when it gets to an FU that is busy,
> > or one that has an exception, or the list end.
> >
> > The effect to cause a contiguous chain of pending results to be
> > flushed as a batch.
>
> yes. we are planning to stripe the register file (banks of 3R1W)
> so that up to four operations can flush at once without contention.
> it does mean that destination registers with the same modulo 4
> value will *not* be multi-issue (as they will not even get issued)
> and that src regs not in the same modulo 4 bank will have to go
> via a multiplexer.
>
> however... we don't mind that, as the vectorisation engine will
> be, for the most part, generating sequentially-increasing index
> dest *and* src registers, so we kinda get away with it.

In this case:: you could simply design a 1R or 1W file (A.K.A. SRAM)
and read 4 registers at a time or write 4 registers at a time. Timing
looks like:

|RdS1|RdS2|RdS3|WtRd|RdS1|RdS2|RdS3|WtRd|RdS1|RdS2|RdS3|WtRd|
|F123|F123|F123|F123|
|Esk1|EsK2|EsK3|EsK4|
|EfK1|EfK2|EfK3|EfK4|

4 cycle FU shown. Read as much as you need in 4 cycles for one operand,
Read as much as you need in 4 cycles for another operand, read as much
as you need in 4 cycles for the last operand, then write as much as you
can for the result. This simply requires flip-flops to capture the width
and then deliver operands in parallel (serial to parallel converter) and
similarly for writing.

>
>
> > When the FU that has a pending exception becomes the list head,
> > it throws the exception and begins the abort sequence.
>
> yes i like this. branch-speculation can use the same trick.
>
> > A t-gate matrix could have a propagation delay of about 1 or 2
> > gates per FU in a chain.
>
> fascinating. t-gate, you mean "CMOS Transmission Gate"?
> https://en.wikipedia.org/wiki/Transmission_gate
>
> apparently they may be specified in verilog. how intriguing
> http://www.asic-world.com/verilog/gate1.html
> https://groups.google.com/forum/#!topic/comp.lang.verilog/KC8csRM1dS0
>
> you may be interested to know that... i think it was professor
> kamakoti of IIT Madras created an N-way routing matrix using a
> similarly unusual type of CMOS gate. very very low cost and latency.
> N inputs from the side, M outputs from the top.

Unfortunately, you won't find these gates in synthesizable libraries,
which is how 90% of <used to be> full custom logic is done these days.
{If you aren't Intel, you won't get access to the layout of your gates
any more.}

RATs are expensive to back up at branch mispredict when done in SRAM.

>
> there's other advantages to using a matrix bit-field (of width of the
> register file), which includes that the register read/write-enable
> lines are already de-multiplexed.

I think you mean the words "unarily encoded".

MitchAlsup

unread,

Dec 15, 2018, 3:51:58 PM12/15/18

to

When Nick and I would talk about these issues, I got to the point I understood
what parts of the compiler* were simplified by the FORTRAN position, Nick had
the examples, and I wish he would drop buy to provide.

(*)I spent nearly a decade working on code generation inside compilers.

Stephen Fuld

unread,

Dec 15, 2018, 6:29:42 PM12/15/18

to

See

https://en.wikipedia.org/wiki/Aliasing_(computing)#Conflicts_with_optimization

Also, note that the version of FORTRAN that the CDC 6600 was designed
for didn't have pointers in the language. While you could still create
aliasing, it wasn't as easy. :-(

--
- Stephen Fuld
(e-mail address disguised to prevent spam)