Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss
Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

tomasulo algorithm and reorder buffers for parallel vector engines / SIMD

1,575 views
Skip to first unread message

lk...@lkcl.net

unread,
Dec 5, 2018, 3:31:09 AM12/5/18
to
with thanks to mitch alsup for alerting me to the tomasulo
algorithm and reservation stations, i began investigating
how to leverage this algorithm for use in SimpleV. here
is a really good video explaining reorder buffers and how
to do speculative execution:

https://www.youtube.com/watch?v=OU3jI8j7Ozw

what i particularly like about reorder buffers is the fact
that it is an exceptionally clean solution which pretty
much solves absolutely every hazard and potential issue
(LD/ST WAW/WAR, register WAW/WAR), can be used to
implement multi-issue, speculation, clear up branch
mis-prediction, and deal with exceptions as well, simply
by rolling back (erasing) the ROB table.

it's effectively conceptually identical to database
"transactions", where the only items that are allowed to be
"committed" are those at the head of the buffer, once all data
is in. not only that: extending tomasulo to do multi-issue
is a simple matter of allowing more than one instruction
to be entered into the ROB on every clock cycle, and ensuring
that all data paths are wide enough to cope.

scoreboarding on the other hand is an absolute nuisance:

http://home.deib.polimi.it/silvano/FilePDF/AAC/Lesson_4_ILP_PartII_Scoreboard.pdf

scoreboarding only works properly when register renaming
is deployed; it doesn't seem to lend itself to multi-issue,
it doesn't solve exception roll-backs, and in fact *causes*
exception problems, needing the concept of "imprecise"
exceptions (p16).

so my first question is: what am i missing about scoreboards?
they're apparently much simpler than the tomasulo algorithm
and reorder buffers, and also i understand that a CAM is not
needed, which will save power.

the second question is: are there known ways to reduce power
consumption of the Reorder Buffer? the four fields of a ROB
table entry are:

* instruction type (branch / store / ALU op / exception)
* destination (register number or memory address)
* value
* ready / completed status

the destination is what needs to be a CAM (key-value store),
as on every cycle it is necessary to check if the source
operands are depending on *any one* of the entries in the
ROB table which have outstanding execution to be carried out.

the one i thought of is to have a bitfield representing every
available register (in the case of SV, that's 128 bits). if
a given register has a bit clear, it indicates that there
does not exist a ROB entry with that register as a destination,
indicating that it is possible to cut out the CAM lookup step.

so... would it not be unreasonable to simply maintain a table
of register-to-ROB# entries? if there are 128 destination
registers, and the ROB table is 32 entries, would a table
128 deep and 5 bits wide do, as a way to not have a CAM?
is it really that simple?


thirdly - this may take some explaining - i'm looking to map
predicated SIMD down onto the Reorder Buffer, which will be
an important aspect of SV, and all other implementation
schemes are potentially problematic.

SV has a register "redirection" table, where a 5-bit standard
*scalar* operation may be "redirected" to a 7-bit *actual*
register. this ensures that the standard RISC-V scalar
operations need not be modified at all.

however, there is no restriction on the redirection: x5 could
be redirected to x112, *and x30 could be as well*.

more than that, it is possible to request that when for example
using x5, the operand bitwidth is changed to e.g. 16-bit,
and when using for example x30, the operand bitwidth is changed
to e.g. 32-bit *and is offset to target the UPPER half of
register x112*.

thus, it is not the *full* width of the register that's being
used, and, consequently, if byte-masking is passed all the way
down to the register file (byte-write-enable as opposed to just
reg-write-enable), there may be an opportunity to even pass
that byte-masking into SIMD-style ALUs

https://libre-riscv.org/3d_gpu/rob_bytemask.jpg

in the above image, there is the possibility of having *three*
entries for the exact same destination register, because all
three byte-masks do not overlap.

in this way it would be possible to use SIMD-style ALUs to
carry out multiple lower-bitwidth operations.

before going ahead with an experimental implementation, i thought
it might be a really really good idea to see if anyone has any
insights or caveats that would make this byte-masking impractical
or unworkable.

thoughts appreciated.

l.

MitchAlsup

unread,
Dec 5, 2018, 12:17:12 PM12/5/18
to
Renaming and a ROB fixes this set of issues.
>
> so my first question is: what am i missing about scoreboards?

The entire scoreboard for the CDC 6600 was smaller than one single
reservation station entry in the IBM 360/91!

> they're apparently much simpler than the tomasulo algorithm
> and reorder buffers, and also i understand that a CAM is not
> needed, which will save power.

Scoreboards rename the dataflow into function unit while reservation
stations rename into the reorder buffer (or physical register file.)
Since a function unit knows its own "name", it can broadcast the "tag"
with a single wire which can goto an AND gate per scoreboard entry
and if both are set, this operand has become ready.

You can do conceptually the same thing with a <really narrow> CAM.
>
> the second question is: are there known ways to reduce power
> consumption of the Reorder Buffer? the four fields of a ROB
> table entry are:
>
> * instruction type (branch / store / ALU op / exception)
> * destination (register number or memory address)
> * value
> * ready / completed status

There is a way to build the register file called "Physical Register
File" where the architectural registers are merged with the in-flight
registers into one single rename pool. THis gets rid of moving the
ROB entry to the RF when the instruction has been committed.
>
> the destination is what needs to be a CAM (key-value store),
> as on every cycle it is necessary to check if the source
> operands are depending on *any one* of the entries in the
> ROB table which have outstanding execution to be carried out.
>
> the one i thought of is to have a bitfield representing every
> available register (in the case of SV, that's 128 bits). if
> a given register has a bit clear, it indicates that there
> does not exist a ROB entry with that register as a destination,
> indicating that it is possible to cut out the CAM lookup step.
>
> so... would it not be unreasonable to simply maintain a table
> of register-to-ROB# entries? if there are 128 destination
> registers, and the ROB table is 32 entries, would a table
> 128 deep and 5 bits wide do, as a way to not have a CAM?
> is it really that simple?
>
This is computer architecture--it is never that simple. But it is
not far from being simple.
ALUs (including SIMD ALUs) are not expensive (in area or power).
What is expensive is routing of the data to them and routing it back.
>
> before going ahead with an experimental implementation, i thought
> it might be a really really good idea to see if anyone has any
> insights or caveats that would make this byte-masking impractical
> or unworkable.
>
> thoughts appreciated.

The inherent simplicity of the scoreboard saves power and area. The
CDC 6600 scoreboard controls the latches used to route data around
the machine. That is correct--latches, not flip-flops. Since a latch
can be as small as 3 gates and a flip-flop is generally 10 gates;
building with latches is a lot smaller than building with flip-flops.
It is a shame the current tools do not support that kind of design.

Forwarding can be added to the scoreboard--eliminating this delay
(compared to a CDC 6600 scoreboard.)

1-cycle integer can also be added to the scoreboard for back to back
calculations.

Branch prediction/recovery can be added to the scoreboard.

And finally, reservation stations can be made "value-free", reading
the register file after the instruction fires into execution.
>
> l.

EricP

unread,
Dec 5, 2018, 5:58:44 PM12/5/18
to
Not quite. A scoreboard doesn't need a renamer.

The cause of imprecise exceptions and interrupts is that
the scoreboard allows out-of-order (OoO) writeback and commit.
It is hypothetically possible to build a scoreboard that stalled
writeback until all prior instructions had completed and thus have
precise interrupts, but that would very much limit the parallel execution
(or require a huge number of function units to hold finished results
until writeback was allowed).

> so my first question is: what am i missing about scoreboards?
> they're apparently much simpler than the tomasulo algorithm
> and reorder buffers, and also i understand that a CAM is not
> needed, which will save power.

A scoreboard is a kind of OoO scheduler that operates
by tracking RAW, WAR and WAW dependencies.
RAW are prevented by stalling the read until prior writeback is performed.
WAR are prevented by stalling the writeback until prior read is performed.
WAW are prevented by stalling the issue until prior writeback is performed.

A scoreboard is limited to only looking the decoded instruction
and the state of the outstanding registers and function units.
If an instruction stalls for WAW then all instructions after it stall.
If an instruction stalls for RAW then it waits in the function unit,
preventing that FU from being used. If a subsequent instruction
needs the same FU then it and all subsequent instructions stall.

(Compare the scoreboard scheduler to a full OoO scheduler described below)

The renamer + physical register file (PRF) eliminates the need to
check for WAR and WAW dependencies because it allocates a new
physical register for each write. It is also one of the two
components that allows out-of-order execution with precise interrupts,
the other component being the Instruction Queue (IQ) aka ROB.

You could have a simple scoreboard scheduler with a simple renamer and
PRF but you would still have imprecise exceptions. It would have
slightly better parallelism than the original scoreboard
because it would not stall for WAR and WAW hazards.
But without an IQ/ROB it would have to park RAW pending instructions
in the FU, and that could stall subsequent instruction decoding.

It is when you also have an IQ/ROB with commit logic that
updates the renamer map that you get precise interrupts.

I draw a distinction between an IQ and a ROB in that a ROB is
an IQ with extra features, typically the scheduling information
and sometimes it holds uncommitted result values.
The IQ is the part that is necessary to track the instruction order
and drive the commit logic giving precise interrupts.

A full OoO scheduler is not limited to looking at just the decoded
instruction. The decoded instructions are transferred to the ROB
and then the OoO scheduler looks at the whole ROB at once.
Only when all an instructions operands are ready does it dispatch
it to the FU. Since it does not use a FU to wait for RAW dependencies
(as the scoreboard does), it can use the FU in any order.
But that makes the OoO scheduler more complex.

> the second question is: are there known ways to reduce power
> consumption of the Reorder Buffer? the four fields of a ROB
> table entry are:
>
> * instruction type (branch / store / ALU op / exception)
> * destination (register number or memory address)
> * value
> * ready / completed status
>
> the destination is what needs to be a CAM (key-value store),
> as on every cycle it is necessary to check if the source
> operands are depending on *any one* of the entries in the
> ROB table which have outstanding execution to be carried out.

There are lots of papers on these subjects online
exploring various design options.

This one walks through a a whole OoO x86 design in detail:

A Superscalar Out-of-Order x86 Soft Processor for FPGA 2017
http://www.stuffedcow.net/files/henry-thesis-phd.pdf

Here is a list of other papers I found on schedulers, ROB, etc.
they may or may no be helpful.
You'll have to hunt them down on Google yourself.
I also find it interesting to check the citation links Google tracks
as that often leads to related papers on the same subject.
--------------------------------------------------------------

The Design Space of Register Renaming Techniques, Sima 2000

Implementing Precise Interrupts in Pipelined Processors 1988

Complexity and Correctness of a Super-Pipelined Processor 2005

Complexity and Correctness of Computer Architectures 1996

Complexity Effective Superscalar Processors 1997

Quantifying the Complexity of Superscalar Processors 1997

Super-Scalar Processor Design CSL-TR-89-383 1989

The Microarchitecture of Superscalar Processors 1995

A high-speed dynamic instruction scheduling scheme
for superscalar processors 2001

A Low-Complexity Issue Logic 2000

A Reorder Buffer Design for High Performance Processors 2012

An oldest-first selection logic implementation for non-compacting
issue queues 2002

Circuits for wide-window superscalar processors 2000

Delay Evaluation of Issue Queue in Superscalar Processors 2012

Design and Evaluation of a RISC Processor
with a Tomasulo Scheduler 1999

Direct instruction wakeup for out-of-order processors 2004

Efficient Dynamic Scheduling through Tag Elimination 2002

Instruction Issue Logic for High-performance, Interruptable
Pipelined Processors 1987

Instruction Issue Logic in Pipelined Supercomputers 1984

Issue logic for a 600-MHz Out-of-Order execution microprocessor 1998

Low-complexity reorder buffer architecture 2002

Macro-op Scheduling Relaxing Scheduling Loop Constraints 2003

Making the Original Scoreboard Mechanism Deadlock Free 1996

Matrix Scheduler Reloaded 2007

On Pipelining Dynamic Instruction Scheduling Logic 2000

Patent US20120278593 Low complexity out-of-order issue
logic using static circuits 2012

Patent US6557095 Scheduling operations using a dependency matrix 1999

Patent US6988185 Select-free dynamic instruction scheduling 2006

Reducing the complexity of the issue logic 2001

Select Free Instruction Scheduling Logic 2005

Simplifying instruction issue logic in superscalar processors 2002

The Impact of Hardware Scheduling Mechanisms
on the Performance and Cost of Processor Designs 1999

Understanding Scheduling Replay Schemes 2004



Ivan Godard

unread,
Dec 5, 2018, 7:58:03 PM12/5/18
to
On 12/5/2018 2:58 PM, EricP wrote:

<an excellent explanation of scoreboard vs. OOO snipped>

Another way to get precise exceptions without OOO overhead is to use
result replay rather than issue replay. Issue replay is the usual way:
track which operations have been issued but not yet in-order retired by
keeping an instruction pointer to the last fully-retired operation. On
exception, clear the machine state and reset the IP to the tracking
address, thereby discarding all in-flight operations, even those that
have completed but not yet retired, and reissue them.

In result replay, all operations in-flight at the exception are allowed
to complete (but not retire) and their results are saved. The hardware
tracks which operations have not yet issued; issue is not necessarily
in-order. When the exception handler returns only those operations not
already issued before the exception are issued; there is no re-issue of
in-flights. However, results are replayed, so that the results of both
issue-before and issue-after operations are interleaved so that the
ordering is the same as if the exception had not happened at all.

Result replay saves the pipeline drain and restart inherent in issue
replay, and obviates the IQ, rename registers, and ROB required by OOO,
but requires the ability to buffer and save the results of in-flight
ops. It also requires special handling for ops (like loads) which cannot
complete until the exception has been handled, and to deal with when
several in-flight ops independently get exceptions (also a problem for OOO).

The Mill is the only architecture I know of using result replay
throughout. Quite a few ISAs use it for some special ops. For example,
restartable string ops can be seen as an instance of result replay.

lk...@lkcl.net

unread,
Dec 6, 2018, 3:52:05 AM12/6/18
to
On Wednesday, December 5, 2018 at 10:58:44 PM UTC, EricP wrote:
> lk...@lkcl.net wrote:
> > scoreboarding only works properly when register renaming
> > is deployed; it doesn't seem to lend itself to multi-issue,
> > it doesn't solve exception roll-backs, and in fact *causes*
> > exception problems, needing the concept of "imprecise"
> > exceptions (p16).
>
> Not quite. A scoreboard doesn't need a renamer.
>
> The cause of imprecise exceptions and interrupts is that
> the scoreboard allows out-of-order (OoO) writeback and commit.

understood.

> It is hypothetically possible to build a scoreboard that stalled
> writeback until all prior instructions had completed and thus have
> precise interrupts, but that would very much limit the parallel execution
> (or require a huge number of function units to hold finished results
> until writeback was allowed).

[or a reorder buffer to hold them... which introduces extra latency.
is that worth it, for the simplicity that a ROB brings? quite likely!]

in the LibreRISCV SoC/GPU, the general plan is to have a lot of function
units anyway, as it is a multi-issue design. can we make simpler FUs
(taking less silicon) that take longer to complete, and have more of them?
maybe: that means a trade-off if doing operand-forwarding, as the routing
between FUs for op-forwarding would need to be increased.

why is this so hard to evaluate! :)

> The renamer + physical register file (PRF) eliminates the need to
> check for WAR and WAW dependencies because it allocates a new
> physical register for each write. It is also one of the two
> components that allows out-of-order execution with precise interrupts,
> the other component being the Instruction Queue (IQ) aka ROB.

i think this is a very important requirement. i found a video
online which describes a scheme to do register-renames:
https://www.youtube.com/watch?v=p4SdrUhZrBM

i took a screenshot of the "Register Aliasing Table", here:
https://libre-riscv.org/3d_gpu/rat_table.png

basically it contains two tables. the first is the RAT table:

Valid | Tag | Value
----- | ------- | ------------
Y/N | FUname | Reg contents

and the second looks remarkably similar to a Reservation Station,
and is associated with the Functional Unit:

<---- Src1 ---> | <---- Src2 ------> |
Reg# | Tag | Value | Reg# | Tag | Value |

so... um... if the RAT basically introduces the Reservation Station
concept in a different form, and a ROB is needed to do precise exceptions
and rollback, then... um... what's the difference again? :)

> I draw a distinction between an IQ and a ROB in that a ROB is
> an IQ with extra features, typically the scheduling information
> and sometimes it holds uncommitted result values.

understood.

> There are lots of papers on these subjects online
> exploring various design options.
>
> This one walks through a a whole OoO x86 design in detail:
>
> A Superscalar Out-of-Order x86 Soft Processor for FPGA 2017
> http://www.stuffedcow.net/files/henry-thesis-phd.pdf

wowww, figure 5.1 - 7R3W regfile. zowee!
very valuable resource, thank you eric.

> Here is a list of other papers I found on schedulers, ROB, etc.
> they may or may no be helpful.
> You'll have to hunt them down on Google yourself.
> I also find it interesting to check the citation links Google tracks
> as that often leads to related papers on the same subject.

appreciated. i'm currently using the technique of doing google
image searches for phrases "register rename RAT table" and so on,
as i find that produces more in the way of visual diagrams, which
seem to sink in better than words, for me.

the list you gave is an extremely valuable alternative starting
point to explore, so thank you.

l.

lk...@lkcl.net

unread,
Dec 6, 2018, 4:36:59 AM12/6/18
to
this seems to be the consensus. and, if that youtube video
is correct, adding those basically gives pretty much identical
functionality to Tomasulo+ROB.

> >
> > so my first question is: what am i missing about scoreboards?
>
> The entire scoreboard for the CDC 6600 was smaller than one single
> reservation station entry in the IBM 360/91!

woooow. well, thinking about it, that's no surprise: a reservation
station is effectively holding all the operand values so that
register-renaming is provided / mitigated.

so, if that's missed out, then yes.

> > they're apparently much simpler than the tomasulo algorithm
> > and reorder buffers, and also i understand that a CAM is not
> > needed, which will save power.
>
> Scoreboards rename the dataflow into function unit while reservation
> stations rename into the reorder buffer (or physical register file.)
> Since a function unit knows its own "name", it can broadcast the "tag"
> with a single wire which can goto an AND gate per scoreboard entry
> and if both are set, this operand has become ready.

ahh of course: the difference being, the Tomasulo Algorithm uses
the ROB# and a broadcast bus. so, TSA+ROB goes like this:

* Dest# goes into ROB
* ROB# goes into RSs (plural) and LOADQ
* Result gets broadcast and goes straight into *all* RS's and Qs

where SBd goes:

* FUs wait for SRCregs
* FUs put Dest *BACK* into RegFile (or RAT table)
* RegFile values updated in SBd entries (plural)
* repeat

the step "FUs put Dest back into RegFile" is where "forwarding"
is often brought in, to reduce the latency associated with that,
which implies an absolutely massive array of bypass routing
between absolutely every single FU's outputs and absolutely
every single FU's input stages.


> You can do conceptually the same thing with a <really narrow> CAM.

1 bit, or 1 entry?

iinteresting... i wonder if it would be possible to
augment TSA+ROB to include references to the Functional Units?

so, the ROB table entry also contains scoreboard back-references
(Qj, Qk) to the Functional Units that will produce the result
they depend on? no - it's the Reservation Station, isn't it.
and that would make ROB# *equal* to the FU# if and only if
there are separate and distinct Reservation Stations associated
with each and every FU. i.e. each FU had its own RS.

hm i'm going to have to draw that out, see if it works.

> >
> > the second question is: are there known ways to reduce power
> > consumption of the Reorder Buffer? the four fields of a ROB
> > table entry are:
> >
> > * instruction type (branch / store / ALU op / exception)
> > * destination (register number or memory address)
> > * value
> > * ready / completed status
>
> There is a way to build the register file called "Physical Register
> File" where the architectural registers are merged with the in-flight
> registers into one single rename pool. This gets rid of moving the
> ROB entry to the RF when the instruction has been committed.

iiinterestiing... and it would require the number of ROB entries to
be equal/greater than the number of architectural registers.

> > so... would it not be unreasonable to simply maintain a table
> > of register-to-ROB# entries? if there are 128 destination
> > registers, and the ROB table is 32 entries, would a table
> > 128 deep and 5 bits wide do, as a way to not have a CAM?
> > is it really that simple?
> >
> This is computer architecture--it is never that simple. But it is
> not far from being simple.

:)

i realised that the proposed table (register-to-ROB#) may not
be a one-to-one and onto (i.e. a permutation table). or, more
to the point, i have not yet determined *if* it is a one-to-one
and onto relationship.

in other words, what concerns me is where two instructions are
issued which write to the same destination register. now, whoops:
actually you need register-to-ROB+ROB(+ROB)?

have to think that one through.

> > in this way it would be possible to use SIMD-style ALUs to
> > carry out multiple lower-bitwidth operations.
>
> ALUs (including SIMD ALUs) are not expensive (in area or power).
> What is expensive is routing of the data to them and routing it back.

yeah, am finding that out...

> > before going ahead with an experimental implementation, i thought
> > it might be a really really good idea to see if anyone has any
> > insights or caveats that would make this byte-masking impractical
> > or unworkable.
> >
> > thoughts appreciated.
>
> The inherent simplicity of the scoreboard saves power and area. The
> CDC 6600 scoreboard controls the latches used to route data around
> the machine. That is correct--latches, not flip-flops. Since a latch
> can be as small as 3 gates and a flip-flop is generally 10 gates;
> building with latches is a lot smaller than building with flip-flops.

that's a *really* compelling argument to continue investigating

> It is a shame the current tools do not support that kind of design.

interestingly it turns out that yosys had support added for
detection and generation of d-latches, in 2015:
https://github.com/YosysHQ/yosys/commit/554a8df5e2e7c750b76021821bdf2e07970b9dbf

yosys is a libre tool that turns verilog into BTIF. one of the
command-line options is *specifically* to disable d-latch creation.

> Forwarding can be added to the scoreboard--eliminating this delay
> (compared to a CDC 6600 scoreboard.)
>
> 1-cycle integer can also be added to the scoreboard for back to back
> calculations.
>
> Branch prediction/recovery can be added to the scoreboard.
>
> And finally, reservation stations can be made "value-free", reading
> the register file after the instruction fires into execution.

ah of course. hm... how would it be ensured that the value being
read is the correct one? one of the key benefits of Tomasulo is
that reg-renaming is avoided (or, "provided") precisely because
the relevant Reservation Station(s) contain *copies* of the (old)
value.

l.

adaptive...@gmail.com

unread,
Dec 6, 2018, 5:14:19 AM12/6/18
to
Let us see course example @1998,
https://people.eecs.berkeley.edu/~pattrsn/252S98/Lec04-tomosulo.pdf

Best,
S.Takano

lk...@lkcl.net

unread,
Dec 6, 2018, 7:12:02 AM12/6/18
to
ah! good find. very useful comparison of both algorithms.

slide 18: ah. the 6600 scoreboard algorithm cannot do
multiple LOAD operations. that's a show-stopper for the
SimpleV SoC: it is absolutely critical that multiple LOAD
(and STORE) operations be issued and carried out.

does anyone know if, since the 6600, this has been enhanced /
augmented, to make multiple LOAD operations possible?

l.

MitchAlsup

unread,
Dec 6, 2018, 12:02:39 PM12/6/18
to
On Wednesday, December 5, 2018 at 6:58:03 PM UTC-6, Ivan Godard wrote:
> On 12/5/2018 2:58 PM, EricP wrote:
>
> <an excellent explanation of scoreboard vs. OOO snipped>
>
> Another way to get precise exceptions without OOO overhead is to use
> result replay rather than issue replay. Issue replay is the usual way:
> track which operations have been issued but not yet in-order retired by
> keeping an instruction pointer to the last fully-retired operation. On
> exception, clear the machine state and reset the IP to the tracking
> address, thereby discarding all in-flight operations, even those that
> have completed but not yet retired, and reissue them.

There is another way to get precise ordering of the writes in a scoreboard.
First, one has to implement forwarding in the scoreboard.
Second, the function units need an output queue <of say 4 registers>
Now, one can launch an instruction and pick up its operand either
from the RF or from the function unit output while the result sits
in the function unit waiting for its GO_Write signal.

Thus the launching of instructions is not delayed due to hazards
but the results are delivered to the RF in program order.

This looks surprisingly like a 'belt' at the end of the function unit.

MitchAlsup

unread,
Dec 6, 2018, 12:08:49 PM12/6/18
to
On Thursday, December 6, 2018 at 2:52:05 AM UTC-6, lk...@lkcl.net wrote:
> On Wednesday, December 5, 2018 at 10:58:44 PM UTC, EricP wrote:
> > lk...@lkcl.net wrote:
> > > scoreboarding only works properly when register renaming
> > > is deployed; it doesn't seem to lend itself to multi-issue,
> > > it doesn't solve exception roll-backs, and in fact *causes*
> > > exception problems, needing the concept of "imprecise"
> > > exceptions (p16).
> >
> > Not quite. A scoreboard doesn't need a renamer.
> >
> > The cause of imprecise exceptions and interrupts is that
> > the scoreboard allows out-of-order (OoO) writeback and commit.
>
> understood.
>
> > It is hypothetically possible to build a scoreboard that stalled
> > writeback until all prior instructions had completed and thus have
> > precise interrupts, but that would very much limit the parallel execution
> > (or require a huge number of function units to hold finished results
> > until writeback was allowed).
>
> [or a reorder buffer to hold them... which introduces extra latency.
> is that worth it, for the simplicity that a ROB brings? quite likely!]
>
> in the LibreRISCV SoC/GPU, the general plan is to have a lot of function
> units anyway, as it is a multi-issue design. can we make simpler FUs
> (taking less silicon) that take longer to complete, and have more of them?
> maybe: that means a trade-off if doing operand-forwarding, as the routing
> between FUs for op-forwarding would need to be increased.

If you can get the tools to allow you to deposit FU results into a latch
array, it will be profitable to have the FU hold several <prior> results.
If you have to do this with real flip-flops, the area will not be conductive.
>
> why is this so hard to evaluate! :)

Because the old guys that know how to do this have been driven away from
this NG due to the signal to noise ratio (which lately has been pretty
good.)
>
> > The renamer + physical register file (PRF) eliminates the need to
> > check for WAR and WAW dependencies because it allocates a new

Err, no. Renamer+PRF does not get rid of the need to check--it gets rid
of the hazard itself! {A not so subtle difference}

> > physical register for each write. It is also one of the two
> > components that allows out-of-order execution with precise interrupts,
> > the other component being the Instruction Queue (IQ) aka ROB.
>
> i think this is a very important requirement. i found a video
> online which describes a scheme to do register-renames:
> https://www.youtube.com/watch?v=p4SdrUhZrBM
>
> i took a screenshot of the "Register Aliasing Table", here:
> https://libre-riscv.org/3d_gpu/rat_table.png
>
> basically it contains two tables. the first is the RAT table:
>
> Valid | Tag | Value
> ----- | ------- | ------------
> Y/N | FUname | Reg contents
>
> and the second looks remarkably similar to a Reservation Station,
> and is associated with the Functional Unit:
>
> <---- Src1 ---> | <---- Src2 ------> |
> Reg# | Tag | Value | Reg# | Tag | Value |
>
> so... um... if the RAT basically introduces the Reservation Station
> concept in a different form, and a ROB is needed to do precise exceptions
> and rollback, then... um... what's the difference again? :)
>
> > I draw a distinction between an IQ and a ROB in that a ROB is
> > an IQ with extra features, typically the scheduling information
> > and sometimes it holds uncommitted result values.

I have a different, and arguably better PRF+renamer, which takes zero
cycles to backup.

MitchAlsup

unread,
Dec 6, 2018, 12:19:04 PM12/6/18
to
At much smaller gate costs.
The thing about renaming into FUs is that it is straightforward to add
an order (count) to the FU where it is no easy to do that with a renamer.
So lets say the Multiplier was FU=5 and had 4 results. Each time an
instruction is put in the queue towards FU=5 the count is incremented
and dependent instructions get the fact that X7 is going to come from
FU=5;cnt=2.
>
>
> > You can do conceptually the same thing with a <really narrow> CAM.
>
> 1 bit, or 1 entry?
>
> iinteresting... i wonder if it would be possible to
> augment TSA+ROB to include references to the Functional Units?

The conversion only works for 1,2, and 3-bits. At 4-bits it becomes
less expensive to use the CAM.
>
> so, the ROB table entry also contains scoreboard back-references
> (Qj, Qk) to the Functional Units that will produce the result
> they depend on? no - it's the Reservation Station, isn't it.
> and that would make ROB# *equal* to the FU# if and only if
> there are separate and distinct Reservation Stations associated
> with each and every FU. i.e. each FU had its own RS.
>
> hm i'm going to have to draw that out, see if it works.

I have a 40-page chapter in a book I wrote explaining the CDC 6600
scoreboard and how to bring it into the modern world.
>
> > >
> > > the second question is: are there known ways to reduce power
> > > consumption of the Reorder Buffer? the four fields of a ROB
> > > table entry are:
> > >
> > > * instruction type (branch / store / ALU op / exception)
> > > * destination (register number or memory address)
> > > * value
> > > * ready / completed status
> >
> > There is a way to build the register file called "Physical Register
> > File" where the architectural registers are merged with the in-flight
> > registers into one single rename pool. This gets rid of moving the
> > ROB entry to the RF when the instruction has been committed.
>
> iiinterestiing... and it would require the number of ROB entries to
> be equal/greater than the number of architectural registers.

Entries = architectural+dynamic.

Where dynamic is essentially the count in the ROB.
>
> > > so... would it not be unreasonable to simply maintain a table
> > > of register-to-ROB# entries? if there are 128 destination
> > > registers, and the ROB table is 32 entries, would a table
> > > 128 deep and 5 bits wide do, as a way to not have a CAM?
> > > is it really that simple?
> > >
> > This is computer architecture--it is never that simple. But it is
> > not far from being simple.
>
> :)
>
> i realised that the proposed table (register-to-ROB#) may not
> be a one-to-one and onto (i.e. a permutation table). or, more
> to the point, i have not yet determined *if* it is a one-to-one
> and onto relationship.
>
> in other words, what concerns me is where two instructions are
> issued which write to the same destination register. now, whoops:
> actually you need register-to-ROB+ROB(+ROB)?
>
> have to think that one through.

The only issue is that they have to be delivered in order to the RF.

Which reminds me of the RF in the 88110. Each RF entry had 2 flip-flops
and control logic allowed 1-level of OoO-ness. For a 2-wide machine
this worked rather well.
>
> > > in this way it would be possible to use SIMD-style ALUs to
> > > carry out multiple lower-bitwidth operations.
> >
> > ALUs (including SIMD ALUs) are not expensive (in area or power).
> > What is expensive is routing of the data to them and routing it back.
>
> yeah, am finding that out...
>
> > > before going ahead with an experimental implementation, i thought
> > > it might be a really really good idea to see if anyone has any
> > > insights or caveats that would make this byte-masking impractical
> > > or unworkable.
> > >
> > > thoughts appreciated.
> >
> > The inherent simplicity of the scoreboard saves power and area. The
> > CDC 6600 scoreboard controls the latches used to route data around
> > the machine. That is correct--latches, not flip-flops. Since a latch
> > can be as small as 3 gates and a flip-flop is generally 10 gates;
> > building with latches is a lot smaller than building with flip-flops.
>
> that's a *really* compelling argument to continue investigating
>
> > It is a shame the current tools do not support that kind of design.
>
> interestingly it turns out that yosys had support added for
> detection and generation of d-latches, in 2015:
> https://github.com/YosysHQ/yosys/commit/554a8df5e2e7c750b76021821bdf2e07970b9dbf
>
Good to know.

> yosys is a libre tool that turns verilog into BTIF. one of the
> command-line options is *specifically* to disable d-latch creation.
>
> > Forwarding can be added to the scoreboard--eliminating this delay
> > (compared to a CDC 6600 scoreboard.)
> >
> > 1-cycle integer can also be added to the scoreboard for back to back
> > calculations.
> >
> > Branch prediction/recovery can be added to the scoreboard.
> >
> > And finally, reservation stations can be made "value-free", reading
> > the register file after the instruction fires into execution.
>
> ah of course. hm... how would it be ensured that the value being
> read is the correct one? one of the key benefits of Tomasulo is
> that reg-renaming is avoided (or, "provided") precisely because
> the relevant Reservation Station(s) contain *copies* of the (old)
> value.

Perhaps you should read the chapter I wrote.
>
> l.

MitchAlsup

unread,
Dec 6, 2018, 12:21:19 PM12/6/18
to
This is a paper that, in my opinion, is overly critical to the Scoreboard
and does nothing to ameliorate its deficiencies.

Whereas, they are perfectly happy to augment the Tomasulo mechanism with
multiple <common> data busses.

And one of the reasons I wrote that chapter, which debunks a large portion
of that paper.

MitchAlsup

unread,
Dec 6, 2018, 12:25:00 PM12/6/18
to
On Thursday, December 6, 2018 at 6:12:02 AM UTC-6, lk...@lkcl.net wrote:
> On Thursday, December 6, 2018 at 10:14:19 AM UTC, adaptive...@gmail.com wrote:
> > Let us see course example @1998,
> > https://people.eecs.berkeley.edu/~pattrsn/252S98/Lec04-tomosulo.pdf
> >
> > Best,
> > S.Takano
>
> ah! good find. very useful comparison of both algorithms.
>
> slide 18: ah. the 6600 scoreboard algorithm cannot do
> multiple LOAD operations. that's a show-stopper for the
> SimpleV SoC: it is absolutely critical that multiple LOAD
> (and STORE) operations be issued and carried out.

CDC 6600 scoreboard could have 2 memory references in the pre AGEN
stage. After AGEN, there were 5 LD address buffers and 2 ST address
buffers. It is all there in the Thornton book in great detail.

The inability to AGEN was the only limitation not the passages
of the mem refs through the memory system.
>
> does anyone know if, since the 6600, this has been enhanced /
> augmented, to make multiple LOAD operations possible?

I probably know as much or more about scoreboards than anyone alive.
Including the actual gate count of the CDC 6600 Scoreboard.
>
> l.

lk...@lkcl.net

unread,
Dec 6, 2018, 1:48:27 PM12/6/18
to
On Thursday, December 6, 2018 at 5:19:04 PM UTC, MitchAlsup wrote:

> > ah of course. hm... how would it be ensured that the value being
> > read is the correct one? one of the key benefits of Tomasulo is
> > that reg-renaming is avoided (or, "provided") precisely because
> > the relevant Reservation Station(s) contain *copies* of the (old)
> > value.
>
> Perhaps you should read the chapter I wrote.

i have a responsibility to make sure i evaluate things properly,
so that's probably a good idea (understatement... :) is it
available somewhere, or would it be ok to email it to me
(luke.l...@gmail.com) ?

l.

Ivan Godard

unread,
Dec 6, 2018, 6:34:30 PM12/6/18
to
On 12/6/2018 9:02 AM, MitchAlsup wrote:
> On Wednesday, December 5, 2018 at 6:58:03 PM UTC-6, Ivan Godard wrote:
>> On 12/5/2018 2:58 PM, EricP wrote:
>>
>> <an excellent explanation of scoreboard vs. OOO snipped>
>>
>> Another way to get precise exceptions without OOO overhead is to use
>> result replay rather than issue replay. Issue replay is the usual way:
>> track which operations have been issued but not yet in-order retired by
>> keeping an instruction pointer to the last fully-retired operation. On
>> exception, clear the machine state and reset the IP to the tracking
>> address, thereby discarding all in-flight operations, even those that
>> have completed but not yet retired, and reissue them.
>
> There is another way to get precise ordering of the writes in a scoreboard.
> First, one has to implement forwarding in the scoreboard.
> Second, the function units need an output queue <of say 4 registers>
> Now, one can launch an instruction and pick up its operand either
> from the RF or from the function unit output while the result sits
> in the function unit waiting for its GO_Write signal.
>
> Thus the launching of instructions is not delayed due to hazards
> but the results are delivered to the RF in program order.
>
> This looks surprisingly like a 'belt' at the end of the function unit.

The drawback to this is that there are a lot of FUs as you widen up the
machine, and due to limits on decode, issue, or retire rates it's not
common for them to have overlapping lifetimes. We take advantage of this
with the "slot" notion common in wide-issue machines, where a given
encoding position can reach only a subset of the FUs and only one of
those. Consequently the queue is per slot, not per FU, sharply reducing
the number of queues.

Per-slot result queues could be used by a dynamic scheduled wide machine
like the Itanium. They are even more advantageous in a statically
scheduled machine, because the slot's queue can be composed from the
per-latency latches of the FUs attached to that slot. Because only one
op can issue per slot per cycle, the queue can be a simple daisy chain
of latency latches and needs no interlocks or allocation/free logic.

Of course, a result in the queue in one slot is likely to be an argument
to an FU in a different slot. This is the same problem as getting a
datum between the int and FP regsets in a split architecture, or between
sides of a TI-style split VLIW. We once looked into having per-slot
belts with explicit inter-slot transfer ops (pretty much as you describe
for your multi-scoreboard, albeit static). However, having a
machine-wide (rather than slot-wide) operand namespace gave better
encoding once we had bit-aligned variable width encoding. We call it the
Belt (as opposed to the 'belt') :-)

MitchAlsup

unread,
Dec 6, 2018, 7:32:29 PM12/6/18
to
Thank you for reminding me of this subtlety.

lk...@lkcl.net

unread,
Dec 7, 2018, 6:40:32 AM12/7/18
to
On Thursday, December 6, 2018 at 12:58:03 AM UTC, Ivan Godard wrote:

> Result replay saves the pipeline drain and restart inherent in issue
> replay, and obviates the IQ, rename registers, and ROB required by OOO,

it's fascinating, isn't it? it's the next "logical step", to save on
resources: you already calculated the results, why throw them out??
the devil is in the details: *how* are the results stored, and guaranteed
to be unique / replayed?

l.

lk...@lkcl.net

unread,
Dec 7, 2018, 9:34:36 AM12/7/18
to
so, i'm chewing through the chapters that mitch very kindly
sent me (off-ng). there's tons of gate diagrams! yay!
i understand gates.

one thing that caught my eye was the design of the 6600
register file. it's capable of doing a "pass-through",
i.e. the register is written... oh and the write may be
"passed through" directly to the read port (if there is
one).

this would allow "operand forwarding" automatically,
as part of the design, because any result register would
be "passed through" (forwarded) on to a src operand
*in the same clock cycle*.

this got me thinking that, functionality-wise, the 6600
register file may be directly equivalent to a "Common
Data Bus".

or, the other way round: the Common Data Bus of the Tomasulo
Algorithm *is* the operand-forwarding mechanism associated
with and augmenting/improving Scoreboard systems.

also, it seems to me that adding Reorder Buffers provides
register renaming, rollback and precise exceptions, even for
Scoreboard systems, and that register renaming which needs
to be added to Scoreboards results in a CAM plus src operand
buffers that look remarkably similar to Reservation Stations
and ROB dest tables.

at this point i am genuinely puzzled and confused as to
whether there is any difference - at all - between
Scoreboards+ROBs and Tomasulo+ROBs :)

at present, the unique feature of SV, which is the exclusion
of SIMD and its replacement with polymorphic bit-widths
and variable-length vectors that map down onto the "real"
register file(s) as opposed to having separate Vector Pipelines
and RFs, is throwing a massive spanner in the works.

where most architectures would have SIMD instructions and
be done with it, after seeing this
https://www.sigarch.org/simd-instructions-considered-harmful/
i never want to be involved with SIMD, ever again :)

as mentioned before, what i'd therefore like to do is to be able
to "merge" SIMD operations at the Reservation Station phase, by
spotting patterns in the bitmasks:

ROB#-dest | dest-bitmask | ROB#-src1/val | ROB#-src2/val
--------- | ------------ | ------------- | -----------
ROB5 | 0b11000000 | 0x59000000 | 0x12000000
ROB4 | 0b00110000 | 0x00130000 | 0x00540000

this to an 8-bit SIMD "Add" Functional Unit.

the RS would have built-in "mask" detection logic that noted that the dest byte masks are non-overlapping, and would pass the *masked* src1 and src2 of *BOTH*
rows of the RS down into the ALU.

on completion of the operation, the destination would be noted as
being different, and it would be necessary to issue *two* writes:
one to ROB4 and the other to ROB5. if however the ROB#s were the
same, only one write would be needed, and even the bitmasks merged.

detection of this merging at instruction-issue time may not be
possible, as it could hypothetically be *completely different*
instructions that issued the instructions with different bytemasks.

in fact, i'm counting on precisely that, as a way to not need
8-bit-wide register routing and multiplexing etc. etc., which
would be insane to have byte-wide crossbars all over the place.

so, as i can't think of any other architectural way to achieve the
same ability to merge non-overlapping SIMD ops, doing some voodoo
on the RS rows *requires* Reservation Stations, that in turn requires
ROB#s, and it all sort-of hangs together.

oh, the other thing: a 3D GPU + VPU needs an insane 256 (total)
64-bit registers. 128 for INT, 128 for FP. this makes it possible
to fit pixel "tiles" (as they're called) into the regfile, without
the performance / power hit of transferring data through the L1/L2
cache barrier (using LD/STs).

if we were to deploy the "usual" algorithm(s) associated with
Scoreboard register-renaming, it would require multiplying that
256-entry register file by say 2, 3 or even 4.

by contrast, the Reservation Stations (whether deployed for SB+ROB
or for TS+ROB) simply house the in-flight data, job done. oh,
and we would get the opportunity to pull that SIMD-style data-merging
trick.

the other thing: we considered it insane to have 2 128-entry
register files with 10R4W (or greater) porting, to cope with
the vectorised loads. instead, we came up with the idea to
"stripe" the register file (4 blocks), use much simpler 2R1W
standard cells, and to add muxers to allow all functional units
[eventual, non-simultaneous] access to all registers.

thinking this through a bit further, we could also "stripe" the
Functional Units, as well. i.e. to lay down 4 sets of ADDs,
4 sets of MULs, effectively breaking the register files into
4 blocks (or lanes).

extending this thought-experiment to its logical extreme, we
considered doing the same thing *to the Reorder Buffer*. i.e.
in effect, we place a hard requirement that the first 2 bits of the
ROB# *must* match and be equal to the Destination Register # first
2 bits.

even if we had 32 Reorder Buffer rows, this would reduce the ROB CAM
from its present (insane) 8-bit width (1 bit for INT/FP, 7 bits for
0-127 on the dest reg), down to a slightly-less mad 6 bits.

the down-side: sequentially-issued instructions that by coincidence
happened to match in their destination register numbers modulo 4
(bottom 2 bits) would result in 3 "holes" in the Reorder Buffer.
given that there's flags "done/!done" already, that's not such a
big deal.

what we *hope* is, that SimpleV's design feature of issuing
sequentially-numbered sets of instructions would end up filling
the striped ROBs.

and, what's really *really* nice: 4-wide striping would effectively
mean 4-wide instruction issue. also, there isn't any connectivity,
no data paths, no dependencies between each of the ROB "stripes".

now, my only big concern here is: how do you deal with lane-crossing,
from when a result is generated from a ROB# (dest) in one of these
stripes (dest modulo 4 == 2) over to a Reservation Station that
is in lane 1 (src reg modulo 4 == 1) for example?

and here is where i think the insight that "Common Data Bus
Equals Same As Pass-through / Register File Forwarding" comes in
to play.

recall that we plan to put in 4-way Muxes on the banks of the Register
File. this being an acceptable bottleneck under the circumstances.
if the pass-through mechanism *is* the CDB, then the write from one
"Lane" will go through the Mux, into the other Lane's Register File,
and from there it would need to be broadcast onto that Lane's CDB,
in order to update the Reservation Stations there.

this is however where i run into difficulties, as i've not yet
thought through the synchronisation or how to handle the case where
there happens *already* to be some reads taking place on the 2nd
Lane.

do we queue the new update so that it's handled cleanly on the next
cycle? that sounds like a recipe for disaster, resulting in
ROB# entries getting out of sync.

have a signal which stalls the lane-crossing write, until the
Read-Reg Bus of the target lane is free? that sounds like a
good way to create pipeline "Concertina" effects
(https://en.wikipedia.org/wiki/Accordion_effect)

conundrum!

thoughts appreciated.

MitchAlsup

unread,
Dec 7, 2018, 11:31:44 AM12/7/18
to
On Friday, December 7, 2018 at 8:34:36 AM UTC-6, lk...@lkcl.net wrote:
> so, i'm chewing through the chapters that mitch very kindly
> sent me (off-ng). there's tons of gate diagrams! yay!
> i understand gates.
>
> one thing that caught my eye was the design of the 6600
> register file. it's capable of doing a "pass-through",
> i.e. the register is written... oh and the write may be
> "passed through" directly to the read port (if there is
> one).

ECL register file blocks also have this property.
>
> this would allow "operand forwarding" automatically,
> as part of the design, because any result register would
> be "passed through" (forwarded) on to a src operand
> *in the same clock cycle*.
>
> this got me thinking that, functionality-wise, the 6600
> register file may be directly equivalent to a "Common
> Data Bus".
>
> or, the other way round: the Common Data Bus of the Tomasulo
> Algorithm *is* the operand-forwarding mechanism associated
> with and augmenting/improving Scoreboard systems.
>
> also, it seems to me that adding Reorder Buffers provides
> register renaming, rollback and precise exceptions, even for
> Scoreboard systems, and that register renaming which needs
> to be added to Scoreboards results in a CAM plus src operand
> buffers that look remarkably similar to Reservation Stations
> and ROB dest tables.
>
> at this point i am genuinely puzzled and confused as to
> whether there is any difference - at all - between
> Scoreboards+ROBs and Tomasulo+ROBs :)

The only practical difference left is that the Scoreboard is a
centralized control block, while the ReservationStation is a
distributed logic block.
>
> at present, the unique feature of SV, which is the exclusion
> of SIMD and its replacement with polymorphic bit-widths
> and variable-length vectors that map down onto the "real"
> register file(s) as opposed to having separate Vector Pipelines
> and RFs, is throwing a massive spanner in the works.
>
> where most architectures would have SIMD instructions and
> be done with it, after seeing this
> https://www.sigarch.org/simd-instructions-considered-harmful/
> i never want to be involved with SIMD, ever again :)

I figured this out about 2003.
>
> as mentioned before, what i'd therefore like to do is to be able
> to "merge" SIMD operations at the Reservation Station phase, by
> spotting patterns in the bitmasks:
>
> ROB#-dest | dest-bitmask | ROB#-src1/val | ROB#-src2/val
> --------- | ------------ | ------------- | -----------
> ROB5 | 0b11000000 | 0x59000000 | 0x12000000
> ROB4 | 0b00110000 | 0x00130000 | 0x00540000
>
> this to an 8-bit SIMD "Add" Functional Unit.
>
> the RS would have built-in "mask" detection logic that noted that the dest byte masks are non-overlapping, and would pass the *masked* src1 and src2 of *BOTH*
> rows of the RS down into the ALU.
>
> on completion of the operation, the destination would be noted as
> being different, and it would be necessary to issue *two* writes:
> one to ROB4 and the other to ROB5. if however the ROB#s were the
> same, only one write would be needed, and even the bitmasks merged.
>
> detection of this merging at instruction-issue time may not be
> possible, as it could hypothetically be *completely different*
> instructions that issued the instructions with different bytemasks.

You said above::
> i never want to be involved with SIMD, ever again :)

And yet you are in effect doing SIMD to your register file!
>
> in fact, i'm counting on precisely that, as a way to not need
> 8-bit-wide register routing and multiplexing etc. etc., which
> would be insane to have byte-wide crossbars all over the place.
>
> so, as i can't think of any other architectural way to achieve the
> same ability to merge non-overlapping SIMD ops, doing some voodoo
> on the RS rows *requires* Reservation Stations, that in turn requires
> ROB#s, and it all sort-of hangs together.
>
> oh, the other thing: a 3D GPU + VPU needs an insane 256 (total)
> 64-bit registers. 128 for INT, 128 for FP. this makes it possible
> to fit pixel "tiles" (as they're called) into the regfile, without
> the performance / power hit of transferring data through the L1/L2
> cache barrier (using LD/STs).

A common register file with both int and float in the same file gets rid
of 1/2 of that insanity.
>
> if we were to deploy the "usual" algorithm(s) associated with
> Scoreboard register-renaming, it would require multiplying that
> 256-entry register file by say 2, 3 or even 4.

Or 32 in order to absorb memory latency.
>
> by contrast, the Reservation Stations (whether deployed for SB+ROB
> or for TS+ROB) simply house the in-flight data, job done. oh,
> and we would get the opportunity to pull that SIMD-style data-merging
> trick.

As I do now and again, I will warn you about the difficulties of
doing SIMD to your register file.
More like carborundum.
>
> thoughts appreciated.

Ivan Godard

unread,
Dec 7, 2018, 4:02:38 PM12/7/18
to
How: You can't put them in any architecturally visible place, by
definition. So you have to put them in reserved memory. But that is way
too expensive, so you need some buffering, essentially a cache but
addressed using some kind of temporal addressing so you get them in the
right order at replay. So after an operand has worked its way up the
per-slot daisy chain and it's still alive (i.e. is still mapped by the
decoders) the chain essentially continues into spill buffers in front of
a configurable amount of SRAM. Add a FSM at the front of the SRAM to
merge the slots into the SRAM, and a rather normal looking cache
spill/fill FSM to connect the SRAM to the memory controllers. The spill
buffers look just like the inputs to an ordinary FU, so the operand
movement needs no special handling or hardware in the bypass crossbar.

At this point you can save and correctly replay over an exception. If
you mark the saved operands with a call-frame number too, then you can
handle nested exception. But exceptions are just "calls by other means",
so you can use the same machinery for ordinary call/return too. Of
course, the replay is after the call returns, and may replay results
that were in-flight (not yet completed) when the call happened, just as
you can with regular OOO except without all the OOO overhead (but with
the SRAM).

Of course, result replay requires knowing when a given operand is to be
replayed. So you either need reorder and dependency hardware during the
replay, or you must statically know the latency of every operation: lots
of hardware or no early outs, choose one. As it turns out, the only
early out that matters is load. The Mill is a hybrid: issue replay for
loads, and result replay for everything else. The load retire stations
are pretty much the same as the load queue on a conventional, except for
the ability to re-request the load they are holding, so there's no added
hardware.

Uniqueness: that's a question of namespace and how dependencies are
represented. I've not looked very hard are dependency representation for
dynamic result replay, and the ideas I've had would need a whiteboard,
so I'll leave that up to you You've a free hand; we have no patents in
the area.

In static result replay, the compiler knows the static latencies of
everything and dependency order is instruction stream order.
Consequently intermediate operands can be addressed as creation order
ordinal. This is temporal addressing (when, not where) rather than
spatial addressing (register number). In the decoder, recent results can
be addressed by indexing into the history; there's a logi-temporal to
physi-spatial latchset mapper in the decoder. The architectural view of
that mapping is the Belt. The actual mapping doesn't matter so long as
the compiler uses the same one when placing operations and assigning
belt numbers to argument references.

Uniqueness turned out to be one of the bigger issues given that we
wanted to be able to migrate tasks between cores, so a call saved on one
core would be replayed on another. We think we have it solved, but we
aren't doing multicore yet and may have to revisit the issue.

lk...@lkcl.net

unread,
Dec 7, 2018, 6:29:39 PM12/7/18
to
On Friday, December 7, 2018 at 4:31:44 PM UTC, MitchAlsup wrote:

> > at this point i am genuinely puzzled and confused as to
> > whether there is any difference - at all - between
> > Scoreboards+ROBs and Tomasulo+ROBs :)
>
> The only practical difference left is that the Scoreboard is a
> centralized control block, while the ReservationStation is a
> distributed logic block.

i found in one of those links you sent, a reference to a paper
that described how intel implemented the ReservationStations:
as a central block, with routing going out from there to ALUs.

after seeing the pages in the book chapters you sent, about
the 6600 having dependency resolution matrices (FU-to-FU
and others), putting the RS logic in one place on the chip
would make sense.

> > https://www.sigarch.org/simd-instructions-considered-harmful/
> > i never want to be involved with SIMD, ever again :)
>
> I figured this out about 2003.

sadly the case for SIMD is seductively compelling. "add another
opcode" boooys, only an O(N^6) opcode proliferation, noo problem
there at aaaaall.

dropping in another ALU, let the compiler sort it out, it's very
very "simple", in hardware terms, unfortunately.

> You said above::
> > i never want to be involved with SIMD, ever again :)
>
> And yet you are in effect doing SIMD to your register file!

:) i'll do a separate reply (change of subject), summary:
the external ISA looks like a vectorisation system, it's
just the internals where SIMD is the architectural compromise.

> > oh, the other thing: a 3D GPU + VPU needs an insane 256 (total)
> > 64-bit registers. 128 for INT, 128 for FP. this makes it possible
> > to fit pixel "tiles" (as they're called) into the regfile, without
> > the performance / power hit of transferring data through the L1/L2
> > cache barrier (using LD/STs).
>
> A common register file with both int and float in the same file gets rid
> of 1/2 of that insanity.

yeahh unfortunately, the requirements are to cover simultaneous
GPU *and* VPU workloads, where the int regfile is likely to be
heavily used for VPU, and the FP one for GPU.

if you've ever seen demos by nvidia of spinning cubes with videos
being played on them, that's pretty much what web browsers now do,
except flat (2D).

what they do is, they have the VPU do the actual decode, however
the handling of the *displaying* part is passed to an OpenGL
call, as some sort of "texture", i don't know the full details.

this can be "solved" with a massive amount of work, modifying
standard software releases of Firefox and Chromium. i don't want
to get involved with that: got enough to do.

more than that: RISC-V (scalar RISC-V) doesn't have shared INT/FP
regfiles: i did however see it suggested for the embedded space.

compiler-wise, shared register files make me extremely nervous.

> > if we were to deploy the "usual" algorithm(s) associated with
> > Scoreboard register-renaming, it would require multiplying that
> > 256-entry register file by say 2, 3 or even 4.
>
> Or 32 in order to absorb memory latency.

i really really hope not - target clock rate's only 800mhz,
and only a 32-bit-wide DDR3/4 800mhz memory interface
(this is standard fare for $4 300/400-pin FBGA SoCs)

> >
> > by contrast, the Reservation Stations (whether deployed for SB+ROB
> > or for TS+ROB) simply house the in-flight data, job done. oh,
> > and we would get the opportunity to pull that SIMD-style data-merging
> > trick.
>
> As I do now and again, I will warn you about the difficulties of
> doing SIMD to your register file.

i believe i have a way to fake it (the bytemask is part of it).
i'll keep it as a separate thread. we *need* massive amounts
of 8-bit and 16-bit operations, for video decoding, as well as
for handling the RGB pixels on the 3D side.

l.

MitchAlsup

unread,
Dec 7, 2018, 7:24:13 PM12/7/18
to
On Friday, December 7, 2018 at 5:29:39 PM UTC-6, lk...@lkcl.net wrote:
> On Friday, December 7, 2018 at 4:31:44 PM UTC, MitchAlsup wrote:
>
> > > at this point i am genuinely puzzled and confused as to
> > > whether there is any difference - at all - between
> > > Scoreboards+ROBs and Tomasulo+ROBs :)
> >
> > The only practical difference left is that the Scoreboard is a
> > centralized control block, while the ReservationStation is a
> > distributed logic block.
>
> i found in one of those links you sent, a reference to a paper
> that described how intel implemented the ReservationStations:
> as a central block, with routing going out from there to ALUs.

The dispatch stack means can be done wither with SB or RS and a
couple of other design points.
>
> after seeing the pages in the book chapters you sent, about
> the 6600 having dependency resolution matrices (FU-to-FU
> and others), putting the RS logic in one place on the chip
> would make sense.
>
> > > https://www.sigarch.org/simd-instructions-considered-harmful/
> > > i never want to be involved with SIMD, ever again :)
> >
> > I figured this out about 2003.
>
> sadly the case for SIMD is seductively compelling. "add another
> opcode" boooys, only an O(N^6) opcode proliferation, noo problem
> there at aaaaall.

In the case of Intel, it was consume another prefix to enable a whole
new sugGroup of instructions.
Why? They also solve a lot of problems:
a) one set of LD/ST instructions
b) INT(F), REAL(I) have no cross routing
c) compare instructions deliver condition to the easiest consumable place
d) FP compare and INT compare are almost the same logic! Why duplicate?
e) FP & MASK is easy to do--surprisingly useful in Transcendental codes.

My K9 design at AMD had both the INT and FP register file in one place,
with slightly different tag and data timing for the FP and INT sections.
The dynamic instruction window was much easier to construct with the
common PRF.
>
> > > if we were to deploy the "usual" algorithm(s) associated with
> > > Scoreboard register-renaming, it would require multiplying that
> > > 256-entry register file by say 2, 3 or even 4.
> >
> > Or 32 in order to absorb memory latency.
>
> i really really hope not - target clock rate's only 800mhz,
> and only a 32-bit-wide DDR3/4 800mhz memory interface
> (this is standard fare for $4 300/400-pin FBGA SoCs)
>
> > >
> > > by contrast, the Reservation Stations (whether deployed for SB+ROB
> > > or for TS+ROB) simply house the in-flight data, job done. oh,
> > > and we would get the opportunity to pull that SIMD-style data-merging
> > > trick.
> >
> > As I do now and again, I will warn you about the difficulties of
> > doing SIMD to your register file.
>
> i believe i have a way to fake it (the bytemask is part of it).
> i'll keep it as a separate thread. we *need* massive amounts
> of 8-bit and 16-bit operations, for video decoding, as well as
> for handling the RGB pixels on the 3D side.

SIMD in my SAMSUNG GPU/PE took essentially 1/2 of the architectural
effort, due to all the silly ramifications (along with a bit of
intragroup squabbling.)
>
> l.

Bruce Hoult

unread,
Dec 7, 2018, 9:04:29 PM12/7/18
to
On Friday, December 7, 2018 at 4:24:13 PM UTC-8, MitchAlsup wrote:
> SIMD in my SAMSUNG GPU/PE took essentially 1/2 of the architectural
> effort, due to all the silly ramifications (along with a bit of
> intragroup squabbling.)

Oh gawd. "Braiding". Don't talk to me about braiding. Kill it with fire!!! Awful.

Last I heard, it might not even get used.

lk...@lkcl.net

unread,
Dec 8, 2018, 2:55:34 AM12/8/18
to
On Friday, December 7, 2018 at 4:31:44 PM UTC, MitchAlsup wrote:

> You said above::
> > i never want to be involved with SIMD, ever again :)
>
> And yet you are in effect doing SIMD to your register file!

ok, so changing the topic: one of the requirements is to
be able to do massive amounts of 8-bit and 16-bit INT (incl.
IEEE754 FP16) computations.

given that the data width is so low, the cost of moving
(only) 8-bit / 16-bit values around, per clock, far exceeds
the amount of logic needed to move it.

so what i decided to do is:

* place a vectorisation front-end on the ISA
* allow "type-casting" of the register file from the
default down to 32-bit, 16-bit or 8-bit *as if*
it was a memory-addressable RAM
* behind the scenes (as part of the instruction issue
phase) have as many of those "elements" jammed into
an operation as will fit into the 64-bit width, and
mark them with a "byte-mask" if they won't fit

the alternatives are pretty much flat-out insane, and include
things like having a separate 8-bit sequential/scalar set of FUs
plus ROB plus Reservation Stations, and yet another separate
16-bit set *and* a 32-bit one... except now how on earth do you
sync those both up?

another alternative: just pass scalar operations in, jamming up
the Reorder Buffer with rows consuming dozens of bits, bearing
in mind each ROB entry is an increase in power consumption due
to the DestReg CAM, now it's no longer a quad-issue system it's
a 32-issue system and that's just insane.

so the compromise is to rely on the sequential nature of the
SV "hardware macro-loop-unrolling", which will spawn multiple
sequential operations (sequentially-incrementing "element"
numbers) that, in the 8-bit case, will *happen* to be the next
byte of a register (until the end of that register is reached,
in which case, we move on to the *next register*).

where the Vector Length is not a multiple of 8 (in the
case of 8-bit elements) or a multiple of 4 (in the case
of 16-bit elements), obviously if the entire register were
allowed to perform parallel computations, the MSBs would
become corrupted.

this is where the byte-mask comes in. the byte-mask will
not only tell the ALU which 8/16-bit operands it is to
calculate, it will also tell the Register File which bytes
of the DestReg are to be written to and which are to be left
alone.

so it's not *actually* a SIMD ISA (that's insane), it's just
predicated-SIMD ALUs, where the decoding is done in phases,
even right down to that optimisation idea of merging non-overlapping
byte-masks within the Reservation Stations.


there are some additional complications / wrinkles at play, here:

(1) the 1D/2D/3D "REMAP" system may add *offsets* onto the
vector-element indices. so, element 0 may come from
byte *THREE* of Register 5.

(2) the 1D/2D/3D "REMAP" system may again completely change
the element indices, according to a regular pattern

(3) there is polymorphic bit-width conversion available, on
*all* src operands *and* the destination.

here what i am planning to do is to utilise the Reorder Buffer
to effectively add in additional micro-code operations that will
pre-process and post-process the operands:

* in the case of the "REMAP", that's dead simple: the operands
go through an xBitManip phase (pre and post). the vectorised
instruction issue phase "gets" registers as standard 64-bit
blocks (element indices &ed with ~0x7), and the xBitManip
micro-phase will take care of the rest (element indices &ed 0x7)

* in the case of the bit-width conversion, that's also simple,
(and can be done in parallel): pass in 8-bit elements, out
comes 16/32/64-bit sign- or zero- extended elements.

the pre-processing and post-processing may in some cases actually
be done i believe by updating the Reorder Buffer entry, as opposed
to occupying multiple ROB entries. in particular this would
be destination register pre/post processing.

other phases (particularly bit-width source register reduction)
may actually be done with a filter on the RegFile.


so, i belieeve that this is a workable strategy for shoe-horning
polymorphic vectorisation into a multi-issue micro-architecture.
it's by no means perfect, as certain workloads will be optimal
and others will be rubbish (missed opportunities for byte-mask
merges).

what we will have to do is, just ensure that the assembly code
reasonably matches the optimal performance. it's a reasonable
compromise for a first effort.

l.

lk...@lkcl.net

unread,
Dec 8, 2018, 7:42:03 AM12/8/18
to
On Saturday, December 8, 2018 at 12:24:13 AM UTC, MitchAlsup wrote:

> > i found in one of those links you sent, a reference to a paper
> > that described how intel implemented the ReservationStations:
> > as a central block, with routing going out from there to ALUs.
>
> The dispatch stack means can be done wither with SB or RS and a
> couple of other design points.

i've yet to work out how instruction order is preserved, i was
fascinated to note in chapter 11 that you sent me, that branch
speculation is achieved by a combination of schroedinger-marking
the instruction (not yet dead/alive) and using that to mask out
the "write" flag.

simple: if the instruction can't "write", its results can't
do anything, they just sit there.

likewise, imprecise exceptions are made "precise" by the same
fundamental technique: the instruction with exception "holds" a
line that prevents the write from proceeding.

i am finally i *think* starting to "Get It", and it's down to
information that is completely missing from the 1967 patent:
the "Dependency Matrices" are an absolutely critical, critical
part that's *not* mentioned anywhere in any of the online
lectures.

the "Dependency Matrices" are basically a bit-version of the
Reorder Buffer, which will do things such as those mentioned
above. these DMs are from Function-Unit along X *and* Function-
Unit along Y, with dead-simple wires and some tiny sets of
logic gates implementing things like "If instruction is
schroedinger'd, disable the Go_Write signal".

* the Reorder Buffer basically moves a 1D "head of queue" in
a cyclic buffer.
* the Dependency Matrices do sort-of the same thing except
that the effects from X ripple along Y.

so... in effect, the lack of the 2D matrices is why the ROB
has the CAM in the Dest Reg... and *that* is why you said,
earlier, mitch, that SBds are like a "1-bit CAM"! but it's
not the CAM that's important, it's the *matrices*!


i haven't yet been able to completely absorb it, i only skim-read
chapter 11 last night, i see you added a means to do concurrent
LOAD/STOREs that *also* are based on a (sparse) matrix
(and then also link in to the other matrices). the LD/ST sparse
matrix then feeds into other Dependency Matrices.

the key for LD/ST is that LOADs can stop stores, and STOREs can
stop loads.

there's *no mention* of these simple Matrices *anywhere* in the
academic literature!

this missing information what was really preventing me from
"accepting" Scoreboards as a viable method of doing OoO superscalar
execution.

utterly, utterly cool.

> > sadly the case for SIMD is seductively compelling. "add another
> > opcode" boooys, only an O(N^6) opcode proliferation, noo problem
> > there at aaaaall.
>
> In the case of Intel, it was consume another prefix to enable a whole
> new sugGroup of instructions.

that's slightly better than i imagined it would be. if that prefix
"persisted" (in some form of CSR state), that's basically what SV
does (and RVV). it means though that a mechanism to store/restore
that state is needed.

> > more than that: RISC-V (scalar RISC-V) doesn't have shared INT/FP
> > regfiles: i did however see it suggested for the embedded space.
> >
> > compiler-wise, shared register files make me extremely nervous.
>
> Why?

i don't really know. or, more specifically: i have a series of
below-conscious-level intuitive hunches that i haven't yet been
able to explore / express, all of which are screaming for my
attention with big neon flashing signs... it's just that they're
about 50 miles away (over the horizon).

these hunches tell me that *if* i was to explore each of them
in huge excruciating detail, then some time in the next... maybe...
6 to 60 months i would have come up with the *one* show-stopping
concrete reason that would allow me to answer...

... oh and i would be absolutely kicking myself for having spent
6 to 60 months on it.


> They also solve a lot of problems:
> a) one set of LD/ST instructions

well.. that's easy enough to have the Functional Unit do the same
task, if the Reservation Station has a bit which marks it as being
either Int or FP.

likewise for (d)

> b) INT(F), REAL(I) have no cross routing
> c) compare instructions deliver condition to the easiest consumable place
> d) FP compare and INT compare are almost the same logic! Why duplicate?
> e) FP & MASK is easy to do--surprisingly useful in Transcendental codes.
>
> My K9 design at AMD had both the INT and FP register file in one place,
> with slightly different tag and data timing for the FP and INT sections.
> The dynamic instruction window was much easier to construct with the
> common PRF.

that seems fairly obvious / logical to me.

> SIMD in my SAMSUNG GPU/PE took essentially 1/2 of the architectural
> effort, due to all the silly ramifications (along with a bit of
> intragroup squabbling.)

Samsung? interesting! was that the internal group, or the external /
3rd party one (Nexell)

l.

lk...@lkcl.net

unread,
Dec 8, 2018, 7:47:42 AM12/8/18
to
On Saturday, December 8, 2018 at 2:04:29 AM UTC, Bruce Hoult wrote:
> On Friday, December 7, 2018 at 4:24:13 PM UTC-8, MitchAlsup wrote:
> > SIMD in my SAMSUNG GPU/PE took essentially 1/2 of the architectural
> > effort, due to all the silly ramifications (along with a bit of
> > intragroup squabbling.)
>
> Oh gawd. "Braiding". Don't talk to me about braiding.

you do _not_ wanna touch it...
https://www.youtube.com/watch?v=54o8aYDhnos

> Kill it with fire!!! Awful.

luckily, patents are obtuse and unreadable these days...
https://patents.justia.com/patent/10061591

l.

Bruce Hoult

unread,
Dec 8, 2018, 8:28:38 AM12/8/18
to
On Saturday, December 8, 2018 at 4:42:03 AM UTC-8, lk...@lkcl.net wrote:
> On Saturday, December 8, 2018 at 12:24:13 AM UTC, MitchAlsup wrote:
>
> > > i found in one of those links you sent, a reference to a paper
> > > that described how intel implemented the ReservationStations:
> > > as a central block, with routing going out from there to ALUs.
> >
> > The dispatch stack means can be done wither with SB or RS and a
> > couple of other design points.
>
> i've yet to work out how instruction order is preserved, i was
> fascinated to note in chapter 11 that you sent me, that branch
> speculation is achieved by a combination of schroedinger-marking
> the instruction (not yet dead/alive) and using that to mask out
> the "write" flag.
>
> simple: if the instruction can't "write", its results can't
> do anything, they just sit there.

Don't forget that in these days of Spectre and Meltdown, merely preventing dead instruction results from being written to registers or memory is NOT ENOUGH. You also need to prevent load instructions from altering cache and branch instructions from altering branch prediction state. For example.

Bruce Hoult

unread,
Dec 8, 2018, 8:32:01 AM12/8/18
to
That's the one. Interesting to see Mitch is not listed, but I personally know and worked on the compiler targeting this with three of the six inventors listed.

To put it simply, the problem is threads in a warp are predicated, but braids in a thread are not. The compiler needs to prove that braids will never want to diverge.

MitchAlsup

unread,
Dec 8, 2018, 12:07:01 PM12/8/18
to
For those out of the loop:
in SIMT each lane in the GPU is a different thread
Braiding placed data from different threads across a single register
Since each thread executes the same instruction,
on the face of it, this looks clean and straightforward.
It is only later that you realize you have just shot out both your feet.

But I agree with Bruce:: Kill the beast using any means possible.

MitchAlsup

unread,
Dec 8, 2018, 12:17:55 PM12/8/18
to
On Saturday, December 8, 2018 at 6:42:03 AM UTC-6, lk...@lkcl.net wrote:
> On Saturday, December 8, 2018 at 12:24:13 AM UTC, MitchAlsup wrote:
>
> > > i found in one of those links you sent, a reference to a paper
> > > that described how intel implemented the ReservationStations:
> > > as a central block, with routing going out from there to ALUs.
> >
> > The dispatch stack means can be done wither with SB or RS and a
> > couple of other design points.
>
> i've yet to work out how instruction order is preserved, i was
> fascinated to note in chapter 11 that you sent me, that branch
> speculation is achieved by a combination of schroedinger-marking
> the instruction (not yet dead/alive) and using that to mask out
> the "write" flag.
>
> simple: if the instruction can't "write", its results can't
> do anything, they just sit there.
>
> likewise, imprecise exceptions are made "precise" by the same
> fundamental technique: the instruction with exception "holds" a
> line that prevents the write from proceeding.
>
> i am finally i *think* starting to "Get It", and it's down to
> information that is completely missing from the 1967 patent:
> the "Dependency Matrices" are an absolutely critical, critical
> part that's *not* mentioned anywhere in any of the online
> lectures.

The online lectures don't want their pupils to actually "get" what scoreboards
can do at low expense. They want everyone to buy in on reservation stations.
>
> the "Dependency Matrices" are basically a bit-version of the
> Reorder Buffer, which will do things such as those mentioned
> above. these DMs are from Function-Unit along X *and* Function-
> Unit along Y, with dead-simple wires and some tiny sets of
> logic gates implementing things like "If instruction is
> schroedinger'd, disable the Go_Write signal".

You actually are starting to get it.
>
> * the Reorder Buffer basically moves a 1D "head of queue" in
> a cyclic buffer.

The RoB is essentially the dependency matrix plus storage and an
access means (post delivery forwarding).

> * the Dependency Matrices do sort-of the same thing except
> that the effects from X ripple along Y.
>
> so... in effect, the lack of the 2D matrices is why the ROB
> has the CAM in the Dest Reg... and *that* is why you said,
> earlier, mitch, that SBds are like a "1-bit CAM"! but it's
> not the CAM that's important, it's the *matrices*!

Bingo!
>
>
> i haven't yet been able to completely absorb it, i only skim-read
> chapter 11 last night, i see you added a means to do concurrent
> LOAD/STOREs that *also* are based on a (sparse) matrix
> (and then also link in to the other matrices). the LD/ST sparse
> matrix then feeds into other Dependency Matrices.

I have also done a DM in support of OoO memory system putting order
back together, but that is for another day.
>
> the key for LD/ST is that LOADs can stop stores, and STOREs can
> stop loads.
>
> there's *no mention* of these simple Matrices *anywhere* in the
> academic literature!

"Design of a Computer" Thornton has it all in great detail, except you
have to translate from pre-TTL gates into modern gate technologies.
If you have not read this, you need to. It is available online.
You are forgetting the precious nature of your OpCode space. The lower
the demands on OpCode space, the better the encoding can be, and the more
space you leave for future iterations.
>
> likewise for (d)
>
> > b) INT(F), REAL(I) have no cross routing
> > c) compare instructions deliver condition to the easiest consumable place
> > d) FP compare and INT compare are almost the same logic! Why duplicate?
> > e) FP & MASK is easy to do--surprisingly useful in Transcendental codes.
> >
> > My K9 design at AMD had both the INT and FP register file in one place,
> > with slightly different tag and data timing for the FP and INT sections.
> > The dynamic instruction window was much easier to construct with the
> > common PRF.
>
> that seems fairly obvious / logical to me.
>
> > SIMD in my SAMSUNG GPU/PE took essentially 1/2 of the architectural
> > effort, due to all the silly ramifications (along with a bit of
> > intragroup squabbling.)
>
> Samsung? interesting! was that the internal group, or the external /
> 3rd party one (Nexell)

Internal.
>
> l.

MitchAlsup

unread,
Dec 8, 2018, 12:21:37 PM12/8/18
to
On Saturday, December 8, 2018 at 7:32:01 AM UTC-6, Bruce Hoult wrote:
> On Saturday, December 8, 2018 at 4:47:42 AM UTC-8, lk...@lkcl.net wrote:
> > On Saturday, December 8, 2018 at 2:04:29 AM UTC, Bruce Hoult wrote:
> > > On Friday, December 7, 2018 at 4:24:13 PM UTC-8, MitchAlsup wrote:
> > > > SIMD in my SAMSUNG GPU/PE took essentially 1/2 of the architectural
> > > > effort, due to all the silly ramifications (along with a bit of
> > > > intragroup squabbling.)
> > >
> > > Oh gawd. "Braiding". Don't talk to me about braiding.
> >
> > you do _not_ wanna touch it...
> > https://www.youtube.com/watch?v=54o8aYDhnos
> >
> > > Kill it with fire!!! Awful.
> >
> > luckily, patents are obtuse and unreadable these days...
> > https://patents.justia.com/patent/10061591
>
> That's the one. Interesting to see Mitch is not listed, but I personally know and worked on the compiler targeting this with three of the six inventors listed.

John was the principal contributor and Max through the compiler would be easy
to get braiding worked out.
>
> To put it simply, the problem is threads in a warp are predicated, but braids in a thread are not. The compiler needs to prove that braids will never want to diverge.

At the time I left, there was a scheme by which even diverged threads still worked with braiding. I spent months figuring out the encodings and logic
to pull it off (being fought most of the way.)

MitchAlsup

unread,
Dec 8, 2018, 12:30:47 PM12/8/18
to
Which, oddly enough, provides a necessity for being able to consume
multiple containers from the cache Miss buffers, which oddly enough,
are what makes a crucial mechanism in the Virtual Vector Method work.

In the past, one would forward the demand container to the waiting memref
and then write the whole the line into the cache. S&M means you have to
forward multiple times from the miss buffers and avoid damaging the cache
until the instruction retires. VVM uses this to avoid having a vector strip
mine the data cache.

lk...@lkcl.net

unread,
Dec 8, 2018, 9:04:38 PM12/8/18
to
*thinks*... i suspect it's more that the patents were what everyone
had access to, and those simply did not emphasise enough the
importance of the FU dependency matrices.

i saw somewhere, "modern computing has all been about how to get
rid of the CAMs"... well... errr... the 6600 managed it, in the 1960s!

> > * the Reorder Buffer basically moves a 1D "head of queue" in
> > a cyclic buffer.
>
> The RoB is essentially the dependency matrix plus storage and an
> access means (post delivery forwarding).

specifying "to" (via the dest-reg) rather than "from" (via the src-regs),
hence the CAM, as the relationship "to" is one-to-many.

> > * the Dependency Matrices do sort-of the same thing except
> > that the effects from X ripple along Y.
> >
> > so... in effect, the lack of the 2D matrices is why the ROB
> > has the CAM in the Dest Reg... and *that* is why you said,
> > earlier, mitch, that SBds are like a "1-bit CAM"! but it's
> > not the CAM that's important, it's the *matrices*!
>
> Bingo!

:)

what that means is: many of the restrictions i envisaged due to the
ROB# Dest-Reg Cam, trying to keep the size of the ROB down to reduce
power usage, that restriction is now gone.

so instead of messing about trying to shoe-horn 2x 32-bit operations
into 64-bit-wide registers, 4x 16-bit ops into the same, and 8x 8-bit
ops likewise, it *might* be practical to consider putting in large
job-lots of 32-bit ALUs, possibly even 16-bit ones.

i'll have to think this through carefully, because just going to
32-bit-wide ALUs would literally double (actually, triple) the number
of FUs, which with an O(N^2) increase of the Dependency Matrix might
be a bit much.

so, keeping the SIMD-like bytemasks might still be a good idea...
hmm....

> >
> >
> > i haven't yet been able to completely absorb it, i only skim-read
> > chapter 11 last night, i see you added a means to do concurrent
> > LOAD/STOREs that *also* are based on a (sparse) matrix
> > (and then also link in to the other matrices). the LD/ST sparse
> > matrix then feeds into other Dependency Matrices.
>
> I have also done a DM in support of OoO memory system putting order
> back together, but that is for another day.

indeed.

> >
> > the key for LD/ST is that LOADs can stop stores, and STOREs can
> > stop loads.
> >
> > there's *no mention* of these simple Matrices *anywhere* in the
> > academic literature!
>
> "Design of a Computer" Thornton has it all in great detail, except you
> have to translate from pre-TTL gates into modern gate technologies.
> If you have not read this, you need to. It is available online.

found it. wooow, it's a scanned PDF... and there is a note
at the back with permission from James Thornton, written by
his wife, to make it available online.

wow.

oh, that's incredible: core memory and boot-loader bank switches!
coool! this reminds me, i remember someone telling me of a bit-encoded
instruction set that was so efficient that its bootstrap loader
required only 17 switches.

the diagrams and photos are absolutely fascinating. they were doing
the ALUs on PCBs with transistors?!?! no wonder there was a focus
on using D-Latches instead of Flip-flops, as that's a hell of a lot
of extra circuitry.

dang.

> > well.. that's easy enough to have the Functional Unit do the same
> > task, if the Reservation Station has a bit which marks it as being
> > either Int or FP.
>
> You are forgetting the precious nature of your OpCode space. The lower
> the demands on OpCode space, the better the encoding can be, and the more
> space you leave for future iterations.

*thinks*... the 1 extra bit from the size of the regfile (assuming
you want the same total register space) would lose that advantage...
however the mv instructions (from fp to reg) would be gone, at
various bitwidth permutations. there's probably some other obscure
savings, too.

l.

lk...@lkcl.net

unread,
Dec 9, 2018, 9:18:30 AM12/9/18
to
so, i'm still completely blown away at how amazing the
6600 design is.

* no need for a Reorder Buffer because the forward-chain
of the Go-Read and Go-Write signals through the matrices
preserves the instruction order dependencies whilst still
allowing out-of-order execution

* no need for operand-forwarding as that's an integral
part of the register file, through the "pass-through"
capability.

* the role of the Common Data Bus is also absorbed partly
by the register file's "pass-through" capability, and
the broadcasting of the Dest Reg on the CDB is replaced
by the "single-wire" sort-of CAMs of the Function Units

* as outlined in Mitch's book chapters, if one of those
Function Units is a "Branch" operation, associated
row entries in the Dependency Matrix that hook into
the "Go_Write" of down-stream Functional Units can be
used to allow speculative branch execution to *begin*
but not yet *complete* (commit)...

* also mentioned in the same: if speculation beyond more
than one branch is desired, just add another Branch
Function Unit!

* the book chapters also outline a multi-issue LOAD/STORE
mechanism that also works similarly, with a
secondary (augmentation) "memory-dependency" matrix.



some things which i've not yet been able to fully grasp:

* i believe the same logic proposed for the Speculative Branching
may be applied to Exceptions:
speculatively executed exceptions may be similarly
implemented, i imagine by having an Exception "Functional
Unit", in a similar analogous way to how a Reorder Buffer
typically stores speculative exceptions, ready for when
(or if) they hit the head of the ROB queue.

* whilst i Get how with the Tomasulo Algorithm, simply marking a
ROB Dest Reg table entry as empty or not is enough to provide
full register-renaming, because the *values* are stored in the
Reservation Stations (several copies of them, waiting in the RS
Function Unit buffers), i'm no longer clear on exactly how to
replace that ROB Dest Reg field in a scoreboard-with-RS context.

is this diagram with a "Register Aliasing Table" relevant and
correct?
https://libre-riscv.org/3d_gpu/rat_table.png
the table is indexed by register number, and contains a valid
bit, tag, and a "value" (presumably that value is the regfile
register value).

the "tag" apparently is just the "Function Unit"
number (a fixed quantity), and *hypothetically* i can kinda
imagine that it would just be a bitfield (one unique bit per
Function Unit). except, with say 50 Function Units and
256 actual registers, that's now a hell of a lot of bits.

i'm almost at the point where i have a full understanding, enough
to be able to commit to a simulation / implementation. lack of
a full understanding of the role of the Register Aliasing Table is
currently holding that back.

l.

MitchAlsup

unread,
Dec 9, 2018, 12:00:46 PM12/9/18
to
On Sunday, December 9, 2018 at 8:18:30 AM UTC-6, lk...@lkcl.net wrote:
> so, i'm still completely blown away at how amazing the
> 6600 design is.
>
> * no need for a Reorder Buffer because the forward-chain
> of the Go-Read and Go-Write signals through the matrices
> preserves the instruction order dependencies whilst still
> allowing out-of-order execution
>
> * no need for operand-forwarding as that's an integral
> part of the register file, through the "pass-through"
> capability.
>
> * the role of the Common Data Bus is also absorbed partly
> by the register file's "pass-through" capability, and
> the broadcasting of the Dest Reg on the CDB is replaced
> by the "single-wire" sort-of CAMs of the Function Units
>
> * as outlined in Mitch's book chapters, if one of those
> Function Units is a "Branch" operation, associated
> row entries in the Dependency Matrix that hook into
> the "Go_Write" of down-stream Functional Units can be
> used to allow speculative branch execution to *begin*
> but not yet *complete* (commit)...
>
> * also mentioned in the same: if speculation beyond more
> than one branch is desired, just add another Branch
> Function Unit!

They are small why not get 2.....
>
> * the book chapters also outline a multi-issue LOAD/STORE
> mechanism that also works similarly, with a
> secondary (augmentation) "memory-dependency" matrix.
>
>
>
> some things which i've not yet been able to fully grasp:
>
> * i believe the same logic proposed for the Speculative Branching
> may be applied to Exceptions:
> speculatively executed exceptions may be similarly
> implemented, i imagine by having an Exception "Functional
> Unit", in a similar analogous way to how a Reorder Buffer
> typically stores speculative exceptions, ready for when
> (or if) they hit the head of the ROB queue.

I designed it to cover both cases. It has the singular drawback of
taking interrupts at what appears to be branch boundaries most of
the time.
>
> * whilst i Get how with the Tomasulo Algorithm, simply marking a
> ROB Dest Reg table entry as empty or not is enough to provide
> full register-renaming, because the *values* are stored in the
> Reservation Stations (several copies of them, waiting in the RS
> Function Unit buffers), i'm no longer clear on exactly how to
> replace that ROB Dest Reg field in a scoreboard-with-RS context.
>
> is this diagram with a "Register Aliasing Table" relevant and
> correct?
> https://libre-riscv.org/3d_gpu/rat_table.png
> the table is indexed by register number, and contains a valid
> bit, tag, and a "value" (presumably that value is the regfile
> register value).
>
> the "tag" apparently is just the "Function Unit"
> number (a fixed quantity), and *hypothetically* i can kinda
> imagine that it would just be a bitfield (one unique bit per
> Function Unit). except, with say 50 Function Units and
> 256 actual registers, that's now a hell of a lot of bits.

The whole thing has a quadratic term in the expansion. So SBs have
an upper limit on the registers times FUs that are supportable.

Ivan Godard

unread,
Dec 9, 2018, 12:25:31 PM12/9/18
to
On 12/9/2018 9:00 AM, MitchAlsup wrote:

>
> The whole thing has a quadratic term in the expansion. So SBs have
> an upper limit on the registers times FUs that are supportable.

Can't you use the same group-FUs-into-slots that we did, to reduce the
size of the FU quadratic coefficient?

lk...@lkcl.net

unread,
Dec 9, 2018, 12:34:28 PM12/9/18
to
On Sunday, December 9, 2018 at 5:00:46 PM UTC, MitchAlsup wrote:

> > * also mentioned in the same: if speculation beyond more
> > than one branch is desired, just add another Branch
> > Function Unit!
>
> They are small why not get 2.....

buy one get one free...

> > the "tag" apparently is just the "Function Unit"
> > number (a fixed quantity), and *hypothetically* i can kinda
> > imagine that it would just be a bitfield (one unique bit per
> > Function Unit). except, with say 50 Function Units and
> > 256 actual registers, that's now a hell of a lot of bits.
>
> The whole thing has a quadratic term in the expansion. So SBs have
> an upper limit on the registers times FUs that are supportable.

after reviewing this video https://youtu.be/p4SdrUhZrBM?t=494
and noting the problem there of two results being ready at
the same time, both of which target the same register r1
(one from the MUL unit, one from an ADD), i think i might have
a workable scalable solution that doesn't blow up.

* each reservation station is given its own scoreboard src1/src2
ready flags. about 2 entries would be just about tolerated
because this would double the number of Functional Unit wires,
thus quadrupling the size of the Dependency Matrices.

* the src reg number(s) are shared (identical) so it's not as
horrible as it seems.

* writes are selected based on the age of the instruction.
this being directly equivalent to the "head of Reorder Buffer
gets commit rights"

(question: does this mechanism already exist?)

* on "Go Write", the index of the Reservation Station, which
is a single bit (there being lots of those), is used to create
a pseudo-encoding that's directly equivalent to the
"Reorder Buffer Number".

* this pseudo-encoding is passed through the Register File
and out the other side, whereupon it is expanded *BACK*
out to single-bit signals that indicate to the Reservation
Stations (with a single bit), "data's arrived".

this is critically dependent on register writes being done in
the instructions' originally-issued order. i am happy to admit
as having no clue how that's done :)

l.

MitchAlsup

unread,
Dec 9, 2018, 2:00:24 PM12/9/18
to
One of the things that made the CDC 6600 SB so small is that it had
3 register files of 8 entries each. The A registers were only readable
by the INCrement unit and the Shifter, THe B registers only readable by
INCrementunit and the shifter, and that the X registers were only
readable by the MUL, DIV, ADDer, and Store units.

If one were to try a SB with 32 GPRs each GPR can go to any of 5-8
function units and the SB becomes too big for its timing to work.
Let us postulate that we have a fully general purpose SB for such a
RISC machine (Opteron size) and the SB has size X.

The SB for the partitioned CDC 6600 is about 24 times smaller. (IIRC).

MitchAlsup

unread,
Dec 9, 2018, 2:04:34 PM12/9/18