On Saturday, April 28, 2018 at 8:54:17 AM UTC+1,
infinis...@gmail.com wrote:
> I was wondering if anybody/any company has tried to implementing
> scoreboarding with register renaming? It seems like a possibilty...
well! it's funny you should ask... :)
by now you probably noticed that i joined the conversations here
on comp.arch well after this question was asked, and i figured
it might be nice to go back through some threads, and hey look!
this question's near-identical to the discussion(s) of the past
month.
this is an awesome thread, read it all, however i figured it might
be nice to go right back to the beginning, and provide a direct
answer (which summarises as "yes!"), and provide a clear synopsis
of the discussions from december.
* the answer is yes... and you do not need a RAT, or CAMs,
or a Reorder Buffer, or a separate PRF-ARF pre-analysis
pre-allocation phase.
* start with a Register Alias Table (RAT Table). note that
it contains register number, tag, and "valid" bit.
* attach multiple-entry "Reservation Station Table" style
rows to each Function Unit of a *standard* scoreboard.
give *EACH ROW* a unique tag name (which is stored in the
RAT, above).
so far, we have a bog-standard RAT system that is commonly
used to give register-renaming to scoreboards... here's where
it gets different:
* note that the RAT table "tag" is effectively a reverse
lookup variant of a Tomasulo Reorder Buffer row number.
* where a Tomasulo Reservation Station stores a reference
to a ROB row number, in the *standard* RAT-scoreboard
style Reservation Station, the RAT tag points to the
RS *row*, i.e. in the reverse direction to a ROB#.
(note that the RAT tag points not to the RS *table*, it points
to the row *in* the table).
* now take the RAT-style Reservation Station tables and
FLATTEN them, *and* add one Function Unit *per RS entry*.
FU-add1 FU-mul1
RSadd1.1 RSmul1.1
RSadd1.2 RSmul1.2
becomes:
FU-add1 FU-add2 FU-mul1 FU-mul2
RSadd1.1 RSadd1.2 RSmul1.1 RSmul1.2
effectively we may view this as: the FUs *become* in effect
the RS's. or, another way to view it: FUs gain src register
latches (if they don't have them already)
* now, recall that the Reservation-Station-aka-Function-Unit
has, just as in any standard scoreboard, an FU-to-Register
Matrix and an FU-to-FU Matrix.
* recall that in the RAT system, the tag indicates where,
on a new instruction issue, the Function Unit allocated
to handle that instruction should look for its src
register.
* remember that this relationship is *exactly* an FU-to-Register
relationship...
* ... and that it is a unique one. it is NOT a CAM, in
other words.
* therefore, where we previously had a RAT "tag", we may
REPLACE that with a mutually-exclusive bit-array (bit
vector), of length EQUAL to the number of Function Units,
where ONLY ONE bit of that array is permitted to be ON,
and all other bits are required to be OFF.
(all bits "off" is also permitted, therefore we do not
need the "valid" bit of a standard RAT).
* note that this bit-array of length equal to FUs is in
a table of length equal to the number of registers.
* therefore we may drop this on top of the FU-to-Reg
Dependency Matrix! i.e. it becomes a simple (single-latch)
addition to each FU-Reg Cell
that's basically it. that's all there is to it. i deliberately
went through it like i did, above, because it shows how to
directly morph a pre-existing well-known *proven* technique
(Register Alias Tables as applied to scoreboards) into one which
completely does away with CAMs, RATs, PRFs, ARFs and ROBs.
to repeat the key characteristics again:
* multiply up (double or triple) the number of Function Units
* add RS-style src register "latches" to each FU [they should
already exist anyway]
* to the FU-to-Reg Dependency Matrix (FU being rows, Regs being
columns) add one extra flag-bit per cell, that, along each
row ONLY ONE of those bits is permitted to be set in any given
clock cycle.
in this way each FU has a ROB-style destination number record
(just "pointing" in the *opposite* direction to a ROB... this
is the primary reason why only a single bit is required,
as opposed to a CAM).
that really is all there is to it. whenever a new instruction
is issued, the mutually-exclusive (new) bit vector is checked
in that register's column.
if there is no bit set, then it indicates (just like with the
RAT table "valid" bit) that the value in the Register File may
be read as a source register.
so, the FU may happily read whatever source register(s) it needs,
as long as they are valid, and stores them in the FU's src
reg "latches" [that we earlier said were normally part of an
RS's row, which are now *part* of the FU if they aren't already].
if any of the src registers *do* have a bit set, it tells us a
number of things:
(1) (with a single bit) that there exists a Function Unit which
has NOT YET generated the result that we want in this
particular register (or, more specifically the dependent FU
has not stored the result back in the Reg File yet)
(2) with a *SINGLE BIT*, it tells us *EXACTLY* which Function
Unit we need to make this current instruction dependent
on, for that src register
therefore, instead of storing the value in the latch (because
it's not ready yet), we just... um... do exactly what is normally
done in a scoreboard: set up an absolutely standard FU-to-FU
RaW dependency link.
lastly, as part of the dependency-setup, the mutually-exclusive
FU bit is set in (row=FU col=DestReg) to indicate to future
instructions that they should look to *this* FU for this DestReg
if they in turn need it as a src reg.
if there happened to be a bit set previously, then its old state
is read *by* the instruction, and the new (changed) state is
updated for use on the *next* cycle *NOT* the current one.
it's real simple.
no CAMs. no more RAT. RS is merged into FU. the PRF *is* the
ARF, just as in the Tomasulo Algorithm... except there's no
ROB, either.
the only difficulty encountered with this scheme is: how to make
it multi-issue. for a "standard" architecture i have not thought
that through.
instead, what we decided to do is to stratify the entire register
file into banks, and for the number of banks to be *equal* to the
issue limit.
thus we allocate registers 0 4 8 12.... to bank 0, registers 1 5 9 13
to bank 1 etc.
and, then, the issue is one *per destination* register. thus, in
a 4-issue scheme, if we have the following instructions:
ADD r1, r1, 5 # bank1
ADD r2, r2, 5 # bank2
ADD r7, r7, 5 # bank3
ADD r4, r4, 5 # bank0
these are all on SEPARATE destination banks, and thus may be issued
all in the same cycle. however THIS may NOT:
ADD r1, r1, 5 # bank1
ADD r5, r2, 5 # bank1
ADD r9, r7, 5 # bank1
ADD r13, r4, 5 # bank1
these are ALL on bank 1, and thus are required to be issued
in FOUR separate cycles.
basically it comes down to the fact that if you go back to the
original RAT+scoreboard, only one tag may be checked and set
in one cycle, for each register.
so if you have these four instructions (and a 4-issue machine)
ADD r1, r2, 5
ADD r2, r1, 5
ADD r1, r2, 5
ADD r2, r1, 5
you have a bit of a problem, because you need to allocate
the first instruction to the first available FU, and set
dest r1 as its RAT, set R1 as the *source* dependency,
then grab a *third* FU with the *second* as its source
dependency (on R2 this time), *destroying* the tag entry
created for the *first* instruction... oh and then doing
the same thing for the fourth instruction, destroying
(overwriting) the tag for R2.
absolute awful mess!
if someone knows of a solution to *that* problem (how to make
*that* multi-issue) then *EXACTLY* the same solution may be
applied to the above scheme.
i simply cheated by entirely avoiding the problem, by stratifying
into four banks, such that, as long as there are no bank-overlaps,
four destination registers (and thus four of the bit-vectors
equivalent to RAT table entries) may be set, easily, with very
little logic, in one cycle.
l.