so, i'm chewing through the chapters that mitch very kindly
sent me (off-ng). there's tons of gate diagrams! yay!
i understand gates.
one thing that caught my eye was the design of the 6600
register file. it's capable of doing a "pass-through",
i.e. the register is written... oh and the write may be
"passed through" directly to the read port (if there is
one).
this would allow "operand forwarding" automatically,
as part of the design, because any result register would
be "passed through" (forwarded) on to a src operand
*in the same clock cycle*.
this got me thinking that, functionality-wise, the 6600
register file may be directly equivalent to a "Common
Data Bus".
or, the other way round: the Common Data Bus of the Tomasulo
Algorithm *is* the operand-forwarding mechanism associated
with and augmenting/improving Scoreboard systems.
also, it seems to me that adding Reorder Buffers provides
register renaming, rollback and precise exceptions, even for
Scoreboard systems, and that register renaming which needs
to be added to Scoreboards results in a CAM plus src operand
buffers that look remarkably similar to Reservation Stations
and ROB dest tables.
at this point i am genuinely puzzled and confused as to
whether there is any difference - at all - between
Scoreboards+ROBs and Tomasulo+ROBs :)
at present, the unique feature of SV, which is the exclusion
of SIMD and its replacement with polymorphic bit-widths
and variable-length vectors that map down onto the "real"
register file(s) as opposed to having separate Vector Pipelines
and RFs, is throwing a massive spanner in the works.
where most architectures would have SIMD instructions and
be done with it, after seeing this
https://www.sigarch.org/simd-instructions-considered-harmful/
i never want to be involved with SIMD, ever again :)
as mentioned before, what i'd therefore like to do is to be able
to "merge" SIMD operations at the Reservation Station phase, by
spotting patterns in the bitmasks:
ROB#-dest | dest-bitmask | ROB#-src1/val | ROB#-src2/val
--------- | ------------ | ------------- | -----------
ROB5 | 0b11000000 | 0x59000000 | 0x12000000
ROB4 | 0b00110000 | 0x00130000 | 0x00540000
this to an 8-bit SIMD "Add" Functional Unit.
the RS would have built-in "mask" detection logic that noted that the dest byte masks are non-overlapping, and would pass the *masked* src1 and src2 of *BOTH*
rows of the RS down into the ALU.
on completion of the operation, the destination would be noted as
being different, and it would be necessary to issue *two* writes:
one to ROB4 and the other to ROB5. if however the ROB#s were the
same, only one write would be needed, and even the bitmasks merged.
detection of this merging at instruction-issue time may not be
possible, as it could hypothetically be *completely different*
instructions that issued the instructions with different bytemasks.
in fact, i'm counting on precisely that, as a way to not need
8-bit-wide register routing and multiplexing etc. etc., which
would be insane to have byte-wide crossbars all over the place.
so, as i can't think of any other architectural way to achieve the
same ability to merge non-overlapping SIMD ops, doing some voodoo
on the RS rows *requires* Reservation Stations, that in turn requires
ROB#s, and it all sort-of hangs together.
oh, the other thing: a 3D GPU + VPU needs an insane 256 (total)
64-bit registers. 128 for INT, 128 for FP. this makes it possible
to fit pixel "tiles" (as they're called) into the regfile, without
the performance / power hit of transferring data through the L1/L2
cache barrier (using LD/STs).
if we were to deploy the "usual" algorithm(s) associated with
Scoreboard register-renaming, it would require multiplying that
256-entry register file by say 2, 3 or even 4.
by contrast, the Reservation Stations (whether deployed for SB+ROB
or for TS+ROB) simply house the in-flight data, job done. oh,
and we would get the opportunity to pull that SIMD-style data-merging
trick.
the other thing: we considered it insane to have 2 128-entry
register files with 10R4W (or greater) porting, to cope with
the vectorised loads. instead, we came up with the idea to
"stripe" the register file (4 blocks), use much simpler 2R1W
standard cells, and to add muxers to allow all functional units
[eventual, non-simultaneous] access to all registers.
thinking this through a bit further, we could also "stripe" the
Functional Units, as well. i.e. to lay down 4 sets of ADDs,
4 sets of MULs, effectively breaking the register files into
4 blocks (or lanes).
extending this thought-experiment to its logical extreme, we
considered doing the same thing *to the Reorder Buffer*. i.e.
in effect, we place a hard requirement that the first 2 bits of the
ROB# *must* match and be equal to the Destination Register # first
2 bits.
even if we had 32 Reorder Buffer rows, this would reduce the ROB CAM
from its present (insane) 8-bit width (1 bit for INT/FP, 7 bits for
0-127 on the dest reg), down to a slightly-less mad 6 bits.
the down-side: sequentially-issued instructions that by coincidence
happened to match in their destination register numbers modulo 4
(bottom 2 bits) would result in 3 "holes" in the Reorder Buffer.
given that there's flags "done/!done" already, that's not such a
big deal.
what we *hope* is, that SimpleV's design feature of issuing
sequentially-numbered sets of instructions would end up filling
the striped ROBs.
and, what's really *really* nice: 4-wide striping would effectively
mean 4-wide instruction issue. also, there isn't any connectivity,
no data paths, no dependencies between each of the ROB "stripes".
now, my only big concern here is: how do you deal with lane-crossing,
from when a result is generated from a ROB# (dest) in one of these
stripes (dest modulo 4 == 2) over to a Reservation Station that
is in lane 1 (src reg modulo 4 == 1) for example?
and here is where i think the insight that "Common Data Bus
Equals Same As Pass-through / Register File Forwarding" comes in
to play.
recall that we plan to put in 4-way Muxes on the banks of the Register
File. this being an acceptable bottleneck under the circumstances.
if the pass-through mechanism *is* the CDB, then the write from one
"Lane" will go through the Mux, into the other Lane's Register File,
and from there it would need to be broadcast onto that Lane's CDB,
in order to update the Reservation Stations there.
this is however where i run into difficulties, as i've not yet
thought through the synchronisation or how to handle the case where
there happens *already* to be some reads taking place on the 2nd
Lane.
do we queue the new update so that it's handled cleanly on the next
cycle? that sounds like a recipe for disaster, resulting in
ROB# entries getting out of sync.
have a signal which stalls the lane-crossing write, until the
Read-Reg Bus of the target lane is free? that sounds like a
good way to create pipeline "Concertina" effects
(
https://en.wikipedia.org/wiki/Accordion_effect)
conundrum!
thoughts appreciated.