6600-style out-of-order scoreboard designs (ariane)

666 views
Skip to first unread message

lk...@lkcl.net

unread,
Apr 16, 2019, 3:26:04 AM4/16/19
to RISC-V HW Dev, Florian Zaruba, MitchAlsup, Dr Jonathan Kimmitt
hiya florian,

i'm ccing hw-dev as below may prove useful to other people implementing out-of-order designs.  i found the PDF overview for ariane which shows that you're implementing (incompletely) a CDC 6600-style out-of-order architecture [1]

if implemented in full you will achieve precise exceptions, need no "rollback" mechanism or "physical-architectural-register-file" nonsense, and yet be extremely power-efficient.  this being a known goal of ariane i thought you might appreciate some insights, below.

from the CDC 6600 patent (and the academic literature) most people understand that you need a Q-Table, which can be done either as a 1D binary array of register indices, array of length equal to the number of FUs *or* as an unary matrix of bits, N=FUs, M=NumRegs, where in the M dimension only one bit at a time is set [2]

what most people do *not* understand is that you also need an N x N (N=FUs) "dependency" matrix as well, and that in each cell of that N x N matrix there are *multiple* logic blocks handling the commit-blocking signals (write-hazards), those signals usually designated as coming in from the "top", and the usually-expected read hazards coming in from the side, to meet in each cell in an easy-to-lay-out fashion.

no disrespect intended to chris' team (i love the BOOM branch prediction algorithm, chris) listening to chris cello's design advice here will *not help you* because his team have implemented the TOMASULO algorithm, which involves a Reorder Buffer, which in turn means they need a CAM, and that is severely power-hungry.

although the Tomasulo algorithm is topologically equivalent to a (properly implemented) 6600-style algorithm, the topological morphing required [3] leaves very little that either design may use - or learn - from the other (without a full comprehensive study of both)

the 6600-style algorithm is extremely power-efficient, requires far less gates, and *does not need CAMs*.  instead it uses unary or binary array encoding as a DIRECT substitute for a CAM, providing exactly the same end-result, and only needing a single AND gate to indicate "active detection".

a CAM cleary requires *multiple* AND gates... *PER ENTRY* in the array.

thus it is clearly far more power-efficient to use 1D-binary-array or 2D-unary encoding.

each cell in the N x N Dependency Matrix basically combines in an OR fashion its write hazard lines (commit-blockers).  those commit-blockers may be:

* exception blockers.  these also handle interrupts.
* branch speculation commit-blockers
* LD/ST blockers (LD/ST management is best done as its own OxO matrix that then feeds its write-hazards to the NxN one)
* "the usual" register-based result blockers (write hazards) which everyone thinks, from the academic literature (and the expired patents) on the 6600, is the only thing that scoreboards can be used for [hint: it's not].

* exception blockers basically stop all down-stream instructions from committing.  once the instruction that *MIGHT* have to throw an exception *KNOWS* that it does not need to throw an exception, it drops its write-hazard line.  if it does, it throws the "Go-Die" switch on down-stream instructions.  in this way you get PRECISE exceptions.  ta-daaaa.

THIS PRECISE EXCEPTION CAPABILITY IS NOT ACKNOWLEDGED BY THE ACADEMIC LITERATURE ON THE 6600, and it is leading to designers creating extremely power-inefficient designs [or a design suffering unnecessarily from imprecise exceptions]

* interrupts *as* an exception means that you *do not* need to do global masking.  commit-blocking on exceptions (interrupts) *is* in effect "masking" [selectively].  it's directly functionally equivalent, yet the "masking" idea that you described last week, Florian, is the "Nuclear Option", blasting away all and any possibility of *any* interrupts.

* branch speculation commit-blockers basically also hook into the "go-die" (instruction cancellation) capability that is needed for clearing out the Function Units if an exception occurs.  i nick-named it the "Schroedinger wire" :)  at the branch-speculation point, you very simply hold a commit-block on all down-stream instructions (in *both* paths if you choose to do that), and when the branch is known to be taken, you either drop the write-hazard or you call the "go-die" (instruction cancel) wire.

it's surprisingly very simple!

* LD/ST blockers, they require their own separate OxO (very-sparse) matrix, with their own sparse-array of hazards (down the main diagonal): LDs block stores, and STs block LDs.  *both* the LD and ST write-commit-blocker signals drop down from above, into the *NxN* Dependency Matrix.

* you have the "usual" register-based write-blocking, and also (which i really like), you have operand (result) forwarding automatically built-in.  however, you have added logic that detects whether an exception has occurred into the operand forwarding block... where, actually, as you can see above, exceptions *need* their own write-hazards, and once cleared, it will be SAFE to forward the operands.

in scoreboard.sv i am not seeing any evidence of the combining of those signals, meaning that you are running into the very difficulties with exceptions (interrupts) that you outlined last week on the list, and you will also be running into difficulties with LD and ST.

i did note that you have a per-FU instruction-in-flight counter, which is excellent.  the reason why is because these counters can be used to turn this into a multi-issue design very very easily.  you very simply:

* count up the number of FUs ready to "commit" (that have no hazards remaining) - using a popcount
* set a threshold of the number that are *ALLOWED* to commit
* if the per-FU instruction-in-flight counter is less than this threshold, allow commit!
* count up the number that were *actually* committed...
* subtract that global count from EVERY one of the per-FU instruction-in-flight counters.

then all you need do is extend the instruction decode and issue phase to drop more than one result into the system: ta-daaa, now you have turned a single-issue design into a multi-issue one :)

of course the multi-porting on the operand forwarding will go up, and the multi-porting on the register file will go up as well: in the Libre RISC-V SoC we stratify the register file (and double/quadruple the number of FUs as well, to match) to avoid this.

this approach has the rather weird side-effect that one from each of register result modulo 4 may be multi-issue committed in any given cycle.  by that i mean that if there are operations which can commit to r4, r8, r12 and r16, these MUST be done sequentially (in 4 cycles), however if we have operations which can commit to r1, r6, r3 and r8 (modulo 4 those are 0,1,2,3), that's okay because of the 4-bank stratification.

it's a little weird however it means that we can use 4x 32-deep banks of 3R1W SRAM instead of requiring a *COMPLETELY INSANE* 256-entry 64-bit 12R4W ported SRAM.

anyway, above is some insights that will help you to avoid a *lot* of design pain, and, if implemented, will give you a stonkingly-good power-performance ratio, for very little effort, and without having the kinds of compromises *normally* expected of a Tomasulo (ROB) algorithm *or* of what is *believed* (incorrectly) that a 6600-style architecture is only capable of.

btw if you're happy to listen, i'm happy to describe an algorithm that provides reversible "Q-Table history", allowing full and precise restoration of register names when operand-forwarding opportunities are detected and an exception occurs.

the "normal" way to deal with this situation is to trash the ENTIRE scoreboard (obliterating every single in-flight instruction), then place the system into a treacle-like "single-issue single-FU-execution" mode and to walk forward at a snail's pace until the exception hurdle has been cleared.

the scheme that i came up with in december (thanks to mitch alsup for the very inspiring discussions) called "Q-Table history" allows full detection of in-flight operand-forwarding opportunities AND reversibility and restoration on branch speculation and exceptions.

this would result in a significant reduction in the number of writes to the register file: depending on the history depth (properly allocated) all *and any* operand-forwarding opportunities will be detected and eliminated.

lastly, if anyone would like to receive copies of mitch alsup's book chapters that augment the book "Design of a Computer", written by James Thornton and Seymour Cray, i have permission to send copies, you will need to give credit and acknowledgement to mitch.  these book chapters go into much better detail than the overview above, and include the gate-level diagrams needed to do a proper implementation of the capabilities mentioned briefly above.


[2] most implementors choose the 1D binary array because of traditional register-file designs as SRAMs.  a unary matrix is actually much more efficient because the unary array *is* the address, meaning that the SRAM's "traditional" address-decode block may be REMOVED (is completely redundant)

[3] if you limit the number of rows in each Tomasulo Reservation Station to one and only one, it effectively becomes the direct equivalent of the "operand latches" present at the front of a 6600-style Function Unit.  this in turn allows the CAM of the ROB to be replaced with an unary matrix, as it is (was) only the *multi-entry* capability of Reservation Stations that required the introduction of a CAM in the first place.

lk...@lkcl.net

unread,
Apr 19, 2019, 9:22:10 AM4/19/19
to RISC-V HW Dev, zar...@iis.ee.ethz.ch, mitch...@aol.com, jr...@cam.ac.uk
hiya florian, appreciate that you're busy with current tasks - i spotted a potential major design flaw that... how can i put this... "could be the cause of the very debugging and investigation tasks that i surmise may be preventing you from having the time to evaluate mitch alsup's book chapters and the information that i am providing on 6600-style scoreboard design" shall we say.

register-renaming (and built-in operand forwarding) is automatically achieved with the 6600-style scoreboard design by way of the Q-Table working along-side the NxN Function Unit Dependency Matrix.  operand forwarding in the 6600 was achieved by using same-cycle (actually, falling-edge) write-through capability on the Register File.  in modern processors, "write-through" of SRAMs can be used for this exact same purpose. this *can* be augmented by a separate additional operand-forwarding bus, very similar to an SRAM except... without the SRAM.

also, pipelining in the 6600 was sort-of achieved by a "revolving door" created by the three-way interaction between Go_Read (register file read), Go_Write (Function Unit latch write) and Go_Commit [write to regfile, i think: mitch can you confirm?].  only two of these signals were permitted to be raised at any one time, and only when one of them was HI could the next-in-the-chain go HI, at which point the previous one would drop, hilariously a bit like a very short caterpillar going round in circles, forever.  operand-forwarding was achieved when two of these lines (one of them Go_Read) were HI for example.

the important thing is that actual pipelining was not introduced until the 7600, not many details are known about the 7600, however the Function Units input-latches and corresponding associated output-latch had this "revolving door" 3-way MUTEX on each of them *even in the 7600*.

in the creation of the ariane scoreboard.sv, however, there is a problem, that is masked / hidden by the fact that the integer ALU operations are all single-cycle.

* on the div unit, you *do* have a form of MUTEX that blocks the Function Unit input operands from being used whilst the DIV operation is in progress.  you will therefore not experience any problems with DIV results.

* on the integer FUs, these are single-cycle, so the problem is *HIDDEN*.  note however there will also be a performance ceiling (64-bit MUL in a single cycle) due to gate latency, that, if fixed by creating a multi-stage integer-mul, *WILL* result in problems.

* on the LD/ST FU, you *may* see problems.  i haven't investigated in depth, because the design is deviating from the 6600 by not splitting out LD/ST into its own separate sparse-array matrix (see section 10.7)

* on the FPU FUs, which i see no evidence of a MUTEX, and i assume that they're multi-stage, you *WILL* experience problems.

the problem is: that without MUTEXes blocking the Function Unit from re-use until the output is generated, the Q-Table will become corrupted.  this will show up *ONLY* during exceptions (and possibly branch speculation cancelling), because it is exceptions where rollback is initiated.

when a destination (result) register number is transferred through the Q-Table to a Function Unit, it does so by assuming that there is a commit-block on that register which "preserves" the name of that destination register.  this is sort-of done by preventing its destruction using the write-hazard infrastructure until such time as its committing can "make it disappear" safely.

thus: only when that result is *actually available* is it SAFE to DESTROY (retire) that result, because at the point at which the result is stored, all "commits" - all write hazards - have been cleared.

by allowing the Q-Table to proceed to a new entry without allowing the write-hazards to be cleared, you are DESTROYING absolutely CRITICAL information.

i repeat:

by allowing the Q-Table to tell the Function Unit that the write-hazards do not matter, rollback is NO LONGER SAFELY POSSIBLE.

the solution is outlined in Section 11.4.9.2 of mitch's book chapters.  reproduced with kind permission from mitch alsup, an image for people who may not have these chapters:


you can see there that there are *four* "apparent" Function Units, all with src1, src2 operand latches, and associated corresponding result latches, so it APPEARS as far as the NxN Function Unit Dependency Matrix that there are FOUR adders (or four FPUs).

there are NOT four FPUs.

there is only ONE (pipelined) FPU.

that FPU however has *FOUR* sets of src1-src2-result latches.

the absolutely critical insight here is to note that the number of FU latch-sets *must* exceed or be equal to the pipeline depth.

* if it is less, the consequence will be that the pipeline will be underutilised.
* if it is greater, there exists the increased ability of the design to undergo register renaming, however bear in mind that the FU NxN Dependency Matrix is, clearly, O(N^2).

one solution in common usage is to merge multiple functions into the Computation Unit, funnily enough exactly as has already been done in both the ariane FPU Function Unit and the ariane integer ALU.

you will be able to check that this is the case by temporarily creating a global "de-pipelining" mutex that only permits a single operation to be carried out at any one time.  only one of an FPU operation, LD/ST operation, Branch operation or Integer operation may be permitted at any one time, *NO PIPELINING  PERMITTED AT ALL*...

... and at that point the problems that i anticipate you to be experiencing (based on an examination of this design) on exceptions and branch prediction should "puzzlingly and mysteriously disappear for no apparent reason".

we have an implementation of a multi-in, multi-out "fan" system in nmigen (if you're able and happy to read python HDL), at least the comments are useful:


the "mid" - multiplexer id - is what is passed in down the pipeline, just as ordinary boring unmodified data, only to be used on *exit* from the pipe to identify which associated fan-out latch the result is to go to.

that mux-id is used here for example:

you can see later at lines 202, 203, 204 and 208, the mid indexes which of the "next stages" to route the incoming data to.  i really like nmigen :)


in the diagram in Mitch Alsup's book you can see that this is replaced with a FIFO (just to the side of the Concurrent Unit aka pipeline).  however that particular design strategy works for a fixed-length pipeline, i.e. it *prevents* early-out and it *prevents* the amalgamation of multiple pipelines (with different lengths) behind a common "ALU" API.

by passing the multiplexer id down through the data, early-out, reordering pipeline layouts, *and* FSMs can all be combined and the multi-in multi-out Concurrent Unit doesn't give a damn :)

early-out pipelines (such as FPU "special cases" for NaN, zero and INF being handled very early in the pipeline) allow less work to be done (less power utilised), however it would be anticipated that this would require dual-porting on the result stage (into the multiplexer).  luckily however we *guarantee* that only *one* of the array of result stages is ever going to be active at any one time, so, bizarrely, ORing of the two possible paths may be deployed as opposed to requiring higher gate count MUXes.

two fan-outs will still be required (one for the early-out path on the FPU pipeline, one for the "normal" path on the FPU pipeline), it's just that the fanned-out outputs from each may be safely ORed together, given the *guarantee* that there will be only one mid in use at any one time.

so... i think that covers it.  summary: you're missing criticical MUTEXes on the Function Unit src1-src2-result latches, without which data corruption will occur (guaranteed) on any form of rollback.  fix that, and you'll have an absolutely fantastic design.

l.



Florian Zaruba

unread,
Apr 19, 2019, 9:49:22 AM4/19/19
to Luke Kenneth Casson Leighton, RISC-V HW Dev, zarubaf, mitch...@aol.com, Dr Jonathan Kimmitt
Dear Luke,

thanks for all the suggestions and the book chapter.

On Fri, Apr 19, 2019 at 3:22 PM <lk...@lkcl.net> wrote:
hiya florian, appreciate that you're busy with current tasks - i spotted a potential major design flaw that... how can i put this... "could be the cause of the very debugging and investigation tasks that i surmise may be preventing you from having the time to evaluate mitch alsup's book chapters and the information that i am providing on 6600-style scoreboard design" shall we say.
Unfortunately, this is not the reason as I still have to pursue a PhD degree (and you won't get that solely with engineering an in-order core) ;-). I appreciate your input and I have the two book chapters as my easter lecture. I hope they will be insightful. 

register-renaming (and built-in operand forwarding) is automatically achieved with the 6600-style scoreboard design by way of the Q-Table working along-side the NxN Function Unit Dependency Matrix.  operand forwarding in the 6600 was achieved by using same-cycle (actually, falling-edge) write-through capability on the Register File.  in modern processors, "write-through" of SRAMs can be used for this exact same purpose. this *can* be augmented by a separate additional operand-forwarding bus, very similar to an SRAM except... without the SRAM.

also, pipelining in the 6600 was sort-of achieved by a "revolving door" created by the three-way interaction between Go_Read (register file read), Go_Write (Function Unit latch write) and Go_Commit [write to regfile, i think: mitch can you confirm?].  only two of these signals were permitted to be raised at any one time, and only when one of them was HI could the next-in-the-chain go HI, at which point the previous one would drop, hilariously a bit like a very short caterpillar going round in circles, forever.  operand-forwarding was achieved when two of these lines (one of them Go_Read) were HI for example.

the important thing is that actual pipelining was not introduced until the 7600, not many details are known about the 7600, however the Function Units input-latches and corresponding associated output-latch had this "revolving door" 3-way MUTEX on each of them *even in the 7600*.

in the creation of the ariane scoreboard.sv, however, there is a problem, that is masked / hidden by the fact that the integer ALU operations are all single-cycle.

* on the div unit, you *do* have a form of MUTEX that blocks the Function Unit input operands from being used whilst the DIV operation is in progress.  you will therefore not experience any problems with DIV results.

* on the integer FUs, these are single-cycle, so the problem is *HIDDEN*.  note however there will also be a performance ceiling (64-bit MUL in a single cycle) due to gate latency, that, if fixed by creating a multi-stage integer-mul, *WILL* result in problems.
The multiplier is pipelined in my design. 

* on the LD/ST FU, you *may* see problems.  i haven't investigated in depth, because the design is deviating from the 6600 by not splitting out LD/ST into its own separate sparse-array matrix (see section 10.7)

* on the FPU FUs, which i see no evidence of a MUTEX, and i assume that they're multi-stage, you *WILL* experience problems. 

the problem is: that without MUTEXes blocking the Function Unit from re-use until the output is generated, the Q-Table will become corrupted.  this will show up *ONLY* during exceptions (and possibly branch speculation cancelling), because it is exceptions where rollback is initiated.
In an in-order, single-issue processor, with a single cycle ALU you don't have much branch shadow (zero). So that should not be a problem for the moment. This can be improved once going super-scalar.  

when a destination (result) register number is transferred through the Q-Table to a Function Unit, it does so by assuming that there is a commit-block on that register which "preserves" the name of that destination register.  this is sort-of done by preventing its destruction using the write-hazard infrastructure until such time as its committing can "make it disappear" safely.

thus: only when that result is *actually available* is it SAFE to DESTROY (retire) that result, because at the point at which the result is stored, all "commits" - all write hazards - have been cleared.

by allowing the Q-Table to proceed to a new entry without allowing the write-hazards to be cleared, you are DESTROYING absolutely CRITICAL information.

i repeat:

by allowing the Q-Table to tell the Function Unit that the write-hazards do not matter, rollback is NO LONGER SAFELY POSSIBLE.

the solution is outlined in Section 11.4.9.2 of mitch's book chapters.  reproduced with kind permission from mitch alsup, an image for people who may not have these chapters:


you can see there that there are *four* "apparent" Function Units, all with src1, src2 operand latches, and associated corresponding result latches, so it APPEARS as far as the NxN Function Unit Dependency Matrix that there are FOUR adders (or four FPUs).

there are NOT four FPUs.

there is only ONE (pipelined) FPU.

that FPU however has *FOUR* sets of src1-src2-result latches.

the absolutely critical insight here is to note that the number of FU latch-sets *must* exceed or be equal to the pipeline depth.

* if it is less, the consequence will be that the pipeline will be underutilised.
* if it is greater, there exists the increased ability of the design to undergo register renaming, however bear in mind that the FU NxN Dependency Matrix is, clearly, O(N^2).

one solution in common usage is to merge multiple functions into the Computation Unit, funnily enough exactly as has already been done in both the ariane FPU Function Unit and the ariane integer ALU.

you will be able to check that this is the case by temporarily creating a global "de-pipelining" mutex that only permits a single operation to be carried out at any one time.  only one of an FPU operation, LD/ST operation, Branch operation or Integer operation may be permitted at any one time, *NO PIPELINING  PERMITTED AT ALL*...

... and at that point the problems that i anticipate you to be experiencing (based on an examination of this design) on exceptions and branch prediction should "puzzlingly and mysteriously disappear for no apparent reason".
I am actually not experiencing any more problem: The "design flaw" which produced the buggy behavior of non-idempotent reads was associating interrupts during commit. The observation to do that during decode helped to eliminate that problem. We are happily booting Debian Linux on the FPGA and multi-core SMP Linux on the OpenPiton platform (with PLIC).

we have an implementation of a multi-in, multi-out "fan" system in nmigen (if you're able and happy to read python HDL), at least the comments are useful:


the "mid" - multiplexer id - is what is passed in down the pipeline, just as ordinary boring unmodified data, only to be used on *exit* from the pipe to identify which associated fan-out latch the result is to go to.

that mux-id is used here for example:

you can see later at lines 202, 203, 204 and 208, the mid indexes which of the "next stages" to route the incoming data to.  i really like nmigen :)


in the diagram in Mitch Alsup's book you can see that this is replaced with a FIFO (just to the side of the Concurrent Unit aka pipeline).  however that particular design strategy works for a fixed-length pipeline, i.e. it *prevents* early-out and it *prevents* the amalgamation of multiple pipelines (with different lengths) behind a common "ALU" API.
Our FPU has early out.  

by passing the multiplexer id down through the data, early-out, reordering pipeline layouts, *and* FSMs can all be combined and the multi-in multi-out Concurrent Unit doesn't give a damn :)

early-out pipelines (such as FPU "special cases" for NaN, zero and INF being handled very early in the pipeline) allow less work to be done (less power utilised), however it would be anticipated that this would require dual-porting on the result stage (into the multiplexer).  luckily however we *guarantee* that only *one* of the array of result stages is ever going to be active at any one time, so, bizarrely, ORing of the two possible paths may be deployed as opposed to requiring higher gate count MUXes.
Our FPU manages all that. 

two fan-outs will still be required (one for the early-out path on the FPU pipeline, one for the "normal" path on the FPU pipeline), it's just that the fanned-out outputs from each may be safely ORed together, given the *guarantee* that there will be only one mid in use at any one time.

so... i think that covers it.  summary: you're missing criticical MUTEXes on the Function Unit src1-src2-result latches, without which data corruption will occur (guaranteed) on any form of rollback.  fix that, and you'll have an absolutely fantastic design.
One "problem" which I have, is this merged scoreboard/rob structure (as it is quite area and timing critical). I have the possibility to rollback the entire state speculative state so no data corruption can occur ;-) But I am certain that the current design point can be further improved. I am hoping that the lecture you send me is giving further insights.

Best,
Florian 

l.





--
Florian Zaruba
PhD Student
Integrated Systems Laboratory, ETH Zurich
Skype: florianzaruba

lk...@lkcl.net

unread,
Apr 20, 2019, 3:33:47 AM4/20/19
to RISC-V HW Dev, lk...@lkcl.net, zar...@iis.ee.ethz.ch, mitch...@aol.com, jr...@cam.ac.uk


On Friday, April 19, 2019 at 2:49:22 PM UTC+1, Florian Zaruba wrote:
Dear Luke,

thanks for all the suggestions and the book chapter.

On Fri, Apr 19, 2019 at 3:22 PM <lk...@lkcl.net> wrote:
hiya florian, appreciate that you're busy with current tasks - i spotted a potential major design flaw that... how can i put this... "could be the cause of the very debugging and investigation tasks that i surmise may be preventing you from having the time to evaluate mitch alsup's book chapters and the information that i am providing on 6600-style scoreboard design" shall we say.
Unfortunately, this is not the reason as I still have to pursue a PhD degree (and you won't get that solely with engineering an in-order core) ;-).

nice!  okok will try to keep it short
 
I appreciate your input and I have the two book chapters as my easter lecture. I hope they will be insightful.

the shakti team implemented the same approach, independently: Professor Kamakoti re-derived the 6600-style Q-Table concept.  so there is working source code to examine, for parallels.
 
... and at that point the problems that i anticipate you to be experiencing (based on an examination of this design) on exceptions and branch prediction should "puzzlingly and mysteriously disappear for no apparent reason".
I am actually not experiencing any more problem: The "design flaw" which produced the buggy behavior of non-idempotent reads was associating interrupts during commit. The observation to do that during decode helped to eliminate that problem. We are happily booting Debian Linux on the FPGA and multi-core SMP Linux on the OpenPiton platform (with PLIC).

 fantastic!

early-out pipelines (such as FPU "special cases" for NaN, zero and INF being handled very early in the pipeline) allow less work to be done (less power utilised), however it would be anticipated that this would require dual-porting on the result stage (into the multiplexer).  luckily however we *guarantee* that only *one* of the array of result stages is ever going to be active at any one time, so, bizarrely, ORing of the two possible paths may be deployed as opposed to requiring higher gate count MUXes.
Our FPU manages all that. 

that's very cool, i'll take a look.
 
so... i think that covers it.  summary: you're missing criticical MUTEXes on the Function Unit src1-src2-result latches, without which data corruption will occur (guaranteed) on any form of rollback.  fix that, and you'll have an absolutely fantastic design.
One "problem" which I have, is this merged scoreboard/rob structure (as it is quite area and timing critical).

it's very very important to remember that only the combination of separate Q-Table plus correctly-implemented Dependency Matrix, with associated single-row Reservation Stations aka Function Units with those MUTEXes on them is directly functionally one-for-one equivalent to everything that Tomasulo and a ROB provides.

and that it's extremely efficient in terms of power and gates.

* on the RS/FU src1-src2-result latches, D-Latches (3 gates) can replace flip-flops (10 gates each)
* there's no CAMs (single-bit AND-gate testing of unary matrices replaces the CAM)
* combining all of the write-hazards (and read-hazards) that block commit is just... an OR-gate of single-bit inputs.

phrase such as, "surely it has to be more complex than that" and "surely it can't possibly replace the need for separate PRF-ARFs" and "surely it can't obviate the need for a ROB" tends to spring to mind quite often.


I have the possibility to rollback the entire state speculative state so no data corruption can occur ;-)


yes, rollback is the "absolutely terrible" solution, aka "history / snapshots" - it requires detection of the problem, rollback of the full state (CSRs, regfile, everything), a full reset / destruction of *all* in-flight data, then putting the processor into "Seriously Slow Crawl" Mode, and guessing at how long the processor can run like that before being allowed to be de-throttled.

and it's totally unnecessary!

But I am certain that the current design point can be further improved. I am hoping that the lecture you send me is giving further insights.


let's hope it benefits other people as well.

l.

k...@dspia.com

unread,
Apr 25, 2019, 2:43:24 AM4/25/19
to RISC-V HW Dev, zar...@iis.ee.ethz.ch, mitch...@aol.com, jr...@cam.ac.uk
Can someone share the book chapters being discussed in these posts?

Thanks

-K

lk...@lkcl.net

unread,
Apr 26, 2019, 9:36:11 AM4/26/19
to RISC-V HW Dev, zar...@iis.ee.ethz.ch, mitch...@aol.com, jr...@cam.ac.uk
(with thanks to Florian for kindly agreeing to publish what was formerly a private discussion, this may be of benefit to others in the future.  i'm also attaching two key screenshots, reproduced with kind permission from Mitch Alsup.  discussion edited slightly).

On Tuesday, April 23, 2019, Florian Zaruba <zar...@iis.ee.ethz.ch> wrote:
Hi Luke,

so I've taken the time and read the pages you have sent me, thanks again. It contains some neat tricks.

Seymour Cray was a genius, well ahead of his time. Mitch learned from that. He is one of the only other people I know who thinks at the gate level and is no longer constrained by NDAs.
 
He explained to me that it is only Intel and AMD that still do massive gate level designs.

Everyone else has moved to HDLs, and it is causing significant design inefficiencies, and causing Foundries to drop support for certain kinds of cells ( t gates for example ).


I am still a bit confused on a couple of things:

1. The paper mentions a 8r/4w register file.. That seems quite unnecessary big for a single issue core.


That's because the context is not a single issue core. The chapters discuss *modernising* 6600.

Also remember it can be used for vector processing, single issue instr yet hit the regfile massively (eg SIMD).

Mitch was responsible for the design of AMD's 5ghz K9 architecture.

He was also a key architect behind AMD's GPU.

IIRC that section is explaining how to allocate resources correctly.

Analyse the workload (number of FMACs, ratio of LDs, then make sure the pipelines match it, then make sure the regfile matches that.
 
 You won't retire much more than one instruction and definitely not read more than two operands (let us concentrate on the integer part). Even for a dual issue approach, you read approximately 1.5 registers per instruction. So three ports should be sufficient. 

Not for a multi issue design it's not. That's the missing context.
 

Not a big deal I think that can be circumvented by some read/write port handshaking on the regfile

Yes. Remember to use write-thru SRAM, otherwise you get a clocks worth of unnecessary delay.

Then the regfile effectively becomes an "operand forwarding bus" as well!
 
.
2. Unfortunately, the drawings are quite hard to read as they are blurry pixel graphics

Moo? Hm perhaps you have a poor quality PDF viewer. Try xpdf. Old, boring, and effective. I get v hi res images on my 3000x1800 laptop LCD out of this PDF, with xpdf.

Debian: apt-get install xpdf
 

 and the style is not very consistent or self explanatory I am sure I missed a couple of points there.

The context is that the chapters are an extension of the original book by J Thornton, "Design of a Computer". Google it, the PDF is online. James' wife says he gave permission for the scans to be put online, it was quite touching, J Thornton was clearly very old at the time the correspondance seeking permission took place.

It also helped me to be on comp.arch for nearly 3 months solid, talking with Mitch direct, to fill in the missing gaps.

Happy to do the same for you, do prefer it to be public discussion though, so others can benefit.

Can we fwd to hw-dev or libre-riscv-dev? Would like my team to be able to see info below, as well as students in future, studying our design.


 Also, there seem to be transmission gates in the drawings which I think are meant to be some kind of storage (flip-flops).. 

Quite possibly.... although in talking with Mitch he explained that only D latches are needed (3 gates). Flip flops (10 gates) are usually used by non-gate-level designers because they're "safe" (resettable).

And have a high cost.
 
3. The text talks about continuous scoreboard and dependency matrix scoreboard. 


Yes. They are definitely separate.
 
The former seems to be a distributed version which takes all the information from the reservation stations/FU and generates global signaling. Does it require extra storage somewhere centralized or can everything be computed combinatorial?

It looks that way! It really is much simpler than we have been led to believe.
 
You can confirm this by examining the original circuit diagrams from Thornton's book.

Yes, the full gate level design is in that book. It's drawn using ECL as they literally hand built the entire machine using PCBs stuffed with 3 leg transistors you can buy from RSOnline, today!

 Although not entirely clear from the text it seems that the continuous scoreboard is preferable.

Yes.
 
4.. What information exactly does the reservation station/FU need to contain?

Nothing! It's a D Latch bank! That's it! Latches for the src ops, latch for the result, and ... errr... that's it.
 
Mitch and I had a debate about this (details), happy to relate when you have time.

 I assume from the drawings that it also needs to capture the write data from the computation unit. 

Yes.
 
5. I am desperately missing a high-level diagram on how the different things fall in place. 

Believe it or not, it is so much simpler than you may have been led to believe, the diagram on page 32 chap 10 *really is everything that is needed* 

Or p28 or p30. Different highlighting on different sections.


6.. The text talks a lot about latches. I am not sure whether latches in the circuit sense are meant. 


Yes D latches. Really D latches. Not flip flops.

If so I would recommend you to take a different approach. Some text also indicates that you are making use of the transparent phases of latches. This all seems rather dangerous.

Mitch has been doing gate level design for over 45 years. His design input is why AMD has had such high performance designs at significantly lower clock rates than Intel.

Remember in the 90s and early 2000s how AMD had to quote "fake dash equivalent" clock rates compared to Intel cores?

Now you know why. The designs used 6600 where Intel used Tomasulo, therefore Intel was an inferior less power efficient design, end of discussion.

He did have one design failure however, said it was a bitch to debug, chip went into random state after reset. I forget how he said he learned from that. It is on comp.arch somewhere.

Basically due to his serious amounts of experience at gate level design, if Mitch says only latches are needed, I trust him.

 
7. One part of the text (11.1.1) talks about instructions can enter the FUs even if they have WAW hazards, it juts needs to make sure it doesn't write its result back until the WAW hazards has been cleared. 


Yes. Commit phase definitely cannot proceed until all hazards - read and write - are cleared.

Instructions can be ALLOCATED to FUs, they just sit there doing nothing, just like Tomasulo RS's.

That allocation is IMPORTANT. it preserves temporal relationship, pending availability of resources basically.


The rest of the text talks about issue stalling if a WAW hazard has been detected. 

Yes. This is equivalent to when the Tomasulo algorithm drops a ROB# into a Reservation Station instead of an actual register value.

So yes it really is the instruction issue.

8. What does "issue" actually mean?

Exactly what it says.

Instruction, from decode phase, goes through Q Table processing and simultaneous FU allocation, all in 1 clock.


 Does it mean putting it in the reservation/station FU? 


Yes. And doing Q Table update which MUST BE Simultaneous.
 
Q Table values are latched btw.  Not combinatorial.


Or does it mean bringing it to the next stage aka read operands?

 No not exactly.  Read is controlled by READ hazard lines.

This MAY occur on the same cycle even for the current issued instruction (hence why everything is a combinatorial block)


 The image in 11.3.1 seems to indicate the former.  Do you really have to keep a separate queue of unissued instructions?

Yes, this is the FIFO that is discussed on hw-dev occasionally. The buffer that allows 16/32 instr length to be dealt with.

 Bear in mind though that the 6600 is like the 68000 in that it has separate address registers, so you kinda have to remember that not everything described will fit RV exactly.


 What prevents you from putting them in the reservation station and marking them as "not issued"?

It would require an extremely comprehensive analysis to work out the consequences. Beyond my time and ability, however I am pretty confident that it would be counterproductive or turn out to be unnecessary.

My only advice then would be, "don't go there" :)


9. I am not really getting the placement of read reservations. If an instruction is being issued it places read and write reservations. 


I think of them as "blockers". Most people call them hazards.

Read hazards (blockers) are like the OTHER SIDE of the regfile which had the WRITE hazard.

So... in effect... read hazards are blockers on instructions BEFORE the current one. Basically the required reg value hasn't been written yet, plus maybe, yes, there are other FUs reading the regfile (not enough ports) so you can't read yet.

And write hazards are blockers which say, "if you go ahead with this write, it will be impossible to undo, or there is another instruction that has to write to the regfile BEFORE you do (in the same reg#), so don't do it! ..... Yet".

If the instruction unconditionally places a read reservation you are throwing away the temporal relationship? 


If the Q table is not respected: yes.

This took me a LOT of watching online videos to understand what the hell the Q Table does.

 It is quite amazing how simply keeping track of the last written register is sufficient to do register renaming.  However it is ONLY possible by respecting the read and write hazards and doing those FU Reservations.

Basically once the FU is reserved, it becomes a marker that represents a virtual (future) result. Just like the ROB# of Tomasulo.

Mess with that at peril.

How do you maintain order on the read operands if you issue out of order (e.g. if you defer issuance of one instruction which would be vital for the correct dependency)? In general how do you keep a "temporal order"? 

Through those read and write hazards. Through the Q table.

That's it.

So by respecting the read and write hazards, this creates a linked list using a bitmatrix, out of the FUs.

A ROB is therefore no longer needed because the write dependencies - as bits in the FU Dep matrix - these ALREADY PRESERVE INSTRUCTION ORDER!
 
This is what proponents of Tomasulo fail to understand. They see no ROB Queue and think, "omg, the instruction order has been destroyed".

This is blatantly false.  Each instruction has a write dependency to block all future instructions from proceeding to the commit phase.  AT NO TIME can an instruction get COMMITTED out of order.

Done.

Simple.

Instr order preserved.

So this is really crucial to understanding.  The results can be GENERATED out of order, however the write hazards make ABSOLUTELY certain that they are COMMITTED in order.

Once that is understood, the beauty, simplicity and elegance of this design starts to fall into place.


Maybe if I'd know the exact fields of the scoreboard that would help me understand it.

Q table. That is all.

This aspect *is* well understood from the patent and the academic literature. Almost any online video on youtube about scoreboards will do.

What the online material does NOT do is explain how write hazards can be successfully applied to LD/ST, branch speculation and exceptions.

The academic literature also fails to highlight the significance of the Dependency Matrix, focussing on the Q Table and glossing over the FU NxN bitlevel combinatorial block because Q table was what was in the 6600's primary patent.

l.

2019-04-26_14-28.png
2019-04-26_14-29.png

高野茂幸

unread,
Apr 26, 2019, 10:37:34 AM4/26/19
to lk...@lkcl.net, RISC-V HW Dev, jr...@cam.ac.uk, mitch...@aol.com, zar...@iis.ee.ethz.ch
Hi,

J Thornton-san’s approach is alternatively called a scoreboard, this can be used for googling.
And, University of Wisconsin at Madison, Grinder Sohi-san and J Smith-san’s publications are good material especially for superscalar study.

Best,
S.Takano


2019年4月26日(金) 22:36 <lk...@lkcl.net>:
--
You received this message because you are subscribed to the Google Groups "RISC-V HW Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hw-dev+un...@groups.riscv.org.
To post to this group, send email to hw-...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/hw-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/hw-dev/ce05d1d6-67f3-45fb-a852-8174424dbb83%40groups.riscv.org.

Dr Jonathan Kimmitt

unread,
Apr 26, 2019, 11:04:20 AM4/26/19
to lk...@lkcl.net, RISC-V HW Dev, zar...@iis.ee.ethz.ch, mitch...@aol.com

See comments inline Luke and best of luck with your implementation. I would recommend that you turn those schematics into a Verilog simulation model (it supports T-latches which are called tranif0/1 devices. Depending on the usage it may result in so-called registered nets that have an impact on simulation speed).

so the reason why T-gates are not much used in UDSM designs is that it is difficult to represent their behaviour as a digital timing library due to the bi-directional nature and the lack of any gain in the pass transistors. It's not a problem if you are happy to verify your design at transistor level complete with layout parasitics. D-latches are widely used and in fact Ariane has a D-latch version of its register file. But they do compromise the performance of ATPG which is another automatic process which is tedious to perform manually.

Quite possibly.... although in talking with Mitch he explained that only D latches are needed (3 gates). Flip flops (10 gates) are usually used by non-gate-level designers because they're "safe" (resettable).

And have a high cost.
 
3. The text talks about continuous scoreboard and dependency matrix scoreboard. 


Yes. They are definitely separate.
 
The former seems to be a distributed version which takes all the information from the reservation stations/FU and generates global signaling. Does it require extra storage somewhere centralized or can everything be computed combinatorial?

It looks that way! It really is much simpler than we have been led to believe.
 
You can confirm this by examining the original circuit diagrams from Thornton's book.

Yes, the full gate level design is in that book. It's drawn using ECL as they literally hand built the entire machine using PCBs stuffed with 3 leg transistors you can buy from RSOnline, today!

 Although not entirely clear from the text it seems that the continuous scoreboard is preferable.

Yes.
 
4.. What information exactly does the reservation station/FU need to contain?

Nothing! It's a D Latch bank! That's it! Latches for the src ops, latch for the result, and ... errr... that's it.
 
Mitch and I had a debate about this (details), happy to relate when you have time.

 I assume from the drawings that it also needs to capture the write data from the computation unit. 

Yes.
 
5. I am desperately missing a high-level diagram on how the different things fall in place. 

Believe it or not, it is so much simpler than you may have been led to believe, the diagram on page 32 chap 10 *really is everything that is needed* 

Or p28 or p30. Different highlighting on different sections.


6.. The text talks a lot about latches. I am not sure whether latches in the circuit sense are meant. 


Yes D latches. Really D latches. Not flip flops.

If so I would recommend you to take a different approach. Some text also indicates that you are making use of the transparent phases of latches. This all seems rather dangerous.
So with the D-latches we have to conduct further analysis to check for pulses which could occur due to clock skew in the transparent period. The D-flipflop which is just two D-latches in series on opposite clocks makes that analysis easier. Again there is an impact on testability because you cannot capture the signal state in the transparent phase, so traditional fault simulation is needed to check testability.

Samuel Falvo II

unread,
Apr 26, 2019, 11:26:50 AM4/26/19
to Luke Kenneth Casson Leighton, RISC-V HW Dev, Florian Zaruba, mitch...@aol.com, Dr Jonathan Kimmitt
On Fri, Apr 26, 2019 at 6:36 AM <lk...@lkcl.net> wrote:
(with thanks to Florian for kindly agreeing to publish what was formerly a private discussion, this may be of benefit to others in the future.  i'm also attaching two key screenshots, reproduced with kind permission from Mitch Alsup.  discussion edited slightly).

This whole thread is over my head at the moment, but thanks to everyone for making it available for eventual absorption.

--
Samuel A. Falvo II

Samuel Falvo II

unread,
Apr 26, 2019, 12:40:19 PM4/26/19
to Luke Kenneth Casson Leighton, RISC-V HW Dev, Florian Zaruba, mitch...@aol.com, Dr Jonathan Kimmitt
On Fri, Apr 26, 2019 at 9:32 AM Luke Kenneth Casson Leighton <lk...@lkcl.net> wrote:
... sounds all very obvious so far, right? :) now the tricky bit:

Unfortunately, no.  :)  Remember, my processor design experience starts with Forth/stack CPUs [1] and ends with 6502-style PLA instruction decoders[2].  I have a lot to learn.

[1]: https://github.com/KestrelComputer/kestrel/tree/master/cores/S16X4A

lk...@lkcl.net

unread,
Apr 27, 2019, 7:11:00 PM4/27/19
to RISC-V HW Dev
that's odd... neither my message nor mitch's made it through to hw-dev, even after 2 days.




Luke Kenneth Casson Leighton <lk...@lkcl.net>
Apr 26, 2019, 5:32 PM (2 days ago)
to Samuel, RISC-V, Florian, mitch...@aol.com, Dr




On Friday, April 26, 2019, Samuel Falvo II <sam....@gmail.com> wrote:

This whole thread is over my head at the moment, but thanks to everyone for making it available for eventual absorption.

There is I believe both a mental and a practical way to turn a single issue inorder design into a degenerate case of a scoreboard style OoO one:

* split out the pipeline and add a bank of incoming src latches and outgoing dest latches. Numbers in the array/bank equal to the depth of the pipeline
* add muxids to the src/dest latches so that results can be associated with the src ops properly on exit from the pipeline

* if there is operand forwarding built-in to the pipeline infrastructure, REMOVE it (and use a writethru regfile instead)

* if there is PRF ARF (physical / architectural regfiles), remove that too.

... sounds all very obvious so far, right? :) now the tricky bit:

* where previously there is "stalling due to LD/ST conflicts", split out the "read blocking" and "write blocking" and "commit now" wires, and route them instead to a matrix.

* where previously there is "stalling because the INT pipeline has to wait for a register result because the next instruction needs it", do the same thing.

* where the DIV FSM (if there was one) complicates the design and causes all sorts of stall messes all over the place, do the same thing.

... you see where that's going? Same thing for exceptions / interrupts, same thing for branches.

Now all you do is, add a Q Table, which is nothing more complex than an a 1D array of latches containing a Register Number, and... errr... you're done.

Congratulations, an in-order design has been turned into a degenerate OoO one.

And, interestingly, much of the awkwardness associated with having to propagate pipeline "stall" mechanisms throughout the design, all those are gone, and are now managed cleanly in one place.

[Note, there was, is, and shall be NO MENTION of "speculation" in the above. Speculation is NOT a hard absolute design requirement in an OoO design, it is just an optimisation that happens to be really easy to add *to* an OoO design. This appears to be a very common misconception because the performance gains are so high, and it is so easy to add, nobody doesn't not do it. Still, correlation != causation and all...]

---

Now all you do if you want some parallelism, is, crank up the repeat button on the number of pipelines, extend the size of the FU Matrix to match, and add some more ports on the regfile to cope.

Making it multi issue and keeping it a precise engine is a little trickier, basis: keep a count of the number of in-flight instructions not yet committed.

For each FU not yet committed, subtract the total number committed in one cycle from ALL FUs not yet committed.

On next cycle, if any FU has a count less than the number PERMITTED to commit, it MAY commit.

It is not quite that simple, some FUs may try to write to the same regnum, so there is a little bit of futzing to sort out there, which, again, that "commit count" can be used to work out who has priority.

---

So, summary: keeping single issue and yet using Dependency Matrices actually simplifies an inorder design, Q Table provides reg renaming "for free", and as long as speculation and multi issue are not attempted to be added as well, it *remains* simple.

lk...@lkcl.net

unread,
Apr 27, 2019, 7:13:31 PM4/27/19
to RISC-V HW Dev
again, mitch's response, which can clearly be seen to be sent to hw-dev, is nowhere to be seen on groups.google.com hw-dev archives.  forwarding.

Mitchalsup

Apr 26, 2019, 10:33 PM (2 days ago)
to jrrk2lkclhw-devzarubaf


Mitch Alsup


THere are tow points to be made, here::

1) when using an array of latches, one can build the scan path at the boundary of the array using the std read-write gates. This is effectively how one tests SRAMs. and how one should test latch based RFs.''

2) when using latches are data capture points in a pipeline one needs to verify 2 points.
2.a) the latch goes closed before the inbound data crosses out of its hold time stable points
2.b) nobody (NOBODY) is "looking" at any data from any latch that is transparent.

Do this and the latch based pipeline is race free.

The SB does this inherently.
Quite possibly.... although in talking with Mitch he explained that only D latches are needed (3 gates). Flip flops (10 gates) are usually used by non-gate-level designers because they're "safe" (resettable).

And have a high cost.
 
3. The text talks about continuous scoreboard and dependency matrix scoreboard. 


Yes. They are definitely separate.
 
The former seems to be a distributed version which takes all the information from the reservation stations/FU and generates global signaling. Does it require extra storage somewhere centralized or can everything be computed combinatorial?

It looks that way! It really is much simpler than we have been led to believe.
 
You can confirm this by examining the original circuit diagrams from Thornton's book.

Yes, the full gate level design is in that book. It's drawn using ECL as they literally hand built the entire machine using PCBs stuffed with 3 leg transistors you can buy from RSOnline, today!

 Although not entirely clear from the text it seems that the continuous scoreboard is preferable.

Yes.
 
4.. What information exactly does the reservation station/FU need to contain?

Nothing! It's a D Latch bank! That's it! Latches for the src ops, latch for the result, and ... errr... that's it.
 
Mitch and I had a debate about this (details), happy to relate when you have time.

 I assume from the drawings that it also needs to capture the write data from the computation unit. 

Yes.
 
5. I am desperately missing a high-level diagram on how the different things fall in place. 

Believe it or not, it is so much simpler than you may have been led to believe, the diagram on page 32 chap 10 *really is everything that is needed* 

Or p28 or p30. Different highlighting on different sections.


6.. The text talks a lot about latches. I am not sure whether latches in the circuit sense are meant. 


Yes D latches. Really D latches. Not flip flops.

If so I would recommend you to take a different approach. Some text also indicates that you are making use of the transparent phases of latches. This all seems rather dangerous.
So with the D-latches we have to conduct further analysis to check for pulses which could occur due to clock skew in the transparent period. The D-flipflop which is just two D-latches in series on opposite clocks makes that analysis easier. Again there is an impact on testability because you cannot capture the signal state in the transparent phase, so traditional fault simulation is needed to check testability.

The latch timing is created in the SB, timed by the pickers in the SB, broadcast from the SB, and result in another latch somewhere transitioning from open->closed. GIven property 2 above one can prove the timing. (or one can eat the area and use flip-flops.)

lk...@lkcl.net

unread,
May 1, 2019, 9:27:17 AM5/1/19
to RISC-V HW Dev

The latch timing is created in the SB, timed by the pickers in the SB, broadcast from the SB, and result in another latch somewhere transitioning from open->closed. GIven property 2 above one can prove the timing. (or one can eat the area and use flip-flops.)

chomp.

an idea occurred to me: if D-latches are serving the same purpose as write-through SRAMs, and SRAMs are a well-understood "thing" that has an accepted verification process, then would it be reasonable to substitute single-address SRAMs *for* D-latches?

l.

Dr Jonathan Kimmitt

unread,
May 1, 2019, 9:47:19 AM5/1/19
to lk...@lkcl.net, RISC-V HW Dev

A write-through RAM is an optimised bit-cell (equivalent to a D-latch) surrounded by sense amplifiers. Logically it is equivalent to a latch, but it would be in no way competitive with a latch in area or power consumption. Also in UDSM technologies RAMs may require a threshold margin optimisation control as well as optional BIST circuitry, which I guess is the part that is interesting for you.

--
You received this message because you are subscribed to the Google Groups "RISC-V HW Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hw-dev+un...@groups.riscv.org.
To post to this group, send email to hw-...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/hw-dev/.

Dan Petrisko

unread,
May 1, 2019, 10:29:10 AM5/1/19
to lk...@lkcl.net, RISC-V HW Dev
Hi Luke --

There are a few reasons you would not want to use a single-address SRAM, mostly for physical design.

There are two types of SRAMs: asynchronous SRAMs, which output their result in the same cycle as requests, and synchronous SRAMs which output their result in the cycle after the request.  In order to insert a 'hardened', or highly circuit-optimized, SRAM into your design, one uses a memory compiler provided by the foundry of choice. Asynchronous SRAMs are generally synthesized out of flip-flops (or latches if they are write-through). A typical memory compiler will just refuse to give you a single-address SRAM, which is a non-starter. If they do support it, it will most likely be generated as a latch surrounded by SRAM-y addressing overhead (and possibly mandatory BIST circuitry).

So, you might think to convert your design to use synchronous SRAMs.  SRAMs become more and more area efficient the larger they are; conversely, tiny SRAMs are very area-inefficient.  The control logic starts to overtake the data storage at (rule of thumb) 1 kB.  You might think to combine your "SRAM D-latches" into a larger SRAM with multiple simultaneous reads/writes.  However, the area of an SRAM scales with the number of ports squared.  So anything more than a few ports becomes untenable.

For very small, simple structures, logic elements are vastly more efficient than SRAMs.

Best,
Dan Petrisko


--

lk...@lkcl.net

unread,
May 1, 2019, 7:40:27 PM5/1/19
to RISC-V HW Dev, lk...@lkcl.net


On Wednesday, May 1, 2019 at 3:29:10 PM UTC+1, Dan Petrisko wrote:

For very small, simple structures, logic elements are vastly more efficient than SRAMs.

dan (and jonathon), thank you for the insights.  i was wondering about the address-less (1-element SRAMs), whether to use them (at all) given that a 6600-like design has the register "address" already decoded into mutually-exclusive 1-bit (unary) N-long arrays (N=num(regs)).

adding a binary re-encoder just to get back to "standard" SRAM addressing when the SRAM has to break that binary encoding *back out to unary* seems... well... redundant.

initially i thought it would be ok to put down a suite of address-less SRAMs, however from what you're saying, that's inadviseable.

l.


Dr Jonathan Kimmitt

unread,
May 2, 2019, 4:18:36 AM5/2/19
to lk...@lkcl.net, RISC-V HW Dev

SRAMs will have a word line which goes to all relevant content (i.e. the corresponding bits in the word), but if N is less than some magic, but quite large threshold then using an optimised bit cell is so not worth it. Now the more interesting question is what size of register file would be needed to justify a dedicated register file IP. This will of course depend on the number of simultaneous ports in use. A large number of ports will degrade timing because of the number of word lines hanging off each cell. For more details refer to (for example) Chapter 9 of "Fundamentals of Modern VLSI Devices" by Yuan Taur and Tak H Ning. My understanding is that you are planning to have a register file with a large number of ports.

--
You received this message because you are subscribed to the Google Groups "RISC-V HW Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hw-dev+un...@groups.riscv.org.
To post to this group, send email to hw-...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/hw-dev/.

lk...@lkcl.net

unread,
May 2, 2019, 7:17:24 AM5/2/19
to RISC-V HW Dev, lk...@lkcl.net


On Thursday, May 2, 2019 at 9:18:36 AM UTC+1, Dr Jonathan Kimmitt wrote:

SRAMs will have a word line which goes to all relevant content (i.e. the corresponding bits in the word), but if N is less than some magic, but quite large threshold then using an optimised bit cell is so not worth it. Now the more interesting question is what size of register file would be needed to justify a dedicated register file IP. This will of course depend on the number of simultaneous ports in use. A large number of ports will degrade timing because of the number of word lines hanging off each cell. For more details refer to (for example) Chapter 9 of "Fundamentals of Modern VLSI Devices" by Yuan Taur and Tak H Ning. My understanding is that you are planning to have a register file with a large number of ports.


ok, so digression from the topic at hand, this becomes specific to the Libre RISC-V CPU/VPU/GPU, not the 6600 scoreboard in general.

we're planning:

* 4x 64-entry stratified banks of
* 32-bit-wide 3R1W write-thru with
* byte-level write-enable lines and
* separate operand-forwarding bypass with "Q-Table history / name-restoring" (nothing to do with the regfile itself)

yes, really, 32-bit-wide register files on an RV64GC/SV system, yes, really, byte-level write-enable.

this gives a total of 128 64-bit registers, where *PAIRS* of banks are required to be read/written in order to provide 64-bit read/write.

due to the individual byte-level write-enable lines, 8 and 16-bit SIMD may be carried out *WITHOUT* needing an extra read cycle.

it is extremely weird.  it means that 64-bit operations may be (up to) dual-issue, yet 32-bit vectorised operations may be (up to) quad-issue.

it does however mean that we do not have insane 10R3W or 12R4W register file porting, yet have really quite decent theoretical maximum performance, and solve one of the most intractable problems of computer science: shared SIMD regfile problems.

the dynamic "name restoring" - precise name-restoring provided by the Q-Table "history" innovation - will allow us to detect operand forwarding opportunities that are normally only done by extremely advanced processors, making the operand forwarding bus much more important than it normally would be, reducing the *need* for significant register porting.

other innovations include having separate Function Units (crucially, with their own Reservation Stations) for 8 and 16-bit operations of non-power-of-two length, that raise *HIERARCHICAL* hazards on the register(s) that the 8/16-bit operation is embedded in, in an upstream cascade.

this is *only* possible by augmenting the 6600 scoreboard system and it represents a significant innovation in its own right, solving what has otherwise been a completely intractable problem associated with SIMD (SIMD on top of standard regfiles, that is), for many many decades.

the hierarchical cascading generation of hazards basically says, "if you need to reserve byte 3 of register r4, please ensure that you also reserve the *WORD* that is used *AND* reserve the *DWORD* of register r4 as well".

it's really that simple [and only possible with a scoreboard].

clearly the read-reservation (read hazard) is on the whole of the register, even if a part of it is required, whereas the write hazard can go down to 32-bit, 16-bit and 8-bit fragments.

with that structure in mind, the subdivision into 4 *32-bit* banks, plus byte-level write-enable lines, starts to make a bit more sense.

l.

lk...@lkcl.net

unread,
May 3, 2019, 8:52:17 PM5/3/19
to RISC-V HW Dev, lk...@lkcl.net, zar...@iis.ee.ethz.ch, mitch...@aol.com, jr...@cam.ac.uk
florian, hi,

re-reading mitch's book chapters (again...) i recall that you asked "what is the "Issue" signal? is it "really instruction issue" or... what?  the answer is, there is a *global* "issue" flag that is only ASSERTed when there is no Write-after-Write Hazard and the relevant FU for the current instruction is not in the "busy" state, which creates a suite of "Issue_{insert_FU_name}" signals that are individually AND-enabled from this global "issue" flag with pattern-matching from the instruction decode on a per-opcode basis.

in subsequent chapters where you then see Function Unit Dependency Cells with the word "Issue" associated with them, i *believe* that the word "Issue" is (unfortunately) an abbreviation for the appropriate "Issue_{Insert_FU_name}" signal associated with that particular FU-related Cell.

i didn't previously spot the name-reuse, apologies.

l.

2019-05-04_01-38.png

lk...@lkcl.net

unread,
May 3, 2019, 9:30:40 PM5/3/19
to RISC-V HW Dev, lk...@lkcl.net, zar...@iis.ee.ethz.ch, mitch...@aol.com, jr...@cam.ac.uk
just got some additional insights from mitch. btw "Design of a Computer, Thornton", is here:
http://ygdes.com/CDC/DesignOfAComputer_CDC6600.pdf
https://archive.org/details/cdc.6600.thornton.design_of_a_computer_the_control_data_6600.1970.102630394_201802
------------------


In the CDC 6600::The Global issue signal causes the instruction fetch/decode pipeline to advance.
The Global issue signal ANDED with the <empty> FU decode (image) causes the
issued instruction to be latched in the Function Unit and in the Computation Unit.


Also note: in the CDC 6600, due to the use of latches, if the instruction is not issued,
the fetch/decode pipeline advances but recognized the instruction was not issued.
The advanced instruction remains undecoded, and a mux redecodes the advanced
instruction until issue is successful.
See Thornton page 122.

lk...@lkcl.net

unread,
May 4, 2019, 12:57:52 AM5/4/19
to RISC-V HW Dev, lk...@lkcl.net, zar...@iis.ee.ethz.ch, mitch...@aol.com, jr...@cam.ac.uk
On Sat, May 4, 2019 at 2:41 AM Mitchalsup <mitch...@aol.com> wrote:
> Note: I was describing how CDC 6600 did it.

noticed.  in the 2nd book chapter, where you describe Concurrent Pipeline Units (an array of latches designated "Function Units" that happen to feed to the same pipeline / ALU), "busy" does not exactly have the same meaning.

however, i believe i am correct in thinking that for a FSM-based div unit (for example), "busy" *would* have the same meaning as in the 6600, i.e. once the FSM was active, it's really important not to "issue" more instructions to that "FU".

whereas with a Concurrent Unit, there's (for example) 4 "Function Units" (say) on the same pipeline, and as long as it never stalls and as long as it is 4 stages or less, *at least one* of those 4 Function Units (entry-points to the same pipeline) is never going to give a "busy" signal.

so "busy" has the same meaning... only an ordinary (ALU) pipeline never *will* be busy... all quite odd :)


> There may (hint: MAY) be better ways with today's technology (clocked flip-flops,

finding it very awkward to create an SRLatch, almost certainly going to have to do a flip-flop.

> Verilog design)

nmigen!  python!  modern OO programming!  classes n random stuff that's actually human-readable!

i'll be updating the document strings (and comments etc.) if/as appropriate.

l.

lk...@lkcl.net

unread,
May 4, 2019, 12:49:45 PM5/4/19
to RISC-V HW Dev, lk...@lkcl.net, zar...@iis.ee.ethz.ch, mitch...@aol.com, jr...@cam.ac.uk
On Sat, May 4, 2019 at 2:48 PM Mitchalsup <mitch...@aol.com> wrote:

> > nmigen!  python!  modern OO programming!  classes n random stuff that's actually human-readable!
> Some of us find gate level schematics readable.

(attached) - yep, it's one of the reasons i'm happy with nmigen.  one of its options is to generate yosys "intermediary language" files, which are barely above gate-level (cell "block" level to be more precise).

running various optimisation and transformation commands will get you even further down the rabbit-hole, getting closer to an actual ASIC netlist whilst losing information such as human-readable net names as a side-effect.

attached "graphviz" screenshots (yosys "show" command) are at the "still-readable-phase", for the FU-Reg Dependency Cell and also the FU-FU Cell, showing that issue input (issue_i) is gating op1, op2 and dest, as well as wr_pending and rd_pending

go_read and go_write are connected to the "reset" side of the sr-latches.  clock is *not* included in any of these... because it's used synchronously inside the sr-latch cell instead.  so... not exactly the original, yet close enough.

-----

unfortunately, it's not perfect: there's no control over how graphviz automatically lays out the connections: it is however the closest most convenient way to get to an approximation of full gate-level design without either having to write any software to do so, whilst also being able to stick to an actual HDL.

the general development technique of writing code, saving, running a tool and viewing its output visually is one that is i feel significantly underappreciated, yet anyone familiar with texstudio and openscad uses without realising quite how useful the technique is.

currently i am training the team "if graph not readable or otherwise understandable, it's a bug".  that way the whole design gets split down to manageable, reviewable chunks.

l.

2019-05-04_17-13.png
2019-05-04_17-14.png

高野茂幸

unread,
May 4, 2019, 8:53:15 PM5/4/19
to lk...@lkcl.net, RISC-V HW Dev, jr...@cam.ac.uk, mitch...@aol.com, zar...@iis.ee.ethz.ch
Hi Luke-san,

Off topic:
I have doubt that open source tools can not have maintananceability and serviceability.
But, wait, even if company close their work, who does these, so will face to same problem.
Competitor will work, but price will be increased because of decrease of competitors.
Rather than company does with closed code set, open software tool remains opportunity to succeed after leave of owner, it is healthy.

I use chisel3.0 for my project, it generates so many intermediate wires, therefore, trace-check the generated code takes unnecessary time. I will try using the nmigen, thank you for your info.

Still now, use of product tools is common sense. It is ok when company work, they have a responsibility for users under license. Ordinary users do not care of earth quake, and spend every day. Means difficulty of explain the moving shift to open source, especially companies of potential user.

Best,
S.Takano

2019年5月5日(日) 1:49 <lk...@lkcl.net>:
--
You received this message because you are subscribed to the Google Groups "RISC-V HW Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hw-dev+un...@groups.riscv.org.
To post to this group, send email to hw-...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/hw-dev/.

lk...@lkcl.net

unread,
May 4, 2019, 10:24:06 PM5/4/19
to RISC-V HW Dev, lk...@lkcl.net, jr...@cam.ac.uk, mitch...@aol.com, zar...@iis.ee.ethz.ch


On Sunday, May 5, 2019 at 1:53:15 AM UTC+1, adaptiveprocessor wrote:
Hi Luke-san,

Off topic:
Rather than company does with closed code set, open software tool remains opportunity to succeed after leave of owner, it is healthy.


indeed.
 
I use chisel3.0 for my project, it generates so many intermediate wires, therefore, trace-check the generated code takes unnecessary time.

nmigen is not perfect: some auto-generated intermediary wires do exist, however as long as you keep the module to a reasonable size, it is possible to read the auto-generated code and still follow it.

however... i honestly found that outputting to yosys "ilang" format, then using yosys "read_ilang {filename.il}; show top" was far more productive and useful than reading the actual (auto-generated) verilog.

 
I will try using the nmigen, thank you for your info.

please do not think that it is a perfect solution by any means: our team chose it "on balance".

* code readability was considered extremely important [i *genuinely* cannot understand scala code, and i have been programming for 40 years]

* python, by contrast, is now in the top 3 world-wide programming languages.

* using python to *generate* Verilog has huge advantages: it brings the OO capabilities of the entire python world to hardware design.

* MyHDL is great... except it actually transforms python *syntax* into verilog.  hence, if there are concepts that do not exist in verilog, they are *NOT* possible to do in MyHDL.  class-based OO for example is an absolute nuisance in MyHDL.

* pyrtl (another python-based HDL) just does not have the same (large) community.  litex (by enjoy-digital.fr), minerva (an rv32 core in nmigen), these are all not small code bases.


the caveats:

* if you do not know what you are doing in python, you can get yourself into an awful, awful mess.

* you absolutely MUST follow best standard SOFTWARE development practices for python.  pep8, docstrings, unit tests, and more.

* unit tests are ABSOLUTELY ESSENTIAL.  if you cannot accept this (think you know better, "i'm a good programmer, i don't need to write unit tests"), you will take ten times as long as anyone else... that's if you succeed at all.

* i have to be diplomatic here: the lead developer of nmigen is brilliant and conscientious in the design of nmigen, focussing on maintainability and stability of the API.  the regression test suite for example is exemplary and a *really* good example of how a really, really good python project should be developed.  at the same time... it has to be pointed out that, sadly, that they are fundamentally disrespectful of standard python development practices that have been long-established well before nmigen existed... *and blame python for it*

this latter has made it extremely difficult for our team to fully deploy python OO techniques in the development of our codebase, to the point where we may actually have to fork nmigen in order to deal with some of the issues.

the majority of nmigen users are *hardware* engineers, *NOT* experienced python *SOFTWARE* engineers, and, as a result, those engineers are being taught *extremely* bad python development practices because this is often literally their first major python project (massive predominant use of python wildcard imports, for example, has been a "No" in the python world for 2 decades)

perhaps if there was a larger adoption of nmigen by experienced python software engineers, there would be more people speaking up.  whitequark _does_ listen to reason [in some instances].


Still now, use of product tools is common sense. It is ok when company work, they have a responsibility for users under license. Ordinary users do not care of earth quake, and spend every day. Means difficulty of explain the moving shift to open source, especially companies of potential user.


yes.  this has been a common recurring theme for a good couple of decades in the "business" world.  i first heard words like this back in 1999.  they are an excuse.  there is nothing that can be done, you just have to let those businesses succeed / fail based on market competition.

l.

高野茂幸

unread,
May 5, 2019, 2:33:17 AM5/5/19
to lk...@lkcl.net, RISC-V HW Dev, jr...@cam.ac.uk, mitch...@aol.com, zar...@iis.ee.ethz.ch
Dear Luke-san,

No problem because my recent languages are Japanese, English, Verilog, and Python :)
And my background is processor development from specification decision to back annotation.

Thank you for your detailed advices.

Anyway, is there a homepage listing high level language to verilog translation tools both of closed and opened source?

Thanks and Regards,
S.Takano


2019年5月5日(日) 11:24 <lk...@lkcl.net>:
--
You received this message because you are subscribed to the Google Groups "RISC-V HW Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hw-dev+un...@groups.riscv.org.
To post to this group, send email to hw-...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/hw-dev/.

Samuel Falvo II

unread,
May 5, 2019, 11:44:27 AM5/5/19
to Luke Kenneth Casson Leighton, RISC-V HW Dev, Dr Jonathan Kimmitt, mitch...@aol.com, Florian Zaruba
On Sat, May 4, 2019 at 7:24 PM <lk...@lkcl.net> wrote:
> however... i honestly found that outputting to yosys "ilang" format, then using yosys "read_ilang {filename.il}; show top" was far more productive and useful than reading the actual (auto-generated) verilog.

I actually had to do this just last night to discover what I consider
to be a bug, but I'm sure is just a misunderstanding on my part.

If I have nmigen code like this:

ctr = Signal(COUNTER_WIDTH, reset=DEFAULT_VALUE)
m.d.sync += output_port.eq(ctr)

without further qualification or use of `ctr`, then it seems that
nmigen will yield a proper, constant-valued register which simulates
properly in Verilator, but which will be treated as an *anonymous
module input* when used with formal verification. This latter
distinction is important, because the formal prover software will
freely fuzz inputs during k-induction proofs unless somehow
constrained with Assume statements. But, in this particular context,
assumptions would be the wrong thing; I'm *specifying explicitly* what
I want in the HDL itself. So, how to fix this?

The work-around for this, I've found, is easy to apply: you just force
the register to equal itself, like so:

ctr = Signal(COUNTER_WIDTH, reset=DEFAULT_VALUE)
m.d.sync += [
output_port.eq(ctr),
ctr.eq(ctr),
]

This is enough to trick the ilang backend into realizing that ctr is,
in fact, driving *something*, and so is a real register, and not an
input port.

However, I *never* would have found this were it not for looking at
the generated schematic. Because, in the schematic, if ctr appears in
an octagon, you *know* it's treated as an I/O port.

DISCLAIMER: You probably won't run into this edge case because most
designs don't instantiate fixed-value registers like this. I,
obviously, intend on replacing this with more complex logic later on.
But, since I'm very early in my development cycle, I don't yet have
everything I need to do so, I "stub" out a portion of my circuit with
a hard-wired register, like this.

Still, it meant the difference between formal properties never being
met versus formal properties which properly match my happy-path unit
test behavior.

> * unit tests are ABSOLUTELY ESSENTIAL. if you cannot accept this (think you know better, "i'm a good programmer, i don't need to write unit tests"), you will take ten times as long as anyone else... that's if you succeed at all.

If I can add something here:

Not only are unit tests essential, the whole test-driven development
process would, in my opinion, be essential. It's far too easy to
write code which doesn't have adequate coverage otherwise. *More* of
my time yesterday was invested into researching why my formal
properties *didn't* fail in the expected way than I spent in making
them pass. I found at least as many bugs, if not more, this way than
just making the bare minimum amount of tests pass.

Also, if your tools support it, I'd advocate using formal methods in
conjunction with unit tests. Contrary to what many think, I find they
*compliment* each other. Formal is great for proving fundamental
properties of a circuit, while unit testing is great at proving more
sophisticated or stateful interactions. The same unit test framework
is also used for integration testing as well. I approach this by
f